Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
593 lines
37 KiB
Markdown
593 lines
37 KiB
Markdown
# Research: Guardrail Integration Patterns for alknet-firewall
|
||
|
||
**Date**: June 2026
|
||
**Scope**: How existing guardrail/integration systems accept external defenses, and which patterns are compatible with alknet-firewall's behavioral signal detection approach
|
||
**Purpose**: Inform the integration strategy — adapters, common interface, or standalone API
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Executive Summary](#1-executive-summary)
|
||
2. [Overview of Each System](#2-overview-of-each-system)
|
||
3. [Comparison Table](#3-comparison-table)
|
||
4. [Analysis for alknet-firewall](#4-analysis-for-alknet-firewall)
|
||
5. [Recommendation](#5-recommendation)
|
||
6. [References](#6-references)
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
After analyzing six major guardrail/integration systems (LlamaFirewall, NeMo Guardrails, Guardrails AI, OpenAI Agents SDK, Amazon Bedrock Guardrails, and OpenGuardrails), the evidence strongly supports a **standalone API with thin adapter pattern** for alknet-firewall:
|
||
|
||
- **Phase 1**: Provide a clean, synchronous standalone API (`Firewall.screen(text) → Alarm`) and let users compose it manually with their existing systems. This is the fastest path to adoption and avoids premature abstraction.
|
||
- **Phase 2**: Build thin adapters for the three highest-value integration targets: LlamaFirewall (custom Scanner), NeMo Guardrails (custom input rail via action), and OpenAI Agents SDK (input guardrail). These adapters should be optional packages, not core dependencies.
|
||
|
||
The key insight is that alknet-firewall's **behavioral signal detection** is fundamentally different from text-surface defenses. It requires running a model to extract activations — this means it cannot simply be plugged into regex pipelines, text classifiers, or rule-based rails. It needs its own inference step. The systems that are most compatible with this are those that accept **arbitrary Python callables** as their extension points (LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK). The systems that are least compatible are those that require text-surface validators (Guardrails AI's Validator pattern) or configuration-only DSLs (NeMo Guardrails' Colang flows).
|
||
|
||
---
|
||
|
||
## 2. Overview of Each System
|
||
|
||
### 2.1 LlamaFirewall (Meta)
|
||
|
||
- **Overview**: An open-source, real-time guardrail framework from Meta that orchestrates multiple security scanners across LLM application workflows. It is part of the PurpleLlama project and is used in production at Meta. It provides a modular scanner architecture with role-based assignment (user, assistant, tool messages).
|
||
|
||
- **Integration Pattern**: **Scanner/Plugin Pattern**. LlamaFirewall exposes a `BaseScanner` abstract class. Custom scanners inherit from `BaseScanner` and implement a `scan()` method. Scanners are registered via a `ScannerType` enum and mapped to message roles in a configuration dictionary. The policy engine orchestrates scanner execution and aggregates results.
|
||
|
||
- **API Surface**:
|
||
```python
|
||
# Core interface
|
||
class BaseScanner:
|
||
def scan(self, input_data) -> bool: ...
|
||
|
||
# Result type
|
||
@dataclass
|
||
class ScanResult:
|
||
decision: ScanDecision # ALLOW or BLOCK
|
||
reason: str # scanner identifier
|
||
score: float # confidence 0.0-1.0
|
||
|
||
# Main entry point
|
||
firewall = LlamaFirewall(scanners={
|
||
Role.USER: [ScannerType.PROMPT_GUARD],
|
||
Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],
|
||
})
|
||
result = firewall.scan(UserMessage(content="..."))
|
||
# → ScanResult(decision=ScanDecision.BLOCK, reason='prompt_guard', score=0.95)
|
||
|
||
# Conversation replay
|
||
result = firewall.scan_replay(trace)
|
||
```
|
||
|
||
- **Data Flow**: Synchronous, single-message scanning. Also supports `scan_replay()` for conversation traces. No built-in async or streaming.
|
||
|
||
- **Type System**: Role-based (UserMessage, AssistantMessage, ToolMessage), ScanDecision enum (ALLOW/BLOCK), numeric score (0.0–1.0), string reason.
|
||
|
||
- **Composability**: Multiple scanners can be assigned per role; results are aggregated with the most restrictive decision winning. Custom scanners are first-class citizens.
|
||
|
||
- **License**: Llama 3.2 Community License (for models), MIT (for CodeShield)
|
||
|
||
- **Compatibility with alknet-firewall**: **HIGH**. LlamaFirewall's Scanner pattern is a natural fit. A `BehavioralScanner` subclass of `BaseScanner` that wraps alknet-firewall's `Firewall.screen()` would integrate cleanly. The `ScanResult(decision, reason, score)` maps directly to our `Alarm` output. Key consideration: LlamaFirewall's `scan()` receives `input_data` as a string — alknet-firewall would need to accept that string, run it through our detector model, extract activations, and return a verdict. This is architecturally compatible.
|
||
|
||
### 2.2 NeMo Guardrails (NVIDIA)
|
||
|
||
- **Overview**: An open-source toolkit for adding programmable guardrails to LLM-based conversational applications. Uses a domain-specific language (Colang) to define safety rules, dialog flows, and content policies. Supports five types of rails: input, dialog, retrieval, execution, and output.
|
||
|
||
- **Integration Pattern**: **Configuration-Driven Rails with Custom Actions**. NeMo Guardrails uses YAML configuration files and Colang DSL files to define guardrail behavior. External systems are integrated through **custom Python actions** that are invoked from Colang flows. The system provides an `LLMRails` class that wraps LLM calls and enforces rails before/after processing.
|
||
|
||
- **API Surface**:
|
||
```python
|
||
# Core interface — Python API
|
||
from nemoguardrails import LLMRails, RailsConfig
|
||
config = RailsConfig.from_path("PATH/TO/CONFIG")
|
||
rails = LLMRails(config)
|
||
completion = rails.generate(messages=[{"role": "user", "content": "..."}])
|
||
|
||
# Custom action integration
|
||
# In actions.py:
|
||
async def check_behavioral_alarm(context):
|
||
# Custom Python callable — can call alknet-firewall here
|
||
result = firewall.screen(context["user_input"])
|
||
if result.alarm:
|
||
raise Exception("Behavioral alarm triggered")
|
||
|
||
# In rails.co (Colang flow):
|
||
define flow
|
||
user express input
|
||
execute check_behavioral_alarm
|
||
```
|
||
|
||
- **Data Flow**: Supports both sync (`generate`) and async (`generate_async`). Streaming is supported. Rails are processed in a pipeline: input rails → dialog rails → LLM call → retrieval rails → output rails.
|
||
|
||
- **Type System**: Chat Completions API format (OpenAI-compatible messages), Colang event-driven flows, YAML configuration for rail types.
|
||
|
||
- **Composability**: Multiple rails can be chained. Input/output rails can run in parallel (IORails engine, v0.21+). The system is designed for defense-in-depth composition.
|
||
|
||
- **License**: Apache 2.0
|
||
|
||
- **Compatibility with alknet-firewall**: **MEDIUM-HIGH**. NeMo Guardrails supports arbitrary Python actions, which means alknet-firewall can be called as an input rail action. However, the Colang DSL is designed for text-surface rule matching (pattern matching, LLM-based classification). Our behavioral detection doesn't fit into Colang's natural expression — it would be a "black box" action that returns a pass/fail. The integration point is the **input rail** (pre-LLM processing), which is architecturally correct for alknet-firewall. The main consideration: NeMo Guardrails wraps the entire LLM interaction, so alknet-firewall would need to be configured as an input check that runs before the target LLM is invoked.
|
||
|
||
### 2.3 Guardrails AI
|
||
|
||
- **Overview**: An open-source Python framework (Apache 2.0) focused on two functions: (1) running Input/Output Guards that detect and mitigate specific types of risks, and (2) generating structured data from LLMs. It provides a Validator ecosystem via Guardrails Hub with community-contributed validators.
|
||
|
||
- **Integration Pattern**: **Validator/Guard Pipeline Pattern**. Guardrails AI uses a `Guard` object that wraps LLM calls and applies a chain of `Validator` instances. Each Validator is a Python class that inherits from a base `Validator` class and implements a `validate()` method. Validators are organized into Input Guards (pre-LLM) and Output Guards (post-LLM). The framework also supports a REST API server mode.
|
||
|
||
- **API Surface**:
|
||
```python
|
||
# Core interface — Validator base class
|
||
class Validator:
|
||
def validate(self, value, metadata) -> ValidationResult: ...
|
||
|
||
# Guard composition
|
||
guard = Guard().use(
|
||
RegexMatch, regex="...", on_fail=OnFailAction.EXCEPTION
|
||
)
|
||
result = guard.validate("input text")
|
||
|
||
# Or for structured output:
|
||
guard = Guard.for_pydantic(output_class=Pet, prompt=prompt)
|
||
raw_output, validated_output, *rest = guard(llm_api=openai.completions.create, ...)
|
||
|
||
# Server mode (OpenAI-compatible endpoint)
|
||
guardrails start --config=./config.py
|
||
```
|
||
|
||
- **Data Flow**: Synchronous by default. Supports async (`AsyncGuard`). Streaming validation is supported (chunk-by-chunk processing). Validation can trigger re-asks (LLM re-generation).
|
||
|
||
- **Type System**: Pydantic models for structured output, Validator chain with `on_fail` actions (EXCEPTION, FIX, FILTER, NOOP, REFRAIN, LOG), `ValidationResult` with pass/fail/fix metadata.
|
||
|
||
- **Composability**: Validators are chained within a Guard. Multiple Guards can be composed. Guardrails Hub provides a marketplace of reusable validators. Custom validators can be created and published.
|
||
|
||
- **License**: Apache 2.0
|
||
|
||
- **Compatibility with alknet-firewall**: **MEDIUM**. Guardrails AI's Validator pattern expects a `validate(value, metadata) → ValidationResult` interface. Our `Firewall.screen(text) → Alarm` maps reasonably well to this. However, there's a conceptual mismatch: Guardrails AI Validators operate on **text content** (strings, JSON fields) and expect to either pass, fix, or reject the content. Our behavioral detection doesn't modify content — it produces a binary alarm with multi-dimensional signal data. The `on_fail` actions (FIX, FILTER) don't apply to behavioral detection. We could implement a `BehavioralAlarmValidator` that returns PASS or EXCEPTION, but the richer Alarm data (dimension scores, SVD projections) would be lost in the simplified ValidationResult. Also, the Guard pattern wraps the entire LLM call, which means alknet-firewall would need to intercept the input before it reaches the target model — but Guardrails AI is designed for the Guard to wrap and manage the LLM call itself, not to run an independent pre-check.
|
||
|
||
### 2.4 OpenAI Agents SDK
|
||
|
||
- **Overview**: The OpenAI Agents SDK (released March 2025) provides a minimalist Python framework for creating multi-agent workflows with built-in guardrail support. It defines three types of guardrails: input, output, and tool guardrails, each with a tripwire mechanism.
|
||
|
||
- **Integration Pattern**: **Agent-Level Guardrail Callbacks**. Guardrails are defined as decorated async Python functions (`@input_guardrail`, `@output_guardrail`, `@tool_input_guardrail`, `@tool_output_guardrail`) attached to Agent objects. Each guardrail function receives input/context and returns a `GuardrailFunctionOutput` with a `tripwire_triggered` boolean.
|
||
|
||
- **API Surface**:
|
||
```python
|
||
from agents import (
|
||
Agent, GuardrailFunctionOutput, InputGuardrailTripwireTriggered,
|
||
RunContextWrapper, Runner, input_guardrail
|
||
)
|
||
|
||
@input_guardrail
|
||
async def behavioral_alarm_guardrail(
|
||
ctx: RunContextWrapper, agent: Agent, input: str | list
|
||
) -> GuardrailFunctionOutput:
|
||
alarm = firewall.screen(input)
|
||
return GuardrailFunctionOutput(
|
||
output_info={"dimensions": alarm.dimensions, "score": alarm.score},
|
||
tripwire_triggered=alarm.alarm,
|
||
)
|
||
|
||
agent = Agent(
|
||
name="Agent",
|
||
instructions="...",
|
||
input_guardrails=[behavioral_alarm_guardrail],
|
||
)
|
||
|
||
result = await Runner.run(agent, "user input")
|
||
```
|
||
|
||
- **Data Flow**: Supports two execution modes:
|
||
- **Parallel** (default): Guardrail runs concurrently with agent. If tripwire triggers, agent is cancelled.
|
||
- **Blocking**: Guardrail runs first, blocks agent if triggered. This is the correct mode for alknet-firewall since we want to prevent the target LLM from processing flagged input.
|
||
|
||
- **Type System**: `GuardrailFunctionOutput` with `tripwire_triggered: bool` and `output_info: dict`. `InputGuardrailTripwireTriggered` and `OutputGuardrailTripwireTriggered` exceptions.
|
||
|
||
- **Composability**: Multiple guardrails can be attached per agent. They run independently; any tripwire triggers the exception.
|
||
|
||
- **License**: MIT (OpenAI Agents SDK)
|
||
|
||
- **Compatibility with alknet-firewall**: **HIGH**. The `@input_guardrail` decorator pattern is very clean for integration. Our `Firewall.screen()` returns an `Alarm` which maps naturally to `GuardrailFunctionOutput(tripwire_triggered=alarm.alarm, output_info={...})`. The blocking execution mode (`run_in_parallel=False`) is ideal — it prevents the target LLM from running until the behavioral check completes. This preserves our <10ms latency advantage. Key advantage: this is an **agent framework** pattern, which is exactly where indirect prompt injection is most dangerous (agents processing untrusted content).
|
||
|
||
### 2.5 Amazon Bedrock Guardrails
|
||
|
||
- **Overview**: A managed AWS service for applying content policies, topic denial, PII filtering, and contextual grounding checks to LLM applications. Supports an independent `ApplyGuardrail` API that can evaluate text without invoking a foundation model.
|
||
|
||
- **Integration Pattern**: **Managed API Service with Independent Evaluation**. Bedrock Guardrails can be applied in two ways: (1) inline with model invocation (automatic), or (2) via the independent `ApplyGuardrail` API (decoupled). The independent API is the relevant pattern for alknet-firewall.
|
||
|
||
- **API Surface**:
|
||
```python
|
||
import boto3
|
||
client = boto3.client('bedrock-runtime')
|
||
|
||
# Independent ApplyGuardrail API
|
||
response = client.apply_guardrail(
|
||
guardrailIdentifier='guardrail-id',
|
||
guardrailVersion='DRAFT',
|
||
source='INPUT', # or 'OUTPUT'
|
||
content=[
|
||
{'text': {'text': 'user input to evaluate'}}
|
||
]
|
||
)
|
||
# Returns: action (GUARDRAIL_INTERVENED or NONE),
|
||
# output text, assessments
|
||
```
|
||
|
||
- **Data Flow**: Synchronous HTTP API. No streaming. The API is independent of model invocation.
|
||
|
||
- **Type System**: Text content with source (INPUT/OUTPUT), guardrail configuration via AWS console/API, structured assessment results.
|
||
|
||
- **Composability**: Guardrails are configured as policies (denied topics, content filters, word blocklists, PII, grounding checks). They compose as layered policies within the AWS ecosystem.
|
||
|
||
- **License**: Proprietary AWS service
|
||
|
||
- **Compatibility with alknet-firewall**: **LOW**. Bedrock Guardrails is a closed, managed service with no plugin/extension mechanism. There is no way to add a custom scanner, validator, or detector. alknet-firewall would be a **parallel service** — users would need to call both Bedrock Guardrails and alknet-firewall independently and combine results themselves. The `ApplyGuardrail` API pattern is actually a good model for how alknet-firewall should work (independent, decoupled evaluation), but there's no direct integration point.
|
||
|
||
### 2.6 OpenGuardrails
|
||
|
||
- **Overview**: An open-source AI Security Gateway (formerly from the OpenGuardrails research paper, now at openguardrails.com) that sits between AI applications and model providers. It provides guardrails, multi-tenant configs, and policy-based routing for every LLM call. Evolved from the academic OpenGuardrails paper (arXiv:2510.19169) that proposed a unified, configurable, and scalable guardrail stack.
|
||
|
||
- **Integration Pattern**: **Gateway/Proxy Pattern**. OpenGuardrails operates as an AI Security Gateway — a proxy that intercepts LLM calls, applies guardrails, and forwards them. It handles multi-tenant configuration, policy-based routing, and supports detection, manipulation defense, and privacy protection.
|
||
|
||
- **API Surface**: Gateway proxy that intercepts HTTP calls to LLM providers. Configuration-driven guardrail policies.
|
||
|
||
- **Data Flow**: HTTP proxy model. All LLM calls route through the gateway. Guardrails execute before forwarding.
|
||
|
||
- **Type System**: Policy-based configuration, HTTP request/response interception.
|
||
|
||
- **Composability**: Multi-tenant, multi-policy. Multiple guardrails compose as layered policies.
|
||
|
||
- **License**: Open-source (GitHub repository appears to have moved/been restructured; current website at openguardrails.com)
|
||
|
||
- **Compatibility with alknet-firewall**: **MEDIUM-LOW**. OpenGuardrails is a gateway/proxy that intercepts HTTP calls. alknet-firewall could theoretically be integrated as a guardrail step within the gateway, but the project appears to be in transition (the GitHub repo is not publicly accessible at the time of research, and the website has shifted to promoting "OpenKai" for security teams). This is more of an infrastructure-level integration than an API-level one.
|
||
|
||
---
|
||
|
||
## 3. Comparison Table
|
||
|
||
| Criteria | LlamaFirewall | NeMo Guardrails | Guardrails AI | OpenAI Agents SDK | Bedrock Guardrails | OpenGuardrails |
|
||
|---|---|---|---|---|---|---|
|
||
| **Integration Pattern** | Scanner/Plugin | Config-Driven Rails + Actions | Validator/Guard Pipeline | Agent-Level Callbacks | Managed API Service | Gateway/Proxy |
|
||
| **Extension Mechanism** | `BaseScanner` subclass + `scan()` | Custom Python actions + Colang flows | `Validator` subclass + `validate()` | `@input_guardrail` decorator | None (closed service) | Custom guardrail policies |
|
||
| **API for External Detection** | ✅ Direct (BaseScanner) | ✅ Direct (actions) | ⚠️ Possible but awkward | ✅ Direct (guardrail func) | ❌ None | ⚠️ Gateway-level |
|
||
| **Input Type** | String message (UserMessage, etc.) | Chat messages (OpenAI format) | String value + metadata | String or message list | Text content | HTTP request body |
|
||
| **Output Type** | `ScanResult(decision, reason, score)` | Modified/allowed/rejected message | `ValidationResult` + on_fail actions | `GuardrailFunctionOutput(tripwire_triggered, output_info)` | Assessment + action (INTERVENED/NONE) | Pass/modify/reject |
|
||
| **Async Support** | ❌ No (sync only) | ✅ Yes (async-first) | ✅ Yes (AsyncGuard) | ✅ Yes (native async) | ✅ Yes (HTTP API) | ✅ Yes (HTTP proxy) |
|
||
| **Streaming Support** | ❌ No | ✅ Yes | ✅ Yes (StreamRunner) | ✅ Yes (via Runner) | ❌ No | ⚠️ Unknown |
|
||
| **Batch Support** | ❌ No (single message) | ⚠️ Via conversation traces | ❌ No (per-call) | ❌ No (per-invocation) | ❌ No (per-call) | ⚠️ Unknown |
|
||
| **Composability** | Multi-scanner per role, most-restrictive wins | Multi-rail pipeline, parallel IORails | Validator chain within Guard | Multiple guardrails per agent, any tripwire triggers | Layered policies | Layered policies |
|
||
| **License** | Llama 3.2 Community / MIT | Apache 2.0 | Apache 2.0 | MIT | Proprietary (AWS) | Open-source |
|
||
| **alknet-fit** | **HIGH** | **MEDIUM-HIGH** | **MEDIUM** | **HIGH** | **LOW** | **MEDIUM-LOW** |
|
||
|
||
### Architectural Pattern Comparison
|
||
|
||
| Pattern | Systems Using It | Key Trait | Suitability for alknet-firewall |
|
||
|---|---|---|---|
|
||
| **Scanner/Plugin** | LlamaFirewall | Inherit base class, implement scan method, register in framework | ✅ Ideal — our behavioral detection maps to a scanner |
|
||
| **Config-Driven Rails** | NeMo Guardrails | Define behavior in DSL (Colang) + YAML, call custom Python actions | ⚠️ Workable — behavioral detection would be an opaque action, not expressible in Colang |
|
||
| **Validator Chain** | Guardrails AI | Chain validators around LLM call, each validates content | ⚠️ Awkward — our detection doesn't produce content fixes, just alarms |
|
||
| **Agent Callback** | OpenAI Agents SDK | Decorated async functions attached to agent, tripwire pattern | ✅ Excellent — natural fit for blocking input before target LLM runs |
|
||
| **Managed API** | Bedrock Guardrails | Closed service, no extension, call independently | ❌ Not integrable — parallel service only |
|
||
| **Gateway Proxy** | OpenGuardrails | Intercept HTTP calls to LLM providers | ⚠️ Infrastructure-level — could embed alknet-firewall as a check step |
|
||
|
||
---
|
||
|
||
## 4. Analysis for alknet-firewall
|
||
|
||
### 4.1 What Makes alknet-firewall Different
|
||
|
||
alknet-firewall's behavioral signal detection is **fundamentally different** from every system analyzed above:
|
||
|
||
1. **It inspects model activations, not text**. All other guardrail systems operate on text content — they read input strings and classify/filter them. alknet-firewall runs a small detector model on the input, extracts hidden state activations, and produces an alarm based on multi-dimensional behavioral patterns.
|
||
|
||
2. **It requires its own inference step**. This is the critical architectural difference. A text-surface validator can be a pure function: `text → verdict`. alknet-firewall needs: `text → model forward pass → activation extraction → SVD projection → alarm`. This means it cannot be simply "plugged into" text-processing pipelines without acknowledging the model inference requirement.
|
||
|
||
3. **It produces rich multi-dimensional output**. An `Alarm` contains not just a binary pass/fail, but dimension scores, SVD projections, and confidence metrics. Most guardrail systems expect a simple pass/fail or safe/unsafe label.
|
||
|
||
4. **It's a pre-check, not a post-check**. By design, alknet-firewall screens input **before** it reaches the target LLM. This makes it an input guardrail, not an output guardrail. It's architecturally similar to LlamaFirewall's `Role.USER` scanners or NeMo Guardrails' input rails.
|
||
|
||
5. **It's fast enough to be inline**. With <10ms latency on commodity hardware, it can run synchronously in the request path without requiring async/background processing.
|
||
|
||
### 4.2 Compatible Integration Patterns
|
||
|
||
#### ✅ Directly Compatible: Scanner/Plugin Pattern (LlamaFirewall)
|
||
|
||
LlamaFirewall's `BaseScanner` is the most natural fit:
|
||
|
||
```python
|
||
# Hypothetical LlamaFirewall adapter
|
||
from llamafirewall.scanners.base_scanner import BaseScanner
|
||
from alknet_firewall import Firewall, Alarm
|
||
|
||
class BehavioralScanner(BaseScanner):
|
||
def __init__(self, config):
|
||
super().__init__(config)
|
||
self.firewall = Firewall() # Loads SmolLM2-135M detector
|
||
|
||
def scan(self, input_data: str) -> ScanResult:
|
||
alarm: Alarm = self.firewall.screen(input_data)
|
||
return ScanResult(
|
||
decision=ScanDecision.BLOCK if alarm.alarm else ScanDecision.ALLOW,
|
||
reason='behavioral_signal_detection',
|
||
score=alarm.confidence
|
||
)
|
||
```
|
||
|
||
**Why it works**: The Scanner pattern accepts a string and returns a `ScanResult(decision, reason, score)`. Our `Alarm` maps directly to this. The scanner is registered in LlamaFirewall's configuration and gets called for every user input.
|
||
|
||
**Limitation**: LlamaFirewall is synchronous and doesn't support batch processing. This is fine since our detection is <10ms.
|
||
|
||
#### ✅ Directly Compatible: Agent Callback Pattern (OpenAI Agents SDK)
|
||
|
||
The `@input_guardrail` decorator pattern is clean and ergonomic:
|
||
|
||
```python
|
||
# Hypothetical OpenAI Agents SDK adapter
|
||
from agents import Agent, GuardrailFunctionOutput, input_guardrail
|
||
from alknet_firewall import Firewall
|
||
|
||
firewall = Firewall()
|
||
|
||
@input_guardrail
|
||
async def behavioral_alarm_guardrail(ctx, agent, input):
|
||
text = input if isinstance(input, str) else str(input)
|
||
alarm = firewall.screen(text)
|
||
return GuardrailFunctionOutput(
|
||
output_info={
|
||
"alarm": alarm.alarm,
|
||
"dimensions": alarm.dimension_scores,
|
||
"confidence": alarm.confidence,
|
||
},
|
||
tripwire_triggered=alarm.alarm,
|
||
)
|
||
|
||
agent = Agent(
|
||
name="Safe Agent",
|
||
instructions="...",
|
||
input_guardrails=[behavioral_alarm_guardrail],
|
||
)
|
||
```
|
||
|
||
**Why it works**: The blocking execution mode (`run_in_parallel=False`) prevents the target LLM from running until the behavioral check completes. This is exactly our use case. The `output_info` dict can carry our rich Alarm data.
|
||
|
||
**Limitation**: Tied to the OpenAI Agents SDK ecosystem. Not portable.
|
||
|
||
#### ⚠️ Workable: Custom Action Pattern (NeMo Guardrails)
|
||
|
||
NeMo Guardrails allows custom Python actions within its Colang flow system:
|
||
|
||
```python
|
||
# In actions.py
|
||
from alknet_firewall import Firewall
|
||
firewall = Firewall()
|
||
|
||
async def check_behavioral_alarm(context):
|
||
user_input = context.get("user_input", "")
|
||
alarm = firewall.screen(user_input)
|
||
if alarm.alarm:
|
||
return False # Block the input
|
||
return True # Allow
|
||
```
|
||
|
||
```colang
|
||
# In rails.co
|
||
define flow
|
||
user express input
|
||
execute check_behavioral_alarm
|
||
```
|
||
|
||
**Why it's workable**: Custom actions can call any Python code, including alknet-firewall. The input rail runs before the LLM.
|
||
|
||
**Limitations**: The Colang DSL can't express behavioral detection natively. The action is an opaque call — no visibility into the detection reasoning within the Colang flow. Configuration is split across multiple files (YAML, .co, actions.py). More complex setup than LlamaFirewall or Agents SDK.
|
||
|
||
#### ⚠️ Awkward: Validator Pattern (Guardrails AI)
|
||
|
||
A `BehavioralAlarmValidator` could wrap alknet-firewall:
|
||
|
||
```python
|
||
# Hypothetical Guardrails AI adapter
|
||
from guardrails.validator_base import Validator
|
||
from alknet_firewall import Firewall
|
||
|
||
class BehavioralAlarmValidator(Validator):
|
||
def validate(self, value, metadata):
|
||
alarm = Firewall().screen(value)
|
||
if alarm.alarm:
|
||
return FailResult(
|
||
error_message="Behavioral alarm triggered",
|
||
fix_value="", # Can't fix it, just block
|
||
)
|
||
return PassResult()
|
||
```
|
||
|
||
**Why it's awkward**: The Validator pattern assumes it can fix content (via `on_fail=OnFailAction.FIX` or `FILTER`). Our system can't fix content — it can only pass or alarm. The `on_fail` actions FIX, FILTER, REFRAIN don't map cleanly to "this input exhibits adversarial behavioral patterns." The ValidationResult type doesn't carry multi-dimensional signal data. The Guard pattern wraps the LLM call, which creates an architectural conflict: alknet-firewall should run before the LLM call, not wrap it.
|
||
|
||
### 4.3 Incompatible Patterns
|
||
|
||
#### ❌ Configuration-Only DSL (NeMo Guardrails Colang)
|
||
|
||
Colang flows define conversational patterns in text — "define user express insult", "define bot respond calmly". There's no way to express "run a small model and check activation patterns" in Colang. Our detection must be an opaque Python action.
|
||
|
||
#### ❌ Rule/Regex-Based Composition (LlamaFirewall Regex Scanners, NeMo Topic Rails)
|
||
|
||
Behavioral signal detection cannot be expressed as regex patterns, keyword lists, or topic rules. It requires model inference. Any composition mechanism that only supports text-matching rules is incompatible with our approach.
|
||
|
||
#### ❌ Managed Service APIs (Bedrock Guardrails)
|
||
|
||
Amazon Bedrock Guardrails is a closed service with no extension mechanism. alknet-firewall would need to run as an independent service alongside it, with users responsible for composing results.
|
||
|
||
### 4.4 Key Considerations
|
||
|
||
| Consideration | Impact on Integration Strategy |
|
||
|---|---|
|
||
| **Model inference required** | alknet-firewall needs a model forward pass. This means it can't be a pure text function. Adapter implementations must handle model loading and inference lifecycle. |
|
||
| **<10ms latency** | Fast enough for synchronous, inline pre-checks. No need for async/background processing. This simplifies adapters. |
|
||
| **Rich multi-dimensional output** | Most guardrail systems expect a simple pass/fail. Our dimension scores and SVD projections will be lost or need to be serialized into metadata fields. |
|
||
| **CPU-capable** | Can run without GPU. This makes deployment simpler than systems requiring GPU (like Llama Guard's 8B model). |
|
||
| **Pre-check only** | alknet-firewall is an input guardrail, not an output guardrail. It should only be composed at input screening positions. |
|
||
| **Standalone value** | alknet-firewall provides unique value (behavioral detection) that text-surface systems don't offer. It's complementary, not competing. |
|
||
|
||
---
|
||
|
||
## 5. Recommendation
|
||
|
||
### Phase 1: Standalone API (Ship Fast, Compose Manually)
|
||
|
||
**Approach**: Provide a clean, synchronous Python API and let users compose it with their existing guardrail systems themselves.
|
||
|
||
```python
|
||
# alknet-firewall core API (already designed)
|
||
from alknet_firewall import Firewall
|
||
|
||
firewall = Firewall() # Loads SmolLM2-135M detector model
|
||
alarm = firewall.screen("user input text")
|
||
|
||
if alarm.alarm:
|
||
# User decides what to do — block, log, flag for review
|
||
print(f"Behavioral alarm: {alarm}")
|
||
print(f"Confidence: {alarm.confidence}")
|
||
print(f"Dimension scores: {alarm.dimension_scores}")
|
||
```
|
||
|
||
**Why this first**:
|
||
1. **No premature abstraction**. We don't yet know which guardrail systems our users actually use. Building adapters before understanding demand is wasted effort.
|
||
2. **Maximum flexibility**. Users can call `firewall.screen()` from any Python context — a Flask middleware, a Lambda handler, a Celery task, or inline in their LLM pipeline.
|
||
3. **Simplest mental model**. One function, one type. `screen(text) → Alarm`. Easy to document, easy to test, easy to reason about.
|
||
4. **Validates the core product**. Before investing in adapters, we need validation that the behavioral detection approach works and that users want it.
|
||
|
||
**Deliverables for Phase 1**:
|
||
- `Firewall` class with `screen(text) → Alarm` method
|
||
- `Alarm` dataclass with `alarm: bool`, `confidence: float`, `dimension_scores: dict`, `reason: str`
|
||
- HTTP API endpoint: `POST /v1/screen` with `{"text": "..."}` → `{"alarm": true, "confidence": 0.95, ...}`
|
||
- Docker image for easy deployment
|
||
- Documentation showing manual composition examples with LlamaFirewall, NeMo Guardrails, and OpenAI Agents SDK
|
||
|
||
### Phase 2: Thin Adapters (Highest-Value Integrations)
|
||
|
||
**Approach**: Build adapter packages for the three systems with the highest compatibility and adoption: LlamaFirewall, OpenAI Agents SDK, and NeMo Guardrails.
|
||
|
||
```python
|
||
# alknet-firewall-llamafirewall adapter
|
||
from llamafirewall import LlamaFirewall, Role, ScannerType
|
||
from alknet_firewall.adapters.llamafirewall import BehavioralScanner
|
||
|
||
firewall = LlamaFirewall(scanners={
|
||
Role.USER: [ScannerType.PROMPT_GUARD, BehavioralScanner()],
|
||
Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],
|
||
})
|
||
|
||
# alknet-firewall-agents-sdk adapter
|
||
from agents import Agent, GuardrailFunctionOutput, input_guardrail
|
||
from alknet_firewall.adapters.openai_agents import create_behavioral_guardrail
|
||
|
||
agent = Agent(
|
||
name="Safe Agent",
|
||
instructions="...",
|
||
input_guardrails=[create_behavioral_guardrail(blocking=True)],
|
||
)
|
||
|
||
# alknet-firewall-nemo adapter
|
||
# Custom action in actions.py that calls firewall.screen()
|
||
```
|
||
|
||
**Why these three**:
|
||
1. **LlamaFirewall** — Highest compatibility. Same Scanner pattern, same role-based model, same `ScanResult` output. LlamaFirewall users are already thinking about input safety. Our behavioral scanner adds a fundamentally different detection method.
|
||
2. **OpenAI Agents SDK** — Highest value target. Agent frameworks are where indirect prompt injection is most dangerous (agents process untrusted content). The `@input_guardrail` pattern is a perfect fit. Blocking mode prevents the target LLM from processing flagged input.
|
||
3. **NeMo Guardrails** — Broad enterprise adoption. Apache 2.0, widely deployed in enterprise settings. The custom action pattern is workable even if not as elegant.
|
||
|
||
**Adapter design principles**:
|
||
- **Optional dependency**. Each adapter is a separate `pip install alknet-firewall-llamafirewall` package. Core `alknet-firewall` doesn't depend on any guardrail framework.
|
||
- **Minimal code**. Each adapter is <100 lines. It wraps `Firewall.screen()` and maps `Alarm` to the target system's type.
|
||
- **Lossy but pragmatic**. The adapter maps `Alarm.alarm` → the target system's pass/fail, `Alarm.confidence` → the target system's score, and serializes `dimension_scores` into a metadata/extra field. Rich signal data is preserved where possible but the binary decision is the primary integration point.
|
||
- **Blocking by default**. All adapters default to blocking execution (prevent LLM from processing flagged input). This matches our pre-check design.
|
||
|
||
### Phase 3: Common Interface (Only if Demand Emerges)
|
||
|
||
**Approach**: If users are composing alknet-firewall with multiple guardrail systems and reporting friction, consider defining a common interface abstract.
|
||
|
||
```python
|
||
# Possible Phase 3 interface (NOT recommended yet)
|
||
from abc import ABC, abstractmethod
|
||
from dataclasses import dataclass
|
||
|
||
@dataclass
|
||
class ScreeningResult:
|
||
passed: bool
|
||
confidence: float
|
||
reason: str
|
||
metadata: dict # system-specific data
|
||
|
||
class ScreeningProvider(ABC):
|
||
@abstractmethod
|
||
def screen(self, text: str) -> ScreeningResult: ...
|
||
|
||
class AlknetFirewallProvider(ScreeningProvider):
|
||
def screen(self, text: str) -> ScreeningResult:
|
||
alarm = self.firewall.screen(text)
|
||
return ScreeningResult(
|
||
passed=not alarm.alarm,
|
||
confidence=alarm.confidence,
|
||
reason=alarm.reason,
|
||
metadata={"dimension_scores": alarm.dimension_scores}
|
||
)
|
||
```
|
||
|
||
**Why NOT now**: Premature abstraction. We have one screening provider (alknet-firewall). Defining a common interface requires multiple implementations to validate the abstraction. This should only happen when:
|
||
- We have 3+ guardrail systems integrating via our adapters
|
||
- Users are asking for a unified composition API
|
||
- We have concrete evidence that the interface generalizes correctly
|
||
|
||
### What About Guardrails AI and Others?
|
||
|
||
| System | Phase 2? | Rationale |
|
||
|---|---|---|
|
||
| Guardrails AI | **No** | Validator pattern is awkward for behavioral detection. If demand emerges, a `BehavioralAlarmValidator` adapter could be built, but it's not a priority. |
|
||
| Bedrock Guardrails | **No** | Closed service, no extension mechanism. Users compose manually (call both APIs). |
|
||
| OpenGuardrails | **No** | Project appears to be in transition. Not a stable integration target. |
|
||
| LangChain/LangGraph | **Possible Phase 2.5** | LangGraph agents would benefit from behavioral pre-checks. The integration pattern would be similar to OpenAI Agents SDK — a custom node in the graph that calls `firewall.screen()`. Monitor demand. |
|
||
|
||
---
|
||
|
||
## 6. References
|
||
|
||
### LlamaFirewall
|
||
1. Meta, "LlamaFirewall: An open source guardrail system for building secure AI agents," arXiv:2505.03574, May 2025. https://arxiv.org/abs/2505.03574
|
||
2. LlamaFirewall GitHub Repository: https://github.com/meta-llama/PurpleLlama/tree/main/LlamaFirewall
|
||
3. LlamaFirewall Documentation — Adding a Custom Scanner: https://meta-llama.github.io/PurpleLlama/LlamaFirewall/docs/documentation/advanced-usage/adding-custom-scanner
|
||
4. LlamaFirewall Architecture: https://meta-llama.github.io/PurpleLlama/LlamaFirewall/docs/documentation/llamafirewall-architecture/architecture
|
||
5. LlamaFirewall PyPI: https://pypi.org/project/llamafirewall/
|
||
6. DeepWiki — LlamaFirewall Security Framework: https://deepwiki.com/meta-llama/PurpleLlama/4-llamafirewall-security-framework
|
||
|
||
### NeMo Guardrails
|
||
7. NVIDIA, "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails," EMNLP 2023. https://aclanthology.org/2023.emnlp-demo.40
|
||
8. NeMo Guardrails GitHub Repository: https://github.com/NVIDIA-NeMo/Guardrails
|
||
9. NeMo Guardrails Documentation: https://docs.nvidia.com/nemo/guardrails
|
||
10. NeMo Guardrails LangGraph Integration: https://docs.nvidia.com/nemo/guardrails/latest/integration/langchain/langgraph-integration.html
|
||
11. DeepWiki — NeMo Guardrails System Architecture: https://deepwiki.com/NVIDIA/NeMo-Guardrails/2-system-architecture
|
||
12. DeepWiki — NeMo Guardrails Rails System: https://deepwiki.com/NVIDIA/NeMo-Guardrails/5-rails-system
|
||
|
||
### Guardrails AI
|
||
13. Guardrails AI GitHub Repository: https://github.com/guardrails-ai/guardrails
|
||
14. Guardrails AI Documentation: https://docs.guardrailsai.com/
|
||
15. Guardrails AI Hub: https://guardrailsai.com/hub/
|
||
16. DeepWiki — Guardrails AI Validators and Validation Pipeline: https://deepwiki.com/guardrails-ai/guardrails/2.2-validators-and-validation-pipeline
|
||
17. DeepWiki — Guardrails AI Integration Patterns: https://deepwiki.com/guardrails-ai/guardrails/5-integration-patterns
|
||
|
||
### OpenAI Agents SDK
|
||
18. OpenAI Agents SDK — Guardrails Documentation: https://openai.github.io/openai-agents-python/guardrails/
|
||
19. OpenAI Agents SDK GitHub: https://github.com/openai/openai-agents-python
|
||
20. DeepWiki — OpenAI Agents SDK Input/Output Guardrails: https://deepwiki.com/openai/openai-agents-python/6.2-input-and-output-guardrails
|
||
|
||
### Amazon Bedrock Guardrails
|
||
21. AWS, "Use the ApplyGuardrail API in your application": https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-use-independent-api.html
|
||
|
||
### OpenGuardrails
|
||
22. OpenGuardrails Paper: "A Configurable, Unified, and Scalable Guardrails Stack for LLMs," arXiv:2510.19169, 2025.
|
||
23. OpenGuardrails Website: https://www.openguardrails.com/
|
||
|
||
### General Guardrail Landscape
|
||
24. AI Safety Directory, "LLM Guardrails: The Complete Guide to AI Safety Guardrails (2026)": https://aisecurityandsafety.org/en/guides/llm-guardrails/
|
||
25. DeepInspect, "Open Source LLM Guardrails: The Libraries Available, Where They Sit, and What They Cannot Replace," May 2026: https://www.deepinspect.ai/blog/open-source-llm-guardrails
|
||
|
||
### alknet-firewall Internal References
|
||
26. `docs/research/llm-input-safety-landscape.md` — Existing landscape analysis covering threat model, defense approaches, and the gap that alknet-firewall fills. |