Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

37 KiB

Raw Blame History

Research: Guardrail Integration Patterns for alknet-firewall

Date: June 2026
Scope: How existing guardrail/integration systems accept external defenses, and which patterns are compatible with alknet-firewall's behavioral signal detection approach
Purpose: Inform the integration strategy — adapters, common interface, or standalone API

Executive Summary
Overview of Each System
Comparison Table
Analysis for alknet-firewall
Recommendation
References

1. Executive Summary

After analyzing six major guardrail/integration systems (LlamaFirewall, NeMo Guardrails, Guardrails AI, OpenAI Agents SDK, Amazon Bedrock Guardrails, and OpenGuardrails), the evidence strongly supports a standalone API with thin adapter pattern for alknet-firewall:

Phase 1: Provide a clean, synchronous standalone API (Firewall.screen(text) → Alarm) and let users compose it manually with their existing systems. This is the fastest path to adoption and avoids premature abstraction.
Phase 2: Build thin adapters for the three highest-value integration targets: LlamaFirewall (custom Scanner), NeMo Guardrails (custom input rail via action), and OpenAI Agents SDK (input guardrail). These adapters should be optional packages, not core dependencies.

The key insight is that alknet-firewall's behavioral signal detection is fundamentally different from text-surface defenses. It requires running a model to extract activations — this means it cannot simply be plugged into regex pipelines, text classifiers, or rule-based rails. It needs its own inference step. The systems that are most compatible with this are those that accept arbitrary Python callables as their extension points (LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK). The systems that are least compatible are those that require text-surface validators (Guardrails AI's Validator pattern) or configuration-only DSLs (NeMo Guardrails' Colang flows).

2. Overview of Each System

2.1 LlamaFirewall (Meta)

Overview: An open-source, real-time guardrail framework from Meta that orchestrates multiple security scanners across LLM application workflows. It is part of the PurpleLlama project and is used in production at Meta. It provides a modular scanner architecture with role-based assignment (user, assistant, tool messages).
Integration Pattern: Scanner/Plugin Pattern. LlamaFirewall exposes a BaseScanner abstract class. Custom scanners inherit from BaseScanner and implement a scan() method. Scanners are registered via a ScannerType enum and mapped to message roles in a configuration dictionary. The policy engine orchestrates scanner execution and aggregates results.

API Surface:

# Core interface
class BaseScanner:
    def scan(self, input_data) -> bool: ...

# Result type
@dataclass
class ScanResult:
    decision: ScanDecision  # ALLOW or BLOCK
    reason: str              # scanner identifier
    score: float             # confidence 0.0-1.0

# Main entry point
firewall = LlamaFirewall(scanners={
    Role.USER: [ScannerType.PROMPT_GUARD],
    Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],
})
result = firewall.scan(UserMessage(content="..."))
# → ScanResult(decision=ScanDecision.BLOCK, reason='prompt_guard', score=0.95)

# Conversation replay
result = firewall.scan_replay(trace)

Data Flow: Synchronous, single-message scanning. Also supports scan_replay() for conversation traces. No built-in async or streaming.
Type System: Role-based (UserMessage, AssistantMessage, ToolMessage), ScanDecision enum (ALLOW/BLOCK), numeric score (0.0–1.0), string reason.
Composability: Multiple scanners can be assigned per role; results are aggregated with the most restrictive decision winning. Custom scanners are first-class citizens.
License: Llama 3.2 Community License (for models), MIT (for CodeShield)
Compatibility with alknet-firewall: HIGH. LlamaFirewall's Scanner pattern is a natural fit. A BehavioralScanner subclass of BaseScanner that wraps alknet-firewall's Firewall.screen() would integrate cleanly. The ScanResult(decision, reason, score) maps directly to our Alarm output. Key consideration: LlamaFirewall's scan() receives input_data as a string — alknet-firewall would need to accept that string, run it through our detector model, extract activations, and return a verdict. This is architecturally compatible.

2.2 NeMo Guardrails (NVIDIA)

Overview: An open-source toolkit for adding programmable guardrails to LLM-based conversational applications. Uses a domain-specific language (Colang) to define safety rules, dialog flows, and content policies. Supports five types of rails: input, dialog, retrieval, execution, and output.
Integration Pattern: Configuration-Driven Rails with Custom Actions. NeMo Guardrails uses YAML configuration files and Colang DSL files to define guardrail behavior. External systems are integrated through custom Python actions that are invoked from Colang flows. The system provides an LLMRails class that wraps LLM calls and enforces rails before/after processing.

API Surface:

# Core interface — Python API
from nemoguardrails import LLMRails, RailsConfig
config = RailsConfig.from_path("PATH/TO/CONFIG")
rails = LLMRails(config)
completion = rails.generate(messages=[{"role": "user", "content": "..."}])

# Custom action integration
# In actions.py:
async def check_behavioral_alarm(context):
    # Custom Python callable — can call alknet-firewall here
    result = firewall.screen(context["user_input"])
    if result.alarm:
        raise Exception("Behavioral alarm triggered")

# In rails.co (Colang flow):
define flow
  user express input
  execute check_behavioral_alarm

Data Flow: Supports both sync (generate) and async (generate_async). Streaming is supported. Rails are processed in a pipeline: input rails → dialog rails → LLM call → retrieval rails → output rails.
Type System: Chat Completions API format (OpenAI-compatible messages), Colang event-driven flows, YAML configuration for rail types.
Composability: Multiple rails can be chained. Input/output rails can run in parallel (IORails engine, v0.21+). The system is designed for defense-in-depth composition.
License: Apache 2.0
Compatibility with alknet-firewall: MEDIUM-HIGH. NeMo Guardrails supports arbitrary Python actions, which means alknet-firewall can be called as an input rail action. However, the Colang DSL is designed for text-surface rule matching (pattern matching, LLM-based classification). Our behavioral detection doesn't fit into Colang's natural expression — it would be a "black box" action that returns a pass/fail. The integration point is the input rail (pre-LLM processing), which is architecturally correct for alknet-firewall. The main consideration: NeMo Guardrails wraps the entire LLM interaction, so alknet-firewall would need to be configured as an input check that runs before the target LLM is invoked.

2.3 Guardrails AI

Overview: An open-source Python framework (Apache 2.0) focused on two functions: (1) running Input/Output Guards that detect and mitigate specific types of risks, and (2) generating structured data from LLMs. It provides a Validator ecosystem via Guardrails Hub with community-contributed validators.
Integration Pattern: Validator/Guard Pipeline Pattern. Guardrails AI uses a Guard object that wraps LLM calls and applies a chain of Validator instances. Each Validator is a Python class that inherits from a base Validator class and implements a validate() method. Validators are organized into Input Guards (pre-LLM) and Output Guards (post-LLM). The framework also supports a REST API server mode.

API Surface:

# Core interface — Validator base class
class Validator:
    def validate(self, value, metadata) -> ValidationResult: ...

# Guard composition
guard = Guard().use(
    RegexMatch, regex="...", on_fail=OnFailAction.EXCEPTION
)
result = guard.validate("input text")

# Or for structured output:
guard = Guard.for_pydantic(output_class=Pet, prompt=prompt)
raw_output, validated_output, *rest = guard(llm_api=openai.completions.create, ...)

# Server mode (OpenAI-compatible endpoint)
guardrails start --config=./config.py

Data Flow: Synchronous by default. Supports async (AsyncGuard). Streaming validation is supported (chunk-by-chunk processing). Validation can trigger re-asks (LLM re-generation).
Type System: Pydantic models for structured output, Validator chain with on_fail actions (EXCEPTION, FIX, FILTER, NOOP, REFRAIN, LOG), ValidationResult with pass/fail/fix metadata.
Composability: Validators are chained within a Guard. Multiple Guards can be composed. Guardrails Hub provides a marketplace of reusable validators. Custom validators can be created and published.
License: Apache 2.0
Compatibility with alknet-firewall: MEDIUM. Guardrails AI's Validator pattern expects a validate(value, metadata) → ValidationResult interface. Our Firewall.screen(text) → Alarm maps reasonably well to this. However, there's a conceptual mismatch: Guardrails AI Validators operate on text content (strings, JSON fields) and expect to either pass, fix, or reject the content. Our behavioral detection doesn't modify content — it produces a binary alarm with multi-dimensional signal data. The on_fail actions (FIX, FILTER) don't apply to behavioral detection. We could implement a BehavioralAlarmValidator that returns PASS or EXCEPTION, but the richer Alarm data (dimension scores, SVD projections) would be lost in the simplified ValidationResult. Also, the Guard pattern wraps the entire LLM call, which means alknet-firewall would need to intercept the input before it reaches the target model — but Guardrails AI is designed for the Guard to wrap and manage the LLM call itself, not to run an independent pre-check.

2.4 OpenAI Agents SDK

Overview: The OpenAI Agents SDK (released March 2025) provides a minimalist Python framework for creating multi-agent workflows with built-in guardrail support. It defines three types of guardrails: input, output, and tool guardrails, each with a tripwire mechanism.
Integration Pattern: Agent-Level Guardrail Callbacks. Guardrails are defined as decorated async Python functions (@input_guardrail, @output_guardrail, @tool_input_guardrail, @tool_output_guardrail) attached to Agent objects. Each guardrail function receives input/context and returns a GuardrailFunctionOutput with a tripwire_triggered boolean.

API Surface:

from agents import (
    Agent, GuardrailFunctionOutput, InputGuardrailTripwireTriggered,
    RunContextWrapper, Runner, input_guardrail
)

@input_guardrail
async def behavioral_alarm_guardrail(
    ctx: RunContextWrapper, agent: Agent, input: str | list
) -> GuardrailFunctionOutput:
    alarm = firewall.screen(input)
    return GuardrailFunctionOutput(
        output_info={"dimensions": alarm.dimensions, "score": alarm.score},
        tripwire_triggered=alarm.alarm,
    )

agent = Agent(
    name="Agent",
    instructions="...",
    input_guardrails=[behavioral_alarm_guardrail],
)

result = await Runner.run(agent, "user input")

Data Flow: Supports two execution modes:
- Parallel (default): Guardrail runs concurrently with agent. If tripwire triggers, agent is cancelled.
- Blocking: Guardrail runs first, blocks agent if triggered. This is the correct mode for alknet-firewall since we want to prevent the target LLM from processing flagged input.
Type System: GuardrailFunctionOutput with tripwire_triggered: bool and output_info: dict. InputGuardrailTripwireTriggered and OutputGuardrailTripwireTriggered exceptions.
Composability: Multiple guardrails can be attached per agent. They run independently; any tripwire triggers the exception.
License: MIT (OpenAI Agents SDK)
Compatibility with alknet-firewall: HIGH. The @input_guardrail decorator pattern is very clean for integration. Our Firewall.screen() returns an Alarm which maps naturally to GuardrailFunctionOutput(tripwire_triggered=alarm.alarm, output_info={...}). The blocking execution mode (run_in_parallel=False) is ideal — it prevents the target LLM from running until the behavioral check completes. This preserves our <10ms latency advantage. Key advantage: this is an agent framework pattern, which is exactly where indirect prompt injection is most dangerous (agents processing untrusted content).

2.5 Amazon Bedrock Guardrails

Overview: A managed AWS service for applying content policies, topic denial, PII filtering, and contextual grounding checks to LLM applications. Supports an independent ApplyGuardrail API that can evaluate text without invoking a foundation model.
Integration Pattern: Managed API Service with Independent Evaluation. Bedrock Guardrails can be applied in two ways: (1) inline with model invocation (automatic), or (2) via the independent ApplyGuardrail API (decoupled). The independent API is the relevant pattern for alknet-firewall.

API Surface:

import boto3
client = boto3.client('bedrock-runtime')

# Independent ApplyGuardrail API
response = client.apply_guardrail(
    guardrailIdentifier='guardrail-id',
    guardrailVersion='DRAFT',
    source='INPUT',  # or 'OUTPUT'
    content=[
        {'text': {'text': 'user input to evaluate'}}
    ]
)
# Returns: action (GUARDRAIL_INTERVENED or NONE),
#          output text, assessments

Data Flow: Synchronous HTTP API. No streaming. The API is independent of model invocation.
Type System: Text content with source (INPUT/OUTPUT), guardrail configuration via AWS console/API, structured assessment results.
Composability: Guardrails are configured as policies (denied topics, content filters, word blocklists, PII, grounding checks). They compose as layered policies within the AWS ecosystem.
License: Proprietary AWS service
Compatibility with alknet-firewall: LOW. Bedrock Guardrails is a closed, managed service with no plugin/extension mechanism. There is no way to add a custom scanner, validator, or detector. alknet-firewall would be a parallel service — users would need to call both Bedrock Guardrails and alknet-firewall independently and combine results themselves. The ApplyGuardrail API pattern is actually a good model for how alknet-firewall should work (independent, decoupled evaluation), but there's no direct integration point.

2.6 OpenGuardrails

Overview: An open-source AI Security Gateway (formerly from the OpenGuardrails research paper, now at openguardrails.com) that sits between AI applications and model providers. It provides guardrails, multi-tenant configs, and policy-based routing for every LLM call. Evolved from the academic OpenGuardrails paper (arXiv:2510.19169) that proposed a unified, configurable, and scalable guardrail stack.
Integration Pattern: Gateway/Proxy Pattern. OpenGuardrails operates as an AI Security Gateway — a proxy that intercepts LLM calls, applies guardrails, and forwards them. It handles multi-tenant configuration, policy-based routing, and supports detection, manipulation defense, and privacy protection.
API Surface: Gateway proxy that intercepts HTTP calls to LLM providers. Configuration-driven guardrail policies.
Data Flow: HTTP proxy model. All LLM calls route through the gateway. Guardrails execute before forwarding.
Type System: Policy-based configuration, HTTP request/response interception.
Composability: Multi-tenant, multi-policy. Multiple guardrails compose as layered policies.
License: Open-source (GitHub repository appears to have moved/been restructured; current website at openguardrails.com)
Compatibility with alknet-firewall: MEDIUM-LOW. OpenGuardrails is a gateway/proxy that intercepts HTTP calls. alknet-firewall could theoretically be integrated as a guardrail step within the gateway, but the project appears to be in transition (the GitHub repo is not publicly accessible at the time of research, and the website has shifted to promoting "OpenKai" for security teams). This is more of an infrastructure-level integration than an API-level one.

3. Comparison Table

Criteria	LlamaFirewall	NeMo Guardrails	Guardrails AI	OpenAI Agents SDK	Bedrock Guardrails	OpenGuardrails
Integration Pattern	Scanner/Plugin	Config-Driven Rails + Actions	Validator/Guard Pipeline	Agent-Level Callbacks	Managed API Service	Gateway/Proxy
Extension Mechanism	`BaseScanner` subclass + `scan()`	Custom Python actions + Colang flows	`Validator` subclass + `validate()`	`@input_guardrail` decorator	None (closed service)	Custom guardrail policies
API for External Detection	✅ Direct (BaseScanner)	✅ Direct (actions)	⚠️ Possible but awkward	✅ Direct (guardrail func)	❌ None	⚠️ Gateway-level
Input Type	String message (UserMessage, etc.)	Chat messages (OpenAI format)	String value + metadata	String or message list	Text content	HTTP request body
Output Type	`ScanResult(decision, reason, score)`	Modified/allowed/rejected message	`ValidationResult` + on_fail actions	`GuardrailFunctionOutput(tripwire_triggered, output_info)`	Assessment + action (INTERVENED/NONE)	Pass/modify/reject
Async Support	❌ No (sync only)	✅ Yes (async-first)	✅ Yes (AsyncGuard)	✅ Yes (native async)	✅ Yes (HTTP API)	✅ Yes (HTTP proxy)
Streaming Support	❌ No	✅ Yes	✅ Yes (StreamRunner)	✅ Yes (via Runner)	❌ No	⚠️ Unknown
Batch Support	❌ No (single message)	⚠️ Via conversation traces	❌ No (per-call)	❌ No (per-invocation)	❌ No (per-call)	⚠️ Unknown
Composability	Multi-scanner per role, most-restrictive wins	Multi-rail pipeline, parallel IORails	Validator chain within Guard	Multiple guardrails per agent, any tripwire triggers	Layered policies	Layered policies
License	Llama 3.2 Community / MIT	Apache 2.0	Apache 2.0	MIT	Proprietary (AWS)	Open-source
alknet-fit	HIGH	MEDIUM-HIGH	MEDIUM	HIGH	LOW	MEDIUM-LOW

Architectural Pattern Comparison

Pattern	Systems Using It	Key Trait	Suitability for alknet-firewall
Scanner/Plugin	LlamaFirewall	Inherit base class, implement scan method, register in framework	✅ Ideal — our behavioral detection maps to a scanner
Config-Driven Rails	NeMo Guardrails	Define behavior in DSL (Colang) + YAML, call custom Python actions	⚠️ Workable — behavioral detection would be an opaque action, not expressible in Colang
Validator Chain	Guardrails AI	Chain validators around LLM call, each validates content	⚠️ Awkward — our detection doesn't produce content fixes, just alarms
Agent Callback	OpenAI Agents SDK	Decorated async functions attached to agent, tripwire pattern	✅ Excellent — natural fit for blocking input before target LLM runs
Managed API	Bedrock Guardrails	Closed service, no extension, call independently	❌ Not integrable — parallel service only
Gateway Proxy	OpenGuardrails	Intercept HTTP calls to LLM providers	⚠️ Infrastructure-level — could embed alknet-firewall as a check step

4. Analysis for alknet-firewall

4.1 What Makes alknet-firewall Different

alknet-firewall's behavioral signal detection is fundamentally different from every system analyzed above:

It inspects model activations, not text. All other guardrail systems operate on text content — they read input strings and classify/filter them. alknet-firewall runs a small detector model on the input, extracts hidden state activations, and produces an alarm based on multi-dimensional behavioral patterns.
It requires its own inference step. This is the critical architectural difference. A text-surface validator can be a pure function: text → verdict. alknet-firewall needs: text → model forward pass → activation extraction → SVD projection → alarm. This means it cannot be simply "plugged into" text-processing pipelines without acknowledging the model inference requirement.
It produces rich multi-dimensional output. An Alarm contains not just a binary pass/fail, but dimension scores, SVD projections, and confidence metrics. Most guardrail systems expect a simple pass/fail or safe/unsafe label.
It's a pre-check, not a post-check. By design, alknet-firewall screens input before it reaches the target LLM. This makes it an input guardrail, not an output guardrail. It's architecturally similar to LlamaFirewall's Role.USER scanners or NeMo Guardrails' input rails.
It's fast enough to be inline. With <10ms latency on commodity hardware, it can run synchronously in the request path without requiring async/background processing.

4.2 Compatible Integration Patterns

✅ Directly Compatible: Scanner/Plugin Pattern (LlamaFirewall)

LlamaFirewall's BaseScanner is the most natural fit:

# Hypothetical LlamaFirewall adapter
from llamafirewall.scanners.base_scanner import BaseScanner
from alknet_firewall import Firewall, Alarm

class BehavioralScanner(BaseScanner):
    def __init__(self, config):
        super().__init__(config)
        self.firewall = Firewall()  # Loads SmolLM2-135M detector
    
    def scan(self, input_data: str) -> ScanResult:
        alarm: Alarm = self.firewall.screen(input_data)
        return ScanResult(
            decision=ScanDecision.BLOCK if alarm.alarm else ScanDecision.ALLOW,
            reason='behavioral_signal_detection',
            score=alarm.confidence
        )

Why it works: The Scanner pattern accepts a string and returns a ScanResult(decision, reason, score). Our Alarm maps directly to this. The scanner is registered in LlamaFirewall's configuration and gets called for every user input.

Limitation: LlamaFirewall is synchronous and doesn't support batch processing. This is fine since our detection is <10ms.

✅ Directly Compatible: Agent Callback Pattern (OpenAI Agents SDK)

The @input_guardrail decorator pattern is clean and ergonomic:

# Hypothetical OpenAI Agents SDK adapter
from agents import Agent, GuardrailFunctionOutput, input_guardrail
from alknet_firewall import Firewall

firewall = Firewall()

@input_guardrail
async def behavioral_alarm_guardrail(ctx, agent, input):
    text = input if isinstance(input, str) else str(input)
    alarm = firewall.screen(text)
    return GuardrailFunctionOutput(
        output_info={
            "alarm": alarm.alarm,
            "dimensions": alarm.dimension_scores,
            "confidence": alarm.confidence,
        },
        tripwire_triggered=alarm.alarm,
    )

agent = Agent(
    name="Safe Agent",
    instructions="...",
    input_guardrails=[behavioral_alarm_guardrail],
)

Why it works: The blocking execution mode (run_in_parallel=False) prevents the target LLM from running until the behavioral check completes. This is exactly our use case. The output_info dict can carry our rich Alarm data.

Limitation: Tied to the OpenAI Agents SDK ecosystem. Not portable.

⚠️ Workable: Custom Action Pattern (NeMo Guardrails)

NeMo Guardrails allows custom Python actions within its Colang flow system:

# In actions.py
from alknet_firewall import Firewall
firewall = Firewall()

async def check_behavioral_alarm(context):
    user_input = context.get("user_input", "")
    alarm = firewall.screen(user_input)
    if alarm.alarm:
        return False  # Block the input
    return True  # Allow

# In rails.co
define flow
  user express input
  execute check_behavioral_alarm

Why it's workable: Custom actions can call any Python code, including alknet-firewall. The input rail runs before the LLM.

Limitations: The Colang DSL can't express behavioral detection natively. The action is an opaque call — no visibility into the detection reasoning within the Colang flow. Configuration is split across multiple files (YAML, .co, actions.py). More complex setup than LlamaFirewall or Agents SDK.

⚠️ Awkward: Validator Pattern (Guardrails AI)

A BehavioralAlarmValidator could wrap alknet-firewall:

# Hypothetical Guardrails AI adapter
from guardrails.validator_base import Validator
from alknet_firewall import Firewall

class BehavioralAlarmValidator(Validator):
    def validate(self, value, metadata):
        alarm = Firewall().screen(value)
        if alarm.alarm:
            return FailResult(
                error_message="Behavioral alarm triggered",
                fix_value="",  # Can't fix it, just block
            )
        return PassResult()

Why it's awkward: The Validator pattern assumes it can fix content (via on_fail=OnFailAction.FIX or FILTER). Our system can't fix content — it can only pass or alarm. The on_fail actions FIX, FILTER, REFRAIN don't map cleanly to "this input exhibits adversarial behavioral patterns." The ValidationResult type doesn't carry multi-dimensional signal data. The Guard pattern wraps the LLM call, which creates an architectural conflict: alknet-firewall should run before the LLM call, not wrap it.

4.3 Incompatible Patterns

❌ Configuration-Only DSL (NeMo Guardrails Colang)

Colang flows define conversational patterns in text — "define user express insult", "define bot respond calmly". There's no way to express "run a small model and check activation patterns" in Colang. Our detection must be an opaque Python action.

❌ Rule/Regex-Based Composition (LlamaFirewall Regex Scanners, NeMo Topic Rails)

Behavioral signal detection cannot be expressed as regex patterns, keyword lists, or topic rules. It requires model inference. Any composition mechanism that only supports text-matching rules is incompatible with our approach.

❌ Managed Service APIs (Bedrock Guardrails)

Amazon Bedrock Guardrails is a closed service with no extension mechanism. alknet-firewall would need to run as an independent service alongside it, with users responsible for composing results.

4.4 Key Considerations

Consideration	Impact on Integration Strategy
Model inference required	alknet-firewall needs a model forward pass. This means it can't be a pure text function. Adapter implementations must handle model loading and inference lifecycle.
<10ms latency	Fast enough for synchronous, inline pre-checks. No need for async/background processing. This simplifies adapters.
Rich multi-dimensional output	Most guardrail systems expect a simple pass/fail. Our dimension scores and SVD projections will be lost or need to be serialized into metadata fields.
CPU-capable	Can run without GPU. This makes deployment simpler than systems requiring GPU (like Llama Guard's 8B model).
Pre-check only	alknet-firewall is an input guardrail, not an output guardrail. It should only be composed at input screening positions.
Standalone value	alknet-firewall provides unique value (behavioral detection) that text-surface systems don't offer. It's complementary, not competing.

5. Recommendation

Phase 1: Standalone API (Ship Fast, Compose Manually)

Approach: Provide a clean, synchronous Python API and let users compose it with their existing guardrail systems themselves.

# alknet-firewall core API (already designed)
from alknet_firewall import Firewall

firewall = Firewall()  # Loads SmolLM2-135M detector model
alarm = firewall.screen("user input text")

if alarm.alarm:
    # User decides what to do — block, log, flag for review
    print(f"Behavioral alarm: {alarm}")
    print(f"Confidence: {alarm.confidence}")
    print(f"Dimension scores: {alarm.dimension_scores}")

Why this first:

No premature abstraction. We don't yet know which guardrail systems our users actually use. Building adapters before understanding demand is wasted effort.
Maximum flexibility. Users can call firewall.screen() from any Python context — a Flask middleware, a Lambda handler, a Celery task, or inline in their LLM pipeline.
Simplest mental model. One function, one type. screen(text) → Alarm. Easy to document, easy to test, easy to reason about.
Validates the core product. Before investing in adapters, we need validation that the behavioral detection approach works and that users want it.

Deliverables for Phase 1:

Firewall class with screen(text) → Alarm method
Alarm dataclass with alarm: bool, confidence: float, dimension_scores: dict, reason: str
HTTP API endpoint: POST /v1/screen with {"text": "..."} → {"alarm": true, "confidence": 0.95, ...}
Docker image for easy deployment
Documentation showing manual composition examples with LlamaFirewall, NeMo Guardrails, and OpenAI Agents SDK

Phase 2: Thin Adapters (Highest-Value Integrations)

Approach: Build adapter packages for the three systems with the highest compatibility and adoption: LlamaFirewall, OpenAI Agents SDK, and NeMo Guardrails.

# alknet-firewall-llamafirewall adapter
from llamafirewall import LlamaFirewall, Role, ScannerType
from alknet_firewall.adapters.llamafirewall import BehavioralScanner

firewall = LlamaFirewall(scanners={
    Role.USER: [ScannerType.PROMPT_GUARD, BehavioralScanner()],
    Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],
})

# alknet-firewall-agents-sdk adapter
from agents import Agent, GuardrailFunctionOutput, input_guardrail
from alknet_firewall.adapters.openai_agents import create_behavioral_guardrail

agent = Agent(
    name="Safe Agent",
    instructions="...",
    input_guardrails=[create_behavioral_guardrail(blocking=True)],
)

# alknet-firewall-nemo adapter
# Custom action in actions.py that calls firewall.screen()

Why these three:

LlamaFirewall — Highest compatibility. Same Scanner pattern, same role-based model, same ScanResult output. LlamaFirewall users are already thinking about input safety. Our behavioral scanner adds a fundamentally different detection method.
OpenAI Agents SDK — Highest value target. Agent frameworks are where indirect prompt injection is most dangerous (agents process untrusted content). The @input_guardrail pattern is a perfect fit. Blocking mode prevents the target LLM from processing flagged input.
NeMo Guardrails — Broad enterprise adoption. Apache 2.0, widely deployed in enterprise settings. The custom action pattern is workable even if not as elegant.

Adapter design principles:

Optional dependency. Each adapter is a separate pip install alknet-firewall-llamafirewall package. Core alknet-firewall doesn't depend on any guardrail framework.
Minimal code. Each adapter is <100 lines. It wraps Firewall.screen() and maps Alarm to the target system's type.
Lossy but pragmatic. The adapter maps Alarm.alarm → the target system's pass/fail, Alarm.confidence → the target system's score, and serializes dimension_scores into a metadata/extra field. Rich signal data is preserved where possible but the binary decision is the primary integration point.
Blocking by default. All adapters default to blocking execution (prevent LLM from processing flagged input). This matches our pre-check design.

Phase 3: Common Interface (Only if Demand Emerges)

Approach: If users are composing alknet-firewall with multiple guardrail systems and reporting friction, consider defining a common interface abstract.

# Possible Phase 3 interface (NOT recommended yet)
from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class ScreeningResult:
    passed: bool
    confidence: float
    reason: str
    metadata: dict  # system-specific data

class ScreeningProvider(ABC):
    @abstractmethod
    def screen(self, text: str) -> ScreeningResult: ...

class AlknetFirewallProvider(ScreeningProvider):
    def screen(self, text: str) -> ScreeningResult:
        alarm = self.firewall.screen(text)
        return ScreeningResult(
            passed=not alarm.alarm,
            confidence=alarm.confidence,
            reason=alarm.reason,
            metadata={"dimension_scores": alarm.dimension_scores}
        )

Why NOT now: Premature abstraction. We have one screening provider (alknet-firewall). Defining a common interface requires multiple implementations to validate the abstraction. This should only happen when:

We have 3+ guardrail systems integrating via our adapters
Users are asking for a unified composition API
We have concrete evidence that the interface generalizes correctly

What About Guardrails AI and Others?

System	Phase 2?	Rationale
Guardrails AI	No	Validator pattern is awkward for behavioral detection. If demand emerges, a `BehavioralAlarmValidator` adapter could be built, but it's not a priority.
Bedrock Guardrails	No	Closed service, no extension mechanism. Users compose manually (call both APIs).
OpenGuardrails	No	Project appears to be in transition. Not a stable integration target.
LangChain/LangGraph	Possible Phase 2.5	LangGraph agents would benefit from behavioral pre-checks. The integration pattern would be similar to OpenAI Agents SDK — a custom node in the graph that calls `firewall.screen()`. Monitor demand.

6. References

OpenGuardrails

OpenGuardrails Paper: "A Configurable, Unified, and Scalable Guardrails Stack for LLMs," arXiv:2510.19169, 2025.
OpenGuardrails Website: https://www.openguardrails.com/

General Guardrail Landscape

AI Safety Directory, "LLM Guardrails: The Complete Guide to AI Safety Guardrails (2026)": https://aisecurityandsafety.org/en/guides/llm-guardrails/
DeepInspect, "Open Source LLM Guardrails: The Libraries Available, Where They Sit, and What They Cannot Replace," May 2026: https://www.deepinspect.ai/blog/open-source-llm-guardrails

alknet-firewall Internal References

docs/research/llm-input-safety-landscape.md — Existing landscape analysis covering threat model, defense approaches, and the gap that alknet-firewall fills.

37 KiB Raw Blame History Unescape Escape

Research: Guardrail Integration Patterns for alknet-firewall

Table of Contents

1. Executive Summary

2. Overview of Each System

2.1 LlamaFirewall (Meta)

2.2 NeMo Guardrails (NVIDIA)

2.3 Guardrails AI

2.4 OpenAI Agents SDK

2.5 Amazon Bedrock Guardrails

2.6 OpenGuardrails

3. Comparison Table

Architectural Pattern Comparison

4. Analysis for alknet-firewall

4.1 What Makes alknet-firewall Different

4.2 Compatible Integration Patterns

✅ Directly Compatible: Scanner/Plugin Pattern (LlamaFirewall)

✅ Directly Compatible: Agent Callback Pattern (OpenAI Agents SDK)

⚠️ Workable: Custom Action Pattern (NeMo Guardrails)

⚠️ Awkward: Validator Pattern (Guardrails AI)

4.3 Incompatible Patterns

❌ Configuration-Only DSL (NeMo Guardrails Colang)

❌ Rule/Regex-Based Composition (LlamaFirewall Regex Scanners, NeMo Topic Rails)

❌ Managed Service APIs (Bedrock Guardrails)

4.4 Key Considerations

5. Recommendation

Phase 1: Standalone API (Ship Fast, Compose Manually)

Phase 2: Thin Adapters (Highest-Value Integrations)

Phase 3: Common Interface (Only if Demand Emerges)

What About Guardrails AI and Others?

6. References

LlamaFirewall

NeMo Guardrails

Guardrails AI

OpenAI Agents SDK

Amazon Bedrock Guardrails

OpenGuardrails

General Guardrail Landscape

alknet-firewall Internal References

37 KiB

Raw Blame History