Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

7.2 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

Firewall

The core firewall component: the public API for screening untrusted inputs and producing behavioral alarms.

What It Is

The Firewall is the primary entry point for alknet-firewall. It receives untrusted text input, runs it through the detector model, extracts behavioral signals from hidden state activations, and produces a structured alarm indicating whether the input exhibits adversarial behavioral patterns.

Why It Exists

LLM-based systems need a fast, pre-inference screening mechanism that catches adversarial inputs before they reach the target model. Text-surface defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal detection catches what text hides — adversarial inputs produce anomalous activation patterns regardless of their surface form (ADR-002).

Data Flow

1. Input Arrives
   "Please summarize this document: [hidden injection payload]"

2. Tokenize
   tokenizer.encode(input) → input_ids

3. Detector Model Inference
   model(input_ids) → hidden_states at key layers

4. Activation Extraction
   Extract hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors

5. SVD Projection
   Project activations onto precomputed SVD basis
   z_coords = svd_basis @ activation_vector

6. Codebook Comparison
   For each SVD dimension:
     - Compute distance from normal behavioral region
     - Apply spline scoring (monotonic distribution)
     - Aggregate multi-dimensional signals

7. Alarm Generation
   Combine per-dimension signals → overall alarm
   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
   Include per-dimension breakdown for interpretability

Key Concepts

Behavioral Alarm

Not a simple safe/unsafe binary. A behavioral alarm contains:

Level: CLEAR, SUSPICIOUS, or DANGEROUS
Score: Continuous 0.0–1.0 composite score
Signals: Per-dimension behavioral signal strengths
Dimensions: Which SVD directions are anomalous and by how much

This multi-signal approach reflects that safety is multi-dimensional in activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input that simultaneously shifts the refusal direction while activating role-playing dimensions is more suspicious than one that shifts only one dimension.

Score Composition

The overall Alarm.score (0.0–1.0) is computed from per-dimension signals using a weighted maximum:

score = max(w_d * signal_d for d in dimensions)

Where w_d are dimension weights (default: equal, configurable in Thresholds.per_dimension). Using max rather than mean ensures that a single strongly anomalous dimension can trigger an alarm even if other dimensions are normal. This is critical for catching attacks that exploit specific behavioral patterns (e.g., refusal-suppression) while leaving other dimensions unaffected.

The suspicious and dangerous thresholds are applied to this composite score to determine Alarm.level.

Alarm Levels

Level	Meaning	Action
`CLEAR`	Input exhibits normal behavioral patterns	Pass to target model
`SUSPICIOUS`	Some anomalous signals detected	Flag for review or apply additional checks
`DANGEROUS`	Strong behavioral anomaly across multiple dimensions	Block input or apply strong mitigations

Latency Budget

The firewall must complete screening in <10ms on commodity hardware (ADR-003). This budget breaks down approximately:

Step	Target Latency
Tokenization	~0.5ms
Model inference (125M, CPU)	~5ms
Activation extraction	~0.1ms
SVD projection	~0.1ms
Codebook comparison	~0.3ms
Total	~6ms

Interfaces

Public API

class AlarmLevel(Enum):
    CLEAR = "clear"
    SUSPICIOUS = "suspicious"
    DANGEROUS = "dangerous"

@dataclass
class DimensionSignal:
    dimension: int
    deviation: float
    score: float
    direction_label: str | None

@dataclass
class Alarm:
    level: AlarmLevel
    score: float
    signals: list[DimensionSignal]
    input_hash: str          # SHA-256 of raw input string (for logging/dedup)
    model_id: str
    timestamp: float

class Firewall:
    def __init__(
        self,
        model_id: str = "HuggingFaceTB/SmolLM2-135M",
        model_revision: str = DEFAULT_MODEL_REVISION,
        codebook_path: Path | None = None,
        thresholds: Thresholds | None = None,
        device: str = "cpu",
        cache_dir: str | None = None,
    ): ...

    def preload(self) -> None: ...

    def screen(self, input: str) -> Alarm: ...

screen_batch is Phase 2 (see overview.md scope).

Constraints

No network calls during screening — the model is lazily loaded on first screen() call or via explicit preload(). Download never happens at import time. Once loaded, screening is entirely local.
Synchronous API — screen() is a blocking call. Async is Phase 2.
No target model dependency — the firewall has no access to the target LLM's internals. It runs its own detector model.
Reproducible — Same input + same model + same codebook = same alarm. Pin model revision and codebook version.

Error Handling

Failure Mode	Exception Type	Behavior
Model download fails (network)	`ModelDownloadError`	Raised from `preload()` or first `screen()`. User must retry.
Model not loaded when `screen()` called	`ModelNotLoadedError`	Raised if model loading was previously attempted and failed.
Corrupted codebook	`CodebookCorruptedError`	Raised at `Firewall.__init__` if codebook fails validation.
Codebook-model mismatch	`CodebookMismatchError`	Raised if codebook's `model_id` doesn't match loaded model.
Empty input	`ValueError`	Raised if input is empty string.
Non-UTF8 input	`ValueError`	Raised if input cannot be encoded to UTF-8.
Very long input	—	Truncated to model's max sequence length with a `UserWarning`.
Insufficient memory for model	`MemoryError`	Propagated from PyTorch/torch. User must reduce model size or free memory.

All exception types subclass AlknetFirewallError (base library exception).

Design Decisions

ADR	Decision	Summary
002	Behavioral signals	Detect how models react, not what text says
003	Small model detector	<10ms latency, CPU-deployable
004	SVD-based detection	Multi-dimensional, interpretable, efficient
008	Three-level alarm	CLEAR/SUSPICIOUS/DANGEROUS with continuous score

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

OQ-03: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; research complete)
~~OQ-05~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2)

7.2 KiB Raw Blame History Unescape Escape