--- status: draft last_updated: 2026-06-13 --- # Firewall The core firewall component: the public API for screening untrusted inputs and producing behavioral alarms. ## What It Is The Firewall is the primary entry point for alknet-firewall. It receives untrusted text input, runs it through the detector model, extracts behavioral signals from hidden state activations, and produces a structured alarm indicating whether the input exhibits adversarial behavioral patterns. ## Why It Exists LLM-based systems need a fast, pre-inference screening mechanism that catches adversarial inputs *before* they reach the target model. Text-surface defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal detection catches what text hides — adversarial inputs produce anomalous activation patterns regardless of their surface form (ADR-002). ## Data Flow ``` 1. Input Arrives "Please summarize this document: [hidden injection payload]" 2. Tokenize tokenizer.encode(input) → input_ids (shape: seq_len) 3. Detector Model Inference model(input_ids, output_hidden_states=True) → hidden_states at key layers 4. Activation Extraction Extract last-token hidden states from configured layers (early + mid) hidden_states[layer_idx][:, -1, :] → per-layer activation vectors 5. SVD Projection Project activations onto precomputed SVD basis z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates 6. Copula Decomposition Transform z-coordinates through CDF → simplex → barycentric: z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale) → (u, v) via barycentric (position on simplex) 7. Token-Level Smoothing (optional) Apply rolling average to (S, u, v) features across token positions window=8: smooths per-token signals, reduces noise from single-token spikes 8. Direction Classification For each behavioral direction (refusal, injection, etc.): logistic_classifier(S, u, v) → P(active | features) per token position 9. Aggregation Per direction: mean P(active), max P(active), fraction above threshold Flag if any direction exceeds threshold for sufficient token positions 10. Alarm Generation Combine per-direction signals → overall alarm AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS Include per-direction breakdown for interpretability ``` Note: Step 4 extracts only the last token in Phase 1. The full pipeline (Phase 2) extracts per-token activations, enabling the token-level smoothing and per-position classification in steps 7–9. ## Key Concepts ### Behavioral Alarm Not a simple safe/unsafe binary. A behavioral alarm contains: - **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS` - **Score**: Continuous 0.0–1.0 composite score - **Signals**: Per-dimension behavioral signal strengths - **Dimensions**: Which SVD directions are anomalous and by how much This multi-signal approach reflects that safety is multi-dimensional in activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input that simultaneously shifts the refusal direction while activating role-playing dimensions is more suspicious than one that shifts only one dimension. ### Score Composition The overall `Alarm.score` (0.0–1.0) is computed from per-direction classification results. For each behavioral direction, the logistic classifier produces P(active | features) for every token position. The alarm score aggregates these across directions: ``` direction_score = max(P(active) across token positions) score = max(w_d * direction_score_d for d in directions) ``` Where `w_d` are direction weights (default: equal, configurable in `Thresholds.per_dimension`). Using `max` at both levels ensures that: - A single strongly anomalous direction can trigger an alarm even if other directions are normal - A sustained behavioral signal at any token position surfaces in the alarm This is critical for catching attacks that exploit specific behavioral patterns (e.g., refusal-suppression) while leaving other directions unaffected. The `suspicious` and `dangerous` thresholds are applied to this composite score to determine `Alarm.level`. ### Alarm Levels | Level | Meaning | Action | |-------|---------|--------| | `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model | | `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks | | `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations | ### Latency Budget The firewall must complete screening in <10ms on commodity hardware (ADR-003). This budget breaks down approximately: | Step | Target Latency | |------|----------------| | Tokenization | ~0.5ms | | Model inference (125M, CPU) | ~5ms | | Activation extraction | ~0.1ms | | SVD projection | ~0.1ms | | Copula decomposition | ~0.05ms | | Token-level smoothing | ~0.05ms | | Direction classification | ~0.1ms | | **Total** | **~6ms** | ## Interfaces ### Public API ```python class AlarmLevel(Enum): CLEAR = "clear" SUSPICIOUS = "suspicious" DANGEROUS = "dangerous" @dataclass class DimensionSignal: direction: str # Behavioral direction name (e.g., "refusal", "injection") score: float # P(active) for this direction max_score: float # Max P(active) across token positions mean_score: float # Mean P(active) across token positions n_positions_above: int # Token positions above threshold direction_label: str | None @dataclass class Alarm: level: AlarmLevel score: float signals: list[DimensionSignal] input_hash: str # SHA-256 of raw input string (for logging/dedup) model_id: str timestamp: float class Firewall: def __init__( self, model_id: str = "HuggingFaceTB/SmolLM2-135M", model_revision: str = DEFAULT_MODEL_REVISION, codebook_path: Path | None = None, thresholds: Thresholds | None = None, device: str = "cpu", cache_dir: str | None = None, ): ... def preload(self) -> None: ... def screen(self, input: str) -> Alarm: ... ``` > `screen_batch` is Phase 2 (see overview.md scope). ### Constraints 1. **No network calls during screening** — the model is lazily loaded on first `screen()` call or via explicit `preload()`. Download never happens at import time. Once loaded, screening is entirely local. 2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2. 3. **No target model dependency** — the firewall has no access to the target LLM's internals. It runs its own detector model. 4. **Reproducible** — Same input + same model + same codebook = same alarm. Pin model revision and codebook version. ## Error Handling | Failure Mode | Exception Type | Behavior | |-------------|---------------|----------| | Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. | | Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. | | Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. | | Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. | | Empty input | `ValueError` | Raised if input is empty string. | | Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. | | Very long input | — | Truncated to model's max sequence length with a `UserWarning`. | | Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. | All exception types subclass `AlknetFirewallError` (base library exception). ## Design Decisions | ADR | Decision | Summary | |-----|----------|---------| | [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says | | [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable | | [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient | | [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: - ~~**OQ-03**~~: ~~Should the firewall support streaming/chunked input screening?~~ (resolved — ADR-012: rolling token windows with `screen_document()` in Phase 2) - ~~**OQ-05**~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2)