alknet-firewall/docs/architecture/firewall.md

---
status: draft
last_updated: 2026-06-13
---

# Firewall

The core firewall component: the public API for screening untrusted inputs and
producing behavioral alarms.

## What It Is

The Firewall is the primary entry point for alknet-firewall. It receives
untrusted text input, runs it through the detector model, extracts behavioral
signals from hidden state activations, and produces a structured alarm
indicating whether the input exhibits adversarial behavioral patterns.

## Why It Exists

LLM-based systems need a fast, pre-inference screening mechanism that catches
adversarial inputs *before* they reach the target model. Text-surface
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
detection catches what text hides — adversarial inputs produce anomalous
activation patterns regardless of their surface form (ADR-002).

## Data Flow

```
1. Input Arrives
   "Please summarize this document: [hidden injection payload]"

2. Tokenize
   tokenizer.encode(input) → input_ids

3. Detector Model Inference
   model(input_ids) → hidden_states at key layers

4. Activation Extraction
   Extract hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors

5. SVD Projection
   Project activations onto precomputed SVD basis
   z_coords = svd_basis @ activation_vector

6. Codebook Comparison
   For each SVD dimension:
     - Compute distance from normal behavioral region
     - Apply spline scoring (monotonic distribution)
     - Aggregate multi-dimensional signals

7. Alarm Generation
   Combine per-dimension signals → overall alarm
   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
   Include per-dimension breakdown for interpretability
```

## Key Concepts

### Behavioral Alarm

Not a simple safe/unsafe binary. A behavioral alarm contains:

- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
- **Score**: Continuous 0.0–1.0 composite score
- **Signals**: Per-dimension behavioral signal strengths
- **Dimensions**: Which SVD directions are anomalous and by how much

This multi-signal approach reflects that safety is multi-dimensional in
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
that simultaneously shifts the refusal direction while activating role-playing
dimensions is more suspicious than one that shifts only one dimension.

### Score Composition

The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
using a weighted maximum:

```
score = max(w_d * signal_d for d in dimensions)
```

Where `w_d` are dimension weights (default: equal, configurable in
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
single strongly anomalous dimension can trigger an alarm even if other
dimensions are normal. This is critical for catching attacks that exploit
specific behavioral patterns (e.g., refusal-suppression) while leaving other
dimensions unaffected.

The `suspicious` and `dangerous` thresholds are applied to this composite
score to determine `Alarm.level`.

### Alarm Levels

| Level | Meaning | Action |
|-------|---------|--------|
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |

### Latency Budget

The firewall must complete screening in <10ms on commodity hardware
(ADR-003). This budget breaks down approximately:

| Step | Target Latency |
|------|----------------|
| Tokenization | ~0.5ms |
| Model inference (125M, CPU) | ~5ms |
| Activation extraction | ~0.1ms |
| SVD projection | ~0.1ms |
| Codebook comparison | ~0.3ms |
| **Total** | **~6ms** |

## Interfaces

### Public API

```python
class AlarmLevel(Enum):
    CLEAR = "clear"
    SUSPICIOUS = "suspicious"
    DANGEROUS = "dangerous"

@dataclass
class DimensionSignal:
    dimension: int
    deviation: float
    score: float
    direction_label: str | None

@dataclass
class Alarm:
    level: AlarmLevel
    score: float
    signals: list[DimensionSignal]
    input_hash: str          # SHA-256 of raw input string (for logging/dedup)
    model_id: str
    timestamp: float

class Firewall:
    def __init__(
        self,
        model_id: str = "HuggingFaceTB/SmolLM2-135M",
        model_revision: str = DEFAULT_MODEL_REVISION,
        codebook_path: Path | None = None,
        thresholds: Thresholds | None = None,
        device: str = "cpu",
        cache_dir: str | None = None,
    ): ...

    def preload(self) -> None: ...

    def screen(self, input: str) -> Alarm: ...
```

> `screen_batch` is Phase 2 (see overview.md scope).

### Constraints

1. **No network calls during screening** — the model is lazily loaded on
   first `screen()` call or via explicit `preload()`. Download never happens at
   import time. Once loaded, screening is entirely local.
2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
3. **No target model dependency** — the firewall has no access to the target
   LLM's internals. It runs its own detector model.
4. **Reproducible** — Same input + same model + same codebook = same alarm.
   Pin model revision and codebook version.

## Error Handling

| Failure Mode | Exception Type | Behavior |
|-------------|---------------|----------|
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
| Empty input | `ValueError` | Raised if input is empty string. |
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |

All exception types subclass `AlknetFirewallError` (base library exception).

## Design Decisions

| ADR | Decision | Summary |
|-----|----------|---------|
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |

## Open Questions

Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:

- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open — needs research)