feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -0,0 +1,200 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Firewall
+
+The core firewall component: the public API for screening untrusted inputs and
+producing behavioral alarms.
+
+## What It Is
+
+The Firewall is the primary entry point for alknet-firewall. It receives
+untrusted text input, runs it through the detector model, extracts behavioral
+signals from hidden state activations, and produces a structured alarm
+indicating whether the input exhibits adversarial behavioral patterns.
+
+## Why It Exists
+
+LLM-based systems need a fast, pre-inference screening mechanism that catches
+adversarial inputs *before* they reach the target model. Text-surface
+defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
+detection catches what text hides — adversarial inputs produce anomalous
+activation patterns regardless of their surface form (ADR-002).
+
+## Data Flow
+
+```
+1. Input Arrives
+   "Please summarize this document: [hidden injection payload]"
+
+2. Tokenize
+   tokenizer.encode(input) → input_ids
+
+3. Detector Model Inference
+   model(input_ids) → hidden_states at key layers
+
+4. Activation Extraction
+   Extract hidden states from configured layers (early + mid)
+   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors
+
+5. SVD Projection
+   Project activations onto precomputed SVD basis
+   z_coords = svd_basis @ activation_vector
+
+6. Codebook Comparison
+   For each SVD dimension:
+     - Compute distance from normal behavioral region
+     - Apply spline scoring (monotonic distribution)
+     - Aggregate multi-dimensional signals
+
+7. Alarm Generation
+   Combine per-dimension signals → overall alarm
+   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
+   Include per-dimension breakdown for interpretability
+```
+
+## Key Concepts
+
+### Behavioral Alarm
+
+Not a simple safe/unsafe binary. A behavioral alarm contains:
+
+- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
+- **Score**: Continuous 0.0–1.0 composite score
+- **Signals**: Per-dimension behavioral signal strengths
+- **Dimensions**: Which SVD directions are anomalous and by how much
+
+This multi-signal approach reflects that safety is multi-dimensional in
+activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
+that simultaneously shifts the refusal direction while activating role-playing
+dimensions is more suspicious than one that shifts only one dimension.
+
+### Score Composition
+
+The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
+using a weighted maximum:
+
+```
+score = max(w_d * signal_d for d in dimensions)
+```
+
+Where `w_d` are dimension weights (default: equal, configurable in
+`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
+single strongly anomalous dimension can trigger an alarm even if other
+dimensions are normal. This is critical for catching attacks that exploit
+specific behavioral patterns (e.g., refusal-suppression) while leaving other
+dimensions unaffected.
+
+The `suspicious` and `dangerous` thresholds are applied to this composite
+score to determine `Alarm.level`.
+
+### Alarm Levels
+
+| Level | Meaning | Action |
+|-------|---------|--------|
+| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
+| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
+| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
+
+### Latency Budget
+
+The firewall must complete screening in <10ms on commodity hardware
+(ADR-003). This budget breaks down approximately:
+
+| Step | Target Latency |
+|------|----------------|
+| Tokenization | ~0.5ms |
+| Model inference (125M, CPU) | ~5ms |
+| Activation extraction | ~0.1ms |
+| SVD projection | ~0.1ms |
+| Codebook comparison | ~0.3ms |
+| **Total** | **~6ms** |
+
+## Interfaces
+
+### Public API
+
+```python
+class AlarmLevel(Enum):
+    CLEAR = "clear"
+    SUSPICIOUS = "suspicious"
+    DANGEROUS = "dangerous"
+
+@dataclass
+class DimensionSignal:
+    dimension: int
+    deviation: float
+    score: float
+    direction_label: str | None
+
+@dataclass
+class Alarm:
+    level: AlarmLevel
+    score: float
+    signals: list[DimensionSignal]
+    input_hash: str          # SHA-256 of raw input string (for logging/dedup)
+    model_id: str
+    timestamp: float
+
+class Firewall:
+    def __init__(
+        self,
+        model_id: str = "HuggingFaceTB/SmolLM2-135M",
+        model_revision: str = DEFAULT_MODEL_REVISION,
+        codebook_path: Path | None = None,
+        thresholds: Thresholds | None = None,
+        device: str = "cpu",
+        cache_dir: str | None = None,
+    ): ...
+
+    def preload(self) -> None: ...
+
+    def screen(self, input: str) -> Alarm: ...
+```
+
+> `screen_batch` is Phase 2 (see overview.md scope).
+
+### Constraints
+
+1. **No network calls during screening** — the model is lazily loaded on
+   first `screen()` call or via explicit `preload()`. Download never happens at
+   import time. Once loaded, screening is entirely local.
+2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
+3. **No target model dependency** — the firewall has no access to the target
+   LLM's internals. It runs its own detector model.
+4. **Reproducible** — Same input + same model + same codebook = same alarm.
+   Pin model revision and codebook version.
+
+## Error Handling
+
+| Failure Mode | Exception Type | Behavior |
+|-------------|---------------|----------|
+| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
+| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
+| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
+| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
+| Empty input | `ValueError` | Raised if input is empty string. |
+| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
+| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
+| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
+
+All exception types subclass `AlknetFirewallError` (base library exception).
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
+| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
+| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
+| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
+- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)