feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/model.md
+++ b/docs/architecture/model.md
@@ -0,0 +1,161 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Model
+
+The model component manages detector model loading, inference, and activation
+extraction. It is the interface between the firewall and the language model
+that provides behavioral signals.
+
+## What It Is
+
+The model component loads a small language model (default: SmolLM2-135M),
+runs inference on untrusted inputs, and extracts hidden state activations at
+configured layers. It is model-agnostic — any transformer model with
+accessible hidden states can serve as a detector.
+
+## Why It Exists
+
+The firewall needs model activations (hidden states) to detect behavioral
+patterns. This component encapsulates the complexity of model loading,
+inference, and activation extraction behind a clean interface that the
+codebook and firewall can consume without knowing model-specific details.
+
+The model-agnostic design (ADR-003) means the firewall is not tied to a
+specific detector model. Switching from SmolLM2-135M to another ~100M model
+requires recomputing the SVD basis and rebuilding the codebook, but no
+changes to the firewall logic.
+
+## Key Concepts
+
+### Activation Extraction
+
+The core operation: running the model on an input and capturing hidden state
+representations at specific layers.
+
+```python
+# Conceptual
+outputs = model(input_ids, output_hidden_states=True)
+activations = {
+    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
+    for layer_idx in configured_layers
+}
+```
+
+Key decisions:
+- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
+  Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
+  Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral
+  patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their
+  signals are highly correlated with the selected layers.
+- **Which token**: The last token's hidden state carries the model's
+  "conclusion" about the full input sequence (ADR-009). This is the standard
+  choice for autoregressive (LLaMA-family) models.
+- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
+  (768 for SmolLM2-135M).
+
+### Model-Agnostic Interface
+
+The model component exposes a generic interface that works with any
+transformer model:
+
+```python
+class DetectorModel(Protocol):
+    model_id: str
+    hidden_dim: int
+    n_layers: int
+
+    def load(self, device: str = "cpu") -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+```
+
+The `infer` method returns hidden states at key layers, abstracting away
+whether the backend is PyTorch, ONNX Runtime, or a future Rust inference
+engine.
+
+### Lazy Loading
+
+The model is loaded on first use or explicit preload — not at import time.
+This keeps the library import fast (~milliseconds) even when torch is
+installed.
+
+```python
+firewall = Firewall()      # Does NOT load model yet
+firewall.preload()         # Explicit: download + load model
+alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
+```
+
+### Offline Support
+
+The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags.
+In air-gapped environments, models must be pre-downloaded. The library
+provides a CLI command for this:
+
+```bash
+python -m alknet_firewall download
+```
+
+## Interfaces
+
+### Public API
+
+```python
+class DetectorModel(Protocol):
+    model_id: str
+    hidden_dim: int
+    n_layers: int
+
+    def load(self, device: str = "cpu") -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+
+class HFDetectorModel:
+    """Default implementation using HuggingFace transformers."""
+
+    DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>"  # Specific SmolLM2-135M commit
+
+    def __init__(
+        self,
+        model_id: str = "HuggingFaceTB/SmolLM2-135M",
+        revision: str = DEFAULT_REVISION,
+        device: str = "cpu",
+        cache_dir: str | None = None,
+    ): ...
+
+    def load(self, device: str | None = None) -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+    def is_loaded(self) -> bool: ...
+
+    @property
+    def extraction_layers(self) -> list[int]: ...
+```
+
+### Constraints
+
+1. **safetensors-only** — Model weights are loaded exclusively from
+   safetensors format. Pickle-based `.pt`/`.bin` files are never loaded
+   (ADR-005). This is a security requirement for a security product.
+2. **Model pinning** — Model revision must be pinned for reproducibility.
+   Default revision is a specific commit hash, not `"main"`.
+3. **CPU-first** — Default device is CPU. GPU inference is supported but not
+   required. The <10ms latency target is achievable on CPU with a 125M model.
+4. **No training** — The detector model is inference-only. No gradients are
+   computed. No model weights are modified at runtime.
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats |
+| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports |
+| [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled |
+| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)