feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -0,0 +1,71 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# alknet-firewall — Architecture
+
+## Current State
+
+**Phase 0→1 (Exploration → Architecture)** — The project has a working PoC
+demonstrating that behavioral signals from small language models can detect
+adversarial inputs. The core detection logic (~1,745 lines) works reasonably
+well but lacks tests, has excessive codebook size, and needs extraction from
+the research codebase into a properly structured Python package.
+
+This project extracts and productionizes the behavioral signal detection
+approach from the metaspline research project. A ~125M parameter model
+(SmolLM2-135M) processes untrusted inputs and produces hidden state
+activations. SVD-based dimensionality reduction on these activations reveals
+behavioral patterns — normal inputs cluster in expected regions while
+adversarial inputs produce anomalous activation signatures. The system
+raises "behavioral alarms" without needing to know specific attack types.
+
+## Architecture Documents
+
+| Document | Status | Description |
+|----------|--------|-------------|
+| [overview.md](overview.md) | Draft | Vision, scope, package structure, dependencies |
+| [firewall.md](firewall.md) | Draft | Core firewall API, input screening, alarm protocol |
+| [codebook.md](codebook.md) | Draft | SVD basis, detection parameters, codebook compilation |
+| [model.md](model.md) | Draft | Model loading, activation extraction, model-agnostic design |
+| [configuration.md](configuration.md) | Draft | Thresholds, model selection, detection tuning |
+| [open-questions.md](open-questions.md) | Active | Unresolved questions tracker with OQ-IDs |
+
+## ADR Table
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| [001](decisions/001-python-uv.md) | Python with uv | Accepted |
+| [002](decisions/002-behavioral-signals.md) | Behavioral Signal Detection (Not Text Classification) | Accepted |
+| [003](decisions/003-small-model-detector.md) | Small Model (~125M) as Detector | Accepted |
+| [004](decisions/004-svd-based-detection.md) | SVD-Based Anomaly Detection | Accepted |
+| [005](decisions/005-safetensors-only.md) | Safetensors-Only Model Loading | Accepted |
+| [006](decisions/006-optional-pytorch.md) | PyTorch as Optional Dependency | Accepted |
+| [007](decisions/007-runtime-model-download.md) | Runtime Model Download via HuggingFace Hub | Accepted |
+| [008](decisions/008-three-level-alarm.md) | Three-Level Alarm System | Accepted |
+| [009](decisions/009-last-token-extraction.md) | Last-Token Activation Extraction | Accepted |
+| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic Spline Distributions | Accepted |
+
+## Open Questions
+
+See [open-questions.md](open-questions.md) for the full tracker.
+
+| OQ | Question | Priority | Status |
+|----|----------|----------|--------|
+| OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
+| OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
+| OQ-03 | Should the firewall support streaming/chunked input screening? | low | open |
+| OQ-04 | Should detection thresholds be per-model or globally configurable? | medium | open |
+| OQ-05 | How should the firewall integrate with existing guardrail systems (LlamaFirewall, NeMo)? | medium | open |
+| OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
+| OQ-07 | Is a Rust port feasible given current ML framework maturity? | low | open |
+
+## Document Lifecycle
+
+| Status | Meaning | Transitions |
+|--------|---------|-------------|
+| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
+| `reviewed` | Architecture is final. Implementation may begin. Changes require review. | → `stable` when implementation is complete |
+| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
+| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |
--- a/docs/architecture/codebook.md
+++ b/docs/architecture/codebook.md
@@ -0,0 +1,248 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Codebook
+
+The codebook contains the compiled detection parameters — SVD basis vectors,
+behavioral region boundaries, and scoring distributions — that the firewall
+uses to detect adversarial inputs.
+
+## What It Is
+
+The codebook is the "compiled detector" — the precomputed parameters that
+transform raw model activations into behavioral alarm signals. It is to the
+firewall what a trained model is to a classifier: the result of an offline
+compilation step that produces the runtime detection parameters.
+
+The name "codebook" comes from vector quantization terminology: it defines a
+set of reference points (codewords) in activation space that represent known
+behavioral patterns. New inputs are compared against these reference patterns.
+
+## Why It Exists
+
+Running full SVD decomposition and distribution fitting on every input would be
+prohibitively expensive. The codebook precomputes these offline:
+
+- **SVD basis**: The principal directions in activation space that capture
+  safety-relevant behavioral variance. Computed once from a calibration
+  dataset.
+- **Behavioral regions**: The expected distribution of normal inputs along each
+  SVD dimension. Defined by fitted spline distributions.
+- **Thresholds**: Decision boundaries for alarm levels along each dimension.
+
+At runtime, the firewall only needs to project new activations onto the
+precomputed basis and compare against the precomputed regions — O(k) per input
+where k is the number of retained dimensions.
+
+## Key Concepts
+
+### z-Coordinates
+
+The projection of an activation vector onto the SVD basis. Computed as:
+
+```
+z = V^T @ (activation - mean)
+```
+
+Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
+mean activation from the calibration dataset. The centering step is critical
+— without it, projections are offset by the mean and thresholds would be
+incorrect.
+
+z-coordinates are raw (unnormalized) projections. The codebook's spline
+distributions are calibrated for this scale, so threshold values in the
+codebook are specific to the z-coordinate range of the calibration data.
+
+### SVD Basis
+
+Singular Value Decomposition of the activation space from a calibration dataset
+reveals the principal components (directions) that capture the most variance.
+The top-k components form the basis that the codebook uses for projection.
+
+Key properties:
+- **Interpretable**: Each direction can be inspected for what behavioral
+  pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
+- **Efficient**: After decomposition, projection is a matrix multiply
+- **Stable**: SVD basis is deterministic for a given calibration dataset
+- **Model-specific**: The basis is computed for a specific model architecture
+  and weights. Changing the detector model requires recomputing the basis
+
+The SVD basis is computed by the codebook training pipeline
+(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
+
+### Behavioral Regions
+
+For each SVD dimension, the codebook defines the expected distribution of
+normal (non-adversarial) inputs. This is modeled as a monotonic spline
+distribution that captures the shape of the behavioral region along that
+dimension.
+
+Inputs whose projections fall within the normal region score low (CLEAR).
+Inputs whose projections fall near or beyond the region boundary score
+increasingly high (SUSPICIOUS → DANGEROUS).
+
+### Spline Distributions
+
+Monotonic spline distributions model the probability density along each SVD
+dimension (ADR-010). They provide:
+
+- **Smooth scoring**: Continuous score rather than hard threshold
+- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
+  anomalous inputs
+- **Parametric compactness**: A handful of spline knots represent the full
+  distribution shape
+- **Differentiability**: Scores are differentiable for potential future use in
+  adversarial training
+
+The spline distribution approach is adapted from the metaspline PoC
+(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
+
+**Formal definition**: The CDF along each dimension is modeled as a monotonic
+cubic spline with 10–20 knots. Knot positions are determined by quantiles of
+the calibration data (ensuring density of knots where data is dense). Beyond
+the extreme knots, the CDF decays exponentially at a rate fitted to the tail
+data. The scoring function maps a z-coordinate to a score in [0, 1] via the
+CDF's complement: `score = 1 - cdf(z)`.
+
+**Canonical implementation**: The metaspline PoC files `spline.py`
+(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
+and `space.py` (`unfold`/`fold`) are the reference implementation for the
+codebook compilation pipeline.
+
+### Calibration Dataset
+
+The calibration dataset is the set of normal (non-adversarial) inputs used to
+compute the SVD basis and fit behavioral region distributions. Requirements:
+
+- **Composition**: Diverse normal inputs representative of the deployment
+  domain. No adversarial examples — the basis models *normal* behavior, and
+  anomalies are detected as deviations from it.
+- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
+  Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
+  have diminishing returns.
+- **Diversity**: Must cover the range of normal inputs the detector will see
+  in production. A narrow calibration dataset (e.g., only short English
+  queries) will produce high false positive rates on unusual but benign inputs.
+- **Model-specific**: A calibration dataset must be collected for each detector
+  model by running that model on the inputs and extracting activations.
+
+The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
+automates calibration dataset processing.
+
+### Codebook Compilation
+
+The codebook is compiled offline by a training pipeline that:
+
+1. Runs the detector model on a calibration dataset (diverse normal inputs)
+2. Extracts hidden state activations at configured layers
+3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
+   deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
+   which uses randomized approximation and may not be deterministic)
+4. Fits spline distributions along each retained dimension
+5. Computes detection thresholds
+6. Serializes the codebook to a portable format (safetensors + JSON config)
+
+This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
+package** as package data (under `src/alknet_firewall/data/codebook/`). This
+keeps the Phase 1 installation simple — no additional download step beyond the
+model. The bundled codebook is specific to the default detector model
+(SmolLM2-135M at the pinned revision). Users who switch to a different
+detector model must provide a matching codebook via `codebook_path`.
+
+## Data Format
+
+The codebook is stored as:
+
+```
+codebook/
+├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
+├── regions.safetensors    # Region boundary parameters
+├── splines.json           # Spline knot positions and coefficients
+└── config.json            # Metadata: model_id, revision, n_dims, thresholds
+```
+
+All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
+
+### Tensor Specifications
+
+**basis.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
+| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
+
+**regions.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
+| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
+
+**splines.json**:
+| Field | Type | Description |
+|-------|------|-------------|
+| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
+| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
+| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
+
+## Interfaces
+
+### Internal API
+
+```python
+@dataclass
+class CodebookConfig:
+    model_id: str
+    model_revision: str
+    n_dimensions: int
+    layers: list[int]
+    suspicious_threshold: float    # Serialized threshold values
+    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)
+
+class Codebook:
+    def __init__(self, path: Path): ...
+
+    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
+        """Project raw activations onto SVD basis → z-coordinates."""
+        ...
+
+    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
+        """Score z-coordinates against behavioral regions."""
+        ...
+
+    @classmethod
+    def load(cls, path: Path) -> Codebook: ...
+
+    @classmethod
+    def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
+```
+
+### Constraints
+
+1. **Immutable at runtime** — The codebook is read-only during screening.
+   Modifying the codebook requires explicit recompilation.
+2. **Model-bound** — A codebook is valid only for the specific model it was
+   compiled for. Loading a codebook with the wrong model produces undefined
+   results.
+3. **Deterministic** — Same codebook + same activations = same scores.
+4. **Portable** — Codebook can be saved to disk and reloaded without
+   recomputation. Can be distributed via HuggingFace Hub.
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
+| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
+| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
+  codebook be compressed? (open)
+- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
--- a/docs/architecture/configuration.md
+++ b/docs/architecture/configuration.md
@@ -0,0 +1,107 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Configuration
+
+Configuration for the firewall: model selection, detection thresholds,
+alarm levels, and operational parameters.
+
+## What It Is
+
+The configuration component defines all tunable parameters for the firewall.
+It controls which model is used, how aggressively inputs are screened, and
+what alarm levels map to what scores.
+
+## Why It Exists
+
+Different deployment contexts need different detection sensitivity. A
+high-security environment (e.g., screening inputs to a system with access to
+sensitive data) may want aggressive thresholds that flag more suspicious
+inputs. A low-risk chatbot may prefer permissive thresholds that minimize
+false positives. The configuration component makes these trade-offs explicit
+and tunable.
+
+## Configuration Structure
+
+### Thresholds
+
+```python
+@dataclass
+class Thresholds:
+    suspicious: float = 0.3    # Score above which input is SUSPICIOUS
+    dangerous: float = 0.7    # Score above which input is DANGEROUS
+    per_dimension: dict[int, float] | None = None  # Override per SVD dimension
+```
+
+Default thresholds are calibrated against the codebook's behavioral regions.
+Per-dimension overrides allow tuning sensitivity for specific behavioral
+patterns (e.g., lower threshold on the refusal-suppression dimension).
+
+### Model Configuration
+
+```python
+@dataclass
+class ModelConfig:
+    model_id: str = "HuggingFaceTB/SmolLM2-135M"
+    revision: str = "<pinned-commit>"   # Specific commit, not "main"
+    device: str = "cpu"
+    extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
+    cache_dir: str | None = None
+```
+
+Extraction layers are chosen based on EMNLP 2024 findings that safety signals
+appear in early layers. The default set covers early (1, 2) and mid (4, 8)
+layers of the 12-layer SmolLM2-135M model.
+
+### Codebook Configuration
+
+```python
+@dataclass
+class CodebookConfig:
+    source: str = "bundled"         # "bundled" | "hf_hub" | "local"
+    repo_id: str | None = None      # HuggingFace repo if source="hf_hub"
+    revision: str | None = None     # HuggingFace revision
+    path: Path | None = None        # Local path if source="local"
+    n_dimensions: int = 10          # Number of SVD dimensions to retain
+```
+
+### Full Configuration
+
+```python
+@dataclass
+class FirewallConfig:
+    model: ModelConfig = field(default_factory=ModelConfig)
+    codebook: CodebookConfig = field(default_factory=CodebookConfig)
+    thresholds: Thresholds = field(default_factory=Thresholds)
+```
+
+## Defaults
+
+All configuration has sensible defaults. The firewall works out of the box:
+
+```python
+# All defaults
+firewall = Firewall()
+alarm = firewall.screen("Hello, how are you?")
+# alarm.level == AlarmLevel.CLEAR
+```
+
+No configuration file is required. All parameters can be passed via the
+constructor. A future phase may add file-based configuration (TOML or YAML).
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
+| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
+| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
--- a/docs/architecture/decisions/001-python-uv.md
+++ b/docs/architecture/decisions/001-python-uv.md
@@ -0,0 +1,41 @@
+# ADR-001: Python with uv
+
+## Status
+
+Accepted
+
+## Context
+
+The project needs a programming language and build toolchain. The PoC was
+written in Python using PyTorch, sklearn, and transformers. A Rust port using
+burn/cubecl was attempted but failed — the ML framework ecosystem in Rust is
+not yet mature enough for this type of work.
+
+The project needs a fast path to a usable system. The PoC already works in
+Python. Modern Python packaging (uv, pyproject.toml, src layout) provides a
+professional project structure that was not available even a few years ago.
+
+## Decision
+
+Use Python 3.10+ with uv as the package manager and build tool. Use uv_build
+as the build backend. Use src/ layout for the package.
+
+## Consequences
+
+**Positive**:
+- Fast path to working system — PoC code is already Python
+- Rich ML ecosystem (PyTorch, transformers, sklearn, safetensors)
+- uv provides 10-100x faster dependency management than pip
+- Modern packaging standards (pyproject.toml, PEP 735 dependency groups)
+- Easy distribution via PyPI with `pip install alknet-firewall[torch]`
+- Type checking via mypy provides strong correctness guarantees
+
+**Negative**:
+- Python is slower than Rust for non-ML code (SVD projection, data wrangling)
+- PyTorch is a large optional dependency (200MB-2.5GB)
+- Rust port remains a future goal (Phase 3, speculative)
+
+## References
+
+- [modern-python-project-setup.md](../research/modern-python-project-setup.md)
+- [python-ml-packaging.md](../research/python-ml-packaging.md)
--- a/docs/architecture/decisions/002-behavioral-signals.md
+++ b/docs/architecture/decisions/002-behavioral-signals.md
@@ -0,0 +1,52 @@
+# ADR-002: Behavioral Signal Detection (Not Text Classification)
+
+## Status
+
+Accepted
+
+## Context
+
+Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
+text-surface approaches — they classify input text as safe or unsafe. This
+fundamentally limits their effectiveness:
+
+- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
+  and pattern matching
+- Novel attack types require retraining classifiers
+- Text that looks natural to a classifier can still be adversarial when
+  processed by a model
+
+Academic research (2024-2025) demonstrates that adversarial inputs produce
+distinctive activation patterns in model internals, regardless of surface form.
+
+## Decision
+
+Build a behavioral signal detection system that monitors how a model processes
+inputs (hidden state activations), not what the inputs say (text surface).
+Adversarial inputs produce anomalous activation patterns that are detectable
+even when the text itself looks innocent.
+
+## Consequences
+
+**Positive**:
+- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
+- Anomalous behavior patterns are attack-type agnostic — novel attacks still
+  produce anomalous patterns
+- Multi-dimensional signals provide interpretable detection (which SVD
+  directions are activated and by how much)
+- Complementary to existing text-surface defenses — can be layered
+
+**Negative**:
+- Requires running a model on every input (adds latency and compute cost)
+- Detection depends on the detector model sharing architectural similarity
+  with likely attack targets
+- False positives possible for unusual but benign inputs (domain-specific
+  language, technical content)
+- No existing production system validates this approach — we are first
+
+## References
+
+- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
+- HiddenDetect (ACL 2025)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- How Alignment and Jailbreak Work (EMNLP 2024)
--- a/docs/architecture/decisions/003-small-model-detector.md
+++ b/docs/architecture/decisions/003-small-model-detector.md
@@ -0,0 +1,56 @@
+# ADR-003: Small Model (~125M) as Detector
+
+## Status
+
+Accepted
+
+## Context
+
+The behavioral signal detection approach requires running a language model on
+every input to extract hidden state activations. The choice of model size
+creates a trade-off:
+
+- **Large model (7B+)**: Better representation quality, more behavioral signal
+  resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
+- **Small model (~125M)**: Sufficient representation quality for early-layer
+  safety signals. Runs on CPU, <10ms latency, negligible cost per check.
+- **Tiny model (<50M)**: Too small for safety-relevant representations to
+  emerge. Lacks the depth where behavioral patterns form.
+
+EMNLP 2024 research confirms that safety signals are detectable in early
+layers — the model doesn't need deep processing to produce useful signals.
+A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
+for safety directions to emerge in early layers.
+
+## Decision
+
+Use a small model (~125M parameters) as the default detector. SmolLM2-135M
+(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
+CPU. Support model-agnostic detection — any compatible model can be used by
+recompiling the codebook.
+
+## Consequences
+
+**Positive**:
+- <10ms latency enables real-time pre-inference screening
+- CPU-deployable — no GPU required for the firewall
+- Can run alongside target model without blocking
+- Fast iteration — training/updating a 125M model takes hours, not days
+- Small enough to embed in API gateways, CDN edges, client applications
+- 269MB model download is feasible via HF Hub with caching
+
+**Negative**:
+- Less representation quality than larger models — may miss subtle signals
+  that a 7B detector would catch
+- Detector model must share some architectural similarity with target models
+  for behavioral signals to transfer
+- SmolLM2-135M is English-focused — multilingual detection requires a
+  multilingual detector model
+- Codebook is model-specific — switching models requires recompilation
+
+## References
+
+- [model.md](../model.md)
+- EMNLP 2024: Safety signals detectable in early layers
+- Subliminal Learning (Nature 2026): Behavioral traits transmit through
+  non-semantic signals
--- a/docs/architecture/decisions/004-svd-based-detection.md
+++ b/docs/architecture/decisions/004-svd-based-detection.md
@@ -0,0 +1,58 @@
+# ADR-004: SVD-Based Anomaly Detection
+
+## Status
+
+Accepted
+
+## Context
+
+After extracting hidden state activations from the detector model, the
+firewall needs a method to distinguish normal behavioral patterns from
+adversarial ones. Options:
+
+- **Single classifier**: Train a binary classifier on activations. Simple but
+  loses the multi-dimensional structure. Black box.
+- **SVD + region comparison**: Decompose activation space into principal
+  directions, model normal behavioral regions along each direction, detect
+  inputs that fall outside normal regions. Interpretable, efficient,
+  multi-dimensional.
+- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
+  detect inputs with high reconstruction error. Complex, not interpretable.
+
+ICML 2025 research shows safety is multi-dimensional in activation space — a
+dominant refusal direction plus secondary dimensions. SVD naturally discovers
+these directions. Region comparison provides interpretable per-dimension
+signals.
+
+## Decision
+
+Use SVD-based anomaly detection: decompose activation space via SVD to
+discover principal behavioral directions, model normal regions along each
+dimension using monotonic spline distributions, and detect inputs whose
+projections fall outside normal regions.
+
+## Consequences
+
+**Positive**:
+- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
+- Efficient: Projection is O(k) after decomposition, trivial at runtime
+- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
+- Robust: SVD captures structure of entire activation space, not a single
+  boundary
+- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
+- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
+  (unlike `TruncatedSVD` which uses randomized initialization)
+
+**Negative**:
+- SVD basis is model-specific — changing detector model requires recomputation
+- Basis quality depends on calibration dataset coverage
+- Linear decomposition may miss non-linear behavioral patterns
+- Requires a codebook compilation pipeline (Phase 2)
+- Full SVD on large calibration datasets may be slow (mitigated by
+  relatively small hidden dim: 768)
+
+## References
+
+- [codebook.md](../codebook.md)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- HiddenDetect (ACL 2025)
--- a/docs/architecture/decisions/005-safetensors-only.md
+++ b/docs/architecture/decisions/005-safetensors-only.md
@@ -0,0 +1,47 @@
+# ADR-005: Safetensors-Only Model Loading
+
+## Status
+
+Accepted
+
+## Context
+
+Model weight files come in two formats:
+
+- **Pickle-based** (`.pt`, `.bin`, `.pth`): Can execute arbitrary Python code
+  during loading. Known supply chain attack vector.
+- **safetensors**: Simple binary format with JSON header. No code execution.
+  76x faster CPU loading. Zero-copy/lazy loading support.
+
+This is a security product. Loading untrusted pickle files in a security
+product is a contradiction. The LiteLLM supply chain attack (CVE-2026-33634,
+CVSS 9.4) demonstrated that compromised model files can lead to credential
+theft and backdoors.
+
+## Decision
+
+Only load model weights from safetensors format. Never load `.pt`, `.bin`,
+or `.pth` files. Apply this policy to both the detector model and the codebook
+tensors.
+
+## Consequences
+
+**Positive**:
+- Eliminates entire class of supply chain attacks via model files
+- 76x faster model loading on CPU
+- Zero-copy/lazy loading reduces memory usage
+- Cross-framework compatible (PyTorch, ONNX, numpy)
+- Consistent with HuggingFace's own migration to safetensors-default
+
+**Negative**:
+- Some older models only ship `.bin` weights — must convert before use
+- Safetensors doesn't support saving optimizer state (irrelevant — we only
+  do inference)
+- Explicit `use_safetensors=True` parameter needed in transformers for older
+  versions
+
+## References
+
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 6:
+  safetensors format comparison
+- CVE-2026-33634 — LiteLLM supply chain attack
--- a/docs/architecture/decisions/006-optional-pytorch.md
+++ b/docs/architecture/decisions/006-optional-pytorch.md
@@ -0,0 +1,64 @@
+# ADR-006: PyTorch as Optional Dependency
+
+## Status
+
+Accepted
+
+## Context
+
+PyTorch is the primary inference backend for the detector model. However,
+PyTorch is large:
+
+- `torch` (CPU): ~200MB download, ~700MB installed
+- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
+- `onnxruntime`: ~30-50MB download, ~300MB installed
+
+Making PyTorch a required dependency would force a 200MB-2.5GB download on
+every user, even those who already have PyTorch installed or prefer ONNX
+Runtime. This is the standard problem for ML libraries, and the HuggingFace
+ecosystem has converged on a solution.
+
+## Decision
+
+Make PyTorch an optional dependency via extras (`pip install
+alknet-firewall[torch]`). The base install includes all non-ML dependencies
+(sklearn, huggingface-hub, safetensors, tokenizers, numpy). ML inference
+backends are installed separately.
+
+Use lazy imports with clear error messages when PyTorch is not installed:
+
+```python
+try:
+    import torch
+except ImportError:
+    raise ImportError(
+        "PyTorch is required for alknet-firewall inference. "
+        "Install with: pip install 'alknet-firewall[torch]' "
+        "or pip install torch --index-url https://download.pytorch.org/whl/cpu"
+    )
+```
+
+## Consequences
+
+**Positive**:
+- Base install is ~30MB download, ~100MB installed — very lightweight
+- Users with existing PyTorch installations don't re-download
+- ONNX Runtime alternative available for minimal footprint (~100MB total)
+- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
+  hub all use this pattern)
+- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
+  `[[tool.uv.index]]`
+
+**Negative**:
+- More complex dependency specification in pyproject.toml
+- Users must read installation docs to choose the right extra
+- Runtime import errors if users forget to install a backend
+- CPU-only torch requires two-step install or uv configuration (can't be
+  expressed in pip extras alone)
+
+## References
+
+- [modern-python-project-setup.md](../research/modern-python-project-setup.md) —
+  Section 2: PyTorch handling
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 1:
+  PyTorch as dependency
--- a/docs/architecture/decisions/007-runtime-model-download.md
+++ b/docs/architecture/decisions/007-runtime-model-download.md
@@ -0,0 +1,53 @@
+# ADR-007: Runtime Model Download via HuggingFace Hub
+
+## Status
+
+Accepted
+
+## Context
+
+The detector model (SmolLM2-135M) is ~269MB. This is too large to bundle in a
+Python package — PyPI has a 60MB per-file limit and 1GB total project size
+limit. Even if it were allowed, a 269MB wheel download is terrible UX.
+
+Options:
+- **Bundle in package**: Not feasible due to size constraints
+- **Separate package for model**: Possible but awkward, requires users to
+  install two packages
+- **Runtime download via HuggingFace Hub**: Standard approach used by
+  transformers. Provides caching, authentication, offline mode, and
+  checksum verification
+- **Custom download (S3, etc.)**: Works but reinvents the wheel
+
+## Decision
+
+Download the detector model at runtime via HuggingFace Hub (`snapshot_download`
+or `from_pretrained` with automatic caching). Support offline mode via
+`HF_HUB_OFFLINE=1` or `local_files_only=True`. Provide a CLI command for
+pre-downloading models in air-gapped environments.
+
+Pin model revisions to specific commit hashes for reproducibility.
+
+## Consequences
+
+**Positive**:
+- Package stays small (~30MB base install)
+- HuggingFace Hub provides automatic caching, deduplication, and checksum
+  verification
+- Offline mode supported via environment variable
+- Authentication for gated models via `HF_TOKEN`
+- Standard approach — users familiar with transformers will recognize the
+  pattern
+
+**Negative**:
+- First run requires network access and ~269MB download (with progress bar)
+- Model availability depends on HuggingFace Hub uptime
+- Users in restricted networks need to pre-download models
+- Different model versions may produce different detection results — must
+  pin revisions
+
+## References
+
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 2:
+  Model file distribution
+- [model.md](../model.md)
--- a/docs/architecture/decisions/008-three-level-alarm.md
+++ b/docs/architecture/decisions/008-three-level-alarm.md
@@ -0,0 +1,47 @@
+# ADR-008: Three-Level Alarm System
+
+## Status
+
+Accepted
+
+## Context
+
+The firewall needs to communicate detection results to downstream systems. The
+design choice is how many alarm levels and what they mean.
+
+Alternatives:
+- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
+  don't warrant blocking but should be flagged. Binary forces a single
+  threshold that either blocks too much (high false positive) or too little
+  (high false negative).
+- **Numeric-only (0.0–1.0 score)**: Maximum information but requires every
+  consumer to choose their own threshold. No shared vocabulary for what's
+  actionable.
+- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
+  pre-inference screening system. The difference between "low" and "medium"
+  is too subtle for consumers to act on differently.
+- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
+  nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
+  review. Most practical for automated systems.
+
+## Decision
+
+Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
+continuous score (0.0–1.0) for consumers that need fine-grained decisions.
+
+## Consequences
+
+**Positive**:
+- Clear action mapping: pass, flag, block
+- Suspicious level enables defense-in-depth (apply additional checks rather
+  than binary block/allow)
+- Continuous score provides gradient for consumers that need it
+- Simple to document and communicate
+
+**Negative**:
+- Some consumers may need more granularity (but can use the score field)
+- "Suspicious" requires consumers to decide what to do — adds decision burden
+
+## References
+
+- [firewall.md](../firewall.md)
--- a/docs/architecture/decisions/009-last-token-extraction.md
+++ b/docs/architecture/decisions/009-last-token-extraction.md
@@ -0,0 +1,55 @@
+# ADR-009: Last-Token Activation Extraction
+
+## Status
+
+Accepted
+
+## Context
+
+To extract behavioral signals from the detector model, we must choose which
+token's hidden state to use from the sequence of hidden states produced during
+inference. Options:
+
+- **Last token**: The hidden state at the final position, which has attended
+  to the entire sequence. Standard for sequence classification (used by BERT
+  pools, GPT-style models naturally aggregate at the last position).
+- **Mean pooling**: Average hidden states across all positions. Smooths out
+  position-specific effects but dilutes signal from safety-relevant tokens.
+- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
+  (LLaMA architecture) does not use a CLS token.
+- **First token**: Has seen only the beginning of the sequence. Misses
+  context from later tokens.
+- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
+  position with extreme activation can dominate.
+
+Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
+models because the last position's hidden state has attended to the full
+sequence via causal attention. For safety detection, this means the last
+token's representation contains the model's "conclusion" about the entire
+input.
+
+## Decision
+
+Extract the last token's hidden state at each configured layer. This is
+standard for LLaMA-family models and provides full-sequence context.
+
+## Consequences
+
+**Positive**:
+- Standard approach for autoregressive models — well-validated
+- Full sequence context via causal attention
+- Single vector per layer — simple to project and score
+- No padding sensitivity (unlike mean pooling with attention masks)
+
+**Negative**:
+- Position-dependent — the last token's representation is influenced by its
+  position in the sequence, not just its content
+- Very short inputs (1–2 tokens) may not have enough context for meaningful
+  activation patterns
+- May miss patterns in long inputs where the adversarial payload is in the
+  middle rather than the end
+
+## References
+
+- [model.md](../model.md)
+- [codebook.md](../codebook.md)
--- a/docs/architecture/decisions/010-monotonic-spline-distributions.md
+++ b/docs/architecture/decisions/010-monotonic-spline-distributions.md
@@ -0,0 +1,64 @@
+# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
+
+## Status
+
+Accepted
+
+## Context
+
+After projecting activations onto SVD dimensions, the firewall needs to score
+how "normal" or "anomalous" a projection is relative to the distribution of
+normal inputs. This requires modeling the probability density of normal inputs
+along each dimension.
+
+Alternatives:
+- **Gaussian**: Simple, well-understood. But real behavioral distributions are
+  often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
+- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
+  bandwidth selection is tricky, and KDE doesn't provide a parametric form for
+  efficient storage and fast evaluation.
+- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
+  choosing the number of components and risks overfitting.
+- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
+  calibration data points — not compact.
+- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
+  spline. Compact (handful of knots), smooth, tail-sensitive, and
+  differentiable. The CDF is naturally monotonic, which enforces a valid
+  probability distribution.
+
+## Decision
+
+Use monotonic spline distributions to model behavioral regions along each SVD
+dimension. The CDF is represented as a monotonic cubic spline with a small
+number of knots (typically 10–20 per dimension). Tail behavior uses
+exponential decay beyond the observed range.
+
+The scoring function computes how far a projection falls in the tail of the
+distribution — projections well within the normal region score low (CLEAR),
+projections near or beyond the tail score increasingly high.
+
+## Consequences
+
+**Positive**:
+- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
+  cliff-edge behavior
+- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
+  inputs without flagging the bulk of normal inputs
+- **Parametric compactness**: A handful of spline knots (10–20) represent the
+  full distribution shape. Very small storage footprint.
+- **Differentiability**: Scores are differentiable — potential for future
+  adversarial training or gradient-based analysis
+- **No distributional assumptions**: Unlike Gaussian, spline distributions
+  handle skew, heavy tails, and non-standard shapes
+
+**Negative**:
+- More complex than Gaussian — requires spline fitting during codebook
+  compilation
+- Spline knot selection affects scoring quality — poor knot placement can
+  miss important distribution features
+- Less familiar to most ML practitioners than Gaussian or KDE
+
+## References
+
+- [codebook.md](../codebook.md)
+- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -0,0 +1,200 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Firewall
+
+The core firewall component: the public API for screening untrusted inputs and
+producing behavioral alarms.
+
+## What It Is
+
+The Firewall is the primary entry point for alknet-firewall. It receives
+untrusted text input, runs it through the detector model, extracts behavioral
+signals from hidden state activations, and produces a structured alarm
+indicating whether the input exhibits adversarial behavioral patterns.
+
+## Why It Exists
+
+LLM-based systems need a fast, pre-inference screening mechanism that catches
+adversarial inputs *before* they reach the target model. Text-surface
+defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
+detection catches what text hides — adversarial inputs produce anomalous
+activation patterns regardless of their surface form (ADR-002).
+
+## Data Flow
+
+```
+1. Input Arrives
+   "Please summarize this document: [hidden injection payload]"
+
+2. Tokenize
+   tokenizer.encode(input) → input_ids
+
+3. Detector Model Inference
+   model(input_ids) → hidden_states at key layers
+
+4. Activation Extraction
+   Extract hidden states from configured layers (early + mid)
+   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors
+
+5. SVD Projection
+   Project activations onto precomputed SVD basis
+   z_coords = svd_basis @ activation_vector
+
+6. Codebook Comparison
+   For each SVD dimension:
+     - Compute distance from normal behavioral region
+     - Apply spline scoring (monotonic distribution)
+     - Aggregate multi-dimensional signals
+
+7. Alarm Generation
+   Combine per-dimension signals → overall alarm
+   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
+   Include per-dimension breakdown for interpretability
+```
+
+## Key Concepts
+
+### Behavioral Alarm
+
+Not a simple safe/unsafe binary. A behavioral alarm contains:
+
+- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
+- **Score**: Continuous 0.0–1.0 composite score
+- **Signals**: Per-dimension behavioral signal strengths
+- **Dimensions**: Which SVD directions are anomalous and by how much
+
+This multi-signal approach reflects that safety is multi-dimensional in
+activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
+that simultaneously shifts the refusal direction while activating role-playing
+dimensions is more suspicious than one that shifts only one dimension.
+
+### Score Composition
+
+The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
+using a weighted maximum:
+
+```
+score = max(w_d * signal_d for d in dimensions)
+```
+
+Where `w_d` are dimension weights (default: equal, configurable in
+`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
+single strongly anomalous dimension can trigger an alarm even if other
+dimensions are normal. This is critical for catching attacks that exploit
+specific behavioral patterns (e.g., refusal-suppression) while leaving other
+dimensions unaffected.
+
+The `suspicious` and `dangerous` thresholds are applied to this composite
+score to determine `Alarm.level`.
+
+### Alarm Levels
+
+| Level | Meaning | Action |
+|-------|---------|--------|
+| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
+| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
+| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
+
+### Latency Budget
+
+The firewall must complete screening in <10ms on commodity hardware
+(ADR-003). This budget breaks down approximately:
+
+| Step | Target Latency |
+|------|----------------|
+| Tokenization | ~0.5ms |
+| Model inference (125M, CPU) | ~5ms |
+| Activation extraction | ~0.1ms |
+| SVD projection | ~0.1ms |
+| Codebook comparison | ~0.3ms |
+| **Total** | **~6ms** |
+
+## Interfaces
+
+### Public API
+
+```python
+class AlarmLevel(Enum):
+    CLEAR = "clear"
+    SUSPICIOUS = "suspicious"
+    DANGEROUS = "dangerous"
+
+@dataclass
+class DimensionSignal:
+    dimension: int
+    deviation: float
+    score: float
+    direction_label: str | None
+
+@dataclass
+class Alarm:
+    level: AlarmLevel
+    score: float
+    signals: list[DimensionSignal]
+    input_hash: str          # SHA-256 of raw input string (for logging/dedup)
+    model_id: str
+    timestamp: float
+
+class Firewall:
+    def __init__(
+        self,
+        model_id: str = "HuggingFaceTB/SmolLM2-135M",
+        model_revision: str = DEFAULT_MODEL_REVISION,
+        codebook_path: Path | None = None,
+        thresholds: Thresholds | None = None,
+        device: str = "cpu",
+        cache_dir: str | None = None,
+    ): ...
+
+    def preload(self) -> None: ...
+
+    def screen(self, input: str) -> Alarm: ...
+```
+
+> `screen_batch` is Phase 2 (see overview.md scope).
+
+### Constraints
+
+1. **No network calls during screening** — the model is lazily loaded on
+   first `screen()` call or via explicit `preload()`. Download never happens at
+   import time. Once loaded, screening is entirely local.
+2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
+3. **No target model dependency** — the firewall has no access to the target
+   LLM's internals. It runs its own detector model.
+4. **Reproducible** — Same input + same model + same codebook = same alarm.
+   Pin model revision and codebook version.
+
+## Error Handling
+
+| Failure Mode | Exception Type | Behavior |
+|-------------|---------------|----------|
+| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
+| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
+| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
+| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
+| Empty input | `ValueError` | Raised if input is empty string. |
+| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
+| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
+| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
+
+All exception types subclass `AlknetFirewallError` (base library exception).
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
+| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
+| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
+| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
+- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
--- a/docs/architecture/model.md
+++ b/docs/architecture/model.md
@@ -0,0 +1,161 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Model
+
+The model component manages detector model loading, inference, and activation
+extraction. It is the interface between the firewall and the language model
+that provides behavioral signals.
+
+## What It Is
+
+The model component loads a small language model (default: SmolLM2-135M),
+runs inference on untrusted inputs, and extracts hidden state activations at
+configured layers. It is model-agnostic — any transformer model with
+accessible hidden states can serve as a detector.
+
+## Why It Exists
+
+The firewall needs model activations (hidden states) to detect behavioral
+patterns. This component encapsulates the complexity of model loading,
+inference, and activation extraction behind a clean interface that the
+codebook and firewall can consume without knowing model-specific details.
+
+The model-agnostic design (ADR-003) means the firewall is not tied to a
+specific detector model. Switching from SmolLM2-135M to another ~100M model
+requires recomputing the SVD basis and rebuilding the codebook, but no
+changes to the firewall logic.
+
+## Key Concepts
+
+### Activation Extraction
+
+The core operation: running the model on an input and capturing hidden state
+representations at specific layers.
+
+```python
+# Conceptual
+outputs = model(input_ids, output_hidden_states=True)
+activations = {
+    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
+    for layer_idx in configured_layers
+}
+```
+
+Key decisions:
+- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
+  Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
+  Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral
+  patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their
+  signals are highly correlated with the selected layers.
+- **Which token**: The last token's hidden state carries the model's
+  "conclusion" about the full input sequence (ADR-009). This is the standard
+  choice for autoregressive (LLaMA-family) models.
+- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
+  (768 for SmolLM2-135M).
+
+### Model-Agnostic Interface
+
+The model component exposes a generic interface that works with any
+transformer model:
+
+```python
+class DetectorModel(Protocol):
+    model_id: str
+    hidden_dim: int
+    n_layers: int
+
+    def load(self, device: str = "cpu") -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+```
+
+The `infer` method returns hidden states at key layers, abstracting away
+whether the backend is PyTorch, ONNX Runtime, or a future Rust inference
+engine.
+
+### Lazy Loading
+
+The model is loaded on first use or explicit preload — not at import time.
+This keeps the library import fast (~milliseconds) even when torch is
+installed.
+
+```python
+firewall = Firewall()      # Does NOT load model yet
+firewall.preload()         # Explicit: download + load model
+alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
+```
+
+### Offline Support
+
+The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags.
+In air-gapped environments, models must be pre-downloaded. The library
+provides a CLI command for this:
+
+```bash
+python -m alknet_firewall download
+```
+
+## Interfaces
+
+### Public API
+
+```python
+class DetectorModel(Protocol):
+    model_id: str
+    hidden_dim: int
+    n_layers: int
+
+    def load(self, device: str = "cpu") -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+
+class HFDetectorModel:
+    """Default implementation using HuggingFace transformers."""
+
+    DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>"  # Specific SmolLM2-135M commit
+
+    def __init__(
+        self,
+        model_id: str = "HuggingFaceTB/SmolLM2-135M",
+        revision: str = DEFAULT_REVISION,
+        device: str = "cpu",
+        cache_dir: str | None = None,
+    ): ...
+
+    def load(self, device: str | None = None) -> None: ...
+    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
+    def is_loaded(self) -> bool: ...
+
+    @property
+    def extraction_layers(self) -> list[int]: ...
+```
+
+### Constraints
+
+1. **safetensors-only** — Model weights are loaded exclusively from
+   safetensors format. Pickle-based `.pt`/`.bin` files are never loaded
+   (ADR-005). This is a security requirement for a security product.
+2. **Model pinning** — Model revision must be pinned for reproducibility.
+   Default revision is a specific commit hash, not `"main"`.
+3. **CPU-first** — Default device is CPU. GPU inference is supported but not
+   required. The <10ms latency target is achievable on CPU with a 125M model.
+4. **No training** — The detector model is inference-only. No gradients are
+   computed. No model weights are modified at runtime.
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats |
+| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports |
+| [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled |
+| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -0,0 +1,129 @@
+# Open Questions
+
+Centralized tracker for unresolved questions across all architecture documents.
+
+## Theme: Inference Backend
+
+### OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?
+
+- **Origin**: [model.md](model.md), [overview.md](overview.md)
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: (pending)
+- **Cross-references**: ADR-006
+
+ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
+for PyTorch) and is well-suited for inference-only use. HuggingFace's `optimum`
+library provides drop-in replacement classes. However, supporting it in Phase 1
+adds complexity: model must be exported to ONNX format, `optimum` integration
+must be tested, and the activation extraction API may differ from PyTorch.
+
+Consider: Is the smaller footprint worth the integration complexity in Phase 1,
+or should ONNX support wait until Phase 2 when the core API is stable?
+
+---
+
+## Theme: Codebook Design
+
+### OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?
+
+- **Origin**: [codebook.md](codebook.md)
+- **Status**: open
+- **Priority**: high
+- **Resolution**: (pending)
+- **Cross-references**: ADR-004
+
+The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
+or excessive parameterization from the research phase. Understanding what's
+essential vs. exploratory is critical for the initial extraction. The codebook
+training pipeline (`run_manifold_projection.py`) should also be analyzed.
+
+Consider: How many SVD dimensions are actually needed? What's the minimum
+calibration dataset? Can spline distributions be simplified?
+
+---
+
+## Theme: API Design
+
+### OQ-03: Should the firewall support streaming/chunked input screening?
+
+- **Origin**: [firewall.md](firewall.md)
+- **Status**: open
+- **Priority**: low
+- **Resolution**: (pending)
+- **Cross-references**: ADR-003
+
+Some inputs arrive in chunks (streaming API responses, large documents). Should
+the firewall support incremental screening as chunks arrive, or require the
+full input before screening? Incremental screening could detect attacks earlier
+but requires buffering and state management.
+
+This is low priority for Phase 1 but affects the internal API design.
+
+---
+
+### OQ-04: Should detection thresholds be per-model or globally configurable?
+
+- **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: (pending)
+- **Cross-references**: ADR-003, ADR-004
+
+Different detector models may produce different score distributions. Thresholds
+that work for SmolLM2-135M may not work for a different model. Should
+thresholds be tied to the codebook (per-model) or set globally by the user?
+
+Consider: Per-model defaults with user overrides? Codebook ships with
+recommended thresholds that the user can adjust?
+
+---
+
+## Theme: Integration
+
+### OQ-05: How should the firewall integrate with existing guardrail systems?
+
+- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: (pending)
+- **Cross-references**: ADR-002
+
+The behavioral firewall is complementary to text-surface defenses. Users may
+want to run both Llama Guard (text classification) and alknet-firewall
+(behavioral signals) in series. How should these be composed?
+
+Consider: Integration adapters? A common interface? Callback hooks? Or is
+composition the user's responsibility and we just provide a clean standalone API?
+
+---
+
+## Theme: Project Setup
+
+### OQ-06: Should file-based configuration use TOML or YAML?
+
+- **Origin**: [configuration.md](configuration.md)
+- **Status**: open
+- **Priority**: low
+- **Resolution**: (pending)
+- **Cross-references**: None
+
+Phase 1 uses constructor-based configuration only. A future phase may add
+file-based configuration for easier deployment. TOML is consistent with
+Python packaging (pyproject.toml) and increasingly the standard for Python
+config. YAML is more familiar in ops/ML contexts. Either works.
+
+---
+
+### OQ-07: Is a Rust port feasible given current ML framework maturity?
+
+- **Origin**: [overview.md](overview.md), ADR-001
+- **Status**: open
+- **Priority**: low
+- **Resolution**: (pending)
+- **Cross-references**: ADR-001
+
+A Rust port using burn/cubecl was attempted during the PoC phase and failed.
+The ML framework ecosystem in Rust is not yet mature enough for this type
+of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
+matures or alternative Rust ML frameworks emerge.
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -0,0 +1,208 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Overview
+
+## Vision
+
+A pip-installable Python library that screens untrusted inputs for adversarial
+content before they reach a target LLM. The library uses behavioral signals —
+patterns in hidden state activations from a small language model — to detect
+injection attempts, obfuscated payloads, and novel attack types that text-surface
+defenses miss.
+
+This project is open source under the MIT license.
+
+## Why This Exists
+
+LLMs process instructions and data in the same token stream. They cannot
+reliably distinguish trusted system prompts from untrusted user content. This
+architectural weakness enables prompt injection — the #1 LLM vulnerability per
+OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
+of the time with just 10 attempts (International AI Safety Report 2026).
+
+Current defenses are **surface-level**: text classifiers (Llama Guard), regex
+filters, perplexity checks, and canary tokens. All examine *what the input
+says*, not *how a model processes it*. Adversarial inputs that look natural to
+text classifiers still produce distinctive activation patterns when a model
+processes them.
+
+Academic research validates this approach:
+- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
+- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
+- **EMNLP 2024**: Safety signals detectable in early layers
+- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
+  through non-semantic hidden signals
+
+See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
+for the full threat analysis and academic evidence.
+
+## Scope
+
+### In Scope
+
+- **Phase 1**: Core behavioral firewall library
+  - Input screening via small model activation analysis
+  - SVD-based anomaly detection with configurable thresholds
+  - Model-agnostic detector (works with any compatible small model)
+  - SmolLM2-135M as the default detector model
+  - Multi-dimensional behavioral alarms (not just safe/unsafe)
+  - PyTorch inference backend (optional dependency)
+  - Runtime model download and caching via HuggingFace Hub
+  - safetensors-only model loading (security requirement)
+  - Synchronous API for single-input screening
+  - Interpretable detection signals (SVD direction analysis)
+
+- **Phase 2**: Integration and operational hardening
+  - ONNX Runtime inference backend
+  - Async/batch screening API
+  - Integration adapters for LlamaFirewall, NeMo Guardrails
+  - Metrics and observability
+  - Codebook training pipeline (`run_manifold_projection.py` extraction)
+
+- **Phase 3**: Advanced capabilities
+  - Multi-turn attack detection (payload splitting)
+  - Streaming input screening
+  - Custom model fine-tuning for domain-specific detection
+  - Rust port via burn/cubecl (speculative, requires R&D)
+
+### Out of Scope
+
+- Text-surface classification (that's Llama Guard's job)
+- Rule-based content filtering (that's NeMo Guardrails' job)
+- Output-side safety monitoring
+- Target model training or modification
+- Multimodal (image) input screening
+- Agent orchestration or access control
+- Replacement for comprehensive LLM security programs
+
+## Architecture
+
+```
+                        ┌──────────────────────────────────────────┐
+                        │  alknet-firewall (Python library)          │
+                        │                                            │
+  Untrusted Input ────► │  ┌─ Firewall API ─────────────────────┐   │
+  (text)                │  │  screen(input) → Alarm              │   │
+                        │  │  ├─ Tokenize input                   │   │
+                        │  │  ├─ Run detector model              │   │
+                        │  │  ├─ Extract hidden state activations│   │
+                        │  │  ├─ Project onto SVD basis           │   │
+                        │  │  ├─ Compare against codebook         │   │
+                        │  │  └─ Return behavioral alarm          │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Model Manager ────────────────────┐   │
+                        │  │  Load model (HF Hub download/cache) │   │
+                        │  │  Extract activations at key layers   │   │
+                        │  │  Model-agnostic interface            │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Codebook ──────────────────────────┐   │
+                        │  │  SVD basis vectors (compiled)        │   │
+                        │  │  Detection thresholds per dimension  │   │
+                        │  │  Behavioral region boundaries        │   │
+                        │  │  Spline distributions for scoring    │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Configuration ─────────────────────┐   │
+                        │  │  Model selection & revision pinning  │   │
+                        │  │  Detection thresholds               │   │
+                        │  │  Alarm severity levels              │   │
+                        │  └────────────────────────────────────┘   │
+                        └──────────────────────────────────────────┘
+                                      │
+                               ┌──────┴──────┐
+                               │             │
+                        HF Hub Cache    Detector Model
+                        (~/.cache/)    (SmolLM2-135M)
+```
+
+## Package Dependencies
+
+### Core (Required)
+
+| Package | Version | Purpose | Notes |
+|---------|---------|---------|-------|
+| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
+| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
+| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
+| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
+| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
+
+### Optional (Extras)
+
+| Package | Extra | Version | Purpose | Notes |
+|---------|-------|---------|---------|-------|
+| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
+| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
+| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
+| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
+
+### Development (Not Published)
+
+| Package | Purpose |
+|---------|---------|
+| `ruff` | Linting + formatting (replaces flake8, black, isort) |
+| `pytest` | Testing |
+| `pytest-cov` | Coverage |
+| `mypy` | Type checking |
+| `pre-commit` | Git hooks |
+
+## Exports
+
+This is a Python library. Public API surface:
+
+```python
+from alknet_firewall import Firewall, Alarm, AlarmLevel
+
+# Core screening
+firewall = Firewall()  # loads default model + codebook
+alarm: Alarm = firewall.screen("untrusted input text")
+
+# Alarm properties
+alarm.level          # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
+alarm.score          # float, 0.0-1.0
+alarm.signals        # list[DimensionSignal] — per-dimension behavioral signals
+alarm.dimensions     # SVD dimension analysis
+```
+
+See [firewall.md](firewall.md) for the full API specification.
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
+| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
+| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
+| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
+| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
+| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
+| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
+| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
+| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
+
+## Dependencies on Other Projects
+
+- **metaspline**: The core detection logic (codebook, spline distributions,
+  SVD projection, space transforms) is adapted from the metaspline research
+  project. The PoC validated the behavioral signal approach; this project
+  extracts and productionizes ~1,745 lines of the working subset.
+
+- **reverse-proxy**: The architecture documentation structure and SDD process
+  are adapted from the @alkdev/reverse-proxy project. The documentation
+  conventions, ADR format, and open questions tracking are reused directly.
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
+- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)