feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/codebook.md
+++ b/docs/architecture/codebook.md
@@ -0,0 +1,248 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Codebook
+
+The codebook contains the compiled detection parameters — SVD basis vectors,
+behavioral region boundaries, and scoring distributions — that the firewall
+uses to detect adversarial inputs.
+
+## What It Is
+
+The codebook is the "compiled detector" — the precomputed parameters that
+transform raw model activations into behavioral alarm signals. It is to the
+firewall what a trained model is to a classifier: the result of an offline
+compilation step that produces the runtime detection parameters.
+
+The name "codebook" comes from vector quantization terminology: it defines a
+set of reference points (codewords) in activation space that represent known
+behavioral patterns. New inputs are compared against these reference patterns.
+
+## Why It Exists
+
+Running full SVD decomposition and distribution fitting on every input would be
+prohibitively expensive. The codebook precomputes these offline:
+
+- **SVD basis**: The principal directions in activation space that capture
+  safety-relevant behavioral variance. Computed once from a calibration
+  dataset.
+- **Behavioral regions**: The expected distribution of normal inputs along each
+  SVD dimension. Defined by fitted spline distributions.
+- **Thresholds**: Decision boundaries for alarm levels along each dimension.
+
+At runtime, the firewall only needs to project new activations onto the
+precomputed basis and compare against the precomputed regions — O(k) per input
+where k is the number of retained dimensions.
+
+## Key Concepts
+
+### z-Coordinates
+
+The projection of an activation vector onto the SVD basis. Computed as:
+
+```
+z = V^T @ (activation - mean)
+```
+
+Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
+mean activation from the calibration dataset. The centering step is critical
+— without it, projections are offset by the mean and thresholds would be
+incorrect.
+
+z-coordinates are raw (unnormalized) projections. The codebook's spline
+distributions are calibrated for this scale, so threshold values in the
+codebook are specific to the z-coordinate range of the calibration data.
+
+### SVD Basis
+
+Singular Value Decomposition of the activation space from a calibration dataset
+reveals the principal components (directions) that capture the most variance.
+The top-k components form the basis that the codebook uses for projection.
+
+Key properties:
+- **Interpretable**: Each direction can be inspected for what behavioral
+  pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
+- **Efficient**: After decomposition, projection is a matrix multiply
+- **Stable**: SVD basis is deterministic for a given calibration dataset
+- **Model-specific**: The basis is computed for a specific model architecture
+  and weights. Changing the detector model requires recomputing the basis
+
+The SVD basis is computed by the codebook training pipeline
+(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
+
+### Behavioral Regions
+
+For each SVD dimension, the codebook defines the expected distribution of
+normal (non-adversarial) inputs. This is modeled as a monotonic spline
+distribution that captures the shape of the behavioral region along that
+dimension.
+
+Inputs whose projections fall within the normal region score low (CLEAR).
+Inputs whose projections fall near or beyond the region boundary score
+increasingly high (SUSPICIOUS → DANGEROUS).
+
+### Spline Distributions
+
+Monotonic spline distributions model the probability density along each SVD
+dimension (ADR-010). They provide:
+
+- **Smooth scoring**: Continuous score rather than hard threshold
+- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
+  anomalous inputs
+- **Parametric compactness**: A handful of spline knots represent the full
+  distribution shape
+- **Differentiability**: Scores are differentiable for potential future use in
+  adversarial training
+
+The spline distribution approach is adapted from the metaspline PoC
+(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
+
+**Formal definition**: The CDF along each dimension is modeled as a monotonic
+cubic spline with 10–20 knots. Knot positions are determined by quantiles of
+the calibration data (ensuring density of knots where data is dense). Beyond
+the extreme knots, the CDF decays exponentially at a rate fitted to the tail
+data. The scoring function maps a z-coordinate to a score in [0, 1] via the
+CDF's complement: `score = 1 - cdf(z)`.
+
+**Canonical implementation**: The metaspline PoC files `spline.py`
+(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
+and `space.py` (`unfold`/`fold`) are the reference implementation for the
+codebook compilation pipeline.
+
+### Calibration Dataset
+
+The calibration dataset is the set of normal (non-adversarial) inputs used to
+compute the SVD basis and fit behavioral region distributions. Requirements:
+
+- **Composition**: Diverse normal inputs representative of the deployment
+  domain. No adversarial examples — the basis models *normal* behavior, and
+  anomalies are detected as deviations from it.
+- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
+  Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
+  have diminishing returns.
+- **Diversity**: Must cover the range of normal inputs the detector will see
+  in production. A narrow calibration dataset (e.g., only short English
+  queries) will produce high false positive rates on unusual but benign inputs.
+- **Model-specific**: A calibration dataset must be collected for each detector
+  model by running that model on the inputs and extracting activations.
+
+The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
+automates calibration dataset processing.
+
+### Codebook Compilation
+
+The codebook is compiled offline by a training pipeline that:
+
+1. Runs the detector model on a calibration dataset (diverse normal inputs)
+2. Extracts hidden state activations at configured layers
+3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
+   deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
+   which uses randomized approximation and may not be deterministic)
+4. Fits spline distributions along each retained dimension
+5. Computes detection thresholds
+6. Serializes the codebook to a portable format (safetensors + JSON config)
+
+This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
+package** as package data (under `src/alknet_firewall/data/codebook/`). This
+keeps the Phase 1 installation simple — no additional download step beyond the
+model. The bundled codebook is specific to the default detector model
+(SmolLM2-135M at the pinned revision). Users who switch to a different
+detector model must provide a matching codebook via `codebook_path`.
+
+## Data Format
+
+The codebook is stored as:
+
+```
+codebook/
+├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
+├── regions.safetensors    # Region boundary parameters
+├── splines.json           # Spline knot positions and coefficients
+└── config.json            # Metadata: model_id, revision, n_dims, thresholds
+```
+
+All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
+
+### Tensor Specifications
+
+**basis.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
+| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
+
+**regions.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
+| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
+
+**splines.json**:
+| Field | Type | Description |
+|-------|------|-------------|
+| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
+| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
+| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
+
+## Interfaces
+
+### Internal API
+
+```python
+@dataclass
+class CodebookConfig:
+    model_id: str
+    model_revision: str
+    n_dimensions: int
+    layers: list[int]
+    suspicious_threshold: float    # Serialized threshold values
+    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)
+
+class Codebook:
+    def __init__(self, path: Path): ...
+
+    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
+        """Project raw activations onto SVD basis → z-coordinates."""
+        ...
+
+    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
+        """Score z-coordinates against behavioral regions."""
+        ...
+
+    @classmethod
+    def load(cls, path: Path) -> Codebook: ...
+
+    @classmethod
+    def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
+```
+
+### Constraints
+
+1. **Immutable at runtime** — The codebook is read-only during screening.
+   Modifying the codebook requires explicit recompilation.
+2. **Model-bound** — A codebook is valid only for the specific model it was
+   compiled for. Loading a codebook with the wrong model produces undefined
+   results.
+3. **Deterministic** — Same codebook + same activations = same scores.
+4. **Portable** — Codebook can be saved to disk and reloaded without
+   recomputation. Can be distributed via HuggingFace Hub.
+
+## Design Decisions
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
+| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
+| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
+  codebook be compressed? (open)
+- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)