Files
alknet-firewall/docs/architecture/codebook.md
glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure
Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
2026-06-13 07:27:40 +00:00

312 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
status: draft
last_updated: 2026-06-13
---
# Codebook
The codebook contains the compiled detection parameters — SVD basis vectors,
behavioral region boundaries, and scoring distributions — that the firewall
uses to detect adversarial inputs.
## What It Is
The codebook is the "compiled detector" — the precomputed parameters that
transform raw model activations into behavioral alarm signals. It is to the
firewall what a trained model is to a classifier: the result of an offline
compilation step that produces the runtime detection parameters.
The name "codebook" comes from vector quantization terminology: it defines a
set of reference points (codewords) in activation space that represent known
behavioral patterns. New inputs are compared against these reference patterns.
## Why It Exists
Running full SVD decomposition and distribution fitting on every input would be
prohibitively expensive. The codebook precomputes these offline:
- **SVD basis**: The principal directions in activation space that capture
safety-relevant behavioral variance. Computed once from a calibration
dataset.
- **Behavioral regions**: The expected distribution of normal inputs along each
SVD dimension. Defined by fitted spline distributions.
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
At runtime, the firewall only needs to project new activations onto the
precomputed basis and compare against the precomputed regions — O(k) per input
where k is the number of retained dimensions.
## Key Concepts
### z-Coordinates
The projection of an activation vector onto the SVD basis. Computed as:
```
z = V^T @ (activation - mean)
```
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
mean activation from the calibration dataset. The centering step is critical
— without it, projections are offset by the mean and thresholds would be
incorrect.
z-coordinates are raw (unnormalized) projections. The codebook's spline
distributions are calibrated for this scale, so threshold values in the
codebook are specific to the z-coordinate range of the calibration data.
### SVD Basis
Singular Value Decomposition of the activation space from a calibration dataset
reveals the principal components (directions) that capture the most variance.
The top-k components form the basis that the codebook uses for projection.
Key properties:
- **Interpretable**: Each direction can be inspected for what behavioral
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
- **Efficient**: After decomposition, projection is a matrix multiply
- **Stable**: SVD basis is deterministic for a given calibration dataset
- **Model-specific**: The basis is computed for a specific model architecture
and weights. Changing the detector model requires recomputing the basis
The SVD basis is computed by the codebook training pipeline
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
### Behavioral Regions
For each SVD dimension, the codebook defines the expected distribution of
normal (non-adversarial) inputs. This is modeled as a monotonic spline
distribution that captures the shape of the behavioral region along that
dimension.
Inputs whose projections fall within the normal region score low (CLEAR).
Inputs whose projections fall near or beyond the region boundary score
increasingly high (SUSPICIOUS → DANGEROUS).
### Spline Distributions
Monotonic spline distributions model the probability density along each SVD
dimension (ADR-010). They provide:
- **Smooth scoring**: Continuous score rather than hard threshold
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
anomalous inputs
- **Parametric compactness**: A handful of spline knots represent the full
distribution shape
- **Differentiability**: Scores are differentiable for potential future use in
adversarial training
The spline distribution approach is adapted from the metaspline PoC
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
**Formal definition**: The CDF along each dimension is modeled as a monotonic
cubic spline with 1020 knots. Knot positions are determined by quantiles of
the calibration data (ensuring density of knots where data is dense). Beyond
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
CDF's complement: `score = 1 - cdf(z)`.
**Canonical implementation**: The metaspline PoC files `spline.py`
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
and `space.py` (`unfold`/`fold`) are the reference implementation for the
codebook compilation pipeline.
### Calibration Dataset
The calibration dataset is the set of normal (non-adversarial) inputs used to
compute the SVD basis and fit behavioral region distributions. Requirements:
- **Composition**: Diverse normal inputs representative of the deployment
domain. No adversarial examples — the basis models *normal* behavior, and
anomalies are detected as deviations from it.
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
Practical range: 1,00010,000 inputs. More inputs stabilize the basis but
have diminishing returns.
- **Diversity**: Must cover the range of normal inputs the detector will see
in production. A narrow calibration dataset (e.g., only short English
queries) will produce high false positive rates on unusual but benign inputs.
- **Model-specific**: A calibration dataset must be collected for each detector
model by running that model on the inputs and extracting activations.
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
automates calibration dataset processing.
### Codebook Compilation
The codebook is compiled offline by a training pipeline that:
1. Runs the detector model on a calibration dataset (diverse normal inputs)
2. Extracts hidden state activations at configured layers
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
which uses randomized approximation and may not be deterministic)
4. Fits spline distributions along each retained dimension
5. Computes detection thresholds
6. Serializes the codebook to a portable format (safetensors + JSON config)
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
package** as package data (under `src/alknet_firewall/data/codebook/`). This
keeps the Phase 1 installation simple — no additional download step beyond the
model. The bundled codebook is specific to the default detector model
(SmolLM2-135M at the pinned revision). Users who switch to a different
detector model must provide a matching codebook via `codebook_path`.
## Package Structure
Based on analysis of the PoC codebook
([poc-architecture.md](../research/codebook-analysis/poc-architecture.md)),
the production codebook decomposes into:
```
src/alknet_firewall/
├── codebook/
│ ├── __init__.py # Public exports
│ ├── codebook.py # Codebook class (init, load, project, score)
│ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex
│ ├── splines.py # MonotonicCubicSpline, SplineDistribution
│ ├── profiles.py # DirectionProfile, population stats
│ ├── classifiers.py # DirectionClassifier (logistic weights)
│ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel
│ ├── projection.py # project(), decompose()
│ └── detection.py # detect(), threshold comparison
├── training/
│ ├── __init__.py
│ ├── compiler.py # build() — SVD, spline fitting, profile comp
│ ├── stats.py # pooled_std, cohen_d, silhouette
│ └── data_loader.py # Condition catalog, prompt sets, data loading
└── data/
└── codebook/
├── basis.safetensors
├── regions.safetensors
├── splines.json
└── config.json
```
### Extraction from PoC
The PoC `firewall_codebook.py` is 1,245 lines with significant duplication
(the decomposition pipeline z → CDF → simplex → barycentric → (sum, u, v) is
repeated 5 times). Analysis identifies:
- **~480 lines of essential runtime code** in the PoC
- **~178 lines needed from metaspline core** (SplineDistribution,
MonotonicCubicSpline, ensure_strictly_increasing, simplex)
- **~130 lines of histogram classifier** — exploratory alternative, not MVP
(the continuous logistic classifier is superior)
- **~95 lines of AUC evaluation** — testing tool, not runtime
- **~429 lines in `build()`** — must be decomposed: training moves to
`training/compiler.py`, runtime state becomes immutable serialized data
Target: **~400500 lines runtime + ~150200 lines training = ~65% compression**
from the PoC's 1,245 lines.
### Key Extraction Decisions
1. **`build()` moves entirely to `training/compiler.py`** — Runtime codebook
is read-only. The codebook class should not have a `build()` method.
2. **`decompose()` becomes a pure function** — `decompose(z, splines)` is a
pure mathematical transform. No state dependencies beyond splines.
3. **Detection is separate from the codebook class**`detect()` is a
stateless function given codebook data. Enables swapping detection
strategies without touching the codebook.
4. **Only 4 of 502 metaspline core lines are needed at runtime**
`SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`,
and `simplex()`. Everything else (DensitySpline, unfold/fold, dcs_norm) is
dropped entirely.
5. **Saved `.pt` files from the PoC provide golden test data** — manifold
projection results for Qwen3-0.6B/1.7B can be reused for integration tests.
## Data Format
The codebook is stored as:
```
codebook/
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
├── regions.safetensors # Region boundary parameters
├── splines.json # Spline knot positions and coefficients
└── config.json # Metadata: model_id, revision, n_dims, thresholds
```
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
### Tensor Specifications
**basis.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
**regions.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
**splines.json**:
| Field | Type | Description |
|-------|------|-------------|
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
## Interfaces
### Internal API
```python
@dataclass
class CodebookConfig:
model_id: str
model_revision: str
n_dimensions: int
layers: list[int]
suspicious_threshold: float # Serialized threshold values
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
class Codebook:
def __init__(self, path: Path): ...
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
"""Project raw activations onto SVD basis → z-coordinates."""
...
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
"""Score z-coordinates against behavioral regions."""
...
@classmethod
def load(cls, path: Path) -> Codebook: ...
@classmethod
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
```
### Constraints
1. **Immutable at runtime** — The codebook is read-only during screening.
Modifying the codebook requires explicit recompilation.
2. **Model-bound** — A codebook is valid only for the specific model it was
compiled for. Loading a codebook with the wrong model produces undefined
results.
3. **Deterministic** — Same codebook + same activations = same scores.
4. **Portable** — Codebook can be saved to disk and reloaded without
recomputation. Can be distributed via HuggingFace Hub.
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-02**: ~~What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~ (resolved — ~65% compression to 500600 lines; see Package Structure section)
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults, user-overridable)