Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
312 lines
13 KiB
Markdown
312 lines
13 KiB
Markdown
---
|
||
status: draft
|
||
last_updated: 2026-06-13
|
||
---
|
||
|
||
# Codebook
|
||
|
||
The codebook contains the compiled detection parameters — SVD basis vectors,
|
||
behavioral region boundaries, and scoring distributions — that the firewall
|
||
uses to detect adversarial inputs.
|
||
|
||
## What It Is
|
||
|
||
The codebook is the "compiled detector" — the precomputed parameters that
|
||
transform raw model activations into behavioral alarm signals. It is to the
|
||
firewall what a trained model is to a classifier: the result of an offline
|
||
compilation step that produces the runtime detection parameters.
|
||
|
||
The name "codebook" comes from vector quantization terminology: it defines a
|
||
set of reference points (codewords) in activation space that represent known
|
||
behavioral patterns. New inputs are compared against these reference patterns.
|
||
|
||
## Why It Exists
|
||
|
||
Running full SVD decomposition and distribution fitting on every input would be
|
||
prohibitively expensive. The codebook precomputes these offline:
|
||
|
||
- **SVD basis**: The principal directions in activation space that capture
|
||
safety-relevant behavioral variance. Computed once from a calibration
|
||
dataset.
|
||
- **Behavioral regions**: The expected distribution of normal inputs along each
|
||
SVD dimension. Defined by fitted spline distributions.
|
||
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
|
||
|
||
At runtime, the firewall only needs to project new activations onto the
|
||
precomputed basis and compare against the precomputed regions — O(k) per input
|
||
where k is the number of retained dimensions.
|
||
|
||
## Key Concepts
|
||
|
||
### z-Coordinates
|
||
|
||
The projection of an activation vector onto the SVD basis. Computed as:
|
||
|
||
```
|
||
z = V^T @ (activation - mean)
|
||
```
|
||
|
||
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
|
||
mean activation from the calibration dataset. The centering step is critical
|
||
— without it, projections are offset by the mean and thresholds would be
|
||
incorrect.
|
||
|
||
z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||
distributions are calibrated for this scale, so threshold values in the
|
||
codebook are specific to the z-coordinate range of the calibration data.
|
||
|
||
### SVD Basis
|
||
|
||
Singular Value Decomposition of the activation space from a calibration dataset
|
||
reveals the principal components (directions) that capture the most variance.
|
||
The top-k components form the basis that the codebook uses for projection.
|
||
|
||
Key properties:
|
||
- **Interpretable**: Each direction can be inspected for what behavioral
|
||
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
|
||
- **Efficient**: After decomposition, projection is a matrix multiply
|
||
- **Stable**: SVD basis is deterministic for a given calibration dataset
|
||
- **Model-specific**: The basis is computed for a specific model architecture
|
||
and weights. Changing the detector model requires recomputing the basis
|
||
|
||
The SVD basis is computed by the codebook training pipeline
|
||
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
|
||
|
||
### Behavioral Regions
|
||
|
||
For each SVD dimension, the codebook defines the expected distribution of
|
||
normal (non-adversarial) inputs. This is modeled as a monotonic spline
|
||
distribution that captures the shape of the behavioral region along that
|
||
dimension.
|
||
|
||
Inputs whose projections fall within the normal region score low (CLEAR).
|
||
Inputs whose projections fall near or beyond the region boundary score
|
||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||
|
||
### Spline Distributions
|
||
|
||
Monotonic spline distributions model the probability density along each SVD
|
||
dimension (ADR-010). They provide:
|
||
|
||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||
anomalous inputs
|
||
- **Parametric compactness**: A handful of spline knots represent the full
|
||
distribution shape
|
||
- **Differentiability**: Scores are differentiable for potential future use in
|
||
adversarial training
|
||
|
||
The spline distribution approach is adapted from the metaspline PoC
|
||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
||
|
||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
||
the calibration data (ensuring density of knots where data is dense). Beyond
|
||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
||
CDF's complement: `score = 1 - cdf(z)`.
|
||
|
||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
||
codebook compilation pipeline.
|
||
|
||
### Calibration Dataset
|
||
|
||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
||
|
||
- **Composition**: Diverse normal inputs representative of the deployment
|
||
domain. No adversarial examples — the basis models *normal* behavior, and
|
||
anomalies are detected as deviations from it.
|
||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
||
have diminishing returns.
|
||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
||
in production. A narrow calibration dataset (e.g., only short English
|
||
queries) will produce high false positive rates on unusual but benign inputs.
|
||
- **Model-specific**: A calibration dataset must be collected for each detector
|
||
model by running that model on the inputs and extracting activations.
|
||
|
||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||
automates calibration dataset processing.
|
||
|
||
### Codebook Compilation
|
||
|
||
The codebook is compiled offline by a training pipeline that:
|
||
|
||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
||
2. Extracts hidden state activations at configured layers
|
||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
||
which uses randomized approximation and may not be deterministic)
|
||
4. Fits spline distributions along each retained dimension
|
||
5. Computes detection thresholds
|
||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
||
|
||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||
keeps the Phase 1 installation simple — no additional download step beyond the
|
||
model. The bundled codebook is specific to the default detector model
|
||
(SmolLM2-135M at the pinned revision). Users who switch to a different
|
||
detector model must provide a matching codebook via `codebook_path`.
|
||
|
||
## Package Structure
|
||
|
||
Based on analysis of the PoC codebook
|
||
([poc-architecture.md](../research/codebook-analysis/poc-architecture.md)),
|
||
the production codebook decomposes into:
|
||
|
||
```
|
||
src/alknet_firewall/
|
||
├── codebook/
|
||
│ ├── __init__.py # Public exports
|
||
│ ├── codebook.py # Codebook class (init, load, project, score)
|
||
│ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex
|
||
│ ├── splines.py # MonotonicCubicSpline, SplineDistribution
|
||
│ ├── profiles.py # DirectionProfile, population stats
|
||
│ ├── classifiers.py # DirectionClassifier (logistic weights)
|
||
│ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel
|
||
│ ├── projection.py # project(), decompose()
|
||
│ └── detection.py # detect(), threshold comparison
|
||
├── training/
|
||
│ ├── __init__.py
|
||
│ ├── compiler.py # build() — SVD, spline fitting, profile comp
|
||
│ ├── stats.py # pooled_std, cohen_d, silhouette
|
||
│ └── data_loader.py # Condition catalog, prompt sets, data loading
|
||
└── data/
|
||
└── codebook/
|
||
├── basis.safetensors
|
||
├── regions.safetensors
|
||
├── splines.json
|
||
└── config.json
|
||
```
|
||
|
||
### Extraction from PoC
|
||
|
||
The PoC `firewall_codebook.py` is 1,245 lines with significant duplication
|
||
(the decomposition pipeline z → CDF → simplex → barycentric → (sum, u, v) is
|
||
repeated 5 times). Analysis identifies:
|
||
|
||
- **~480 lines of essential runtime code** in the PoC
|
||
- **~178 lines needed from metaspline core** (SplineDistribution,
|
||
MonotonicCubicSpline, ensure_strictly_increasing, simplex)
|
||
- **~130 lines of histogram classifier** — exploratory alternative, not MVP
|
||
(the continuous logistic classifier is superior)
|
||
- **~95 lines of AUC evaluation** — testing tool, not runtime
|
||
- **~429 lines in `build()`** — must be decomposed: training moves to
|
||
`training/compiler.py`, runtime state becomes immutable serialized data
|
||
|
||
Target: **~400–500 lines runtime + ~150–200 lines training = ~65% compression**
|
||
from the PoC's 1,245 lines.
|
||
|
||
### Key Extraction Decisions
|
||
|
||
1. **`build()` moves entirely to `training/compiler.py`** — Runtime codebook
|
||
is read-only. The codebook class should not have a `build()` method.
|
||
2. **`decompose()` becomes a pure function** — `decompose(z, splines)` is a
|
||
pure mathematical transform. No state dependencies beyond splines.
|
||
3. **Detection is separate from the codebook class** — `detect()` is a
|
||
stateless function given codebook data. Enables swapping detection
|
||
strategies without touching the codebook.
|
||
4. **Only 4 of 502 metaspline core lines are needed at runtime** —
|
||
`SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`,
|
||
and `simplex()`. Everything else (DensitySpline, unfold/fold, dcs_norm) is
|
||
dropped entirely.
|
||
5. **Saved `.pt` files from the PoC provide golden test data** — manifold
|
||
projection results for Qwen3-0.6B/1.7B can be reused for integration tests.
|
||
|
||
## Data Format
|
||
|
||
The codebook is stored as:
|
||
|
||
```
|
||
codebook/
|
||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||
├── regions.safetensors # Region boundary parameters
|
||
├── splines.json # Spline knot positions and coefficients
|
||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
||
```
|
||
|
||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||
|
||
### Tensor Specifications
|
||
|
||
**basis.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
|
||
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
|
||
|
||
**regions.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||
|
||
**splines.json**:
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
|
||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||
|
||
## Interfaces
|
||
|
||
### Internal API
|
||
|
||
```python
|
||
@dataclass
|
||
class CodebookConfig:
|
||
model_id: str
|
||
model_revision: str
|
||
n_dimensions: int
|
||
layers: list[int]
|
||
suspicious_threshold: float # Serialized threshold values
|
||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
||
|
||
class Codebook:
|
||
def __init__(self, path: Path): ...
|
||
|
||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||
"""Project raw activations onto SVD basis → z-coordinates."""
|
||
...
|
||
|
||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
||
"""Score z-coordinates against behavioral regions."""
|
||
...
|
||
|
||
@classmethod
|
||
def load(cls, path: Path) -> Codebook: ...
|
||
|
||
@classmethod
|
||
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
|
||
```
|
||
|
||
### Constraints
|
||
|
||
1. **Immutable at runtime** — The codebook is read-only during screening.
|
||
Modifying the codebook requires explicit recompilation.
|
||
2. **Model-bound** — A codebook is valid only for the specific model it was
|
||
compiled for. Loading a codebook with the wrong model produces undefined
|
||
results.
|
||
3. **Deterministic** — Same codebook + same activations = same scores.
|
||
4. **Portable** — Codebook can be saved to disk and reloaded without
|
||
recomputation. Can be distributed via HuggingFace Hub.
|
||
|
||
## Design Decisions
|
||
|
||
| ADR | Decision | Summary |
|
||
|-----|----------|---------|
|
||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
|
||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
|
||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
|
||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
|
||
|
||
## Open Questions
|
||
|
||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||
questions affecting this document:
|
||
|
||
- **OQ-02**: ~~What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~ (resolved — ~65% compression to 500–600 lines; see Package Structure section)
|
||
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults, user-overridable) |