The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
485 lines
21 KiB
Markdown
485 lines
21 KiB
Markdown
---
|
||
status: draft
|
||
last_updated: 2026-06-13
|
||
---
|
||
|
||
# Codebook
|
||
|
||
The codebook contains the compiled detection parameters — SVD basis vectors,
|
||
behavioral region boundaries, and scoring distributions — that the firewall
|
||
uses to detect adversarial inputs.
|
||
|
||
## What It Is
|
||
|
||
The codebook is the "compiled detector" — the precomputed parameters that
|
||
transform raw model activations into behavioral alarm signals. It is to the
|
||
firewall what a trained model is to a classifier: the result of an offline
|
||
compilation step that produces the runtime detection parameters.
|
||
|
||
The name "codebook" comes from vector quantization terminology: it defines a
|
||
set of reference points (codewords) in activation space that represent known
|
||
behavioral patterns. New inputs are compared against these reference patterns.
|
||
|
||
## Why It Exists
|
||
|
||
Running full SVD decomposition and distribution fitting on every input would be
|
||
prohibitively expensive. The codebook precomputes these offline:
|
||
|
||
- **SVD basis**: The principal directions in activation space that capture
|
||
safety-relevant behavioral variance. Computed once from a calibration
|
||
dataset.
|
||
- **Behavioral regions**: The expected distribution of normal inputs along each
|
||
SVD dimension. Defined by fitted spline distributions.
|
||
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
|
||
|
||
At runtime, the firewall only needs to project new activations onto the
|
||
precomputed basis and compare against the precomputed regions — O(k) per input
|
||
where k is the number of retained dimensions.
|
||
|
||
## Key Concepts
|
||
|
||
### z-Coordinates
|
||
|
||
The projection of an activation vector onto the SVD basis. Computed as:
|
||
|
||
```
|
||
z = V^T @ (activation - mean)
|
||
```
|
||
|
||
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
|
||
mean activation from the calibration dataset. The centering step is critical
|
||
— without it, projections are offset by the mean and thresholds would be
|
||
incorrect.
|
||
|
||
z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||
distributions are calibrated for this scale, so threshold values in the
|
||
codebook are specific to the z-coordinate range of the calibration data.
|
||
|
||
**Training shape**: `(N, 3)` where N is the total number of token positions
|
||
across all calibration prompts. Each token position produces its own
|
||
z-coordinate, so the population data is a flattened collection of per-token
|
||
z-vectors.
|
||
|
||
**Inference shape**: `(seq_len, 3)` for a single input. Each token position
|
||
in the input sequence produces a z-coordinate. The detection pipeline
|
||
operates on this per-token sequence, optionally smoothing it before
|
||
classification.
|
||
|
||
### SVD Basis
|
||
|
||
Singular Value Decomposition of the activation space from a calibration dataset
|
||
reveals the principal components (directions) that capture the most variance.
|
||
The top-k components form the basis that the codebook uses for projection.
|
||
|
||
Key properties:
|
||
- **Interpretable**: Each direction can be inspected for what behavioral
|
||
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
|
||
- **Efficient**: After decomposition, projection is a matrix multiply
|
||
- **Stable**: SVD basis is deterministic for a given calibration dataset
|
||
- **Model-specific**: The basis is computed for a specific model architecture
|
||
and weights. Changing the detector model requires recomputing the basis
|
||
|
||
The SVD basis is computed by the codebook training pipeline
|
||
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
|
||
|
||
### Behavioral Regions
|
||
|
||
For each SVD dimension, the codebook defines the expected distribution of
|
||
normal (non-adversarial) inputs. This is modeled as a monotonic spline
|
||
distribution that captures the shape of the behavioral region along that
|
||
dimension.
|
||
|
||
Inputs whose projections fall within the normal region score low (CLEAR).
|
||
Inputs whose projections fall near or beyond the region boundary score
|
||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||
|
||
### Copula Decomposition
|
||
|
||
Raw z-coordinates are not the detection feature space. The codebook
|
||
decomposes z-coordinates through a copula transform that separates **scale**
|
||
(how far from normal) from **position** (which behavioral direction):
|
||
|
||
```
|
||
z → CDF → (x₀, x₁, x₂) # Uniform marginals via CDF transform
|
||
→ S = x₀ + x₁ + x₂ # Scale: total CDF magnitude
|
||
→ x_norm = simplex(x) # Normalize to probability simplex
|
||
→ (u, v) = barycentric(x_norm) # Position: 2D simplex coordinates
|
||
```
|
||
|
||
The three derived features `(S, u, v)` form the actual detection space:
|
||
|
||
- **S (scale)**: How far the input's z-coordinates deviate from the
|
||
population norm, aggregated across all three SVD dimensions. High S means
|
||
the input is anomalous in *magnitude*.
|
||
- **u, v (position)**: Where the input sits on the behavioral simplex —
|
||
which *direction* the deviation points. Different behavioral patterns
|
||
(refusal, instruction-following, self-reference) separate along different
|
||
(u, v) axes.
|
||
|
||
This decomposition is why the codebook can distinguish "this input activates
|
||
the refusal direction" from "this input is just generally unusual" — the same
|
||
S value with different (u, v) coordinates implies different behavioral
|
||
patterns.
|
||
|
||
The PoC's `decompose()` method implements this pipeline as a pure function.
|
||
It is called both during codebook compilation (to compute direction
|
||
profiles) and during inference (to transform new z-coordinates for
|
||
classification).
|
||
|
||
### Direction Profiles and Contrast Pairs
|
||
|
||
The codebook doesn't just detect "anomalous" — it detects specific behavioral
|
||
**directions**. Each direction is defined by a contrast pair of conditions:
|
||
|
||
| Contrast Pair | Condition A | Condition B | Behavioral Direction |
|
||
|---------------|-------------|-------------|---------------------|
|
||
| refusal | harmful | harmless | Refusal activation |
|
||
| instruction_vs_data | instruction | data | Instruction-following |
|
||
| tool_call | tool_call | natural_language | Tool call patterns |
|
||
| self_vs_other | self_ref | other_ref | Self-reference |
|
||
| semantic_violation | violated | expected | Semantic norm violation |
|
||
| uncertainty | uncertain | confident | Uncertainty expression |
|
||
| injection | injection | benign_instruction | Prompt injection |
|
||
|
||
For each contrast pair, the codebook computes a **DirectionProfile** — the
|
||
statistical baseline (means, pooled standard deviations, Cohen's d) of the
|
||
(S, u, v) features for both conditions. This enables:
|
||
|
||
1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
|
||
features of condition A vs condition B. Produces P(active | features) —
|
||
the probability that the input exhibits the "active" behavioral pattern.
|
||
2. **Thresholds**: Midpoints between condition means for each feature, used
|
||
for interpretable rule-based detection as a fallback.
|
||
|
||
### Token-Level Smoothing
|
||
|
||
During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
|
||
The detection pipeline optionally applies a rolling average (uniform kernel)
|
||
to the decomposed (S, u, v) features before classification:
|
||
|
||
- **window=1**: No smoothing. Each token position classified independently.
|
||
- **window=8** (PoC default): Smooth features across 8 token positions.
|
||
Reduces noise from individual token fluctuations while preserving
|
||
sustained behavioral signals.
|
||
|
||
Smoothing is an inference-time parameter — it does not affect codebook
|
||
compilation or thresholds. The codebook is calibrated on per-token
|
||
z-coordinates (all positions from the calibration data, flattened into
|
||
`(N, 3)`), so the classifier weights are valid regardless of the smoothing
|
||
window used at inference time.
|
||
|
||
### Spline Distributions
|
||
|
||
Monotonic spline distributions model the probability density along each SVD
|
||
dimension (ADR-010). They serve two roles in the codebook:
|
||
|
||
1. **CDF transform**: The copula decomposition requires mapping z-coordinates
|
||
to uniform marginals via the CDF. The spline CDF provides this transform.
|
||
2. **Scale distribution**: A separate spline distribution models the
|
||
sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
|
||
|
||
They provide:
|
||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||
anomalous inputs
|
||
- **Parametric compactness**: A handful of spline knots represent the full
|
||
distribution shape
|
||
- **Differentiability**: Scores are differentiable for potential future use in
|
||
adversarial training
|
||
|
||
The spline distribution approach is adapted from the metaspline PoC
|
||
(`spline.py` — `SplineDistribution` class, ~378 lines).
|
||
|
||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||
cubic spline with knots (typically 10–64, depending on calibration data
|
||
size). Knot positions are determined by quantiles of the calibration data
|
||
(ensuring density of knots where data is dense). Beyond the extreme knots,
|
||
the CDF decays exponentially at a rate fitted to the tail data.
|
||
|
||
### Calibration Dataset
|
||
|
||
The calibration dataset serves two purposes: establishing the population
|
||
distribution (normal behavioral baseline) and providing contrast pairs
|
||
(labeled examples for each behavioral direction).
|
||
|
||
**Population data**: Diverse normal inputs representative of the deployment
|
||
domain. No adversarial examples — the population models *normal* behavior,
|
||
and anomalies are detected as deviations from it. Each prompt is processed
|
||
by the detector model, and z-coordinates are extracted at every token
|
||
position. The flattened `(N, 3)` tensor of all positions forms the population.
|
||
|
||
**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
|
||
instruction/data) that define each behavioral direction. Each condition
|
||
produces a set of z-coordinates that, after copula decomposition, reveal
|
||
where the conditions separate in (S, u, v) space.
|
||
|
||
Requirements:
|
||
- **Composition**: Population must cover the range of normal inputs the
|
||
detector will see in production. Contrast pairs must be clearly distinct
|
||
along their target behavioral direction.
|
||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||
Practical range: 1,000–10,000 prompts for population. Each contrast
|
||
condition needs at least 50–200 prompts.
|
||
- **Diversity**: A narrow population (e.g., only short English queries) will
|
||
produce high false positive rates on unusual but benign inputs.
|
||
- **Model-specific**: Calibration data must be collected for each detector
|
||
model by running that model on the inputs and extracting activations.
|
||
|
||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||
automates calibration dataset processing with `max_length=128` tokens per
|
||
prompt.
|
||
|
||
### Codebook Compilation
|
||
|
||
The codebook is compiled offline by a training pipeline that:
|
||
|
||
1. Runs the detector model on a calibration dataset (population + contrast
|
||
pairs)
|
||
2. Extracts hidden state activations at configured layers for every token
|
||
position (not just last-token)
|
||
3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
|
||
deterministic decomposition)
|
||
4. Projects activations onto the top-3 SVD components → z-coordinates
|
||
5. Fits spline distributions on each SVD dimension and the sum S
|
||
6. Applies copula decomposition to all z-coordinates → (S, u, v) features
|
||
7. Computes direction profiles (means, pooled std, Cohen's d) for each
|
||
contrast pair
|
||
8. Trains logistic classifiers on (S, u, v) for each contrast pair
|
||
9. Computes detection thresholds (midpoints between condition means)
|
||
10. Serializes the codebook to a portable format (safetensors + JSON config)
|
||
|
||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||
keeps the Phase 1 installation simple — no additional download step beyond the
|
||
model. The bundled codebook is specific to the default detector model
|
||
(SmolLM2-135M at the pinned revision). Users who switch to a different
|
||
detector model must provide a matching codebook via `codebook_path`.
|
||
|
||
## Package Structure
|
||
|
||
Based on analysis of the PoC codebook
|
||
([poc-architecture.md](../research/codebook-analysis/poc-architecture.md)),
|
||
the production codebook decomposes into:
|
||
|
||
```
|
||
src/alknet_firewall/
|
||
├── codebook/
|
||
│ ├── __init__.py # Public exports
|
||
│ ├── codebook.py # Codebook class (init, load, project, score)
|
||
│ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex
|
||
│ ├── splines.py # MonotonicCubicSpline, SplineDistribution
|
||
│ ├── profiles.py # DirectionProfile, population stats
|
||
│ ├── classifiers.py # DirectionClassifier (logistic weights)
|
||
│ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel
|
||
│ ├── projection.py # project(), decompose()
|
||
│ └── detection.py # detect(), threshold comparison
|
||
├── training/
|
||
│ ├── __init__.py
|
||
│ ├── compiler.py # build() — SVD, spline fitting, profile comp
|
||
│ ├── stats.py # pooled_std, cohen_d, silhouette
|
||
│ └── data_loader.py # Condition catalog, prompt sets, data loading
|
||
└── data/
|
||
└── codebook/
|
||
├── basis.safetensors
|
||
├── regions.safetensors
|
||
├── splines.json
|
||
└── config.json
|
||
```
|
||
|
||
### Extraction from PoC
|
||
|
||
The PoC `firewall_codebook.py` is 1,245 lines with significant duplication
|
||
(the decomposition pipeline z → CDF → simplex → barycentric → (sum, u, v) is
|
||
repeated 5 times). Analysis identifies:
|
||
|
||
- **~480 lines of essential runtime code** in the PoC
|
||
- **~178 lines needed from metaspline core** (SplineDistribution,
|
||
MonotonicCubicSpline, ensure_strictly_increasing, simplex)
|
||
- **~130 lines of histogram classifier** — exploratory alternative, not MVP
|
||
(the continuous logistic classifier is superior)
|
||
- **~95 lines of AUC evaluation** — testing tool, not runtime
|
||
- **~429 lines in `build()`** — must be decomposed: training moves to
|
||
`training/compiler.py`, runtime state becomes immutable serialized data
|
||
|
||
Target: **~400–500 lines runtime + ~150–200 lines training = ~65% compression**
|
||
from the PoC's 1,245 lines.
|
||
|
||
### Key Extraction Decisions
|
||
|
||
1. **`build()` moves entirely to `training/compiler.py`** — Runtime codebook
|
||
is read-only. The codebook class should not have a `build()` method.
|
||
2. **`decompose()` becomes a pure function** — `decompose(z, splines)` is a
|
||
pure mathematical transform. No state dependencies beyond splines.
|
||
3. **Detection is separate from the codebook class** — `detect()` is a
|
||
stateless function given codebook data. Enables swapping detection
|
||
strategies without touching the codebook.
|
||
4. **Only 4 of 502 metaspline core lines are needed at runtime** —
|
||
`SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`,
|
||
and `simplex()`. Everything else (DensitySpline, unfold/fold, dcs_norm) is
|
||
dropped entirely.
|
||
5. **Saved `.pt` files from the PoC provide golden test data** — manifold
|
||
projection results for Qwen3-0.6B/1.7B can be reused for integration tests.
|
||
|
||
## Data Format
|
||
|
||
The codebook is stored as:
|
||
|
||
```
|
||
codebook/
|
||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||
├── regions.safetensors # Region boundary parameters
|
||
├── classifiers.safetensors # Logistic classifier weights per direction
|
||
├── splines.json # Spline knot positions and coefficients
|
||
├── profiles.json # Direction profiles (means, stds, Cohen's d)
|
||
└── config.json # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
|
||
```
|
||
|
||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||
|
||
### Tensor Specifications
|
||
|
||
**basis.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
|
||
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
|
||
|
||
**regions.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||
|
||
**classifiers.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
|
||
| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
|
||
| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
|
||
| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
|
||
|
||
**splines.json**:
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
|
||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||
|
||
**profiles.json**:
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
|
||
| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
|
||
|
||
Each `DirectionProfile` entry contains:
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `label` | `str` | Direction name (e.g., "refusal") |
|
||
| `sum_mean_a` | `float` | Mean S for condition A |
|
||
| `sum_mean_b` | `float` | Mean S for condition B |
|
||
| `sum_std_pooled` | `float` | Pooled std of S |
|
||
| `u_mean_a` | `float` | Mean u for condition A |
|
||
| `u_mean_b` | `float` | Mean u for condition B |
|
||
| `u_std_pooled` | `float` | Pooled std of u |
|
||
| `v_mean_a` | `float` | Mean v for condition A |
|
||
| `v_mean_b` | `float` | Mean v for condition B |
|
||
| `v_std_pooled` | `float` | Pooled std of v |
|
||
| `cohen_d_sum` | `float` | Effect size for S |
|
||
| `cohen_d_u` | `float` | Effect size for u |
|
||
| `cohen_d_v` | `float` | Effect size for v |
|
||
|
||
## Interfaces
|
||
|
||
### Internal API
|
||
|
||
```python
|
||
@dataclass
|
||
class CodebookConfig:
|
||
model_id: str
|
||
model_revision: str
|
||
n_dimensions: int
|
||
layers: list[int]
|
||
suspicious_threshold: float
|
||
dangerous_threshold: float
|
||
contrast_pairs: list[tuple[str, str, str]] # (cond_a, cond_b, label)
|
||
smoothing_window: int = 8 # Token-level smoothing (inference only)
|
||
|
||
class Codebook:
|
||
def __init__(self, path: Path): ...
|
||
|
||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||
"""Project raw activations onto SVD basis → z-coordinates.
|
||
|
||
Returns: (seq_len, 3) z-coordinates.
|
||
"""
|
||
...
|
||
|
||
def decompose(self, z_coords: np.ndarray) -> dict:
|
||
"""Copula decomposition: z → CDF → (S, u, v).
|
||
|
||
Args:
|
||
z_coords: (seq_len, 3) or (N, 3) z-coordinates
|
||
|
||
Returns:
|
||
dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
|
||
"""
|
||
...
|
||
|
||
def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
|
||
"""Classify decomposed features using logistic classifiers.
|
||
|
||
Args:
|
||
features: Output of decompose(), with (seq_len,) arrays
|
||
window: Smoothing window size (1 = no smoothing)
|
||
|
||
Returns:
|
||
dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
|
||
"""
|
||
...
|
||
|
||
def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
|
||
min_positions: int = 3, window: int = 8) -> DetectionResult:
|
||
"""Full detection pipeline: project → decompose → smooth → classify → flag.
|
||
|
||
Args:
|
||
z_coords: (seq_len, 3) z-coordinates for a single input
|
||
threshold_prob: P(active) threshold for flagging a direction
|
||
min_positions: Minimum token positions above threshold to flag
|
||
window: Smoothing window for token-level features
|
||
"""
|
||
...
|
||
|
||
@classmethod
|
||
def load(cls, path: Path) -> Codebook: ...
|
||
|
||
@classmethod
|
||
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
|
||
```
|
||
|
||
### Constraints
|
||
|
||
1. **Immutable at runtime** — The codebook is read-only during screening.
|
||
Modifying the codebook requires explicit recompilation.
|
||
2. **Model-bound** — A codebook is valid only for the specific model it was
|
||
compiled for. Loading a codebook with the wrong model produces undefined
|
||
results.
|
||
3. **Deterministic** — Same codebook + same activations = same scores.
|
||
4. **Portable** — Codebook can be saved to disk and reloaded without
|
||
recomputation. Can be distributed via HuggingFace Hub.
|
||
|
||
## Design Decisions
|
||
|
||
| ADR | Decision | Summary |
|
||
|-----|----------|---------|
|
||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
|
||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
|
||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
|
||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
|
||
|
||
## Open Questions
|
||
|
||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||
questions affecting this document:
|
||
|
||
- **OQ-02**: ~~What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~ (resolved — ~65% compression to 500–600 lines; see Package Structure section)
|
||
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults, user-overridable) |