alknet-firewall/docs/architecture/codebook.md

---
status: draft
last_updated: 2026-06-13
---

# Codebook

The codebook contains the compiled detection parameters — SVD basis vectors,
behavioral region boundaries, and scoring distributions — that the firewall
uses to detect adversarial inputs.

## What It Is

The codebook is the "compiled detector" — the precomputed parameters that
transform raw model activations into behavioral alarm signals. It is to the
firewall what a trained model is to a classifier: the result of an offline
compilation step that produces the runtime detection parameters.

The name "codebook" comes from vector quantization terminology: it defines a
set of reference points (codewords) in activation space that represent known
behavioral patterns. New inputs are compared against these reference patterns.

## Why It Exists

Running full SVD decomposition and distribution fitting on every input would be
prohibitively expensive. The codebook precomputes these offline:

- **SVD basis**: The principal directions in activation space that capture
  safety-relevant behavioral variance. Computed once from a calibration
  dataset.
- **Behavioral regions**: The expected distribution of normal inputs along each
  SVD dimension. Defined by fitted spline distributions.
- **Thresholds**: Decision boundaries for alarm levels along each dimension.

At runtime, the firewall only needs to project new activations onto the
precomputed basis and compare against the precomputed regions — O(k) per input
where k is the number of retained dimensions.

## Key Concepts

### z-Coordinates

The projection of an activation vector onto the SVD basis. Computed as:

```
z = V^T @ (activation - mean)
```

Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
mean activation from the calibration dataset. The centering step is critical
— without it, projections are offset by the mean and thresholds would be
incorrect.

z-coordinates are raw (unnormalized) projections. The codebook's spline
distributions are calibrated for this scale, so threshold values in the
codebook are specific to the z-coordinate range of the calibration data.

**Training shape**: `(N, 3)` where N is the total number of token positions
across all calibration prompts. Each token position produces its own
z-coordinate, so the population data is a flattened collection of per-token
z-vectors.

**Inference shape**: `(seq_len, 3)` for a single input. Each token position
in the input sequence produces a z-coordinate. The detection pipeline
operates on this per-token sequence, optionally smoothing it before
classification.

### SVD Basis

Singular Value Decomposition of the activation space from a calibration dataset
reveals the principal components (directions) that capture the most variance.
The top-k components form the basis that the codebook uses for projection.

Key properties:
- **Interpretable**: Each direction can be inspected for what behavioral
  pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
- **Efficient**: After decomposition, projection is a matrix multiply
- **Stable**: SVD basis is deterministic for a given calibration dataset
- **Model-specific**: The basis is computed for a specific model architecture
  and weights. Changing the detector model requires recomputing the basis

The SVD basis is computed by the codebook training pipeline
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.

### Behavioral Regions

For each SVD dimension, the codebook defines the expected distribution of
normal (non-adversarial) inputs. This is modeled as a monotonic spline
distribution that captures the shape of the behavioral region along that
dimension.

Inputs whose projections fall within the normal region score low (CLEAR).
Inputs whose projections fall near or beyond the region boundary score
increasingly high (SUSPICIOUS → DANGEROUS).

### Copula Decomposition

Raw z-coordinates are not the detection feature space. The codebook
decomposes z-coordinates through a copula transform that separates **scale**
(how far from normal) from **position** (which behavioral direction):

```
z → CDF → (x₀, x₁, x₂)     # Uniform marginals via CDF transform
  → S = x₀ + x₁ + x₂       # Scale: total CDF magnitude
  → x_norm = simplex(x)      # Normalize to probability simplex
  → (u, v) = barycentric(x_norm)  # Position: 2D simplex coordinates
```

The three derived features `(S, u, v)` form the actual detection space:

- **S (scale)**: How far the input's z-coordinates deviate from the
  population norm, aggregated across all three SVD dimensions. High S means
  the input is anomalous in *magnitude*.
- **u, v (position)**: Where the input sits on the behavioral simplex —
  which *direction* the deviation points. Different behavioral patterns
  (refusal, instruction-following, self-reference) separate along different
  (u, v) axes.

This decomposition is why the codebook can distinguish "this input activates
the refusal direction" from "this input is just generally unusual" — the same
S value with different (u, v) coordinates implies different behavioral
patterns.

The PoC's `decompose()` method implements this pipeline as a pure function.
It is called both during codebook compilation (to compute direction
profiles) and during inference (to transform new z-coordinates for
classification).

### Direction Profiles and Contrast Pairs

The codebook doesn't just detect "anomalous" — it detects specific behavioral
**directions**. Each direction is defined by a contrast pair of conditions:

| Contrast Pair | Condition A | Condition B | Behavioral Direction |
|---------------|-------------|-------------|---------------------|
| refusal | harmful | harmless | Refusal activation |
| instruction_vs_data | instruction | data | Instruction-following |
| tool_call | tool_call | natural_language | Tool call patterns |
| self_vs_other | self_ref | other_ref | Self-reference |
| semantic_violation | violated | expected | Semantic norm violation |
| uncertainty | uncertain | confident | Uncertainty expression |
| injection | injection | benign_instruction | Prompt injection |

For each contrast pair, the codebook computes a **DirectionProfile** — the
statistical baseline (means, pooled standard deviations, Cohen's d) of the
(S, u, v) features for both conditions. This enables:

1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
   features of condition A vs condition B. Produces P(active | features) —
   the probability that the input exhibits the "active" behavioral pattern.
2. **Thresholds**: Midpoints between condition means for each feature, used
   for interpretable rule-based detection as a fallback.

### Token-Level Smoothing

During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
The detection pipeline optionally applies a rolling average (uniform kernel)
to the decomposed (S, u, v) features before classification:

- **window=1**: No smoothing. Each token position classified independently.
- **window=8** (PoC default): Smooth features across 8 token positions.
  Reduces noise from individual token fluctuations while preserving
  sustained behavioral signals.

Smoothing is an inference-time parameter — it does not affect codebook
compilation or thresholds. The codebook is calibrated on per-token
z-coordinates (all positions from the calibration data, flattened into
`(N, 3)`), so the classifier weights are valid regardless of the smoothing
window used at inference time.

### Spline Distributions

Monotonic spline distributions model the probability density along each SVD
dimension (ADR-010). They serve two roles in the codebook:

1. **CDF transform**: The copula decomposition requires mapping z-coordinates
   to uniform marginals via the CDF. The spline CDF provides this transform.
2. **Scale distribution**: A separate spline distribution models the
   sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.

They provide:
- **Smooth scoring**: Continuous score rather than hard threshold
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
  anomalous inputs
- **Parametric compactness**: A handful of spline knots represent the full
  distribution shape
- **Differentiability**: Scores are differentiable for potential future use in
  adversarial training

The spline distribution approach is adapted from the metaspline PoC
(`spline.py` — `SplineDistribution` class, ~378 lines).

**Formal definition**: The CDF along each dimension is modeled as a monotonic
cubic spline with knots (typically 10–64, depending on calibration data
size). Knot positions are determined by quantiles of the calibration data
(ensuring density of knots where data is dense). Beyond the extreme knots,
the CDF decays exponentially at a rate fitted to the tail data.

### Calibration Dataset

The calibration dataset serves two purposes: establishing the population
distribution (normal behavioral baseline) and providing contrast pairs
(labeled examples for each behavioral direction).

**Population data**: Diverse normal inputs representative of the deployment
domain. No adversarial examples — the population models *normal* behavior,
and anomalies are detected as deviations from it. Each prompt is processed
by the detector model, and z-coordinates are extracted at every token
position. The flattened `(N, 3)` tensor of all positions forms the population.

**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
instruction/data) that define each behavioral direction. Each condition
produces a set of z-coordinates that, after copula decomposition, reveal
where the conditions separate in (S, u, v) space.

Requirements:
- **Composition**: Population must cover the range of normal inputs the
  detector will see in production. Contrast pairs must be clearly distinct
  along their target behavioral direction.
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
  Practical range: 1,000–10,000 prompts for population. Each contrast
  condition needs at least 50–200 prompts.
- **Diversity**: A narrow population (e.g., only short English queries) will
  produce high false positive rates on unusual but benign inputs.
- **Model-specific**: Calibration data must be collected for each detector
  model by running that model on the inputs and extracting activations.

The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
automates calibration dataset processing with `max_length=128` tokens per
prompt.

### Codebook Compilation

The codebook is compiled offline by a training pipeline that:

1. Runs the detector model on a calibration dataset (population + contrast
   pairs)
2. Extracts hidden state activations at configured layers for every token
   position (not just last-token)
3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
   deterministic decomposition)
4. Projects activations onto the top-3 SVD components → z-coordinates
5. Fits spline distributions on each SVD dimension and the sum S
6. Applies copula decomposition to all z-coordinates → (S, u, v) features
7. Computes direction profiles (means, pooled std, Cohen's d) for each
   contrast pair
8. Trains logistic classifiers on (S, u, v) for each contrast pair
9. Computes detection thresholds (midpoints between condition means)
10. Serializes the codebook to a portable format (safetensors + JSON config)

This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
package** as package data (under `src/alknet_firewall/data/codebook/`). This
keeps the Phase 1 installation simple — no additional download step beyond the
model. The bundled codebook is specific to the default detector model
(SmolLM2-135M at the pinned revision). Users who switch to a different
detector model must provide a matching codebook via `codebook_path`.

## Package Structure

Based on analysis of the PoC codebook
([poc-architecture.md](../research/codebook-analysis/poc-architecture.md)),
the production codebook decomposes into:

```
src/alknet_firewall/
├── codebook/
│   ├── __init__.py            # Public exports
│   ├── codebook.py            # Codebook class (init, load, project, score)
│   ├── transforms.py          # simplex, reverse_bary3d, bary_to_simplex
│   ├── splines.py             # MonotonicCubicSpline, SplineDistribution
│   ├── profiles.py            # DirectionProfile, population stats
│   ├── classifiers.py         # DirectionClassifier (logistic weights)
│   ├── results.py             # DetectionResult, DimensionSignal, AlarmLevel
│   ├── projection.py          # project(), decompose()
│   └── detection.py           # detect(), threshold comparison
├── training/
│   ├── __init__.py
│   ├── compiler.py            # build() — SVD, spline fitting, profile comp
│   ├── stats.py               # pooled_std, cohen_d, silhouette
│   └── data_loader.py          # Condition catalog, prompt sets, data loading
└── data/
    └── codebook/
        ├── basis.safetensors
        ├── regions.safetensors
        ├── splines.json
        └── config.json
```

### Extraction from PoC

The PoC `firewall_codebook.py` is 1,245 lines with significant duplication
(the decomposition pipeline z → CDF → simplex → barycentric → (sum, u, v) is
repeated 5 times). Analysis identifies:

- **~480 lines of essential runtime code** in the PoC
- **~178 lines needed from metaspline core** (SplineDistribution,
  MonotonicCubicSpline, ensure_strictly_increasing, simplex)
- **~130 lines of histogram classifier** — exploratory alternative, not MVP
  (the continuous logistic classifier is superior)
- **~95 lines of AUC evaluation** — testing tool, not runtime
- **~429 lines in `build()`** — must be decomposed: training moves to
  `training/compiler.py`, runtime state becomes immutable serialized data

Target: **~400–500 lines runtime + ~150–200 lines training = ~65% compression**
from the PoC's 1,245 lines.

### Key Extraction Decisions

1. **`build()` moves entirely to `training/compiler.py`** — Runtime codebook
   is read-only. The codebook class should not have a `build()` method.
2. **`decompose()` becomes a pure function** — `decompose(z, splines)` is a
   pure mathematical transform. No state dependencies beyond splines.
3. **Detection is separate from the codebook class** — `detect()` is a
   stateless function given codebook data. Enables swapping detection
   strategies without touching the codebook.
4. **Only 4 of 502 metaspline core lines are needed at runtime** —
   `SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`,
   and `simplex()`. Everything else (DensitySpline, unfold/fold, dcs_norm) is
   dropped entirely.
5. **Saved `.pt` files from the PoC provide golden test data** — manifold
   projection results for Qwen3-0.6B/1.7B can be reused for integration tests.

## Data Format

The codebook is stored as:

```
codebook/
├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
├── regions.safetensors    # Region boundary parameters
├── classifiers.safetensors # Logistic classifier weights per direction
├── splines.json           # Spline knot positions and coefficients
├── profiles.json          # Direction profiles (means, stds, Cohen's d)
└── config.json            # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
```

All tensor data uses safetensors format (ADR-005). Configuration uses JSON.

### Tensor Specifications

**basis.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |

**regions.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |

**classifiers.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |

**splines.json**:
| Field | Type | Description |
|-------|------|-------------|
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |

**profiles.json**:
| Field | Type | Description |
|-------|------|-------------|
| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |

Each `DirectionProfile` entry contains:
| Field | Type | Description |
|-------|------|-------------|
| `label` | `str` | Direction name (e.g., "refusal") |
| `sum_mean_a` | `float` | Mean S for condition A |
| `sum_mean_b` | `float` | Mean S for condition B |
| `sum_std_pooled` | `float` | Pooled std of S |
| `u_mean_a` | `float` | Mean u for condition A |
| `u_mean_b` | `float` | Mean u for condition B |
| `u_std_pooled` | `float` | Pooled std of u |
| `v_mean_a` | `float` | Mean v for condition A |
| `v_mean_b` | `float` | Mean v for condition B |
| `v_std_pooled` | `float` | Pooled std of v |
| `cohen_d_sum` | `float` | Effect size for S |
| `cohen_d_u` | `float` | Effect size for u |
| `cohen_d_v` | `float` | Effect size for v |

## Interfaces

### Internal API

```python
@dataclass
class CodebookConfig:
    model_id: str
    model_revision: str
    n_dimensions: int
    layers: list[int]
    suspicious_threshold: float
    dangerous_threshold: float
    contrast_pairs: list[tuple[str, str, str]]  # (cond_a, cond_b, label)
    smoothing_window: int = 8                   # Token-level smoothing (inference only)

class Codebook:
    def __init__(self, path: Path): ...

    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
        """Project raw activations onto SVD basis → z-coordinates.

        Returns: (seq_len, 3) z-coordinates.
        """
        ...

    def decompose(self, z_coords: np.ndarray) -> dict:
        """Copula decomposition: z → CDF → (S, u, v).

        Args:
            z_coords: (seq_len, 3) or (N, 3) z-coordinates

        Returns:
            dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
        """
        ...

    def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
        """Classify decomposed features using logistic classifiers.

        Args:
            features: Output of decompose(), with (seq_len,) arrays
            window: Smoothing window size (1 = no smoothing)

        Returns:
            dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
        """
        ...

    def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
               min_positions: int = 3, window: int = 8) -> DetectionResult:
        """Full detection pipeline: project → decompose → smooth → classify → flag.

        Args:
            z_coords: (seq_len, 3) z-coordinates for a single input
            threshold_prob: P(active) threshold for flagging a direction
            min_positions: Minimum token positions above threshold to flag
            window: Smoothing window for token-level features
        """
        ...

    @classmethod
    def load(cls, path: Path) -> Codebook: ...

    @classmethod
    def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
```

### Constraints

1. **Immutable at runtime** — The codebook is read-only during screening.
   Modifying the codebook requires explicit recompilation.
2. **Model-bound** — A codebook is valid only for the specific model it was
   compiled for. Loading a codebook with the wrong model produces undefined
   results.
3. **Deterministic** — Same codebook + same activations = same scores.
4. **Portable** — Codebook can be saved to disk and reloaded without
   recomputation. Can be distributed via HuggingFace Hub.

## Design Decisions

| ADR | Decision | Summary |
|-----|----------|---------|
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |

## Open Questions

Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:

- **OQ-02**: ~~What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~ (resolved — ~65% compression to 500–600 lines; see Package Structure section)
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults, user-overridable)