Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
248 lines
10 KiB
Markdown
248 lines
10 KiB
Markdown
---
|
||
status: draft
|
||
last_updated: 2026-06-13
|
||
---
|
||
|
||
# Codebook
|
||
|
||
The codebook contains the compiled detection parameters — SVD basis vectors,
|
||
behavioral region boundaries, and scoring distributions — that the firewall
|
||
uses to detect adversarial inputs.
|
||
|
||
## What It Is
|
||
|
||
The codebook is the "compiled detector" — the precomputed parameters that
|
||
transform raw model activations into behavioral alarm signals. It is to the
|
||
firewall what a trained model is to a classifier: the result of an offline
|
||
compilation step that produces the runtime detection parameters.
|
||
|
||
The name "codebook" comes from vector quantization terminology: it defines a
|
||
set of reference points (codewords) in activation space that represent known
|
||
behavioral patterns. New inputs are compared against these reference patterns.
|
||
|
||
## Why It Exists
|
||
|
||
Running full SVD decomposition and distribution fitting on every input would be
|
||
prohibitively expensive. The codebook precomputes these offline:
|
||
|
||
- **SVD basis**: The principal directions in activation space that capture
|
||
safety-relevant behavioral variance. Computed once from a calibration
|
||
dataset.
|
||
- **Behavioral regions**: The expected distribution of normal inputs along each
|
||
SVD dimension. Defined by fitted spline distributions.
|
||
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
|
||
|
||
At runtime, the firewall only needs to project new activations onto the
|
||
precomputed basis and compare against the precomputed regions — O(k) per input
|
||
where k is the number of retained dimensions.
|
||
|
||
## Key Concepts
|
||
|
||
### z-Coordinates
|
||
|
||
The projection of an activation vector onto the SVD basis. Computed as:
|
||
|
||
```
|
||
z = V^T @ (activation - mean)
|
||
```
|
||
|
||
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
|
||
mean activation from the calibration dataset. The centering step is critical
|
||
— without it, projections are offset by the mean and thresholds would be
|
||
incorrect.
|
||
|
||
z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||
distributions are calibrated for this scale, so threshold values in the
|
||
codebook are specific to the z-coordinate range of the calibration data.
|
||
|
||
### SVD Basis
|
||
|
||
Singular Value Decomposition of the activation space from a calibration dataset
|
||
reveals the principal components (directions) that capture the most variance.
|
||
The top-k components form the basis that the codebook uses for projection.
|
||
|
||
Key properties:
|
||
- **Interpretable**: Each direction can be inspected for what behavioral
|
||
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
|
||
- **Efficient**: After decomposition, projection is a matrix multiply
|
||
- **Stable**: SVD basis is deterministic for a given calibration dataset
|
||
- **Model-specific**: The basis is computed for a specific model architecture
|
||
and weights. Changing the detector model requires recomputing the basis
|
||
|
||
The SVD basis is computed by the codebook training pipeline
|
||
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
|
||
|
||
### Behavioral Regions
|
||
|
||
For each SVD dimension, the codebook defines the expected distribution of
|
||
normal (non-adversarial) inputs. This is modeled as a monotonic spline
|
||
distribution that captures the shape of the behavioral region along that
|
||
dimension.
|
||
|
||
Inputs whose projections fall within the normal region score low (CLEAR).
|
||
Inputs whose projections fall near or beyond the region boundary score
|
||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||
|
||
### Spline Distributions
|
||
|
||
Monotonic spline distributions model the probability density along each SVD
|
||
dimension (ADR-010). They provide:
|
||
|
||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||
anomalous inputs
|
||
- **Parametric compactness**: A handful of spline knots represent the full
|
||
distribution shape
|
||
- **Differentiability**: Scores are differentiable for potential future use in
|
||
adversarial training
|
||
|
||
The spline distribution approach is adapted from the metaspline PoC
|
||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
||
|
||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
||
the calibration data (ensuring density of knots where data is dense). Beyond
|
||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
||
CDF's complement: `score = 1 - cdf(z)`.
|
||
|
||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
||
codebook compilation pipeline.
|
||
|
||
### Calibration Dataset
|
||
|
||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
||
|
||
- **Composition**: Diverse normal inputs representative of the deployment
|
||
domain. No adversarial examples — the basis models *normal* behavior, and
|
||
anomalies are detected as deviations from it.
|
||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
||
have diminishing returns.
|
||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
||
in production. A narrow calibration dataset (e.g., only short English
|
||
queries) will produce high false positive rates on unusual but benign inputs.
|
||
- **Model-specific**: A calibration dataset must be collected for each detector
|
||
model by running that model on the inputs and extracting activations.
|
||
|
||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||
automates calibration dataset processing.
|
||
|
||
### Codebook Compilation
|
||
|
||
The codebook is compiled offline by a training pipeline that:
|
||
|
||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
||
2. Extracts hidden state activations at configured layers
|
||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
||
which uses randomized approximation and may not be deterministic)
|
||
4. Fits spline distributions along each retained dimension
|
||
5. Computes detection thresholds
|
||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
||
|
||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||
keeps the Phase 1 installation simple — no additional download step beyond the
|
||
model. The bundled codebook is specific to the default detector model
|
||
(SmolLM2-135M at the pinned revision). Users who switch to a different
|
||
detector model must provide a matching codebook via `codebook_path`.
|
||
|
||
## Data Format
|
||
|
||
The codebook is stored as:
|
||
|
||
```
|
||
codebook/
|
||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||
├── regions.safetensors # Region boundary parameters
|
||
├── splines.json # Spline knot positions and coefficients
|
||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
||
```
|
||
|
||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||
|
||
### Tensor Specifications
|
||
|
||
**basis.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
|
||
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
|
||
|
||
**regions.safetensors**:
|
||
| Key | Shape | Dtype | Description |
|
||
|-----|-------|-------|-------------|
|
||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||
|
||
**splines.json**:
|
||
| Field | Type | Description |
|
||
|-------|------|-------------|
|
||
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
|
||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||
|
||
## Interfaces
|
||
|
||
### Internal API
|
||
|
||
```python
|
||
@dataclass
|
||
class CodebookConfig:
|
||
model_id: str
|
||
model_revision: str
|
||
n_dimensions: int
|
||
layers: list[int]
|
||
suspicious_threshold: float # Serialized threshold values
|
||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
||
|
||
class Codebook:
|
||
def __init__(self, path: Path): ...
|
||
|
||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||
"""Project raw activations onto SVD basis → z-coordinates."""
|
||
...
|
||
|
||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
||
"""Score z-coordinates against behavioral regions."""
|
||
...
|
||
|
||
@classmethod
|
||
def load(cls, path: Path) -> Codebook: ...
|
||
|
||
@classmethod
|
||
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
|
||
```
|
||
|
||
### Constraints
|
||
|
||
1. **Immutable at runtime** — The codebook is read-only during screening.
|
||
Modifying the codebook requires explicit recompilation.
|
||
2. **Model-bound** — A codebook is valid only for the specific model it was
|
||
compiled for. Loading a codebook with the wrong model produces undefined
|
||
results.
|
||
3. **Deterministic** — Same codebook + same activations = same scores.
|
||
4. **Portable** — Codebook can be saved to disk and reloaded without
|
||
recomputation. Can be distributed via HuggingFace Hub.
|
||
|
||
## Design Decisions
|
||
|
||
| ADR | Decision | Summary |
|
||
|-----|----------|---------|
|
||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
|
||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
|
||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
|
||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
|
||
|
||
## Open Questions
|
||
|
||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||
questions affecting this document:
|
||
|
||
- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
|
||
codebook be compressed? (open)
|
||
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open) |