feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
248
docs/architecture/codebook.md
Normal file
248
docs/architecture/codebook.md
Normal file
@@ -0,0 +1,248 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Codebook
|
||||
|
||||
The codebook contains the compiled detection parameters — SVD basis vectors,
|
||||
behavioral region boundaries, and scoring distributions — that the firewall
|
||||
uses to detect adversarial inputs.
|
||||
|
||||
## What It Is
|
||||
|
||||
The codebook is the "compiled detector" — the precomputed parameters that
|
||||
transform raw model activations into behavioral alarm signals. It is to the
|
||||
firewall what a trained model is to a classifier: the result of an offline
|
||||
compilation step that produces the runtime detection parameters.
|
||||
|
||||
The name "codebook" comes from vector quantization terminology: it defines a
|
||||
set of reference points (codewords) in activation space that represent known
|
||||
behavioral patterns. New inputs are compared against these reference patterns.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
Running full SVD decomposition and distribution fitting on every input would be
|
||||
prohibitively expensive. The codebook precomputes these offline:
|
||||
|
||||
- **SVD basis**: The principal directions in activation space that capture
|
||||
safety-relevant behavioral variance. Computed once from a calibration
|
||||
dataset.
|
||||
- **Behavioral regions**: The expected distribution of normal inputs along each
|
||||
SVD dimension. Defined by fitted spline distributions.
|
||||
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
|
||||
|
||||
At runtime, the firewall only needs to project new activations onto the
|
||||
precomputed basis and compare against the precomputed regions — O(k) per input
|
||||
where k is the number of retained dimensions.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### z-Coordinates
|
||||
|
||||
The projection of an activation vector onto the SVD basis. Computed as:
|
||||
|
||||
```
|
||||
z = V^T @ (activation - mean)
|
||||
```
|
||||
|
||||
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
|
||||
mean activation from the calibration dataset. The centering step is critical
|
||||
— without it, projections are offset by the mean and thresholds would be
|
||||
incorrect.
|
||||
|
||||
z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||||
distributions are calibrated for this scale, so threshold values in the
|
||||
codebook are specific to the z-coordinate range of the calibration data.
|
||||
|
||||
### SVD Basis
|
||||
|
||||
Singular Value Decomposition of the activation space from a calibration dataset
|
||||
reveals the principal components (directions) that capture the most variance.
|
||||
The top-k components form the basis that the codebook uses for projection.
|
||||
|
||||
Key properties:
|
||||
- **Interpretable**: Each direction can be inspected for what behavioral
|
||||
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
|
||||
- **Efficient**: After decomposition, projection is a matrix multiply
|
||||
- **Stable**: SVD basis is deterministic for a given calibration dataset
|
||||
- **Model-specific**: The basis is computed for a specific model architecture
|
||||
and weights. Changing the detector model requires recomputing the basis
|
||||
|
||||
The SVD basis is computed by the codebook training pipeline
|
||||
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
|
||||
|
||||
### Behavioral Regions
|
||||
|
||||
For each SVD dimension, the codebook defines the expected distribution of
|
||||
normal (non-adversarial) inputs. This is modeled as a monotonic spline
|
||||
distribution that captures the shape of the behavioral region along that
|
||||
dimension.
|
||||
|
||||
Inputs whose projections fall within the normal region score low (CLEAR).
|
||||
Inputs whose projections fall near or beyond the region boundary score
|
||||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||||
|
||||
### Spline Distributions
|
||||
|
||||
Monotonic spline distributions model the probability density along each SVD
|
||||
dimension (ADR-010). They provide:
|
||||
|
||||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||||
anomalous inputs
|
||||
- **Parametric compactness**: A handful of spline knots represent the full
|
||||
distribution shape
|
||||
- **Differentiability**: Scores are differentiable for potential future use in
|
||||
adversarial training
|
||||
|
||||
The spline distribution approach is adapted from the metaspline PoC
|
||||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
||||
|
||||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
||||
the calibration data (ensuring density of knots where data is dense). Beyond
|
||||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
||||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
||||
CDF's complement: `score = 1 - cdf(z)`.
|
||||
|
||||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
||||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
||||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
||||
codebook compilation pipeline.
|
||||
|
||||
### Calibration Dataset
|
||||
|
||||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
||||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
||||
|
||||
- **Composition**: Diverse normal inputs representative of the deployment
|
||||
domain. No adversarial examples — the basis models *normal* behavior, and
|
||||
anomalies are detected as deviations from it.
|
||||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
||||
have diminishing returns.
|
||||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
||||
in production. A narrow calibration dataset (e.g., only short English
|
||||
queries) will produce high false positive rates on unusual but benign inputs.
|
||||
- **Model-specific**: A calibration dataset must be collected for each detector
|
||||
model by running that model on the inputs and extracting activations.
|
||||
|
||||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||||
automates calibration dataset processing.
|
||||
|
||||
### Codebook Compilation
|
||||
|
||||
The codebook is compiled offline by a training pipeline that:
|
||||
|
||||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
||||
2. Extracts hidden state activations at configured layers
|
||||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
||||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
||||
which uses randomized approximation and may not be deterministic)
|
||||
4. Fits spline distributions along each retained dimension
|
||||
5. Computes detection thresholds
|
||||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
||||
|
||||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||||
keeps the Phase 1 installation simple — no additional download step beyond the
|
||||
model. The bundled codebook is specific to the default detector model
|
||||
(SmolLM2-135M at the pinned revision). Users who switch to a different
|
||||
detector model must provide a matching codebook via `codebook_path`.
|
||||
|
||||
## Data Format
|
||||
|
||||
The codebook is stored as:
|
||||
|
||||
```
|
||||
codebook/
|
||||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||||
├── regions.safetensors # Region boundary parameters
|
||||
├── splines.json # Spline knot positions and coefficients
|
||||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
||||
```
|
||||
|
||||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||
|
||||
### Tensor Specifications
|
||||
|
||||
**basis.safetensors**:
|
||||
| Key | Shape | Dtype | Description |
|
||||
|-----|-------|-------|-------------|
|
||||
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
|
||||
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
|
||||
|
||||
**regions.safetensors**:
|
||||
| Key | Shape | Dtype | Description |
|
||||
|-----|-------|-------|-------------|
|
||||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||||
|
||||
**splines.json**:
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
|
||||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Internal API
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CodebookConfig:
|
||||
model_id: str
|
||||
model_revision: str
|
||||
n_dimensions: int
|
||||
layers: list[int]
|
||||
suspicious_threshold: float # Serialized threshold values
|
||||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
||||
|
||||
class Codebook:
|
||||
def __init__(self, path: Path): ...
|
||||
|
||||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||||
"""Project raw activations onto SVD basis → z-coordinates."""
|
||||
...
|
||||
|
||||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
||||
"""Score z-coordinates against behavioral regions."""
|
||||
...
|
||||
|
||||
@classmethod
|
||||
def load(cls, path: Path) -> Codebook: ...
|
||||
|
||||
@classmethod
|
||||
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
|
||||
```
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **Immutable at runtime** — The codebook is read-only during screening.
|
||||
Modifying the codebook requires explicit recompilation.
|
||||
2. **Model-bound** — A codebook is valid only for the specific model it was
|
||||
compiled for. Loading a codebook with the wrong model produces undefined
|
||||
results.
|
||||
3. **Deterministic** — Same codebook + same activations = same scores.
|
||||
4. **Portable** — Codebook can be saved to disk and reloaded without
|
||||
recomputation. Can be distributed via HuggingFace Hub.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
|
||||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
|
||||
codebook be compressed? (open)
|
||||
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
|
||||
Reference in New Issue
Block a user