Files

glm-5.1 cf464c2296 feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).

2026-06-13 05:17:40 +00:00

10 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

Codebook

The codebook contains the compiled detection parameters — SVD basis vectors, behavioral region boundaries, and scoring distributions — that the firewall uses to detect adversarial inputs.

What It Is

The codebook is the "compiled detector" — the precomputed parameters that transform raw model activations into behavioral alarm signals. It is to the firewall what a trained model is to a classifier: the result of an offline compilation step that produces the runtime detection parameters.

The name "codebook" comes from vector quantization terminology: it defines a set of reference points (codewords) in activation space that represent known behavioral patterns. New inputs are compared against these reference patterns.

Why It Exists

Running full SVD decomposition and distribution fitting on every input would be prohibitively expensive. The codebook precomputes these offline:

SVD basis: The principal directions in activation space that capture safety-relevant behavioral variance. Computed once from a calibration dataset.
Behavioral regions: The expected distribution of normal inputs along each SVD dimension. Defined by fitted spline distributions.
Thresholds: Decision boundaries for alarm levels along each dimension.

At runtime, the firewall only needs to project new activations onto the precomputed basis and compare against the precomputed regions — O(k) per input where k is the number of retained dimensions.

Key Concepts

z-Coordinates

The projection of an activation vector onto the SVD basis. Computed as:

z = V^T @ (activation - mean)

Where V is the SVD right-singular matrix (basis vectors) and mean is the mean activation from the calibration dataset. The centering step is critical — without it, projections are offset by the mean and thresholds would be incorrect.

z-coordinates are raw (unnormalized) projections. The codebook's spline distributions are calibrated for this scale, so threshold values in the codebook are specific to the z-coordinate range of the calibration data.

SVD Basis

Singular Value Decomposition of the activation space from a calibration dataset reveals the principal components (directions) that capture the most variance. The top-k components form the basis that the codebook uses for projection.

Key properties:

Interpretable: Each direction can be inspected for what behavioral pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
Efficient: After decomposition, projection is a matrix multiply
Stable: SVD basis is deterministic for a given calibration dataset
Model-specific: The basis is computed for a specific model architecture and weights. Changing the detector model requires recomputing the basis

The SVD basis is computed by the codebook training pipeline (run_manifold_projection.py in the PoC) and stored as part of the codebook.

Behavioral Regions

For each SVD dimension, the codebook defines the expected distribution of normal (non-adversarial) inputs. This is modeled as a monotonic spline distribution that captures the shape of the behavioral region along that dimension.

Inputs whose projections fall within the normal region score low (CLEAR). Inputs whose projections fall near or beyond the region boundary score increasingly high (SUSPICIOUS → DANGEROUS).

Spline Distributions

Monotonic spline distributions model the probability density along each SVD dimension (ADR-010). They provide:

Smooth scoring: Continuous score rather than hard threshold
Tail sensitivity: Exponential tail behavior captures rare-but-critical anomalous inputs
Parametric compactness: A handful of spline knots represent the full distribution shape
Differentiability: Scores are differentiable for potential future use in adversarial training

The spline distribution approach is adapted from the metaspline PoC (spline.py, transform.py, space.py — ~280 lines total).

Formal definition: The CDF along each dimension is modeled as a monotonic cubic spline with 10–20 knots. Knot positions are determined by quantiles of the calibration data (ensuring density of knots where data is dense). Beyond the extreme knots, the CDF decays exponentially at a rate fitted to the tail data. The scoring function maps a z-coordinate to a score in [0, 1] via the CDF's complement: score = 1 - cdf(z).

Canonical implementation: The metaspline PoC files spline.py (SplineDistribution class), transform.py (dcs_norm, simplex transforms), and space.py (unfold/fold) are the reference implementation for the codebook compilation pipeline.

Calibration Dataset

The calibration dataset is the set of normal (non-adversarial) inputs used to compute the SVD basis and fit behavioral region distributions. Requirements:

Composition: Diverse normal inputs representative of the deployment domain. No adversarial examples — the basis models normal behavior, and anomalies are detected as deviations from it.
Size: At minimum, enough inputs to produce a stable SVD decomposition. Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but have diminishing returns.
Diversity: Must cover the range of normal inputs the detector will see in production. A narrow calibration dataset (e.g., only short English queries) will produce high false positive rates on unusual but benign inputs.
Model-specific: A calibration dataset must be collected for each detector model by running that model on the inputs and extracting activations.

The codebook compilation pipeline (run_manifold_projection.py in the PoC) automates calibration dataset processing.

Codebook Compilation

The codebook is compiled offline by a training pipeline that:

Runs the detector model on a calibration dataset (diverse normal inputs)
Extracts hidden state activations at configured layers
Computes SVD on the activation matrix (scipy.linalg.svd for exact, deterministic decomposition; not sklearn.decomposition.TruncatedSVD which uses randomized approximation and may not be deterministic)
Fits spline distributions along each retained dimension
Computes detection thresholds
Serializes the codebook to a portable format (safetensors + JSON config)

This pipeline is Phase 2. In Phase 1, the codebook is bundled with the package as package data (under src/alknet_firewall/data/codebook/). This keeps the Phase 1 installation simple — no additional download step beyond the model. The bundled codebook is specific to the default detector model (SmolLM2-135M at the pinned revision). Users who switch to a different detector model must provide a matching codebook via codebook_path.

Data Format

The codebook is stored as:

codebook/
├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
├── regions.safetensors    # Region boundary parameters
├── splines.json           # Spline knot positions and coefficients
└── config.json            # Metadata: model_id, revision, n_dims, thresholds

All tensor data uses safetensors format (ADR-005). Configuration uses JSON.

Tensor Specifications

basis.safetensors:

Key	Shape	Dtype	Description
`basis_vectors`	`(n_layers, n_dims, hidden_dim)`	float32	SVD right-singular vectors
`mean`	`(n_layers, hidden_dim)`	float32	Mean activation per layer (for centering)

regions.safetensors:

Key	Shape	Dtype	Description
`centroids`	`(n_layers, n_dims)`	float32	Mean projection per dimension
`scale`	`(n_layers, n_dims)`	float32	Standard deviation per dimension

splines.json:

Field	Type	Description
`knots`	`list[list[float]]`	Knot positions per dimension (n_dims lists of varying length)
`coefficients`	`list[list[float]]`	Spline coefficients per dimension
`tail_decay`	`list[float]`	Exponential tail decay rate per dimension

Interfaces

Internal API

@dataclass
class CodebookConfig:
    model_id: str
    model_revision: str
    n_dimensions: int
    layers: list[int]
    suspicious_threshold: float    # Serialized threshold values
    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)

class Codebook:
    def __init__(self, path: Path): ...

    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
        """Project raw activations onto SVD basis → z-coordinates."""
        ...

    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
        """Score z-coordinates against behavioral regions."""
        ...

    @classmethod
    def load(cls, path: Path) -> Codebook: ...

    @classmethod
    def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...

Constraints

Immutable at runtime — The codebook is read-only during screening. Modifying the codebook requires explicit recompilation.
Model-bound — A codebook is valid only for the specific model it was compiled for. Loading a codebook with the wrong model produces undefined results.
Deterministic — Same codebook + same activations = same scores.
Portable — Codebook can be saved to disk and reloaded without recomputation. Can be distributed via HuggingFace Hub.

Design Decisions

ADR	Decision	Summary
004	SVD-based detection	Interpretable, efficient, multi-dimensional
005	Safetensors-only	Secure format for codebook tensors
009	Last-token extraction	Which activation to use for projection
010	Monotonic spline distributions	Behavioral region scoring

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed? (open)
OQ-04: Should detection thresholds be per-model or globally configurable? (open)

10 KiB Raw Blame History Unescape Escape