docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
This commit is contained in:
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions

View File

@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
distributions are calibrated for this scale, so threshold values in the
codebook are specific to the z-coordinate range of the calibration data.
**Training shape**: `(N, 3)` where N is the total number of token positions
across all calibration prompts. Each token position produces its own
z-coordinate, so the population data is a flattened collection of per-token
z-vectors.
**Inference shape**: `(seq_len, 3)` for a single input. Each token position
in the input sequence produces a z-coordinate. The detection pipeline
operates on this per-token sequence, optionally smoothing it before
classification.
### SVD Basis
Singular Value Decomposition of the activation space from a calibration dataset
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
Inputs whose projections fall near or beyond the region boundary score
increasingly high (SUSPICIOUS → DANGEROUS).
### Copula Decomposition
Raw z-coordinates are not the detection feature space. The codebook
decomposes z-coordinates through a copula transform that separates **scale**
(how far from normal) from **position** (which behavioral direction):
```
z → CDF → (x₀, x₁, x₂) # Uniform marginals via CDF transform
→ S = x₀ + x₁ + x₂ # Scale: total CDF magnitude
→ x_norm = simplex(x) # Normalize to probability simplex
→ (u, v) = barycentric(x_norm) # Position: 2D simplex coordinates
```
The three derived features `(S, u, v)` form the actual detection space:
- **S (scale)**: How far the input's z-coordinates deviate from the
population norm, aggregated across all three SVD dimensions. High S means
the input is anomalous in *magnitude*.
- **u, v (position)**: Where the input sits on the behavioral simplex —
which *direction* the deviation points. Different behavioral patterns
(refusal, instruction-following, self-reference) separate along different
(u, v) axes.
This decomposition is why the codebook can distinguish "this input activates
the refusal direction" from "this input is just generally unusual" — the same
S value with different (u, v) coordinates implies different behavioral
patterns.
The PoC's `decompose()` method implements this pipeline as a pure function.
It is called both during codebook compilation (to compute direction
profiles) and during inference (to transform new z-coordinates for
classification).
### Direction Profiles and Contrast Pairs
The codebook doesn't just detect "anomalous" — it detects specific behavioral
**directions**. Each direction is defined by a contrast pair of conditions:
| Contrast Pair | Condition A | Condition B | Behavioral Direction |
|---------------|-------------|-------------|---------------------|
| refusal | harmful | harmless | Refusal activation |
| instruction_vs_data | instruction | data | Instruction-following |
| tool_call | tool_call | natural_language | Tool call patterns |
| self_vs_other | self_ref | other_ref | Self-reference |
| semantic_violation | violated | expected | Semantic norm violation |
| uncertainty | uncertain | confident | Uncertainty expression |
| injection | injection | benign_instruction | Prompt injection |
For each contrast pair, the codebook computes a **DirectionProfile** — the
statistical baseline (means, pooled standard deviations, Cohen's d) of the
(S, u, v) features for both conditions. This enables:
1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
features of condition A vs condition B. Produces P(active | features) —
the probability that the input exhibits the "active" behavioral pattern.
2. **Thresholds**: Midpoints between condition means for each feature, used
for interpretable rule-based detection as a fallback.
### Token-Level Smoothing
During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
The detection pipeline optionally applies a rolling average (uniform kernel)
to the decomposed (S, u, v) features before classification:
- **window=1**: No smoothing. Each token position classified independently.
- **window=8** (PoC default): Smooth features across 8 token positions.
Reduces noise from individual token fluctuations while preserving
sustained behavioral signals.
Smoothing is an inference-time parameter — it does not affect codebook
compilation or thresholds. The codebook is calibrated on per-token
z-coordinates (all positions from the calibration data, flattened into
`(N, 3)`), so the classifier weights are valid regardless of the smoothing
window used at inference time.
### Spline Distributions
Monotonic spline distributions model the probability density along each SVD
dimension (ADR-010). They provide:
dimension (ADR-010). They serve two roles in the codebook:
1. **CDF transform**: The copula decomposition requires mapping z-coordinates
to uniform marginals via the CDF. The spline CDF provides this transform.
2. **Scale distribution**: A separate spline distribution models the
sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
They provide:
- **Smooth scoring**: Continuous score rather than hard threshold
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
anomalous inputs
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
adversarial training
The spline distribution approach is adapted from the metaspline PoC
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
(`spline.py` `SplineDistribution` class, ~378 lines).
**Formal definition**: The CDF along each dimension is modeled as a monotonic
cubic spline with 1020 knots. Knot positions are determined by quantiles of
the calibration data (ensuring density of knots where data is dense). Beyond
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
CDF's complement: `score = 1 - cdf(z)`.
**Canonical implementation**: The metaspline PoC files `spline.py`
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
and `space.py` (`unfold`/`fold`) are the reference implementation for the
codebook compilation pipeline.
cubic spline with knots (typically 1064, depending on calibration data
size). Knot positions are determined by quantiles of the calibration data
(ensuring density of knots where data is dense). Beyond the extreme knots,
the CDF decays exponentially at a rate fitted to the tail data.
### Calibration Dataset
The calibration dataset is the set of normal (non-adversarial) inputs used to
compute the SVD basis and fit behavioral region distributions. Requirements:
The calibration dataset serves two purposes: establishing the population
distribution (normal behavioral baseline) and providing contrast pairs
(labeled examples for each behavioral direction).
- **Composition**: Diverse normal inputs representative of the deployment
domain. No adversarial examples — the basis models *normal* behavior, and
anomalies are detected as deviations from it.
**Population data**: Diverse normal inputs representative of the deployment
domain. No adversarial examples — the population models *normal* behavior,
and anomalies are detected as deviations from it. Each prompt is processed
by the detector model, and z-coordinates are extracted at every token
position. The flattened `(N, 3)` tensor of all positions forms the population.
**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
instruction/data) that define each behavioral direction. Each condition
produces a set of z-coordinates that, after copula decomposition, reveal
where the conditions separate in (S, u, v) space.
Requirements:
- **Composition**: Population must cover the range of normal inputs the
detector will see in production. Contrast pairs must be clearly distinct
along their target behavioral direction.
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
Practical range: 1,00010,000 inputs. More inputs stabilize the basis but
have diminishing returns.
- **Diversity**: Must cover the range of normal inputs the detector will see
in production. A narrow calibration dataset (e.g., only short English
queries) will produce high false positive rates on unusual but benign inputs.
- **Model-specific**: A calibration dataset must be collected for each detector
Practical range: 1,00010,000 prompts for population. Each contrast
condition needs at least 50200 prompts.
- **Diversity**: A narrow population (e.g., only short English queries) will
produce high false positive rates on unusual but benign inputs.
- **Model-specific**: Calibration data must be collected for each detector
model by running that model on the inputs and extracting activations.
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
automates calibration dataset processing.
automates calibration dataset processing with `max_length=128` tokens per
prompt.
### Codebook Compilation
The codebook is compiled offline by a training pipeline that:
1. Runs the detector model on a calibration dataset (diverse normal inputs)
2. Extracts hidden state activations at configured layers
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
which uses randomized approximation and may not be deterministic)
4. Fits spline distributions along each retained dimension
5. Computes detection thresholds
6. Serializes the codebook to a portable format (safetensors + JSON config)
1. Runs the detector model on a calibration dataset (population + contrast
pairs)
2. Extracts hidden state activations at configured layers for every token
position (not just last-token)
3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
deterministic decomposition)
4. Projects activations onto the top-3 SVD components → z-coordinates
5. Fits spline distributions on each SVD dimension and the sum S
6. Applies copula decomposition to all z-coordinates → (S, u, v) features
7. Computes direction profiles (means, pooled std, Cohen's d) for each
contrast pair
8. Trains logistic classifiers on (S, u, v) for each contrast pair
9. Computes detection thresholds (midpoints between condition means)
10. Serializes the codebook to a portable format (safetensors + JSON config)
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
package** as package data (under `src/alknet_firewall/data/codebook/`). This
@@ -224,8 +328,10 @@ The codebook is stored as:
codebook/
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
├── regions.safetensors # Region boundary parameters
├── classifiers.safetensors # Logistic classifier weights per direction
├── splines.json # Spline knot positions and coefficients
── config.json # Metadata: model_id, revision, n_dims, thresholds
── profiles.json # Direction profiles (means, stds, Cohen's d)
└── config.json # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
```
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
**classifiers.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
**splines.json**:
| Field | Type | Description |
|-------|------|-------------|
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
**profiles.json**:
| Field | Type | Description |
|-------|------|-------------|
| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
Each `DirectionProfile` entry contains:
| Field | Type | Description |
|-------|------|-------------|
| `label` | `str` | Direction name (e.g., "refusal") |
| `sum_mean_a` | `float` | Mean S for condition A |
| `sum_mean_b` | `float` | Mean S for condition B |
| `sum_std_pooled` | `float` | Pooled std of S |
| `u_mean_a` | `float` | Mean u for condition A |
| `u_mean_b` | `float` | Mean u for condition B |
| `u_std_pooled` | `float` | Pooled std of u |
| `v_mean_a` | `float` | Mean v for condition A |
| `v_mean_b` | `float` | Mean v for condition B |
| `v_std_pooled` | `float` | Pooled std of v |
| `cohen_d_sum` | `float` | Effect size for S |
| `cohen_d_u` | `float` | Effect size for u |
| `cohen_d_v` | `float` | Effect size for v |
## Interfaces
### Internal API
@@ -262,18 +399,54 @@ class CodebookConfig:
model_revision: str
n_dimensions: int
layers: list[int]
suspicious_threshold: float # Serialized threshold values
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
suspicious_threshold: float
dangerous_threshold: float
contrast_pairs: list[tuple[str, str, str]] # (cond_a, cond_b, label)
smoothing_window: int = 8 # Token-level smoothing (inference only)
class Codebook:
def __init__(self, path: Path): ...
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
"""Project raw activations onto SVD basis → z-coordinates."""
"""Project raw activations onto SVD basis → z-coordinates.
Returns: (seq_len, 3) z-coordinates.
"""
...
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
"""Score z-coordinates against behavioral regions."""
def decompose(self, z_coords: np.ndarray) -> dict:
"""Copula decomposition: z → CDF → (S, u, v).
Args:
z_coords: (seq_len, 3) or (N, 3) z-coordinates
Returns:
dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
"""
...
def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
"""Classify decomposed features using logistic classifiers.
Args:
features: Output of decompose(), with (seq_len,) arrays
window: Smoothing window size (1 = no smoothing)
Returns:
dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
"""
...
def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
min_positions: int = 3, window: int = 8) -> DetectionResult:
"""Full detection pipeline: project → decompose → smooth → classify → flag.
Args:
z_coords: (seq_len, 3) z-coordinates for a single input
threshold_prob: P(active) threshold for flagging a direction
min_positions: Minimum token positions above threshold to flag
window: Smoothing window for token-level features
"""
...
@classmethod