docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
This commit is contained in:
@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||||
distributions are calibrated for this scale, so threshold values in the
|
||||
codebook are specific to the z-coordinate range of the calibration data.
|
||||
|
||||
**Training shape**: `(N, 3)` where N is the total number of token positions
|
||||
across all calibration prompts. Each token position produces its own
|
||||
z-coordinate, so the population data is a flattened collection of per-token
|
||||
z-vectors.
|
||||
|
||||
**Inference shape**: `(seq_len, 3)` for a single input. Each token position
|
||||
in the input sequence produces a z-coordinate. The detection pipeline
|
||||
operates on this per-token sequence, optionally smoothing it before
|
||||
classification.
|
||||
|
||||
### SVD Basis
|
||||
|
||||
Singular Value Decomposition of the activation space from a calibration dataset
|
||||
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
|
||||
Inputs whose projections fall near or beyond the region boundary score
|
||||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||||
|
||||
### Copula Decomposition
|
||||
|
||||
Raw z-coordinates are not the detection feature space. The codebook
|
||||
decomposes z-coordinates through a copula transform that separates **scale**
|
||||
(how far from normal) from **position** (which behavioral direction):
|
||||
|
||||
```
|
||||
z → CDF → (x₀, x₁, x₂) # Uniform marginals via CDF transform
|
||||
→ S = x₀ + x₁ + x₂ # Scale: total CDF magnitude
|
||||
→ x_norm = simplex(x) # Normalize to probability simplex
|
||||
→ (u, v) = barycentric(x_norm) # Position: 2D simplex coordinates
|
||||
```
|
||||
|
||||
The three derived features `(S, u, v)` form the actual detection space:
|
||||
|
||||
- **S (scale)**: How far the input's z-coordinates deviate from the
|
||||
population norm, aggregated across all three SVD dimensions. High S means
|
||||
the input is anomalous in *magnitude*.
|
||||
- **u, v (position)**: Where the input sits on the behavioral simplex —
|
||||
which *direction* the deviation points. Different behavioral patterns
|
||||
(refusal, instruction-following, self-reference) separate along different
|
||||
(u, v) axes.
|
||||
|
||||
This decomposition is why the codebook can distinguish "this input activates
|
||||
the refusal direction" from "this input is just generally unusual" — the same
|
||||
S value with different (u, v) coordinates implies different behavioral
|
||||
patterns.
|
||||
|
||||
The PoC's `decompose()` method implements this pipeline as a pure function.
|
||||
It is called both during codebook compilation (to compute direction
|
||||
profiles) and during inference (to transform new z-coordinates for
|
||||
classification).
|
||||
|
||||
### Direction Profiles and Contrast Pairs
|
||||
|
||||
The codebook doesn't just detect "anomalous" — it detects specific behavioral
|
||||
**directions**. Each direction is defined by a contrast pair of conditions:
|
||||
|
||||
| Contrast Pair | Condition A | Condition B | Behavioral Direction |
|
||||
|---------------|-------------|-------------|---------------------|
|
||||
| refusal | harmful | harmless | Refusal activation |
|
||||
| instruction_vs_data | instruction | data | Instruction-following |
|
||||
| tool_call | tool_call | natural_language | Tool call patterns |
|
||||
| self_vs_other | self_ref | other_ref | Self-reference |
|
||||
| semantic_violation | violated | expected | Semantic norm violation |
|
||||
| uncertainty | uncertain | confident | Uncertainty expression |
|
||||
| injection | injection | benign_instruction | Prompt injection |
|
||||
|
||||
For each contrast pair, the codebook computes a **DirectionProfile** — the
|
||||
statistical baseline (means, pooled standard deviations, Cohen's d) of the
|
||||
(S, u, v) features for both conditions. This enables:
|
||||
|
||||
1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
|
||||
features of condition A vs condition B. Produces P(active | features) —
|
||||
the probability that the input exhibits the "active" behavioral pattern.
|
||||
2. **Thresholds**: Midpoints between condition means for each feature, used
|
||||
for interpretable rule-based detection as a fallback.
|
||||
|
||||
### Token-Level Smoothing
|
||||
|
||||
During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
|
||||
The detection pipeline optionally applies a rolling average (uniform kernel)
|
||||
to the decomposed (S, u, v) features before classification:
|
||||
|
||||
- **window=1**: No smoothing. Each token position classified independently.
|
||||
- **window=8** (PoC default): Smooth features across 8 token positions.
|
||||
Reduces noise from individual token fluctuations while preserving
|
||||
sustained behavioral signals.
|
||||
|
||||
Smoothing is an inference-time parameter — it does not affect codebook
|
||||
compilation or thresholds. The codebook is calibrated on per-token
|
||||
z-coordinates (all positions from the calibration data, flattened into
|
||||
`(N, 3)`), so the classifier weights are valid regardless of the smoothing
|
||||
window used at inference time.
|
||||
|
||||
### Spline Distributions
|
||||
|
||||
Monotonic spline distributions model the probability density along each SVD
|
||||
dimension (ADR-010). They provide:
|
||||
dimension (ADR-010). They serve two roles in the codebook:
|
||||
|
||||
1. **CDF transform**: The copula decomposition requires mapping z-coordinates
|
||||
to uniform marginals via the CDF. The spline CDF provides this transform.
|
||||
2. **Scale distribution**: A separate spline distribution models the
|
||||
sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
|
||||
|
||||
They provide:
|
||||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||||
anomalous inputs
|
||||
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
|
||||
adversarial training
|
||||
|
||||
The spline distribution approach is adapted from the metaspline PoC
|
||||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
||||
(`spline.py` — `SplineDistribution` class, ~378 lines).
|
||||
|
||||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
||||
the calibration data (ensuring density of knots where data is dense). Beyond
|
||||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
||||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
||||
CDF's complement: `score = 1 - cdf(z)`.
|
||||
|
||||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
||||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
||||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
||||
codebook compilation pipeline.
|
||||
cubic spline with knots (typically 10–64, depending on calibration data
|
||||
size). Knot positions are determined by quantiles of the calibration data
|
||||
(ensuring density of knots where data is dense). Beyond the extreme knots,
|
||||
the CDF decays exponentially at a rate fitted to the tail data.
|
||||
|
||||
### Calibration Dataset
|
||||
|
||||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
||||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
||||
The calibration dataset serves two purposes: establishing the population
|
||||
distribution (normal behavioral baseline) and providing contrast pairs
|
||||
(labeled examples for each behavioral direction).
|
||||
|
||||
- **Composition**: Diverse normal inputs representative of the deployment
|
||||
domain. No adversarial examples — the basis models *normal* behavior, and
|
||||
anomalies are detected as deviations from it.
|
||||
**Population data**: Diverse normal inputs representative of the deployment
|
||||
domain. No adversarial examples — the population models *normal* behavior,
|
||||
and anomalies are detected as deviations from it. Each prompt is processed
|
||||
by the detector model, and z-coordinates are extracted at every token
|
||||
position. The flattened `(N, 3)` tensor of all positions forms the population.
|
||||
|
||||
**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
|
||||
instruction/data) that define each behavioral direction. Each condition
|
||||
produces a set of z-coordinates that, after copula decomposition, reveal
|
||||
where the conditions separate in (S, u, v) space.
|
||||
|
||||
Requirements:
|
||||
- **Composition**: Population must cover the range of normal inputs the
|
||||
detector will see in production. Contrast pairs must be clearly distinct
|
||||
along their target behavioral direction.
|
||||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
||||
have diminishing returns.
|
||||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
||||
in production. A narrow calibration dataset (e.g., only short English
|
||||
queries) will produce high false positive rates on unusual but benign inputs.
|
||||
- **Model-specific**: A calibration dataset must be collected for each detector
|
||||
Practical range: 1,000–10,000 prompts for population. Each contrast
|
||||
condition needs at least 50–200 prompts.
|
||||
- **Diversity**: A narrow population (e.g., only short English queries) will
|
||||
produce high false positive rates on unusual but benign inputs.
|
||||
- **Model-specific**: Calibration data must be collected for each detector
|
||||
model by running that model on the inputs and extracting activations.
|
||||
|
||||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||||
automates calibration dataset processing.
|
||||
automates calibration dataset processing with `max_length=128` tokens per
|
||||
prompt.
|
||||
|
||||
### Codebook Compilation
|
||||
|
||||
The codebook is compiled offline by a training pipeline that:
|
||||
|
||||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
||||
2. Extracts hidden state activations at configured layers
|
||||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
||||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
||||
which uses randomized approximation and may not be deterministic)
|
||||
4. Fits spline distributions along each retained dimension
|
||||
5. Computes detection thresholds
|
||||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
||||
1. Runs the detector model on a calibration dataset (population + contrast
|
||||
pairs)
|
||||
2. Extracts hidden state activations at configured layers for every token
|
||||
position (not just last-token)
|
||||
3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
|
||||
deterministic decomposition)
|
||||
4. Projects activations onto the top-3 SVD components → z-coordinates
|
||||
5. Fits spline distributions on each SVD dimension and the sum S
|
||||
6. Applies copula decomposition to all z-coordinates → (S, u, v) features
|
||||
7. Computes direction profiles (means, pooled std, Cohen's d) for each
|
||||
contrast pair
|
||||
8. Trains logistic classifiers on (S, u, v) for each contrast pair
|
||||
9. Computes detection thresholds (midpoints between condition means)
|
||||
10. Serializes the codebook to a portable format (safetensors + JSON config)
|
||||
|
||||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||||
@@ -224,8 +328,10 @@ The codebook is stored as:
|
||||
codebook/
|
||||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||||
├── regions.safetensors # Region boundary parameters
|
||||
├── classifiers.safetensors # Logistic classifier weights per direction
|
||||
├── splines.json # Spline knot positions and coefficients
|
||||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
||||
├── profiles.json # Direction profiles (means, stds, Cohen's d)
|
||||
└── config.json # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
|
||||
```
|
||||
|
||||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||||
|
||||
**classifiers.safetensors**:
|
||||
| Key | Shape | Dtype | Description |
|
||||
|-----|-------|-------|-------------|
|
||||
| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
|
||||
| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
|
||||
| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
|
||||
| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
|
||||
|
||||
**splines.json**:
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||||
|
||||
**profiles.json**:
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
|
||||
| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
|
||||
|
||||
Each `DirectionProfile` entry contains:
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `label` | `str` | Direction name (e.g., "refusal") |
|
||||
| `sum_mean_a` | `float` | Mean S for condition A |
|
||||
| `sum_mean_b` | `float` | Mean S for condition B |
|
||||
| `sum_std_pooled` | `float` | Pooled std of S |
|
||||
| `u_mean_a` | `float` | Mean u for condition A |
|
||||
| `u_mean_b` | `float` | Mean u for condition B |
|
||||
| `u_std_pooled` | `float` | Pooled std of u |
|
||||
| `v_mean_a` | `float` | Mean v for condition A |
|
||||
| `v_mean_b` | `float` | Mean v for condition B |
|
||||
| `v_std_pooled` | `float` | Pooled std of v |
|
||||
| `cohen_d_sum` | `float` | Effect size for S |
|
||||
| `cohen_d_u` | `float` | Effect size for u |
|
||||
| `cohen_d_v` | `float` | Effect size for v |
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Internal API
|
||||
@@ -262,18 +399,54 @@ class CodebookConfig:
|
||||
model_revision: str
|
||||
n_dimensions: int
|
||||
layers: list[int]
|
||||
suspicious_threshold: float # Serialized threshold values
|
||||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
||||
suspicious_threshold: float
|
||||
dangerous_threshold: float
|
||||
contrast_pairs: list[tuple[str, str, str]] # (cond_a, cond_b, label)
|
||||
smoothing_window: int = 8 # Token-level smoothing (inference only)
|
||||
|
||||
class Codebook:
|
||||
def __init__(self, path: Path): ...
|
||||
|
||||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||||
"""Project raw activations onto SVD basis → z-coordinates."""
|
||||
"""Project raw activations onto SVD basis → z-coordinates.
|
||||
|
||||
Returns: (seq_len, 3) z-coordinates.
|
||||
"""
|
||||
...
|
||||
|
||||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
||||
"""Score z-coordinates against behavioral regions."""
|
||||
def decompose(self, z_coords: np.ndarray) -> dict:
|
||||
"""Copula decomposition: z → CDF → (S, u, v).
|
||||
|
||||
Args:
|
||||
z_coords: (seq_len, 3) or (N, 3) z-coordinates
|
||||
|
||||
Returns:
|
||||
dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
|
||||
"""
|
||||
...
|
||||
|
||||
def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
|
||||
"""Classify decomposed features using logistic classifiers.
|
||||
|
||||
Args:
|
||||
features: Output of decompose(), with (seq_len,) arrays
|
||||
window: Smoothing window size (1 = no smoothing)
|
||||
|
||||
Returns:
|
||||
dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
|
||||
"""
|
||||
...
|
||||
|
||||
def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
|
||||
min_positions: int = 3, window: int = 8) -> DetectionResult:
|
||||
"""Full detection pipeline: project → decompose → smooth → classify → flag.
|
||||
|
||||
Args:
|
||||
z_coords: (seq_len, 3) z-coordinates for a single input
|
||||
threshold_prob: P(active) threshold for flagging a direction
|
||||
min_positions: Minimum token positions above threshold to flag
|
||||
window: Smoothing window for token-level features
|
||||
"""
|
||||
...
|
||||
|
||||
@classmethod
|
||||
|
||||
Reference in New Issue
Block a user