From 45a0e0798ca4a042d98af8172a95543f96c2df8d Mon Sep 17 00:00:00 2001 From: "glm-5.1" Date: Sat, 13 Jun 2026 08:17:09 +0000 Subject: [PATCH] docs: add copula decomposition pipeline, clarify detection data flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline --- docs/architecture/codebook.md | 249 +++++++++++++++--- .../decisions/009-last-token-extraction.md | 13 +- docs/architecture/firewall.md | 83 ++++-- docs/architecture/model.md | 27 +- 4 files changed, 300 insertions(+), 72 deletions(-) diff --git a/docs/architecture/codebook.md b/docs/architecture/codebook.md index 4f10c98..e295c94 100644 --- a/docs/architecture/codebook.md +++ b/docs/architecture/codebook.md @@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline distributions are calibrated for this scale, so threshold values in the codebook are specific to the z-coordinate range of the calibration data. +**Training shape**: `(N, 3)` where N is the total number of token positions +across all calibration prompts. Each token position produces its own +z-coordinate, so the population data is a flattened collection of per-token +z-vectors. + +**Inference shape**: `(seq_len, 3)` for a single input. Each token position +in the input sequence produces a z-coordinate. The detection pipeline +operates on this per-token sequence, optionally smoothing it before +classification. + ### SVD Basis Singular Value Decomposition of the activation space from a calibration dataset @@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR). Inputs whose projections fall near or beyond the region boundary score increasingly high (SUSPICIOUS → DANGEROUS). +### Copula Decomposition + +Raw z-coordinates are not the detection feature space. The codebook +decomposes z-coordinates through a copula transform that separates **scale** +(how far from normal) from **position** (which behavioral direction): + +``` +z → CDF → (x₀, x₁, x₂) # Uniform marginals via CDF transform + → S = x₀ + x₁ + x₂ # Scale: total CDF magnitude + → x_norm = simplex(x) # Normalize to probability simplex + → (u, v) = barycentric(x_norm) # Position: 2D simplex coordinates +``` + +The three derived features `(S, u, v)` form the actual detection space: + +- **S (scale)**: How far the input's z-coordinates deviate from the + population norm, aggregated across all three SVD dimensions. High S means + the input is anomalous in *magnitude*. +- **u, v (position)**: Where the input sits on the behavioral simplex — + which *direction* the deviation points. Different behavioral patterns + (refusal, instruction-following, self-reference) separate along different + (u, v) axes. + +This decomposition is why the codebook can distinguish "this input activates +the refusal direction" from "this input is just generally unusual" — the same +S value with different (u, v) coordinates implies different behavioral +patterns. + +The PoC's `decompose()` method implements this pipeline as a pure function. +It is called both during codebook compilation (to compute direction +profiles) and during inference (to transform new z-coordinates for +classification). + +### Direction Profiles and Contrast Pairs + +The codebook doesn't just detect "anomalous" — it detects specific behavioral +**directions**. Each direction is defined by a contrast pair of conditions: + +| Contrast Pair | Condition A | Condition B | Behavioral Direction | +|---------------|-------------|-------------|---------------------| +| refusal | harmful | harmless | Refusal activation | +| instruction_vs_data | instruction | data | Instruction-following | +| tool_call | tool_call | natural_language | Tool call patterns | +| self_vs_other | self_ref | other_ref | Self-reference | +| semantic_violation | violated | expected | Semantic norm violation | +| uncertainty | uncertain | confident | Uncertainty expression | +| injection | injection | benign_instruction | Prompt injection | + +For each contrast pair, the codebook computes a **DirectionProfile** — the +statistical baseline (means, pooled standard deviations, Cohen's d) of the +(S, u, v) features for both conditions. This enables: + +1. **DirectionClassifier**: A logistic regression trained on the (S, u, v) + features of condition A vs condition B. Produces P(active | features) — + the probability that the input exhibits the "active" behavioral pattern. +2. **Thresholds**: Midpoints between condition means for each feature, used + for interpretable rule-based detection as a fallback. + +### Token-Level Smoothing + +During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`. +The detection pipeline optionally applies a rolling average (uniform kernel) +to the decomposed (S, u, v) features before classification: + +- **window=1**: No smoothing. Each token position classified independently. +- **window=8** (PoC default): Smooth features across 8 token positions. + Reduces noise from individual token fluctuations while preserving + sustained behavioral signals. + +Smoothing is an inference-time parameter — it does not affect codebook +compilation or thresholds. The codebook is calibrated on per-token +z-coordinates (all positions from the calibration data, flattened into +`(N, 3)`), so the classifier weights are valid regardless of the smoothing +window used at inference time. + ### Spline Distributions Monotonic spline distributions model the probability density along each SVD -dimension (ADR-010). They provide: +dimension (ADR-010). They serve two roles in the codebook: +1. **CDF transform**: The copula decomposition requires mapping z-coordinates + to uniform marginals via the CDF. The spline CDF provides this transform. +2. **Scale distribution**: A separate spline distribution models the + sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature. + +They provide: - **Smooth scoring**: Continuous score rather than hard threshold - **Tail sensitivity**: Exponential tail behavior captures rare-but-critical anomalous inputs @@ -97,52 +188,65 @@ dimension (ADR-010). They provide: adversarial training The spline distribution approach is adapted from the metaspline PoC -(`spline.py`, `transform.py`, `space.py` — ~280 lines total). +(`spline.py` — `SplineDistribution` class, ~378 lines). **Formal definition**: The CDF along each dimension is modeled as a monotonic -cubic spline with 10–20 knots. Knot positions are determined by quantiles of -the calibration data (ensuring density of knots where data is dense). Beyond -the extreme knots, the CDF decays exponentially at a rate fitted to the tail -data. The scoring function maps a z-coordinate to a score in [0, 1] via the -CDF's complement: `score = 1 - cdf(z)`. - -**Canonical implementation**: The metaspline PoC files `spline.py` -(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms), -and `space.py` (`unfold`/`fold`) are the reference implementation for the -codebook compilation pipeline. +cubic spline with knots (typically 10–64, depending on calibration data +size). Knot positions are determined by quantiles of the calibration data +(ensuring density of knots where data is dense). Beyond the extreme knots, +the CDF decays exponentially at a rate fitted to the tail data. ### Calibration Dataset -The calibration dataset is the set of normal (non-adversarial) inputs used to -compute the SVD basis and fit behavioral region distributions. Requirements: +The calibration dataset serves two purposes: establishing the population +distribution (normal behavioral baseline) and providing contrast pairs +(labeled examples for each behavioral direction). -- **Composition**: Diverse normal inputs representative of the deployment - domain. No adversarial examples — the basis models *normal* behavior, and - anomalies are detected as deviations from it. +**Population data**: Diverse normal inputs representative of the deployment +domain. No adversarial examples — the population models *normal* behavior, +and anomalies are detected as deviations from it. Each prompt is processed +by the detector model, and z-coordinates are extracted at every token +position. The flattened `(N, 3)` tensor of all positions forms the population. + +**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless, +instruction/data) that define each behavioral direction. Each condition +produces a set of z-coordinates that, after copula decomposition, reveal +where the conditions separate in (S, u, v) space. + +Requirements: +- **Composition**: Population must cover the range of normal inputs the + detector will see in production. Contrast pairs must be clearly distinct + along their target behavioral direction. - **Size**: At minimum, enough inputs to produce a stable SVD decomposition. - Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but - have diminishing returns. -- **Diversity**: Must cover the range of normal inputs the detector will see - in production. A narrow calibration dataset (e.g., only short English - queries) will produce high false positive rates on unusual but benign inputs. -- **Model-specific**: A calibration dataset must be collected for each detector + Practical range: 1,000–10,000 prompts for population. Each contrast + condition needs at least 50–200 prompts. +- **Diversity**: A narrow population (e.g., only short English queries) will + produce high false positive rates on unusual but benign inputs. +- **Model-specific**: Calibration data must be collected for each detector model by running that model on the inputs and extracting activations. The codebook compilation pipeline (`run_manifold_projection.py` in the PoC) -automates calibration dataset processing. +automates calibration dataset processing with `max_length=128` tokens per +prompt. ### Codebook Compilation The codebook is compiled offline by a training pipeline that: -1. Runs the detector model on a calibration dataset (diverse normal inputs) -2. Extracts hidden state activations at configured layers -3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact, - deterministic decomposition; not `sklearn.decomposition.TruncatedSVD` - which uses randomized approximation and may not be deterministic) -4. Fits spline distributions along each retained dimension -5. Computes detection thresholds -6. Serializes the codebook to a portable format (safetensors + JSON config) +1. Runs the detector model on a calibration dataset (population + contrast + pairs) +2. Extracts hidden state activations at configured layers for every token + position (not just last-token) +3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact, + deterministic decomposition) +4. Projects activations onto the top-3 SVD components → z-coordinates +5. Fits spline distributions on each SVD dimension and the sum S +6. Applies copula decomposition to all z-coordinates → (S, u, v) features +7. Computes direction profiles (means, pooled std, Cohen's d) for each + contrast pair +8. Trains logistic classifiers on (S, u, v) for each contrast pair +9. Computes detection thresholds (midpoints between condition means) +10. Serializes the codebook to a portable format (safetensors + JSON config) This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the package** as package data (under `src/alknet_firewall/data/codebook/`). This @@ -224,8 +328,10 @@ The codebook is stored as: codebook/ ├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim) ├── regions.safetensors # Region boundary parameters +├── classifiers.safetensors # Logistic classifier weights per direction ├── splines.json # Spline knot positions and coefficients -└── config.json # Metadata: model_id, revision, n_dims, thresholds +├── profiles.json # Direction profiles (means, stds, Cohen's d) +└── config.json # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs ``` All tensor data uses safetensors format (ADR-005). Configuration uses JSON. @@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON. | `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension | | `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension | +**classifiers.safetensors**: +| Key | Shape | Dtype | Description | +|-----|-------|-------|-------------| +| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature | +| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature | +| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature | +| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts | + **splines.json**: | Field | Type | Description | |-------|------|-------------| @@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON. | `coefficients` | `list[list[float]]` | Spline coefficients per dimension | | `tail_decay` | `list[float]` | Exponential tail decay rate per dimension | +**profiles.json**: +| Field | Type | Description | +|-------|------|-------------| +| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles | +| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples | + +Each `DirectionProfile` entry contains: +| Field | Type | Description | +|-------|------|-------------| +| `label` | `str` | Direction name (e.g., "refusal") | +| `sum_mean_a` | `float` | Mean S for condition A | +| `sum_mean_b` | `float` | Mean S for condition B | +| `sum_std_pooled` | `float` | Pooled std of S | +| `u_mean_a` | `float` | Mean u for condition A | +| `u_mean_b` | `float` | Mean u for condition B | +| `u_std_pooled` | `float` | Pooled std of u | +| `v_mean_a` | `float` | Mean v for condition A | +| `v_mean_b` | `float` | Mean v for condition B | +| `v_std_pooled` | `float` | Pooled std of v | +| `cohen_d_sum` | `float` | Effect size for S | +| `cohen_d_u` | `float` | Effect size for u | +| `cohen_d_v` | `float` | Effect size for v | + ## Interfaces ### Internal API @@ -262,18 +399,54 @@ class CodebookConfig: model_revision: str n_dimensions: int layers: list[int] - suspicious_threshold: float # Serialized threshold values - dangerous_threshold: float # (mapped to Thresholds dataclass at runtime) + suspicious_threshold: float + dangerous_threshold: float + contrast_pairs: list[tuple[str, str, str]] # (cond_a, cond_b, label) + smoothing_window: int = 8 # Token-level smoothing (inference only) class Codebook: def __init__(self, path: Path): ... def project(self, activations: dict[int, np.ndarray]) -> np.ndarray: - """Project raw activations onto SVD basis → z-coordinates.""" + """Project raw activations onto SVD basis → z-coordinates. + + Returns: (seq_len, 3) z-coordinates. + """ ... - def score(self, z_coords: np.ndarray) -> list[DimensionSignal]: - """Score z-coordinates against behavioral regions.""" + def decompose(self, z_coords: np.ndarray) -> dict: + """Copula decomposition: z → CDF → (S, u, v). + + Args: + z_coords: (seq_len, 3) or (N, 3) z-coordinates + + Returns: + dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v) + """ + ... + + def classify(self, features: dict, window: int = 8) -> dict[str, dict]: + """Classify decomposed features using logistic classifiers. + + Args: + features: Output of decompose(), with (seq_len,) arrays + window: Smoothing window size (1 = no smoothing) + + Returns: + dict mapping direction name to {'prob', 'mean_prob', 'max_prob'} + """ + ... + + def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7, + min_positions: int = 3, window: int = 8) -> DetectionResult: + """Full detection pipeline: project → decompose → smooth → classify → flag. + + Args: + z_coords: (seq_len, 3) z-coordinates for a single input + threshold_prob: P(active) threshold for flagging a direction + min_positions: Minimum token positions above threshold to flag + window: Smoothing window for token-level features + """ ... @classmethod diff --git a/docs/architecture/decisions/009-last-token-extraction.md b/docs/architecture/decisions/009-last-token-extraction.md index 8a95a5b..4337fe4 100644 --- a/docs/architecture/decisions/009-last-token-extraction.md +++ b/docs/architecture/decisions/009-last-token-extraction.md @@ -30,8 +30,14 @@ input. ## Decision -Extract the last token's hidden state at each configured layer. This is -standard for LLaMA-family models and provides full-sequence context. +Extract the last token's hidden state at each configured layer as the Phase 1 +default. This is standard for LLaMA-family models and provides full-sequence +context. + +Phase 2 extends this to per-token extraction (hidden states at every position) +to enable token-level smoothing and per-position behavioral classification. +The training pipeline already uses per-token extraction for calibration data +collection. ## Consequences @@ -40,6 +46,7 @@ standard for LLaMA-family models and provides full-sequence context. - Full sequence context via causal attention - Single vector per layer — simple to project and score - No padding sensitivity (unlike mean pooling with attention masks) +- Phase 1 simplification: reduces implementation complexity and latency **Negative**: - Position-dependent — the last token's representation is influenced by its @@ -48,6 +55,8 @@ standard for LLaMA-family models and provides full-sequence context. activation patterns - May miss patterns in long inputs where the adversarial payload is in the middle rather than the end +- Phase 1 only: misses token-level behavioral signals that require per-token + extraction (addressed in Phase 2) ## References diff --git a/docs/architecture/firewall.md b/docs/architecture/firewall.md index 425c68d..f8967aa 100644 --- a/docs/architecture/firewall.md +++ b/docs/architecture/firewall.md @@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002). "Please summarize this document: [hidden injection payload]" 2. Tokenize - tokenizer.encode(input) → input_ids + tokenizer.encode(input) → input_ids (shape: seq_len) 3. Detector Model Inference - model(input_ids) → hidden_states at key layers + model(input_ids, output_hidden_states=True) → hidden_states at key layers 4. Activation Extraction - Extract hidden states from configured layers (early + mid) + Extract last-token hidden states from configured layers (early + mid) hidden_states[layer_idx][:, -1, :] → per-layer activation vectors 5. SVD Projection Project activations onto precomputed SVD basis - z_coords = svd_basis @ activation_vector + z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates -6. Codebook Comparison - For each SVD dimension: - - Compute distance from normal behavioral region - - Apply spline scoring (monotonic distribution) - - Aggregate multi-dimensional signals +6. Copula Decomposition + Transform z-coordinates through CDF → simplex → barycentric: + z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale) + → (u, v) via barycentric (position on simplex) -7. Alarm Generation - Combine per-dimension signals → overall alarm - AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS - Include per-dimension breakdown for interpretability +7. Token-Level Smoothing (optional) + Apply rolling average to (S, u, v) features across token positions + window=8: smooths per-token signals, reduces noise from single-token spikes + +8. Direction Classification + For each behavioral direction (refusal, injection, etc.): + logistic_classifier(S, u, v) → P(active | features) per token position + +9. Aggregation + Per direction: mean P(active), max P(active), fraction above threshold + Flag if any direction exceeds threshold for sufficient token positions + +10. Alarm Generation + Combine per-direction signals → overall alarm + AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS + Include per-direction breakdown for interpretability ``` +Note: Step 4 extracts only the last token in Phase 1. The full pipeline +(Phase 2) extracts per-token activations, enabling the token-level smoothing +and per-position classification in steps 7–9. + ## Key Concepts ### Behavioral Alarm @@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension. ### Score Composition -The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals -using a weighted maximum: +The overall `Alarm.score` (0.0–1.0) is computed from per-direction +classification results. For each behavioral direction, the logistic +classifier produces P(active | features) for every token position. The +alarm score aggregates these across directions: ``` -score = max(w_d * signal_d for d in dimensions) +direction_score = max(P(active) across token positions) +score = max(w_d * direction_score_d for d in directions) ``` -Where `w_d` are dimension weights (default: equal, configurable in -`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a -single strongly anomalous dimension can trigger an alarm even if other -dimensions are normal. This is critical for catching attacks that exploit -specific behavioral patterns (e.g., refusal-suppression) while leaving other -dimensions unaffected. +Where `w_d` are direction weights (default: equal, configurable in +`Thresholds.per_dimension`). Using `max` at both levels ensures that: +- A single strongly anomalous direction can trigger an alarm even if other + directions are normal +- A sustained behavioral signal at any token position surfaces in the alarm + +This is critical for catching attacks that exploit specific behavioral +patterns (e.g., refusal-suppression) while leaving other directions +unaffected. The `suspicious` and `dangerous` thresholds are applied to this composite score to determine `Alarm.level`. @@ -94,9 +115,9 @@ score to determine `Alarm.level`. | Level | Meaning | Action | |-------|---------|--------| -| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model | -| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks | -| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations | +| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model | +| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks | +| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations | ### Latency Budget @@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware | Model inference (125M, CPU) | ~5ms | | Activation extraction | ~0.1ms | | SVD projection | ~0.1ms | -| Codebook comparison | ~0.3ms | +| Copula decomposition | ~0.05ms | +| Token-level smoothing | ~0.05ms | +| Direction classification | ~0.1ms | | **Total** | **~6ms** | ## Interfaces @@ -124,9 +147,11 @@ class AlarmLevel(Enum): @dataclass class DimensionSignal: - dimension: int - deviation: float - score: float + direction: str # Behavioral direction name (e.g., "refusal", "injection") + score: float # P(active) for this direction + max_score: float # Max P(active) across token positions + mean_score: float # Mean P(active) across token positions + n_positions_above: int # Token positions above threshold direction_label: str | None @dataclass diff --git a/docs/architecture/model.md b/docs/architecture/model.md index d75334f..cda135e 100644 --- a/docs/architecture/model.md +++ b/docs/architecture/model.md @@ -35,15 +35,34 @@ changes to the firewall logic. The core operation: running the model on an input and capturing hidden state representations at specific layers. +**Phase 1 (last-token extraction)**: ```python -# Conceptual outputs = model(input_ids, output_hidden_states=True) activations = { layer_idx: outputs.hidden_states[layer_idx][:, -1, :] for layer_idx in configured_layers } +# Shape: (hidden_dim,) per layer — single vector ``` +**Phase 2 (per-token extraction)**: Extract hidden states at every token +position to enable token-level smoothing and per-position classification +(see codebook.md: Token-Level Smoothing). +```python +outputs = model(input_ids, output_hidden_states=True) +activations = { + layer_idx: outputs.hidden_states[layer_idx][0, :, :] + for layer_idx in configured_layers +} +# Shape: (seq_len, hidden_dim) per layer — sequence of vectors +``` + +The training pipeline uses per-token extraction (z-coordinates at every +position are collected for population statistics). Phase 1 simplifies to +last-token only for lower latency and simpler implementation. The codebook's +classifiers are trained on per-token data from all positions, so they remain +valid for both extraction modes. + Key decisions: - **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. @@ -52,9 +71,11 @@ Key decisions: signals are highly correlated with the selected layers. - **Which token**: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard - choice for autoregressive (LLaMA-family) models. + choice for autoregressive (LLaMA-family) models and sufficient for Phase 1. + Per-token extraction enables the full detection pipeline in Phase 2. - **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim` - (768 for SmolLM2-135M). + (768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)` + in Phase 2. ### Model-Agnostic Interface