docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
This commit is contained in:
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions

View File

@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
"Please summarize this document: [hidden injection payload]"
2. Tokenize
tokenizer.encode(input) → input_ids
tokenizer.encode(input) → input_ids (shape: seq_len)
3. Detector Model Inference
model(input_ids) → hidden_states at key layers
model(input_ids, output_hidden_states=True) → hidden_states at key layers
4. Activation Extraction
Extract hidden states from configured layers (early + mid)
Extract last-token hidden states from configured layers (early + mid)
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
5. SVD Projection
Project activations onto precomputed SVD basis
z_coords = svd_basis @ activation_vector
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
6. Codebook Comparison
For each SVD dimension:
- Compute distance from normal behavioral region
- Apply spline scoring (monotonic distribution)
- Aggregate multi-dimensional signals
6. Copula Decomposition
Transform z-coordinates through CDF → simplex → barycentric:
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
→ (u, v) via barycentric (position on simplex)
7. Alarm Generation
Combine per-dimension signals → overall alarm
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
Include per-dimension breakdown for interpretability
7. Token-Level Smoothing (optional)
Apply rolling average to (S, u, v) features across token positions
window=8: smooths per-token signals, reduces noise from single-token spikes
8. Direction Classification
For each behavioral direction (refusal, injection, etc.):
logistic_classifier(S, u, v) → P(active | features) per token position
9. Aggregation
Per direction: mean P(active), max P(active), fraction above threshold
Flag if any direction exceeds threshold for sufficient token positions
10. Alarm Generation
Combine per-direction signals → overall alarm
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
Include per-direction breakdown for interpretability
```
Note: Step 4 extracts only the last token in Phase 1. The full pipeline
(Phase 2) extracts per-token activations, enabling the token-level smoothing
and per-position classification in steps 79.
## Key Concepts
### Behavioral Alarm
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.
### Score Composition
The overall `Alarm.score` (0.01.0) is computed from per-dimension signals
using a weighted maximum:
The overall `Alarm.score` (0.01.0) is computed from per-direction
classification results. For each behavioral direction, the logistic
classifier produces P(active | features) for every token position. The
alarm score aggregates these across directions:
```
score = max(w_d * signal_d for d in dimensions)
direction_score = max(P(active) across token positions)
score = max(w_d * direction_score_d for d in directions)
```
Where `w_d` are dimension weights (default: equal, configurable in
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
single strongly anomalous dimension can trigger an alarm even if other
dimensions are normal. This is critical for catching attacks that exploit
specific behavioral patterns (e.g., refusal-suppression) while leaving other
dimensions unaffected.
Where `w_d` are direction weights (default: equal, configurable in
`Thresholds.per_dimension`). Using `max` at both levels ensures that:
- A single strongly anomalous direction can trigger an alarm even if other
directions are normal
- A sustained behavioral signal at any token position surfaces in the alarm
This is critical for catching attacks that exploit specific behavioral
patterns (e.g., refusal-suppression) while leaving other directions
unaffected.
The `suspicious` and `dangerous` thresholds are applied to this composite
score to determine `Alarm.level`.
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.
| Level | Meaning | Action |
|-------|---------|--------|
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
### Latency Budget
@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
| Model inference (125M, CPU) | ~5ms |
| Activation extraction | ~0.1ms |
| SVD projection | ~0.1ms |
| Codebook comparison | ~0.3ms |
| Copula decomposition | ~0.05ms |
| Token-level smoothing | ~0.05ms |
| Direction classification | ~0.1ms |
| **Total** | **~6ms** |
## Interfaces
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):
@dataclass
class DimensionSignal:
dimension: int
deviation: float
score: float
direction: str # Behavioral direction name (e.g., "refusal", "injection")
score: float # P(active) for this direction
max_score: float # Max P(active) across token positions
mean_score: float # Mean P(active) across token positions
n_positions_above: int # Token positions above threshold
direction_label: str | None
@dataclass