docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
This commit is contained in:
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions

View File

@@ -35,15 +35,34 @@ changes to the firewall logic.
The core operation: running the model on an input and capturing hidden state
representations at specific layers.
**Phase 1 (last-token extraction)**:
```python
# Conceptual
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in configured_layers
}
# Shape: (hidden_dim,) per layer — single vector
```
**Phase 2 (per-token extraction)**: Extract hidden states at every token
position to enable token-level smoothing and per-position classification
(see codebook.md: Token-Level Smoothing).
```python
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][0, :, :]
for layer_idx in configured_layers
}
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
```
The training pipeline uses per-token extraction (z-coordinates at every
position are collected for population statistics). Phase 1 simplifies to
last-token only for lower latency and simpler implementation. The codebook's
classifiers are trained on per-token data from all positions, so they remain
valid for both extraction modes.
Key decisions:
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
@@ -52,9 +71,11 @@ Key decisions:
signals are highly correlated with the selected layers.
- **Which token**: The last token's hidden state carries the model's
"conclusion" about the full input sequence (ADR-009). This is the standard
choice for autoregressive (LLaMA-family) models.
choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
Per-token extraction enables the full detection pipeline in Phase 2.
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
(768 for SmolLM2-135M).
(768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
in Phase 2.
### Model-Agnostic Interface