docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions
--- a/docs/architecture/model.md
+++ b/docs/architecture/model.md
@@ -35,15 +35,34 @@ changes to the firewall logic.
 The core operation: running the model on an input and capturing hidden state
 representations at specific layers.

+**Phase 1 (last-token extraction)**:
 ```python
-# Conceptual
 outputs = model(input_ids, output_hidden_states=True)
 activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in configured_layers
 }
+# Shape: (hidden_dim,) per layer — single vector
 ```

+**Phase 2 (per-token extraction)**: Extract hidden states at every token
+position to enable token-level smoothing and per-position classification
+(see codebook.md: Token-Level Smoothing).
+```python
+outputs = model(input_ids, output_hidden_states=True)
+activations = {
+    layer_idx: outputs.hidden_states[layer_idx][0, :, :]
+    for layer_idx in configured_layers
+}
+# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
+```
+
+The training pipeline uses per-token extraction (z-coordinates at every
+position are collected for population statistics). Phase 1 simplifies to
+last-token only for lower latency and simpler implementation. The codebook's
+classifiers are trained on per-token data from all positions, so they remain
+valid for both extraction modes.
+
 Key decisions:
 - **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
  Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
@@ -52,9 +71,11 @@ Key decisions:
  signals are highly correlated with the selected layers.
 - **Which token**: The last token's hidden state carries the model's
  "conclusion" about the full input sequence (ADR-009). This is the standard
-  choice for autoregressive (LLaMA-family) models.
+  choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
+  Per-token extraction enables the full detection pipeline in Phase 2.
 - **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
-  (768 for SmolLM2-135M).
+  (768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
+  in Phase 2.

 ### Model-Agnostic Interface