docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
This commit is contained in:
@@ -35,15 +35,34 @@ changes to the firewall logic.
|
||||
The core operation: running the model on an input and capturing hidden state
|
||||
representations at specific layers.
|
||||
|
||||
**Phase 1 (last-token extraction)**:
|
||||
```python
|
||||
# Conceptual
|
||||
outputs = model(input_ids, output_hidden_states=True)
|
||||
activations = {
|
||||
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
|
||||
for layer_idx in configured_layers
|
||||
}
|
||||
# Shape: (hidden_dim,) per layer — single vector
|
||||
```
|
||||
|
||||
**Phase 2 (per-token extraction)**: Extract hidden states at every token
|
||||
position to enable token-level smoothing and per-position classification
|
||||
(see codebook.md: Token-Level Smoothing).
|
||||
```python
|
||||
outputs = model(input_ids, output_hidden_states=True)
|
||||
activations = {
|
||||
layer_idx: outputs.hidden_states[layer_idx][0, :, :]
|
||||
for layer_idx in configured_layers
|
||||
}
|
||||
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
|
||||
```
|
||||
|
||||
The training pipeline uses per-token extraction (z-coordinates at every
|
||||
position are collected for population statistics). Phase 1 simplifies to
|
||||
last-token only for lower latency and simpler implementation. The codebook's
|
||||
classifiers are trained on per-token data from all positions, so they remain
|
||||
valid for both extraction modes.
|
||||
|
||||
Key decisions:
|
||||
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
|
||||
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
|
||||
@@ -52,9 +71,11 @@ Key decisions:
|
||||
signals are highly correlated with the selected layers.
|
||||
- **Which token**: The last token's hidden state carries the model's
|
||||
"conclusion" about the full input sequence (ADR-009). This is the standard
|
||||
choice for autoregressive (LLaMA-family) models.
|
||||
choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
|
||||
Per-token extraction enables the full detection pipeline in Phase 2.
|
||||
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
|
||||
(768 for SmolLM2-135M).
|
||||
(768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
|
||||
in Phase 2.
|
||||
|
||||
### Model-Agnostic Interface
|
||||
|
||||
|
||||
Reference in New Issue
Block a user