The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
64 lines
2.5 KiB
Markdown
64 lines
2.5 KiB
Markdown
# ADR-009: Last-Token Activation Extraction
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
To extract behavioral signals from the detector model, we must choose which
|
||
token's hidden state to use from the sequence of hidden states produced during
|
||
inference. Options:
|
||
|
||
- **Last token**: The hidden state at the final position, which has attended
|
||
to the entire sequence. Standard for sequence classification (used by BERT
|
||
pools, GPT-style models naturally aggregate at the last position).
|
||
- **Mean pooling**: Average hidden states across all positions. Smooths out
|
||
position-specific effects but dilutes signal from safety-relevant tokens.
|
||
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
|
||
(LLaMA architecture) does not use a CLS token.
|
||
- **First token**: Has seen only the beginning of the sequence. Misses
|
||
context from later tokens.
|
||
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
|
||
position with extreme activation can dominate.
|
||
|
||
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
|
||
models because the last position's hidden state has attended to the full
|
||
sequence via causal attention. For safety detection, this means the last
|
||
token's representation contains the model's "conclusion" about the entire
|
||
input.
|
||
|
||
## Decision
|
||
|
||
Extract the last token's hidden state at each configured layer as the Phase 1
|
||
default. This is standard for LLaMA-family models and provides full-sequence
|
||
context.
|
||
|
||
Phase 2 extends this to per-token extraction (hidden states at every position)
|
||
to enable token-level smoothing and per-position behavioral classification.
|
||
The training pipeline already uses per-token extraction for calibration data
|
||
collection.
|
||
|
||
## Consequences
|
||
|
||
**Positive**:
|
||
- Standard approach for autoregressive models — well-validated
|
||
- Full sequence context via causal attention
|
||
- Single vector per layer — simple to project and score
|
||
- No padding sensitivity (unlike mean pooling with attention masks)
|
||
- Phase 1 simplification: reduces implementation complexity and latency
|
||
|
||
**Negative**:
|
||
- Position-dependent — the last token's representation is influenced by its
|
||
position in the sequence, not just its content
|
||
- Very short inputs (1–2 tokens) may not have enough context for meaningful
|
||
activation patterns
|
||
- May miss patterns in long inputs where the adversarial payload is in the
|
||
middle rather than the end
|
||
- Phase 1 only: misses token-level behavioral signals that require per-token
|
||
extraction (addressed in Phase 2)
|
||
|
||
## References
|
||
|
||
- [model.md](../model.md)
|
||
- [codebook.md](../codebook.md) |