# ADR-009: Last-Token Activation Extraction

## Status

Accepted

## Context

To extract behavioral signals from the detector model, we must choose which
token's hidden state to use from the sequence of hidden states produced during
inference. Options:

- **Last token**: The hidden state at the final position, which has attended
  to the entire sequence. Standard for sequence classification (used by BERT
  pools, GPT-style models naturally aggregate at the last position).
- **Mean pooling**: Average hidden states across all positions. Smooths out
  position-specific effects but dilutes signal from safety-relevant tokens.
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
  (LLaMA architecture) does not use a CLS token.
- **First token**: Has seen only the beginning of the sequence. Misses
  context from later tokens.
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
  position with extreme activation can dominate.

Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
models because the last position's hidden state has attended to the full
sequence via causal attention. For safety detection, this means the last
token's representation contains the model's "conclusion" about the entire
input.

## Decision

Extract the last token's hidden state at each configured layer as the Phase 1
default. This is standard for LLaMA-family models and provides full-sequence
context.

Phase 2 extends this to per-token extraction (hidden states at every position)
to enable token-level smoothing and per-position behavioral classification.
The training pipeline already uses per-token extraction for calibration data
collection.

## Consequences

**Positive**:
- Standard approach for autoregressive models — well-validated
- Full sequence context via causal attention
- Single vector per layer — simple to project and score
- No padding sensitivity (unlike mean pooling with attention masks)
- Phase 1 simplification: reduces implementation complexity and latency

**Negative**:
- Position-dependent — the last token's representation is influenced by its
  position in the sequence, not just its content
- Very short inputs (1–2 tokens) may not have enough context for meaningful
  activation patterns
- May miss patterns in long inputs where the adversarial payload is in the
  middle rather than the end
- Phase 1 only: misses token-level behavioral signals that require per-token
  extraction (addressed in Phase 2)

## References

- [model.md](../model.md)
- [codebook.md](../codebook.md)