Files

glm-5.1 45a0e0798c docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline

2026-06-13 08:17:09 +00:00

2.5 KiB

Raw Blame History

ADR-009: Last-Token Activation Extraction

Status

Accepted

Context

To extract behavioral signals from the detector model, we must choose which token's hidden state to use from the sequence of hidden states produced during inference. Options:

Last token: The hidden state at the final position, which has attended to the entire sequence. Standard for sequence classification (used by BERT pools, GPT-style models naturally aggregate at the last position).
Mean pooling: Average hidden states across all positions. Smooths out position-specific effects but dilutes signal from safety-relevant tokens.
CLS token: A dedicated classification token (BERT-style). SmolLM2-135M (LLaMA architecture) does not use a CLS token.
First token: Has seen only the beginning of the sequence. Misses context from later tokens.
Max pooling: Per-dimension maximum across positions. Noisy — a single position with extreme activation can dominate.

Last-token extraction is the standard for autoregressive (GPT/LLaMA-style) models because the last position's hidden state has attended to the full sequence via causal attention. For safety detection, this means the last token's representation contains the model's "conclusion" about the entire input.

Decision

Extract the last token's hidden state at each configured layer as the Phase 1 default. This is standard for LLaMA-family models and provides full-sequence context.

Phase 2 extends this to per-token extraction (hidden states at every position) to enable token-level smoothing and per-position behavioral classification. The training pipeline already uses per-token extraction for calibration data collection.

Consequences

Positive:

Standard approach for autoregressive models — well-validated
Full sequence context via causal attention
Single vector per layer — simple to project and score
No padding sensitivity (unlike mean pooling with attention masks)
Phase 1 simplification: reduces implementation complexity and latency

Negative:

Position-dependent — the last token's representation is influenced by its position in the sequence, not just its content
Very short inputs (1–2 tokens) may not have enough context for meaningful activation patterns
May miss patterns in long inputs where the adversarial payload is in the middle rather than the end
Phase 1 only: misses token-level behavioral signals that require per-token extraction (addressed in Phase 2)

2.5 KiB Raw Blame History Unescape Escape