Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
55 lines
2.1 KiB
Markdown
55 lines
2.1 KiB
Markdown
# ADR-009: Last-Token Activation Extraction
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
## Context
|
||
|
||
To extract behavioral signals from the detector model, we must choose which
|
||
token's hidden state to use from the sequence of hidden states produced during
|
||
inference. Options:
|
||
|
||
- **Last token**: The hidden state at the final position, which has attended
|
||
to the entire sequence. Standard for sequence classification (used by BERT
|
||
pools, GPT-style models naturally aggregate at the last position).
|
||
- **Mean pooling**: Average hidden states across all positions. Smooths out
|
||
position-specific effects but dilutes signal from safety-relevant tokens.
|
||
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
|
||
(LLaMA architecture) does not use a CLS token.
|
||
- **First token**: Has seen only the beginning of the sequence. Misses
|
||
context from later tokens.
|
||
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
|
||
position with extreme activation can dominate.
|
||
|
||
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
|
||
models because the last position's hidden state has attended to the full
|
||
sequence via causal attention. For safety detection, this means the last
|
||
token's representation contains the model's "conclusion" about the entire
|
||
input.
|
||
|
||
## Decision
|
||
|
||
Extract the last token's hidden state at each configured layer. This is
|
||
standard for LLaMA-family models and provides full-sequence context.
|
||
|
||
## Consequences
|
||
|
||
**Positive**:
|
||
- Standard approach for autoregressive models — well-validated
|
||
- Full sequence context via causal attention
|
||
- Single vector per layer — simple to project and score
|
||
- No padding sensitivity (unlike mean pooling with attention masks)
|
||
|
||
**Negative**:
|
||
- Position-dependent — the last token's representation is influenced by its
|
||
position in the sequence, not just its content
|
||
- Very short inputs (1–2 tokens) may not have enough context for meaningful
|
||
activation patterns
|
||
- May miss patterns in long inputs where the adversarial payload is in the
|
||
middle rather than the end
|
||
|
||
## References
|
||
|
||
- [model.md](../model.md)
|
||
- [codebook.md](../codebook.md) |