Files
alknet-firewall/docs/architecture/decisions/009-last-token-extraction.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

55 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-009: Last-Token Activation Extraction
## Status
Accepted
## Context
To extract behavioral signals from the detector model, we must choose which
token's hidden state to use from the sequence of hidden states produced during
inference. Options:
- **Last token**: The hidden state at the final position, which has attended
to the entire sequence. Standard for sequence classification (used by BERT
pools, GPT-style models naturally aggregate at the last position).
- **Mean pooling**: Average hidden states across all positions. Smooths out
position-specific effects but dilutes signal from safety-relevant tokens.
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
(LLaMA architecture) does not use a CLS token.
- **First token**: Has seen only the beginning of the sequence. Misses
context from later tokens.
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
position with extreme activation can dominate.
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
models because the last position's hidden state has attended to the full
sequence via causal attention. For safety detection, this means the last
token's representation contains the model's "conclusion" about the entire
input.
## Decision
Extract the last token's hidden state at each configured layer. This is
standard for LLaMA-family models and provides full-sequence context.
## Consequences
**Positive**:
- Standard approach for autoregressive models — well-validated
- Full sequence context via causal attention
- Single vector per layer — simple to project and score
- No padding sensitivity (unlike mean pooling with attention masks)
**Negative**:
- Position-dependent — the last token's representation is influenced by its
position in the sequence, not just its content
- Very short inputs (12 tokens) may not have enough context for meaningful
activation patterns
- May miss patterns in long inputs where the adversarial payload is in the
middle rather than the end
## References
- [model.md](../model.md)
- [codebook.md](../codebook.md)