feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
55
docs/architecture/decisions/009-last-token-extraction.md
Normal file
55
docs/architecture/decisions/009-last-token-extraction.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# ADR-009: Last-Token Activation Extraction
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
To extract behavioral signals from the detector model, we must choose which
|
||||
token's hidden state to use from the sequence of hidden states produced during
|
||||
inference. Options:
|
||||
|
||||
- **Last token**: The hidden state at the final position, which has attended
|
||||
to the entire sequence. Standard for sequence classification (used by BERT
|
||||
pools, GPT-style models naturally aggregate at the last position).
|
||||
- **Mean pooling**: Average hidden states across all positions. Smooths out
|
||||
position-specific effects but dilutes signal from safety-relevant tokens.
|
||||
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
|
||||
(LLaMA architecture) does not use a CLS token.
|
||||
- **First token**: Has seen only the beginning of the sequence. Misses
|
||||
context from later tokens.
|
||||
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
|
||||
position with extreme activation can dominate.
|
||||
|
||||
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
|
||||
models because the last position's hidden state has attended to the full
|
||||
sequence via causal attention. For safety detection, this means the last
|
||||
token's representation contains the model's "conclusion" about the entire
|
||||
input.
|
||||
|
||||
## Decision
|
||||
|
||||
Extract the last token's hidden state at each configured layer. This is
|
||||
standard for LLaMA-family models and provides full-sequence context.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Standard approach for autoregressive models — well-validated
|
||||
- Full sequence context via causal attention
|
||||
- Single vector per layer — simple to project and score
|
||||
- No padding sensitivity (unlike mean pooling with attention masks)
|
||||
|
||||
**Negative**:
|
||||
- Position-dependent — the last token's representation is influenced by its
|
||||
position in the sequence, not just its content
|
||||
- Very short inputs (1–2 tokens) may not have enough context for meaningful
|
||||
activation patterns
|
||||
- May miss patterns in long inputs where the adversarial payload is in the
|
||||
middle rather than the end
|
||||
|
||||
## References
|
||||
|
||||
- [model.md](../model.md)
|
||||
- [codebook.md](../codebook.md)
|
||||
Reference in New Issue
Block a user