feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/decisions/009-last-token-extraction.md
+++ b/docs/architecture/decisions/009-last-token-extraction.md
@@ -0,0 +1,55 @@
+# ADR-009: Last-Token Activation Extraction
+
+## Status
+
+Accepted
+
+## Context
+
+To extract behavioral signals from the detector model, we must choose which
+token's hidden state to use from the sequence of hidden states produced during
+inference. Options:
+
+- **Last token**: The hidden state at the final position, which has attended
+  to the entire sequence. Standard for sequence classification (used by BERT
+  pools, GPT-style models naturally aggregate at the last position).
+- **Mean pooling**: Average hidden states across all positions. Smooths out
+  position-specific effects but dilutes signal from safety-relevant tokens.
+- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
+  (LLaMA architecture) does not use a CLS token.
+- **First token**: Has seen only the beginning of the sequence. Misses
+  context from later tokens.
+- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
+  position with extreme activation can dominate.
+
+Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
+models because the last position's hidden state has attended to the full
+sequence via causal attention. For safety detection, this means the last
+token's representation contains the model's "conclusion" about the entire
+input.
+
+## Decision
+
+Extract the last token's hidden state at each configured layer. This is
+standard for LLaMA-family models and provides full-sequence context.
+
+## Consequences
+
+**Positive**:
+- Standard approach for autoregressive models — well-validated
+- Full sequence context via causal attention
+- Single vector per layer — simple to project and score
+- No padding sensitivity (unlike mean pooling with attention masks)
+
+**Negative**:
+- Position-dependent — the last token's representation is influenced by its
+  position in the sequence, not just its content
+- Very short inputs (1–2 tokens) may not have enough context for meaningful
+  activation patterns
+- May miss patterns in long inputs where the adversarial payload is in the
+  middle rather than the end
+
+## References
+
+- [model.md](../model.md)
+- [codebook.md](../codebook.md)