feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
56
docs/architecture/decisions/003-small-model-detector.md
Normal file
56
docs/architecture/decisions/003-small-model-detector.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# ADR-003: Small Model (~125M) as Detector
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The behavioral signal detection approach requires running a language model on
|
||||
every input to extract hidden state activations. The choice of model size
|
||||
creates a trade-off:
|
||||
|
||||
- **Large model (7B+)**: Better representation quality, more behavioral signal
|
||||
resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
|
||||
- **Small model (~125M)**: Sufficient representation quality for early-layer
|
||||
safety signals. Runs on CPU, <10ms latency, negligible cost per check.
|
||||
- **Tiny model (<50M)**: Too small for safety-relevant representations to
|
||||
emerge. Lacks the depth where behavioral patterns form.
|
||||
|
||||
EMNLP 2024 research confirms that safety signals are detectable in early
|
||||
layers — the model doesn't need deep processing to produce useful signals.
|
||||
A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
|
||||
for safety directions to emerge in early layers.
|
||||
|
||||
## Decision
|
||||
|
||||
Use a small model (~125M parameters) as the default detector. SmolLM2-135M
|
||||
(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
|
||||
CPU. Support model-agnostic detection — any compatible model can be used by
|
||||
recompiling the codebook.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- <10ms latency enables real-time pre-inference screening
|
||||
- CPU-deployable — no GPU required for the firewall
|
||||
- Can run alongside target model without blocking
|
||||
- Fast iteration — training/updating a 125M model takes hours, not days
|
||||
- Small enough to embed in API gateways, CDN edges, client applications
|
||||
- 269MB model download is feasible via HF Hub with caching
|
||||
|
||||
**Negative**:
|
||||
- Less representation quality than larger models — may miss subtle signals
|
||||
that a 7B detector would catch
|
||||
- Detector model must share some architectural similarity with target models
|
||||
for behavioral signals to transfer
|
||||
- SmolLM2-135M is English-focused — multilingual detection requires a
|
||||
multilingual detector model
|
||||
- Codebook is model-specific — switching models requires recompilation
|
||||
|
||||
## References
|
||||
|
||||
- [model.md](../model.md)
|
||||
- EMNLP 2024: Safety signals detectable in early layers
|
||||
- Subliminal Learning (Nature 2026): Behavioral traits transmit through
|
||||
non-semantic signals
|
||||
Reference in New Issue
Block a user