Files
alknet-firewall/docs/architecture/decisions/003-small-model-detector.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

2.2 KiB

ADR-003: Small Model (~125M) as Detector

Status

Accepted

Context

The behavioral signal detection approach requires running a language model on every input to extract hidden state activations. The choice of model size creates a trade-off:

  • Large model (7B+): Better representation quality, more behavioral signal resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
  • Small model (~125M): Sufficient representation quality for early-layer safety signals. Runs on CPU, <10ms latency, negligible cost per check.
  • Tiny model (<50M): Too small for safety-relevant representations to emerge. Lacks the depth where behavioral patterns form.

EMNLP 2024 research confirms that safety signals are detectable in early layers — the model doesn't need deep processing to produce useful signals. A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim) for safety directions to emerge in early layers.

Decision

Use a small model (~125M parameters) as the default detector. SmolLM2-135M (269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on CPU. Support model-agnostic detection — any compatible model can be used by recompiling the codebook.

Consequences

Positive:

  • <10ms latency enables real-time pre-inference screening
  • CPU-deployable — no GPU required for the firewall
  • Can run alongside target model without blocking
  • Fast iteration — training/updating a 125M model takes hours, not days
  • Small enough to embed in API gateways, CDN edges, client applications
  • 269MB model download is feasible via HF Hub with caching

Negative:

  • Less representation quality than larger models — may miss subtle signals that a 7B detector would catch
  • Detector model must share some architectural similarity with target models for behavioral signals to transfer
  • SmolLM2-135M is English-focused — multilingual detection requires a multilingual detector model
  • Codebook is model-specific — switching models requires recompilation

References

  • model.md
  • EMNLP 2024: Safety signals detectable in early layers
  • Subliminal Learning (Nature 2026): Behavioral traits transmit through non-semantic signals