alknet-firewall/docs/architecture/decisions/003-small-model-detector.md

# ADR-003: Small Model (~125M) as Detector

## Status

Accepted

## Context

The behavioral signal detection approach requires running a language model on
every input to extract hidden state activations. The choice of model size
creates a trade-off:

- **Large model (7B+)**: Better representation quality, more behavioral signal
  resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
- **Small model (~125M)**: Sufficient representation quality for early-layer
  safety signals. Runs on CPU, <10ms latency, negligible cost per check.
- **Tiny model (<50M)**: Too small for safety-relevant representations to
  emerge. Lacks the depth where behavioral patterns form.

EMNLP 2024 research confirms that safety signals are detectable in early
layers — the model doesn't need deep processing to produce useful signals.
A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
for safety directions to emerge in early layers.

## Decision

Use a small model (~125M parameters) as the default detector. SmolLM2-135M
(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
CPU. Support model-agnostic detection — any compatible model can be used by
recompiling the codebook.

## Consequences

**Positive**:
- <10ms latency enables real-time pre-inference screening
- CPU-deployable — no GPU required for the firewall
- Can run alongside target model without blocking
- Fast iteration — training/updating a 125M model takes hours, not days
- Small enough to embed in API gateways, CDN edges, client applications
- 269MB model download is feasible via HF Hub with caching

**Negative**:
- Less representation quality than larger models — may miss subtle signals
  that a 7B detector would catch
- Detector model must share some architectural similarity with target models
  for behavioral signals to transfer
- SmolLM2-135M is English-focused — multilingual detection requires a
  multilingual detector model
- Codebook is model-specific — switching models requires recompilation

## References

- [model.md](../model.md)
- EMNLP 2024: Safety signals detectable in early layers
- Subliminal Learning (Nature 2026): Behavioral traits transmit through
  non-semantic signals