# ADR-003: Small Model (~125M) as Detector ## Status Accepted ## Context The behavioral signal detection approach requires running a language model on every input to extract hidden state activations. The choice of model size creates a trade-off: - **Large model (7B+)**: Better representation quality, more behavioral signal resolution. But requires GPU, adds ~200-500ms latency, costs more per check. - **Small model (~125M)**: Sufficient representation quality for early-layer safety signals. Runs on CPU, <10ms latency, negligible cost per check. - **Tiny model (<50M)**: Too small for safety-relevant representations to emerge. Lacks the depth where behavioral patterns form. EMNLP 2024 research confirms that safety signals are detectable in early layers — the model doesn't need deep processing to produce useful signals. A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim) for safety directions to emerge in early layers. ## Decision Use a small model (~125M parameters) as the default detector. SmolLM2-135M (269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on CPU. Support model-agnostic detection — any compatible model can be used by recompiling the codebook. ## Consequences **Positive**: - <10ms latency enables real-time pre-inference screening - CPU-deployable — no GPU required for the firewall - Can run alongside target model without blocking - Fast iteration — training/updating a 125M model takes hours, not days - Small enough to embed in API gateways, CDN edges, client applications - 269MB model download is feasible via HF Hub with caching **Negative**: - Less representation quality than larger models — may miss subtle signals that a 7B detector would catch - Detector model must share some architectural similarity with target models for behavioral signals to transfer - SmolLM2-135M is English-focused — multilingual detection requires a multilingual detector model - Codebook is model-specific — switching models requires recompilation ## References - [model.md](../model.md) - EMNLP 2024: Safety signals detectable in early layers - Subliminal Learning (Nature 2026): Behavioral traits transmit through non-semantic signals