# ADR-002: Behavioral Signal Detection (Not Text Classification) ## Status Accepted ## Context Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are text-surface approaches — they classify input text as safe or unsafe. This fundamentally limits their effectiveness: - Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword and pattern matching - Novel attack types require retraining classifiers - Text that looks natural to a classifier can still be adversarial when processed by a model Academic research (2024-2025) demonstrates that adversarial inputs produce distinctive activation patterns in model internals, regardless of surface form. ## Decision Build a behavioral signal detection system that monitors how a model processes inputs (hidden state activations), not what the inputs say (text surface). Adversarial inputs produce anomalous activation patterns that are detectable even when the text itself looks innocent. ## Consequences **Positive**: - Catches obfuscated, multilingual, and novel attacks that text classifiers miss - Anomalous behavior patterns are attack-type agnostic — novel attacks still produce anomalous patterns - Multi-dimensional signals provide interpretable detection (which SVD directions are activated and by how much) - Complementary to existing text-surface defenses — can be layered **Negative**: - Requires running a model on every input (adds latency and compute cost) - Detection depends on the detector model sharing architectural similarity with likely attack targets - False positives possible for unusual but benign inputs (domain-specific language, technical content) - No existing production system validates this approach — we are first ## References - [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md) - HiddenDetect (ACL 2025) - Hidden Dimensions of LLM Alignment (ICML 2025) - How Alignment and Jailbreak Work (EMNLP 2024)