# ADR-002: Behavioral Signal Detection (Not Text Classification)

## Status

Accepted

## Context

Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
text-surface approaches — they classify input text as safe or unsafe. This
fundamentally limits their effectiveness:

- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
  and pattern matching
- Novel attack types require retraining classifiers
- Text that looks natural to a classifier can still be adversarial when
  processed by a model

Academic research (2024-2025) demonstrates that adversarial inputs produce
distinctive activation patterns in model internals, regardless of surface form.

## Decision

Build a behavioral signal detection system that monitors how a model processes
inputs (hidden state activations), not what the inputs say (text surface).
Adversarial inputs produce anomalous activation patterns that are detectable
even when the text itself looks innocent.

## Consequences

**Positive**:
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
  produce anomalous patterns
- Multi-dimensional signals provide interpretable detection (which SVD
  directions are activated and by how much)
- Complementary to existing text-surface defenses — can be layered

**Negative**:
- Requires running a model on every input (adds latency and compute cost)
- Detection depends on the detector model sharing architectural similarity
  with likely attack targets
- False positives possible for unusual but benign inputs (domain-specific
  language, technical content)
- No existing production system validates this approach — we are first

## References

- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
- HiddenDetect (ACL 2025)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- How Alignment and Jailbreak Work (EMNLP 2024)