# ADR-004: SVD-Based Anomaly Detection ## Status Accepted ## Context After extracting hidden state activations from the detector model, the firewall needs a method to distinguish normal behavioral patterns from adversarial ones. Options: - **Single classifier**: Train a binary classifier on activations. Simple but loses the multi-dimensional structure. Black box. - **SVD + region comparison**: Decompose activation space into principal directions, model normal behavioral regions along each direction, detect inputs that fall outside normal regions. Interpretable, efficient, multi-dimensional. - **Autoencoder anomaly detection**: Train an autoencoder on normal inputs, detect inputs with high reconstruction error. Complex, not interpretable. ICML 2025 research shows safety is multi-dimensional in activation space — a dominant refusal direction plus secondary dimensions. SVD naturally discovers these directions. Region comparison provides interpretable per-dimension signals. ## Decision Use SVD-based anomaly detection: decompose activation space via SVD to discover principal behavioral directions, model normal regions along each dimension using monotonic spline distributions, and detect inputs whose projections fall outside normal regions. ## Consequences **Positive**: - Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.) - Efficient: Projection is O(k) after decomposition, trivial at runtime - Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025) - Robust: SVD captures structure of entire activation space, not a single boundary - Small-model friendly: SVD on 768-dim hidden states is computationally trivial - Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition (unlike `TruncatedSVD` which uses randomized initialization) **Negative**: - SVD basis is model-specific — changing detector model requires recomputation - Basis quality depends on calibration dataset coverage - Linear decomposition may miss non-linear behavioral patterns - Requires a codebook compilation pipeline (Phase 2) - Full SVD on large calibration datasets may be slow (mitigated by relatively small hidden dim: 768) ## References - [codebook.md](../codebook.md) - Hidden Dimensions of LLM Alignment (ICML 2025) - HiddenDetect (ACL 2025)