alknet-firewall/docs/architecture/decisions/004-svd-based-detection.md

# ADR-004: SVD-Based Anomaly Detection

## Status

Accepted

## Context

After extracting hidden state activations from the detector model, the
firewall needs a method to distinguish normal behavioral patterns from
adversarial ones. Options:

- **Single classifier**: Train a binary classifier on activations. Simple but
  loses the multi-dimensional structure. Black box.
- **SVD + region comparison**: Decompose activation space into principal
  directions, model normal behavioral regions along each direction, detect
  inputs that fall outside normal regions. Interpretable, efficient,
  multi-dimensional.
- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
  detect inputs with high reconstruction error. Complex, not interpretable.

ICML 2025 research shows safety is multi-dimensional in activation space — a
dominant refusal direction plus secondary dimensions. SVD naturally discovers
these directions. Region comparison provides interpretable per-dimension
signals.

## Decision

Use SVD-based anomaly detection: decompose activation space via SVD to
discover principal behavioral directions, model normal regions along each
dimension using monotonic spline distributions, and detect inputs whose
projections fall outside normal regions.

## Consequences

**Positive**:
- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
- Efficient: Projection is O(k) after decomposition, trivial at runtime
- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
- Robust: SVD captures structure of entire activation space, not a single
  boundary
- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
  (unlike `TruncatedSVD` which uses randomized initialization)

**Negative**:
- SVD basis is model-specific — changing detector model requires recomputation
- Basis quality depends on calibration dataset coverage
- Linear decomposition may miss non-linear behavioral patterns
- Requires a codebook compilation pipeline (Phase 2)
- Full SVD on large calibration datasets may be slow (mitigated by
  relatively small hidden dim: 768)

## References

- [codebook.md](../codebook.md)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- HiddenDetect (ACL 2025)