Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2.3 KiB
2.3 KiB
ADR-004: SVD-Based Anomaly Detection
Status
Accepted
Context
After extracting hidden state activations from the detector model, the firewall needs a method to distinguish normal behavioral patterns from adversarial ones. Options:
- Single classifier: Train a binary classifier on activations. Simple but loses the multi-dimensional structure. Black box.
- SVD + region comparison: Decompose activation space into principal directions, model normal behavioral regions along each direction, detect inputs that fall outside normal regions. Interpretable, efficient, multi-dimensional.
- Autoencoder anomaly detection: Train an autoencoder on normal inputs, detect inputs with high reconstruction error. Complex, not interpretable.
ICML 2025 research shows safety is multi-dimensional in activation space — a dominant refusal direction plus secondary dimensions. SVD naturally discovers these directions. Region comparison provides interpretable per-dimension signals.
Decision
Use SVD-based anomaly detection: decompose activation space via SVD to discover principal behavioral directions, model normal regions along each dimension using monotonic spline distributions, and detect inputs whose projections fall outside normal regions.
Consequences
Positive:
- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
- Efficient: Projection is O(k) after decomposition, trivial at runtime
- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
- Robust: SVD captures structure of entire activation space, not a single boundary
- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
- Deterministic:
scipy.linalg.svdproduces exact, reproducible decomposition (unlikeTruncatedSVDwhich uses randomized initialization)
Negative:
- SVD basis is model-specific — changing detector model requires recomputation
- Basis quality depends on calibration dataset coverage
- Linear decomposition may miss non-linear behavioral patterns
- Requires a codebook compilation pipeline (Phase 2)
- Full SVD on large calibration datasets may be slow (mitigated by relatively small hidden dim: 768)
References
- codebook.md
- Hidden Dimensions of LLM Alignment (ICML 2025)
- HiddenDetect (ACL 2025)