feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/decisions/004-svd-based-detection.md
+++ b/docs/architecture/decisions/004-svd-based-detection.md
@@ -0,0 +1,58 @@
+# ADR-004: SVD-Based Anomaly Detection
+
+## Status
+
+Accepted
+
+## Context
+
+After extracting hidden state activations from the detector model, the
+firewall needs a method to distinguish normal behavioral patterns from
+adversarial ones. Options:
+
+- **Single classifier**: Train a binary classifier on activations. Simple but
+  loses the multi-dimensional structure. Black box.
+- **SVD + region comparison**: Decompose activation space into principal
+  directions, model normal behavioral regions along each direction, detect
+  inputs that fall outside normal regions. Interpretable, efficient,
+  multi-dimensional.
+- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
+  detect inputs with high reconstruction error. Complex, not interpretable.
+
+ICML 2025 research shows safety is multi-dimensional in activation space — a
+dominant refusal direction plus secondary dimensions. SVD naturally discovers
+these directions. Region comparison provides interpretable per-dimension
+signals.
+
+## Decision
+
+Use SVD-based anomaly detection: decompose activation space via SVD to
+discover principal behavioral directions, model normal regions along each
+dimension using monotonic spline distributions, and detect inputs whose
+projections fall outside normal regions.
+
+## Consequences
+
+**Positive**:
+- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
+- Efficient: Projection is O(k) after decomposition, trivial at runtime
+- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
+- Robust: SVD captures structure of entire activation space, not a single
+  boundary
+- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
+- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
+  (unlike `TruncatedSVD` which uses randomized initialization)
+
+**Negative**:
+- SVD basis is model-specific — changing detector model requires recomputation
+- Basis quality depends on calibration dataset coverage
+- Linear decomposition may miss non-linear behavioral patterns
+- Requires a codebook compilation pipeline (Phase 2)
+- Full SVD on large calibration datasets may be slow (mitigated by
+  relatively small hidden dim: 768)
+
+## References
+
+- [codebook.md](../codebook.md)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- HiddenDetect (ACL 2025)