feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/decisions/002-behavioral-signals.md
+++ b/docs/architecture/decisions/002-behavioral-signals.md
@@ -0,0 +1,52 @@
+# ADR-002: Behavioral Signal Detection (Not Text Classification)
+
+## Status
+
+Accepted
+
+## Context
+
+Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
+text-surface approaches — they classify input text as safe or unsafe. This
+fundamentally limits their effectiveness:
+
+- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
+  and pattern matching
+- Novel attack types require retraining classifiers
+- Text that looks natural to a classifier can still be adversarial when
+  processed by a model
+
+Academic research (2024-2025) demonstrates that adversarial inputs produce
+distinctive activation patterns in model internals, regardless of surface form.
+
+## Decision
+
+Build a behavioral signal detection system that monitors how a model processes
+inputs (hidden state activations), not what the inputs say (text surface).
+Adversarial inputs produce anomalous activation patterns that are detectable
+even when the text itself looks innocent.
+
+## Consequences
+
+**Positive**:
+- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
+- Anomalous behavior patterns are attack-type agnostic — novel attacks still
+  produce anomalous patterns
+- Multi-dimensional signals provide interpretable detection (which SVD
+  directions are activated and by how much)
+- Complementary to existing text-surface defenses — can be layered
+
+**Negative**:
+- Requires running a model on every input (adds latency and compute cost)
+- Detection depends on the detector model sharing architectural similarity
+  with likely attack targets
+- False positives possible for unusual but benign inputs (domain-specific
+  language, technical content)
+- No existing production system validates this approach — we are first
+
+## References
+
+- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
+- HiddenDetect (ACL 2025)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- How Alignment and Jailbreak Work (EMNLP 2024)