Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
1.9 KiB
1.9 KiB
ADR-002: Behavioral Signal Detection (Not Text Classification)
Status
Accepted
Context
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are text-surface approaches — they classify input text as safe or unsafe. This fundamentally limits their effectiveness:
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword and pattern matching
- Novel attack types require retraining classifiers
- Text that looks natural to a classifier can still be adversarial when processed by a model
Academic research (2024-2025) demonstrates that adversarial inputs produce distinctive activation patterns in model internals, regardless of surface form.
Decision
Build a behavioral signal detection system that monitors how a model processes inputs (hidden state activations), not what the inputs say (text surface). Adversarial inputs produce anomalous activation patterns that are detectable even when the text itself looks innocent.
Consequences
Positive:
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
- Anomalous behavior patterns are attack-type agnostic — novel attacks still produce anomalous patterns
- Multi-dimensional signals provide interpretable detection (which SVD directions are activated and by how much)
- Complementary to existing text-surface defenses — can be layered
Negative:
- Requires running a model on every input (adds latency and compute cost)
- Detection depends on the detector model sharing architectural similarity with likely attack targets
- False positives possible for unusual but benign inputs (domain-specific language, technical content)
- No existing production system validates this approach — we are first
References
- llm-input-safety-landscape.md
- HiddenDetect (ACL 2025)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- How Alignment and Jailbreak Work (EMNLP 2024)