Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
52 lines
1.9 KiB
Markdown
52 lines
1.9 KiB
Markdown
# ADR-002: Behavioral Signal Detection (Not Text Classification)
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
|
|
text-surface approaches — they classify input text as safe or unsafe. This
|
|
fundamentally limits their effectiveness:
|
|
|
|
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
|
|
and pattern matching
|
|
- Novel attack types require retraining classifiers
|
|
- Text that looks natural to a classifier can still be adversarial when
|
|
processed by a model
|
|
|
|
Academic research (2024-2025) demonstrates that adversarial inputs produce
|
|
distinctive activation patterns in model internals, regardless of surface form.
|
|
|
|
## Decision
|
|
|
|
Build a behavioral signal detection system that monitors how a model processes
|
|
inputs (hidden state activations), not what the inputs say (text surface).
|
|
Adversarial inputs produce anomalous activation patterns that are detectable
|
|
even when the text itself looks innocent.
|
|
|
|
## Consequences
|
|
|
|
**Positive**:
|
|
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
|
|
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
|
|
produce anomalous patterns
|
|
- Multi-dimensional signals provide interpretable detection (which SVD
|
|
directions are activated and by how much)
|
|
- Complementary to existing text-surface defenses — can be layered
|
|
|
|
**Negative**:
|
|
- Requires running a model on every input (adds latency and compute cost)
|
|
- Detection depends on the detector model sharing architectural similarity
|
|
with likely attack targets
|
|
- False positives possible for unusual but benign inputs (domain-specific
|
|
language, technical content)
|
|
- No existing production system validates this approach — we are first
|
|
|
|
## References
|
|
|
|
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
|
- HiddenDetect (ACL 2025)
|
|
- Hidden Dimensions of LLM Alignment (ICML 2025)
|
|
- How Alignment and Jailbreak Work (EMNLP 2024) |