feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
52
docs/architecture/decisions/002-behavioral-signals.md
Normal file
52
docs/architecture/decisions/002-behavioral-signals.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# ADR-002: Behavioral Signal Detection (Not Text Classification)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
|
||||
text-surface approaches — they classify input text as safe or unsafe. This
|
||||
fundamentally limits their effectiveness:
|
||||
|
||||
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
|
||||
and pattern matching
|
||||
- Novel attack types require retraining classifiers
|
||||
- Text that looks natural to a classifier can still be adversarial when
|
||||
processed by a model
|
||||
|
||||
Academic research (2024-2025) demonstrates that adversarial inputs produce
|
||||
distinctive activation patterns in model internals, regardless of surface form.
|
||||
|
||||
## Decision
|
||||
|
||||
Build a behavioral signal detection system that monitors how a model processes
|
||||
inputs (hidden state activations), not what the inputs say (text surface).
|
||||
Adversarial inputs produce anomalous activation patterns that are detectable
|
||||
even when the text itself looks innocent.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
|
||||
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
|
||||
produce anomalous patterns
|
||||
- Multi-dimensional signals provide interpretable detection (which SVD
|
||||
directions are activated and by how much)
|
||||
- Complementary to existing text-surface defenses — can be layered
|
||||
|
||||
**Negative**:
|
||||
- Requires running a model on every input (adds latency and compute cost)
|
||||
- Detection depends on the detector model sharing architectural similarity
|
||||
with likely attack targets
|
||||
- False positives possible for unusual but benign inputs (domain-specific
|
||||
language, technical content)
|
||||
- No existing production system validates this approach — we are first
|
||||
|
||||
## References
|
||||
|
||||
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
||||
- HiddenDetect (ACL 2025)
|
||||
- Hidden Dimensions of LLM Alignment (ICML 2025)
|
||||
- How Alignment and Jailbreak Work (EMNLP 2024)
|
||||
Reference in New Issue
Block a user