Files
alknet-firewall/docs/architecture/decisions/002-behavioral-signals.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

52 lines
1.9 KiB
Markdown

# ADR-002: Behavioral Signal Detection (Not Text Classification)
## Status
Accepted
## Context
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
text-surface approaches — they classify input text as safe or unsafe. This
fundamentally limits their effectiveness:
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
and pattern matching
- Novel attack types require retraining classifiers
- Text that looks natural to a classifier can still be adversarial when
processed by a model
Academic research (2024-2025) demonstrates that adversarial inputs produce
distinctive activation patterns in model internals, regardless of surface form.
## Decision
Build a behavioral signal detection system that monitors how a model processes
inputs (hidden state activations), not what the inputs say (text surface).
Adversarial inputs produce anomalous activation patterns that are detectable
even when the text itself looks innocent.
## Consequences
**Positive**:
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
produce anomalous patterns
- Multi-dimensional signals provide interpretable detection (which SVD
directions are activated and by how much)
- Complementary to existing text-surface defenses — can be layered
**Negative**:
- Requires running a model on every input (adds latency and compute cost)
- Detection depends on the detector model sharing architectural similarity
with likely attack targets
- False positives possible for unusual but benign inputs (domain-specific
language, technical content)
- No existing production system validates this approach — we are first
## References
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
- HiddenDetect (ACL 2025)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- How Alignment and Jailbreak Work (EMNLP 2024)