Files
alknet-firewall/docs/architecture/decisions/008-three-level-alarm.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

47 lines
1.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-008: Three-Level Alarm System
## Status
Accepted
## Context
The firewall needs to communicate detection results to downstream systems. The
design choice is how many alarm levels and what they mean.
Alternatives:
- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
don't warrant blocking but should be flagged. Binary forces a single
threshold that either blocks too much (high false positive) or too little
(high false negative).
- **Numeric-only (0.01.0 score)**: Maximum information but requires every
consumer to choose their own threshold. No shared vocabulary for what's
actionable.
- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
pre-inference screening system. The difference between "low" and "medium"
is too subtle for consumers to act on differently.
- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
review. Most practical for automated systems.
## Decision
Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
continuous score (0.01.0) for consumers that need fine-grained decisions.
## Consequences
**Positive**:
- Clear action mapping: pass, flag, block
- Suspicious level enables defense-in-depth (apply additional checks rather
than binary block/allow)
- Continuous score provides gradient for consumers that need it
- Simple to document and communicate
**Negative**:
- Some consumers may need more granularity (but can use the score field)
- "Suspicious" requires consumers to decide what to do — adds decision burden
## References
- [firewall.md](../firewall.md)