feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
47
docs/architecture/decisions/008-three-level-alarm.md
Normal file
47
docs/architecture/decisions/008-three-level-alarm.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# ADR-008: Three-Level Alarm System
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The firewall needs to communicate detection results to downstream systems. The
|
||||
design choice is how many alarm levels and what they mean.
|
||||
|
||||
Alternatives:
|
||||
- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
|
||||
don't warrant blocking but should be flagged. Binary forces a single
|
||||
threshold that either blocks too much (high false positive) or too little
|
||||
(high false negative).
|
||||
- **Numeric-only (0.0–1.0 score)**: Maximum information but requires every
|
||||
consumer to choose their own threshold. No shared vocabulary for what's
|
||||
actionable.
|
||||
- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
|
||||
pre-inference screening system. The difference between "low" and "medium"
|
||||
is too subtle for consumers to act on differently.
|
||||
- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
|
||||
nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
|
||||
review. Most practical for automated systems.
|
||||
|
||||
## Decision
|
||||
|
||||
Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
|
||||
continuous score (0.0–1.0) for consumers that need fine-grained decisions.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Clear action mapping: pass, flag, block
|
||||
- Suspicious level enables defense-in-depth (apply additional checks rather
|
||||
than binary block/allow)
|
||||
- Continuous score provides gradient for consumers that need it
|
||||
- Simple to document and communicate
|
||||
|
||||
**Negative**:
|
||||
- Some consumers may need more granularity (but can use the score field)
|
||||
- "Suspicious" requires consumers to decide what to do — adds decision burden
|
||||
|
||||
## References
|
||||
|
||||
- [firewall.md](../firewall.md)
|
||||
Reference in New Issue
Block a user