Files

glm-5.1 cf464c2296 feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).

2026-06-13 05:17:40 +00:00

1.6 KiB

Raw Blame History

ADR-008: Three-Level Alarm System

Status

Accepted

Context

The firewall needs to communicate detection results to downstream systems. The design choice is how many alarm levels and what they mean.

Alternatives:

Binary (safe/unsafe): Simple but loses nuance. Many suspicious inputs don't warrant blocking but should be flagged. Binary forces a single threshold that either blocks too much (high false positive) or too little (high false negative).
Numeric-only (0.0–1.0 score): Maximum information but requires every consumer to choose their own threshold. No shared vocabulary for what's actionable.
Five-tier (safe/low/medium/high/critical): Over-engineered for a pre-inference screening system. The difference between "low" and "medium" is too subtle for consumers to act on differently.
Three-tier (clear/suspicious/dangerous): Balances simplicity with nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional review. Most practical for automated systems.

Decision

Use three alarm levels: CLEAR, SUSPICIOUS, DANGEROUS. Include a continuous score (0.0–1.0) for consumers that need fine-grained decisions.

Consequences

Positive:

Clear action mapping: pass, flag, block
Suspicious level enables defense-in-depth (apply additional checks rather than binary block/allow)
Continuous score provides gradient for consumers that need it
Simple to document and communicate

Negative:

Some consumers may need more granularity (but can use the score field)
"Suspicious" requires consumers to decide what to do — adds decision burden

References

firewall.md

1.6 KiB Raw Blame History Unescape Escape