Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
alknet-firewall — Architecture
Current State
Phase 0→1 (Exploration → Architecture) — The project has a working PoC demonstrating that behavioral signals from small language models can detect adversarial inputs. The core detection logic (~1,745 lines) works reasonably well but lacks tests, has excessive codebook size, and needs extraction from the research codebase into a properly structured Python package.
This project extracts and productionizes the behavioral signal detection approach from the metaspline research project. A ~125M parameter model (SmolLM2-135M) processes untrusted inputs and produces hidden state activations. SVD-based dimensionality reduction on these activations reveals behavioral patterns — normal inputs cluster in expected regions while adversarial inputs produce anomalous activation signatures. The system raises "behavioral alarms" without needing to know specific attack types.
Architecture Documents
| Document | Status | Description |
|---|---|---|
| overview.md | Draft | Vision, scope, package structure, dependencies |
| firewall.md | Draft | Core firewall API, input screening, alarm protocol |
| codebook.md | Draft | SVD basis, detection parameters, codebook compilation |
| model.md | Draft | Model loading, activation extraction, model-agnostic design |
| configuration.md | Draft | Thresholds, model selection, detection tuning |
| open-questions.md | Active | Unresolved questions tracker with OQ-IDs |
ADR Table
| ADR | Title | Status |
|---|---|---|
| 001 | Python with uv | Accepted |
| 002 | Behavioral Signal Detection (Not Text Classification) | Accepted |
| 003 | Small Model (~125M) as Detector | Accepted |
| 004 | SVD-Based Anomaly Detection | Accepted |
| 005 | Safetensors-Only Model Loading | Accepted |
| 006 | PyTorch as Optional Dependency | Accepted |
| 007 | Runtime Model Download via HuggingFace Hub | Accepted |
| 008 | Three-Level Alarm System | Accepted |
| 009 | Last-Token Activation Extraction | Accepted |
| 010 | Monotonic Spline Distributions | Accepted |
Open Questions
See open-questions.md for the full tracker.
| OQ | Question | Priority | Status |
|---|---|---|---|
| OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
| OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
| OQ-03 | Should the firewall support streaming/chunked input screening? | low | open |
| OQ-04 | Should detection thresholds be per-model or globally configurable? | medium | open |
| OQ-05 | How should the firewall integrate with existing guardrail systems (LlamaFirewall, NeMo)? | medium | open |
| OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
| OQ-07 | Is a Rust port feasible given current ML framework maturity? | low | open |
Document Lifecycle
| Status | Meaning | Transitions |
|---|---|---|
draft |
Under active development. May change significantly. | → reviewed when open questions are resolved |
reviewed |
Architecture is final. Implementation may begin. Changes require review. | → stable when implementation is complete |
stable |
Locked. Changes require review and may warrant an ADR. | → deprecated when superseded |
deprecated |
Superseded. Kept for reference. | Removed when no longer referenced |