feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -0,0 +1,208 @@
+---
+status: draft
+last_updated: 2026-06-13
+---
+
+# Overview
+
+## Vision
+
+A pip-installable Python library that screens untrusted inputs for adversarial
+content before they reach a target LLM. The library uses behavioral signals —
+patterns in hidden state activations from a small language model — to detect
+injection attempts, obfuscated payloads, and novel attack types that text-surface
+defenses miss.
+
+This project is open source under the MIT license.
+
+## Why This Exists
+
+LLMs process instructions and data in the same token stream. They cannot
+reliably distinguish trusted system prompts from untrusted user content. This
+architectural weakness enables prompt injection — the #1 LLM vulnerability per
+OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
+of the time with just 10 attempts (International AI Safety Report 2026).
+
+Current defenses are **surface-level**: text classifiers (Llama Guard), regex
+filters, perplexity checks, and canary tokens. All examine *what the input
+says*, not *how a model processes it*. Adversarial inputs that look natural to
+text classifiers still produce distinctive activation patterns when a model
+processes them.
+
+Academic research validates this approach:
+- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
+- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
+- **EMNLP 2024**: Safety signals detectable in early layers
+- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
+  through non-semantic hidden signals
+
+See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
+for the full threat analysis and academic evidence.
+
+## Scope
+
+### In Scope
+
+- **Phase 1**: Core behavioral firewall library
+  - Input screening via small model activation analysis
+  - SVD-based anomaly detection with configurable thresholds
+  - Model-agnostic detector (works with any compatible small model)
+  - SmolLM2-135M as the default detector model
+  - Multi-dimensional behavioral alarms (not just safe/unsafe)
+  - PyTorch inference backend (optional dependency)
+  - Runtime model download and caching via HuggingFace Hub
+  - safetensors-only model loading (security requirement)
+  - Synchronous API for single-input screening
+  - Interpretable detection signals (SVD direction analysis)
+
+- **Phase 2**: Integration and operational hardening
+  - ONNX Runtime inference backend
+  - Async/batch screening API
+  - Integration adapters for LlamaFirewall, NeMo Guardrails
+  - Metrics and observability
+  - Codebook training pipeline (`run_manifold_projection.py` extraction)
+
+- **Phase 3**: Advanced capabilities
+  - Multi-turn attack detection (payload splitting)
+  - Streaming input screening
+  - Custom model fine-tuning for domain-specific detection
+  - Rust port via burn/cubecl (speculative, requires R&D)
+
+### Out of Scope
+
+- Text-surface classification (that's Llama Guard's job)
+- Rule-based content filtering (that's NeMo Guardrails' job)
+- Output-side safety monitoring
+- Target model training or modification
+- Multimodal (image) input screening
+- Agent orchestration or access control
+- Replacement for comprehensive LLM security programs
+
+## Architecture
+
+```
+                        ┌──────────────────────────────────────────┐
+                        │  alknet-firewall (Python library)          │
+                        │                                            │
+  Untrusted Input ────► │  ┌─ Firewall API ─────────────────────┐   │
+  (text)                │  │  screen(input) → Alarm              │   │
+                        │  │  ├─ Tokenize input                   │   │
+                        │  │  ├─ Run detector model              │   │
+                        │  │  ├─ Extract hidden state activations│   │
+                        │  │  ├─ Project onto SVD basis           │   │
+                        │  │  ├─ Compare against codebook         │   │
+                        │  │  └─ Return behavioral alarm          │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Model Manager ────────────────────┐   │
+                        │  │  Load model (HF Hub download/cache) │   │
+                        │  │  Extract activations at key layers   │   │
+                        │  │  Model-agnostic interface            │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Codebook ──────────────────────────┐   │
+                        │  │  SVD basis vectors (compiled)        │   │
+                        │  │  Detection thresholds per dimension  │   │
+                        │  │  Behavioral region boundaries        │   │
+                        │  │  Spline distributions for scoring    │   │
+                        │  └────────────────────────────────────┘   │
+                        │                                            │
+                        │  ┌─ Configuration ─────────────────────┐   │
+                        │  │  Model selection & revision pinning  │   │
+                        │  │  Detection thresholds               │   │
+                        │  │  Alarm severity levels              │   │
+                        │  └────────────────────────────────────┘   │
+                        └──────────────────────────────────────────┘
+                                      │
+                               ┌──────┴──────┐
+                               │             │
+                        HF Hub Cache    Detector Model
+                        (~/.cache/)    (SmolLM2-135M)
+```
+
+## Package Dependencies
+
+### Core (Required)
+
+| Package | Version | Purpose | Notes |
+|---------|---------|---------|-------|
+| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
+| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
+| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
+| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
+| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
+
+### Optional (Extras)
+
+| Package | Extra | Version | Purpose | Notes |
+|---------|-------|---------|---------|-------|
+| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
+| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
+| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
+| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
+
+### Development (Not Published)
+
+| Package | Purpose |
+|---------|---------|
+| `ruff` | Linting + formatting (replaces flake8, black, isort) |
+| `pytest` | Testing |
+| `pytest-cov` | Coverage |
+| `mypy` | Type checking |
+| `pre-commit` | Git hooks |
+
+## Exports
+
+This is a Python library. Public API surface:
+
+```python
+from alknet_firewall import Firewall, Alarm, AlarmLevel
+
+# Core screening
+firewall = Firewall()  # loads default model + codebook
+alarm: Alarm = firewall.screen("untrusted input text")
+
+# Alarm properties
+alarm.level          # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
+alarm.score          # float, 0.0-1.0
+alarm.signals        # list[DimensionSignal] — per-dimension behavioral signals
+alarm.dimensions     # SVD dimension analysis
+```
+
+See [firewall.md](firewall.md) for the full API specification.
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
+| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
+| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
+| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
+| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
+| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
+| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
+| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
+| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
+| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
+
+## Dependencies on Other Projects
+
+- **metaspline**: The core detection logic (codebook, spline distributions,
+  SVD projection, space transforms) is adapted from the metaspline research
+  project. The PoC validated the behavioral signal approach; this project
+  extracts and productionizes ~1,745 lines of the working subset.
+
+- **reverse-proxy**: The architecture documentation structure and SDD process
+  are adapted from the @alkdev/reverse-proxy project. The documentation
+  conventions, ADR format, and open questions tracking are reused directly.
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
+- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)