feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
208
docs/architecture/overview.md
Normal file
208
docs/architecture/overview.md
Normal file
@@ -0,0 +1,208 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
## Vision
|
||||
|
||||
A pip-installable Python library that screens untrusted inputs for adversarial
|
||||
content before they reach a target LLM. The library uses behavioral signals —
|
||||
patterns in hidden state activations from a small language model — to detect
|
||||
injection attempts, obfuscated payloads, and novel attack types that text-surface
|
||||
defenses miss.
|
||||
|
||||
This project is open source under the MIT license.
|
||||
|
||||
## Why This Exists
|
||||
|
||||
LLMs process instructions and data in the same token stream. They cannot
|
||||
reliably distinguish trusted system prompts from untrusted user content. This
|
||||
architectural weakness enables prompt injection — the #1 LLM vulnerability per
|
||||
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
|
||||
of the time with just 10 attempts (International AI Safety Report 2026).
|
||||
|
||||
Current defenses are **surface-level**: text classifiers (Llama Guard), regex
|
||||
filters, perplexity checks, and canary tokens. All examine *what the input
|
||||
says*, not *how a model processes it*. Adversarial inputs that look natural to
|
||||
text classifiers still produce distinctive activation patterns when a model
|
||||
processes them.
|
||||
|
||||
Academic research validates this approach:
|
||||
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
|
||||
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
|
||||
- **EMNLP 2024**: Safety signals detectable in early layers
|
||||
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
|
||||
through non-semantic hidden signals
|
||||
|
||||
See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
||||
for the full threat analysis and academic evidence.
|
||||
|
||||
## Scope
|
||||
|
||||
### In Scope
|
||||
|
||||
- **Phase 1**: Core behavioral firewall library
|
||||
- Input screening via small model activation analysis
|
||||
- SVD-based anomaly detection with configurable thresholds
|
||||
- Model-agnostic detector (works with any compatible small model)
|
||||
- SmolLM2-135M as the default detector model
|
||||
- Multi-dimensional behavioral alarms (not just safe/unsafe)
|
||||
- PyTorch inference backend (optional dependency)
|
||||
- Runtime model download and caching via HuggingFace Hub
|
||||
- safetensors-only model loading (security requirement)
|
||||
- Synchronous API for single-input screening
|
||||
- Interpretable detection signals (SVD direction analysis)
|
||||
|
||||
- **Phase 2**: Integration and operational hardening
|
||||
- ONNX Runtime inference backend
|
||||
- Async/batch screening API
|
||||
- Integration adapters for LlamaFirewall, NeMo Guardrails
|
||||
- Metrics and observability
|
||||
- Codebook training pipeline (`run_manifold_projection.py` extraction)
|
||||
|
||||
- **Phase 3**: Advanced capabilities
|
||||
- Multi-turn attack detection (payload splitting)
|
||||
- Streaming input screening
|
||||
- Custom model fine-tuning for domain-specific detection
|
||||
- Rust port via burn/cubecl (speculative, requires R&D)
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- Text-surface classification (that's Llama Guard's job)
|
||||
- Rule-based content filtering (that's NeMo Guardrails' job)
|
||||
- Output-side safety monitoring
|
||||
- Target model training or modification
|
||||
- Multimodal (image) input screening
|
||||
- Agent orchestration or access control
|
||||
- Replacement for comprehensive LLM security programs
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ alknet-firewall (Python library) │
|
||||
│ │
|
||||
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
|
||||
(text) │ │ screen(input) → Alarm │ │
|
||||
│ │ ├─ Tokenize input │ │
|
||||
│ │ ├─ Run detector model │ │
|
||||
│ │ ├─ Extract hidden state activations│ │
|
||||
│ │ ├─ Project onto SVD basis │ │
|
||||
│ │ ├─ Compare against codebook │ │
|
||||
│ │ └─ Return behavioral alarm │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Model Manager ────────────────────┐ │
|
||||
│ │ Load model (HF Hub download/cache) │ │
|
||||
│ │ Extract activations at key layers │ │
|
||||
│ │ Model-agnostic interface │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Codebook ──────────────────────────┐ │
|
||||
│ │ SVD basis vectors (compiled) │ │
|
||||
│ │ Detection thresholds per dimension │ │
|
||||
│ │ Behavioral region boundaries │ │
|
||||
│ │ Spline distributions for scoring │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Configuration ─────────────────────┐ │
|
||||
│ │ Model selection & revision pinning │ │
|
||||
│ │ Detection thresholds │ │
|
||||
│ │ Alarm severity levels │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────┘
|
||||
│
|
||||
┌──────┴──────┐
|
||||
│ │
|
||||
HF Hub Cache Detector Model
|
||||
(~/.cache/) (SmolLM2-135M)
|
||||
```
|
||||
|
||||
## Package Dependencies
|
||||
|
||||
### Core (Required)
|
||||
|
||||
| Package | Version | Purpose | Notes |
|
||||
|---------|---------|---------|-------|
|
||||
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
|
||||
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
|
||||
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
|
||||
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
|
||||
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
|
||||
|
||||
### Optional (Extras)
|
||||
|
||||
| Package | Extra | Version | Purpose | Notes |
|
||||
|---------|-------|---------|---------|-------|
|
||||
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
|
||||
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
|
||||
| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
|
||||
| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
|
||||
|
||||
### Development (Not Published)
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
|
||||
| `pytest` | Testing |
|
||||
| `pytest-cov` | Coverage |
|
||||
| `mypy` | Type checking |
|
||||
| `pre-commit` | Git hooks |
|
||||
|
||||
## Exports
|
||||
|
||||
This is a Python library. Public API surface:
|
||||
|
||||
```python
|
||||
from alknet_firewall import Firewall, Alarm, AlarmLevel
|
||||
|
||||
# Core screening
|
||||
firewall = Firewall() # loads default model + codebook
|
||||
alarm: Alarm = firewall.screen("untrusted input text")
|
||||
|
||||
# Alarm properties
|
||||
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
|
||||
alarm.score # float, 0.0-1.0
|
||||
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
|
||||
alarm.dimensions # SVD dimension analysis
|
||||
```
|
||||
|
||||
See [firewall.md](firewall.md) for the full API specification.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
|
||||
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
|
||||
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
|
||||
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
|
||||
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
|
||||
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
|
||||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
|
||||
|
||||
## Dependencies on Other Projects
|
||||
|
||||
- **metaspline**: The core detection logic (codebook, spline distributions,
|
||||
SVD projection, space transforms) is adapted from the metaspline research
|
||||
project. The PoC validated the behavioral signal approach; this project
|
||||
extracts and productionizes ~1,745 lines of the working subset.
|
||||
|
||||
- **reverse-proxy**: The architecture documentation structure and SDD process
|
||||
are adapted from the @alkdev/reverse-proxy project. The documentation
|
||||
conventions, ADR format, and open questions tracking are reused directly.
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
|
||||
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
|
||||
Reference in New Issue
Block a user