--- status: draft last_updated: 2026-06-13 --- # Overview ## Vision A pip-installable Python library that screens untrusted inputs for adversarial content before they reach a target LLM. The library uses behavioral signals — patterns in hidden state activations from a small language model — to detect injection attempts, obfuscated payloads, and novel attack types that text-surface defenses miss. This project is open source under the MIT license. ## Why This Exists LLMs process instructions and data in the same token stream. They cannot reliably distinguish trusted system prompts from untrusted user content. This architectural weakness enables prompt injection — the #1 LLM vulnerability per OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50% of the time with just 10 attempts (International AI Safety Report 2026). Current defenses are **surface-level**: text classifiers (Llama Guard), regex filters, perplexity checks, and canary tokens. All examine *what the input says*, not *how a model processes it*. Adversarial inputs that look natural to text classifiers still produce distinctive activation patterns when a model processes them. Academic research validates this approach: - **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA - **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space - **EMNLP 2024**: Safety signals detectable in early layers - **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals through non-semantic hidden signals See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md) for the full threat analysis and academic evidence. ## Scope ### In Scope - **Phase 1**: Core behavioral firewall library - Input screening via small model activation analysis - SVD-based anomaly detection with configurable thresholds - Model-agnostic detector (works with any compatible small model) - SmolLM2-135M as the default detector model - Multi-dimensional behavioral alarms (not just safe/unsafe) - PyTorch inference backend (optional dependency) - Runtime model download and caching via HuggingFace Hub - safetensors-only model loading (security requirement) - Synchronous API for single-input screening - Interpretable detection signals (SVD direction analysis) - **Phase 2**: Integration and operational hardening - Async/batch screening API - Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK - Metrics and observability - Codebook training pipeline (`run_manifold_projection.py` extraction) - Streaming/rolling-window input screening (granular detection for documents) - **Phase 3**: Advanced capabilities - Multi-turn attack detection (payload splitting) - Custom model fine-tuning for domain-specific detection - Alternative inference backends (burn/cublas via safetensors) ### Out of Scope - Text-surface classification (that's Llama Guard's job) - Rule-based content filtering (that's NeMo Guardrails' job) - Output-side safety monitoring - Target model training or modification - Multimodal (image) input screening - Agent orchestration or access control - Replacement for comprehensive LLM security programs ## Architecture ``` ┌──────────────────────────────────────────┐ │ alknet-firewall (Python library) │ │ │ Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │ (text) │ │ screen(input) → Alarm │ │ │ │ ├─ Tokenize input │ │ │ │ ├─ Run detector model │ │ │ │ ├─ Extract hidden state activations│ │ │ │ ├─ Project onto SVD basis │ │ │ │ ├─ Compare against codebook │ │ │ │ └─ Return behavioral alarm │ │ │ └────────────────────────────────────┘ │ │ │ │ ┌─ Model Manager ────────────────────┐ │ │ │ Load model (HF Hub download/cache) │ │ │ │ Extract activations at key layers │ │ │ │ Model-agnostic interface │ │ │ └────────────────────────────────────┘ │ │ │ │ ┌─ Codebook ──────────────────────────┐ │ │ │ SVD basis vectors (compiled) │ │ │ │ Detection thresholds per dimension │ │ │ │ Behavioral region boundaries │ │ │ │ Spline distributions for scoring │ │ │ └────────────────────────────────────┘ │ │ │ │ ┌─ Configuration ─────────────────────┐ │ │ │ Model selection & revision pinning │ │ │ │ Detection thresholds │ │ │ │ Alarm severity levels │ │ │ └────────────────────────────────────┘ │ └──────────────────────────────────────────┘ │ ┌──────┴──────┐ │ │ HF Hub Cache Detector Model (~/.cache/) (SmolLM2-135M) ``` ## Package Dependencies ### Core (Required) | Package | Version | Purpose | Notes | |---------|---------|---------|-------| | `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode | | `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution | | `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer | | `numpy` | >=1.24 | Tensor operations | Core numerical dependency | | `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection | ### Optional (Extras) | Package | Extra | Version | Purpose | Notes | |---------|-------|---------|---------|-------| | `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency | | `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra | ### Development (Not Published) | Package | Purpose | |---------|---------| | `ruff` | Linting + formatting (replaces flake8, black, isort) | | `pytest` | Testing | | `pytest-cov` | Coverage | | `mypy` | Type checking | | `pre-commit` | Git hooks | ## Exports This is a Python library. Public API surface: ```python from alknet_firewall import Firewall, Alarm, AlarmLevel # Core screening firewall = Firewall() # loads default model + codebook alarm: Alarm = firewall.screen("untrusted input text") # Alarm properties alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS alarm.score # float, 0.0-1.0 alarm.signals # list[DimensionSignal] — per-dimension behavioral signals alarm.dimensions # SVD dimension analysis ``` See [firewall.md](firewall.md) for the full API specification. ## Design Decisions All design decisions are documented as ADRs in [decisions/](decisions/). | ADR | Decision | Summary | |-----|----------|---------| | [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging | | [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say | | [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals | | [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly | | [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure | | [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard | | [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth | | [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance | | [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context | | [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling | | [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages | | [012](decisions/012-rolling-window-screening.md) | Rolling token window screening | Phase 2 `screen_document()` with 25% overlap, max pooling | ## Dependencies on Other Projects - **metaspline**: The core detection logic (codebook, spline distributions, SVD projection, space transforms) is adapted from the metaspline research project. The PoC validated the behavioral signal approach; this project extracts and productionizes ~1,745 lines of the working subset. - **reverse-proxy**: The architecture documentation structure and SDD process are adapted from the @alkdev/reverse-proxy project. The documentation conventions, ADR format, and open questions tracking are reused directly. ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: - **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path) - **OQ-05**: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)