Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
11 KiB
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
Overview
Vision
A pip-installable Python library that screens untrusted inputs for adversarial content before they reach a target LLM. The library uses behavioral signals — patterns in hidden state activations from a small language model — to detect injection attempts, obfuscated payloads, and novel attack types that text-surface defenses miss.
This project is open source under the MIT license.
Why This Exists
LLMs process instructions and data in the same token stream. They cannot reliably distinguish trusted system prompts from untrusted user content. This architectural weakness enables prompt injection — the #1 LLM vulnerability per OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50% of the time with just 10 attempts (International AI Safety Report 2026).
Current defenses are surface-level: text classifiers (Llama Guard), regex filters, perplexity checks, and canary tokens. All examine what the input says, not how a model processes it. Adversarial inputs that look natural to text classifiers still produce distinctive activation patterns when a model processes them.
Academic research validates this approach:
- HiddenDetect (ACL 2025): Activation-based detection outperforms SOTA
- Hidden Dimensions (ICML 2025): Safety is multi-dimensional in activation space
- EMNLP 2024: Safety signals detectable in early layers
- Subliminal Learning (Nature 2026): Models transmit behavioral signals through non-semantic hidden signals
See llm-input-safety-landscape.md for the full threat analysis and academic evidence.
Scope
In Scope
-
Phase 1: Core behavioral firewall library
- Input screening via small model activation analysis
- SVD-based anomaly detection with configurable thresholds
- Model-agnostic detector (works with any compatible small model)
- SmolLM2-135M as the default detector model
- Multi-dimensional behavioral alarms (not just safe/unsafe)
- PyTorch inference backend (optional dependency)
- Runtime model download and caching via HuggingFace Hub
- safetensors-only model loading (security requirement)
- Synchronous API for single-input screening
- Interpretable detection signals (SVD direction analysis)
-
Phase 2: Integration and operational hardening
- Async/batch screening API
- Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
- Metrics and observability
- Codebook training pipeline (
run_manifold_projection.pyextraction) - Streaming/rolling-window input screening (granular detection for documents)
-
Phase 3: Advanced capabilities
- Multi-turn attack detection (payload splitting)
- Custom model fine-tuning for domain-specific detection
- Alternative inference backends (burn/cublas via safetensors)
Out of Scope
- Text-surface classification (that's Llama Guard's job)
- Rule-based content filtering (that's NeMo Guardrails' job)
- Output-side safety monitoring
- Target model training or modification
- Multimodal (image) input screening
- Agent orchestration or access control
- Replacement for comprehensive LLM security programs
Architecture
┌──────────────────────────────────────────┐
│ alknet-firewall (Python library) │
│ │
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
(text) │ │ screen(input) → Alarm │ │
│ │ ├─ Tokenize input │ │
│ │ ├─ Run detector model │ │
│ │ ├─ Extract hidden state activations│ │
│ │ ├─ Project onto SVD basis │ │
│ │ ├─ Compare against codebook │ │
│ │ └─ Return behavioral alarm │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Model Manager ────────────────────┐ │
│ │ Load model (HF Hub download/cache) │ │
│ │ Extract activations at key layers │ │
│ │ Model-agnostic interface │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Codebook ──────────────────────────┐ │
│ │ SVD basis vectors (compiled) │ │
│ │ Detection thresholds per dimension │ │
│ │ Behavioral region boundaries │ │
│ │ Spline distributions for scoring │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Configuration ─────────────────────┐ │
│ │ Model selection & revision pinning │ │
│ │ Detection thresholds │ │
│ │ Alarm severity levels │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
│
┌──────┴──────┐
│ │
HF Hub Cache Detector Model
(~/.cache/) (SmolLM2-135M)
Package Dependencies
Core (Required)
| Package | Version | Purpose | Notes |
|---|---|---|---|
huggingface-hub |
>=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
safetensors |
>=0.4.3 | Safe model weight loading | No arbitrary code execution |
tokenizers |
>=0.20 | Text tokenization | Fast Rust-based tokenizer |
numpy |
>=1.24 | Tensor operations | Core numerical dependency |
scikit-learn |
>=1.3 | SVD computations | TruncatedSVD for basis projection |
Optional (Extras)
| Package | Extra | Version | Purpose | Notes |
|---|---|---|---|---|
torch |
[torch] |
>=2.2 | Model inference | 200MB-2.5GB; optional dependency |
transformers |
[torch] |
>=4.40 | Model loading pipeline | Required with torch extra |
Development (Not Published)
| Package | Purpose |
|---|---|
ruff |
Linting + formatting (replaces flake8, black, isort) |
pytest |
Testing |
pytest-cov |
Coverage |
mypy |
Type checking |
pre-commit |
Git hooks |
Exports
This is a Python library. Public API surface:
from alknet_firewall import Firewall, Alarm, AlarmLevel
# Core screening
firewall = Firewall() # loads default model + codebook
alarm: Alarm = firewall.screen("untrusted input text")
# Alarm properties
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
alarm.score # float, 0.0-1.0
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
alarm.dimensions # SVD dimension analysis
See firewall.md for the full API specification.
Design Decisions
All design decisions are documented as ADRs in decisions/.
| ADR | Decision | Summary |
|---|---|---|
| 001 | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
| 002 | Behavioral signal detection | Detect how models process inputs, not what inputs say |
| 003 | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
| 004 | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
| 005 | Safetensors-only loading | No pickle-based model files — security product must be secure |
| 006 | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
| 007 | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
| 008 | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
| 009 | Last-token activation extraction | Standard for autoregressive models; full sequence context |
| 010 | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
| 011 | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
Dependencies on Other Projects
-
metaspline: The core detection logic (codebook, spline distributions, SVD projection, space transforms) is adapted from the metaspline research project. The PoC validated the behavioral signal approach; this project extracts and productionizes ~1,745 lines of the working subset.
-
reverse-proxy: The architecture documentation structure and SDD process are adapted from the @alkdev/reverse-proxy project. The documentation conventions, ADR format, and open questions tracking are reused directly.
Open Questions
Open questions are tracked in open-questions.md. Key questions affecting this document:
- OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
- OQ-05: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)