Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
206 lines
11 KiB
Markdown
206 lines
11 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-06-13
|
|
---
|
|
|
|
# Overview
|
|
|
|
## Vision
|
|
|
|
A pip-installable Python library that screens untrusted inputs for adversarial
|
|
content before they reach a target LLM. The library uses behavioral signals —
|
|
patterns in hidden state activations from a small language model — to detect
|
|
injection attempts, obfuscated payloads, and novel attack types that text-surface
|
|
defenses miss.
|
|
|
|
This project is open source under the MIT license.
|
|
|
|
## Why This Exists
|
|
|
|
LLMs process instructions and data in the same token stream. They cannot
|
|
reliably distinguish trusted system prompts from untrusted user content. This
|
|
architectural weakness enables prompt injection — the #1 LLM vulnerability per
|
|
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
|
|
of the time with just 10 attempts (International AI Safety Report 2026).
|
|
|
|
Current defenses are **surface-level**: text classifiers (Llama Guard), regex
|
|
filters, perplexity checks, and canary tokens. All examine *what the input
|
|
says*, not *how a model processes it*. Adversarial inputs that look natural to
|
|
text classifiers still produce distinctive activation patterns when a model
|
|
processes them.
|
|
|
|
Academic research validates this approach:
|
|
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
|
|
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
|
|
- **EMNLP 2024**: Safety signals detectable in early layers
|
|
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
|
|
through non-semantic hidden signals
|
|
|
|
See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
|
for the full threat analysis and academic evidence.
|
|
|
|
## Scope
|
|
|
|
### In Scope
|
|
|
|
- **Phase 1**: Core behavioral firewall library
|
|
- Input screening via small model activation analysis
|
|
- SVD-based anomaly detection with configurable thresholds
|
|
- Model-agnostic detector (works with any compatible small model)
|
|
- SmolLM2-135M as the default detector model
|
|
- Multi-dimensional behavioral alarms (not just safe/unsafe)
|
|
- PyTorch inference backend (optional dependency)
|
|
- Runtime model download and caching via HuggingFace Hub
|
|
- safetensors-only model loading (security requirement)
|
|
- Synchronous API for single-input screening
|
|
- Interpretable detection signals (SVD direction analysis)
|
|
|
|
- **Phase 2**: Integration and operational hardening
|
|
- Async/batch screening API
|
|
- Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
|
|
- Metrics and observability
|
|
- Codebook training pipeline (`run_manifold_projection.py` extraction)
|
|
- Streaming/rolling-window input screening (granular detection for documents)
|
|
|
|
- **Phase 3**: Advanced capabilities
|
|
- Multi-turn attack detection (payload splitting)
|
|
- Custom model fine-tuning for domain-specific detection
|
|
- Alternative inference backends (burn/cublas via safetensors)
|
|
|
|
### Out of Scope
|
|
|
|
- Text-surface classification (that's Llama Guard's job)
|
|
- Rule-based content filtering (that's NeMo Guardrails' job)
|
|
- Output-side safety monitoring
|
|
- Target model training or modification
|
|
- Multimodal (image) input screening
|
|
- Agent orchestration or access control
|
|
- Replacement for comprehensive LLM security programs
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────┐
|
|
│ alknet-firewall (Python library) │
|
|
│ │
|
|
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
|
|
(text) │ │ screen(input) → Alarm │ │
|
|
│ │ ├─ Tokenize input │ │
|
|
│ │ ├─ Run detector model │ │
|
|
│ │ ├─ Extract hidden state activations│ │
|
|
│ │ ├─ Project onto SVD basis │ │
|
|
│ │ ├─ Compare against codebook │ │
|
|
│ │ └─ Return behavioral alarm │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─ Model Manager ────────────────────┐ │
|
|
│ │ Load model (HF Hub download/cache) │ │
|
|
│ │ Extract activations at key layers │ │
|
|
│ │ Model-agnostic interface │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─ Codebook ──────────────────────────┐ │
|
|
│ │ SVD basis vectors (compiled) │ │
|
|
│ │ Detection thresholds per dimension │ │
|
|
│ │ Behavioral region boundaries │ │
|
|
│ │ Spline distributions for scoring │ │
|
|
│ └────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─ Configuration ─────────────────────┐ │
|
|
│ │ Model selection & revision pinning │ │
|
|
│ │ Detection thresholds │ │
|
|
│ │ Alarm severity levels │ │
|
|
│ └────────────────────────────────────┘ │
|
|
└──────────────────────────────────────────┘
|
|
│
|
|
┌──────┴──────┐
|
|
│ │
|
|
HF Hub Cache Detector Model
|
|
(~/.cache/) (SmolLM2-135M)
|
|
```
|
|
|
|
## Package Dependencies
|
|
|
|
### Core (Required)
|
|
|
|
| Package | Version | Purpose | Notes |
|
|
|---------|---------|---------|-------|
|
|
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
|
|
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
|
|
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
|
|
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
|
|
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
|
|
|
|
### Optional (Extras)
|
|
|
|
| Package | Extra | Version | Purpose | Notes |
|
|
|---------|-------|---------|---------|-------|
|
|
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
|
|
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
|
|
|
|
### Development (Not Published)
|
|
|
|
| Package | Purpose |
|
|
|---------|---------|
|
|
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
|
|
| `pytest` | Testing |
|
|
| `pytest-cov` | Coverage |
|
|
| `mypy` | Type checking |
|
|
| `pre-commit` | Git hooks |
|
|
|
|
## Exports
|
|
|
|
This is a Python library. Public API surface:
|
|
|
|
```python
|
|
from alknet_firewall import Firewall, Alarm, AlarmLevel
|
|
|
|
# Core screening
|
|
firewall = Firewall() # loads default model + codebook
|
|
alarm: Alarm = firewall.screen("untrusted input text")
|
|
|
|
# Alarm properties
|
|
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
|
|
alarm.score # float, 0.0-1.0
|
|
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
|
|
alarm.dimensions # SVD dimension analysis
|
|
```
|
|
|
|
See [firewall.md](firewall.md) for the full API specification.
|
|
|
|
## Design Decisions
|
|
|
|
All design decisions are documented as ADRs in [decisions/](decisions/).
|
|
|
|
| ADR | Decision | Summary |
|
|
|-----|----------|---------|
|
|
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
|
|
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
|
|
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
|
|
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
|
|
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
|
|
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
|
|
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
|
|
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
|
|
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
|
|
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
|
|
| [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
|
|
|
|
## Dependencies on Other Projects
|
|
|
|
- **metaspline**: The core detection logic (codebook, spline distributions,
|
|
SVD projection, space transforms) is adapted from the metaspline research
|
|
project. The PoC validated the behavioral signal approach; this project
|
|
extracts and productionizes ~1,745 lines of the working subset.
|
|
|
|
- **reverse-proxy**: The architecture documentation structure and SDD process
|
|
are adapted from the @alkdev/reverse-proxy project. The documentation
|
|
conventions, ADR format, and open questions tracking are reused directly.
|
|
|
|
## Open Questions
|
|
|
|
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
|
questions affecting this document:
|
|
|
|
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
|
|
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2) |