Files
glm-5.1 c225cf420c docs: resolve OQ-03 — adopt rolling token window screening (ADR-012)
Research confirmed rolling token windows as the right approach for long
document screening. ADR-012 formalizes the decision: Phase 2 implements
screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max
pooling aggregation, and character offset tracking. Short inputs fall
through to screen() unchanged.

This resolves the last open question. All 6 original OQs are now resolved:
- OQ-01: ONNX removed (burn/cublas better future path)
- OQ-02: 65% codebook compression achievable
- OQ-03: Rolling token windows for Phase 2 (ADR-012)
- OQ-04: Both model-specific defaults + user-overridable
- OQ-05: Standalone API + thin adapters (ADR-011)
- OQ-06: TOML for file-based config
2026-06-13 08:25:12 +00:00

207 lines
11 KiB
Markdown

---
status: draft
last_updated: 2026-06-13
---
# Overview
## Vision
A pip-installable Python library that screens untrusted inputs for adversarial
content before they reach a target LLM. The library uses behavioral signals —
patterns in hidden state activations from a small language model — to detect
injection attempts, obfuscated payloads, and novel attack types that text-surface
defenses miss.
This project is open source under the MIT license.
## Why This Exists
LLMs process instructions and data in the same token stream. They cannot
reliably distinguish trusted system prompts from untrusted user content. This
architectural weakness enables prompt injection — the #1 LLM vulnerability per
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
of the time with just 10 attempts (International AI Safety Report 2026).
Current defenses are **surface-level**: text classifiers (Llama Guard), regex
filters, perplexity checks, and canary tokens. All examine *what the input
says*, not *how a model processes it*. Adversarial inputs that look natural to
text classifiers still produce distinctive activation patterns when a model
processes them.
Academic research validates this approach:
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
- **EMNLP 2024**: Safety signals detectable in early layers
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
through non-semantic hidden signals
See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
for the full threat analysis and academic evidence.
## Scope
### In Scope
- **Phase 1**: Core behavioral firewall library
- Input screening via small model activation analysis
- SVD-based anomaly detection with configurable thresholds
- Model-agnostic detector (works with any compatible small model)
- SmolLM2-135M as the default detector model
- Multi-dimensional behavioral alarms (not just safe/unsafe)
- PyTorch inference backend (optional dependency)
- Runtime model download and caching via HuggingFace Hub
- safetensors-only model loading (security requirement)
- Synchronous API for single-input screening
- Interpretable detection signals (SVD direction analysis)
- **Phase 2**: Integration and operational hardening
- Async/batch screening API
- Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
- Metrics and observability
- Codebook training pipeline (`run_manifold_projection.py` extraction)
- Streaming/rolling-window input screening (granular detection for documents)
- **Phase 3**: Advanced capabilities
- Multi-turn attack detection (payload splitting)
- Custom model fine-tuning for domain-specific detection
- Alternative inference backends (burn/cublas via safetensors)
### Out of Scope
- Text-surface classification (that's Llama Guard's job)
- Rule-based content filtering (that's NeMo Guardrails' job)
- Output-side safety monitoring
- Target model training or modification
- Multimodal (image) input screening
- Agent orchestration or access control
- Replacement for comprehensive LLM security programs
## Architecture
```
┌──────────────────────────────────────────┐
│ alknet-firewall (Python library) │
│ │
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
(text) │ │ screen(input) → Alarm │ │
│ │ ├─ Tokenize input │ │
│ │ ├─ Run detector model │ │
│ │ ├─ Extract hidden state activations│ │
│ │ ├─ Project onto SVD basis │ │
│ │ ├─ Compare against codebook │ │
│ │ └─ Return behavioral alarm │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Model Manager ────────────────────┐ │
│ │ Load model (HF Hub download/cache) │ │
│ │ Extract activations at key layers │ │
│ │ Model-agnostic interface │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Codebook ──────────────────────────┐ │
│ │ SVD basis vectors (compiled) │ │
│ │ Detection thresholds per dimension │ │
│ │ Behavioral region boundaries │ │
│ │ Spline distributions for scoring │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Configuration ─────────────────────┐ │
│ │ Model selection & revision pinning │ │
│ │ Detection thresholds │ │
│ │ Alarm severity levels │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
┌──────┴──────┐
│ │
HF Hub Cache Detector Model
(~/.cache/) (SmolLM2-135M)
```
## Package Dependencies
### Core (Required)
| Package | Version | Purpose | Notes |
|---------|---------|---------|-------|
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
### Optional (Extras)
| Package | Extra | Version | Purpose | Notes |
|---------|-------|---------|---------|-------|
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
### Development (Not Published)
| Package | Purpose |
|---------|---------|
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
| `pytest` | Testing |
| `pytest-cov` | Coverage |
| `mypy` | Type checking |
| `pre-commit` | Git hooks |
## Exports
This is a Python library. Public API surface:
```python
from alknet_firewall import Firewall, Alarm, AlarmLevel
# Core screening
firewall = Firewall() # loads default model + codebook
alarm: Alarm = firewall.screen("untrusted input text")
# Alarm properties
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
alarm.score # float, 0.0-1.0
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
alarm.dimensions # SVD dimension analysis
```
See [firewall.md](firewall.md) for the full API specification.
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
| [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
| [012](decisions/012-rolling-window-screening.md) | Rolling token window screening | Phase 2 `screen_document()` with 25% overlap, max pooling |
## Dependencies on Other Projects
- **metaspline**: The core detection logic (codebook, spline distributions,
SVD projection, space transforms) is adapted from the metaspline research
project. The PoC validated the behavioral signal approach; this project
extracts and productionizes ~1,745 lines of the working subset.
- **reverse-proxy**: The architecture documentation structure and SDD process
are adapted from the @alkdev/reverse-proxy project. The documentation
conventions, ADR format, and open questions tracking are reused directly.
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)