---
status: draft
last_updated: 2026-06-13
---

# Overview

## Vision

A pip-installable Python library that screens untrusted inputs for adversarial
content before they reach a target LLM. The library uses behavioral signals —
patterns in hidden state activations from a small language model — to detect
injection attempts, obfuscated payloads, and novel attack types that text-surface
defenses miss.

This project is open source under the MIT license.

## Why This Exists

LLMs process instructions and data in the same token stream. They cannot
reliably distinguish trusted system prompts from untrusted user content. This
architectural weakness enables prompt injection — the #1 LLM vulnerability per
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
of the time with just 10 attempts (International AI Safety Report 2026).

Current defenses are **surface-level**: text classifiers (Llama Guard), regex
filters, perplexity checks, and canary tokens. All examine *what the input
says*, not *how a model processes it*. Adversarial inputs that look natural to
text classifiers still produce distinctive activation patterns when a model
processes them.

Academic research validates this approach:
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
- **EMNLP 2024**: Safety signals detectable in early layers
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
  through non-semantic hidden signals

See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
for the full threat analysis and academic evidence.

## Scope

### In Scope

- **Phase 1**: Core behavioral firewall library
  - Input screening via small model activation analysis
  - SVD-based anomaly detection with configurable thresholds
  - Model-agnostic detector (works with any compatible small model)
  - SmolLM2-135M as the default detector model
  - Multi-dimensional behavioral alarms (not just safe/unsafe)
  - PyTorch inference backend (optional dependency)
  - Runtime model download and caching via HuggingFace Hub
  - safetensors-only model loading (security requirement)
  - Synchronous API for single-input screening
  - Interpretable detection signals (SVD direction analysis)

- **Phase 2**: Integration and operational hardening
  - Async/batch screening API
  - Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
  - Metrics and observability
  - Codebook training pipeline (`run_manifold_projection.py` extraction)
  - Streaming/rolling-window input screening (granular detection for documents)

- **Phase 3**: Advanced capabilities
  - Multi-turn attack detection (payload splitting)
  - Custom model fine-tuning for domain-specific detection
  - Alternative inference backends (burn/cublas via safetensors)

### Out of Scope

- Text-surface classification (that's Llama Guard's job)
- Rule-based content filtering (that's NeMo Guardrails' job)
- Output-side safety monitoring
- Target model training or modification
- Multimodal (image) input screening
- Agent orchestration or access control
- Replacement for comprehensive LLM security programs

## Architecture

```
                        ┌──────────────────────────────────────────┐
                        │  alknet-firewall (Python library)          │
                        │                                            │
  Untrusted Input ────► │  ┌─ Firewall API ─────────────────────┐   │
  (text)                │  │  screen(input) → Alarm              │   │
                        │  │  ├─ Tokenize input                   │   │
                        │  │  ├─ Run detector model              │   │
                        │  │  ├─ Extract hidden state activations│   │
                        │  │  ├─ Project onto SVD basis           │   │
                        │  │  ├─ Compare against codebook         │   │
                        │  │  └─ Return behavioral alarm          │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Model Manager ────────────────────┐   │
                        │  │  Load model (HF Hub download/cache) │   │
                        │  │  Extract activations at key layers   │   │
                        │  │  Model-agnostic interface            │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Codebook ──────────────────────────┐   │
                        │  │  SVD basis vectors (compiled)        │   │
                        │  │  Detection thresholds per dimension  │   │
                        │  │  Behavioral region boundaries        │   │
                        │  │  Spline distributions for scoring    │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Configuration ─────────────────────┐   │
                        │  │  Model selection & revision pinning  │   │
                        │  │  Detection thresholds               │   │
                        │  │  Alarm severity levels              │   │
                        │  └────────────────────────────────────┘   │
                        └──────────────────────────────────────────┘
                                      │
                               ┌──────┴──────┐
                               │             │
                        HF Hub Cache    Detector Model
                        (~/.cache/)    (SmolLM2-135M)
```

## Package Dependencies

### Core (Required)

| Package | Version | Purpose | Notes |
|---------|---------|---------|-------|
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |

### Optional (Extras)

| Package | Extra | Version | Purpose | Notes |
|---------|-------|---------|---------|-------|
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |

### Development (Not Published)

| Package | Purpose |
|---------|---------|
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
| `pytest` | Testing |
| `pytest-cov` | Coverage |
| `mypy` | Type checking |
| `pre-commit` | Git hooks |

## Exports

This is a Python library. Public API surface:

```python
from alknet_firewall import Firewall, Alarm, AlarmLevel

# Core screening
firewall = Firewall()  # loads default model + codebook
alarm: Alarm = firewall.screen("untrusted input text")

# Alarm properties
alarm.level          # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
alarm.score          # float, 0.0-1.0
alarm.signals        # list[DimensionSignal] — per-dimension behavioral signals
alarm.dimensions     # SVD dimension analysis
```

See [firewall.md](firewall.md) for the full API specification.

## Design Decisions

All design decisions are documented as ADRs in [decisions/](decisions/).

| ADR | Decision | Summary |
|-----|----------|---------|
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
| [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
| [012](decisions/012-rolling-window-screening.md) | Rolling token window screening | Phase 2 `screen_document()` with 25% overlap, max pooling |

## Dependencies on Other Projects

- **metaspline**: The core detection logic (codebook, spline distributions,
  SVD projection, space transforms) is adapted from the metaspline research
  project. The PoC validated the behavioral signal approach; this project
  extracts and productionizes ~1,745 lines of the working subset.

- **reverse-proxy**: The architecture documentation structure and SDD process
  are adapted from the @alkdev/reverse-proxy project. The documentation
  conventions, ADR format, and open questions tracking are reused directly.

## Open Questions

Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:

- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)