feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions

View File

@@ -0,0 +1,71 @@
---
status: draft
last_updated: 2026-06-13
---
# alknet-firewall — Architecture
## Current State
**Phase 0→1 (Exploration → Architecture)** — The project has a working PoC
demonstrating that behavioral signals from small language models can detect
adversarial inputs. The core detection logic (~1,745 lines) works reasonably
well but lacks tests, has excessive codebook size, and needs extraction from
the research codebase into a properly structured Python package.
This project extracts and productionizes the behavioral signal detection
approach from the metaspline research project. A ~125M parameter model
(SmolLM2-135M) processes untrusted inputs and produces hidden state
activations. SVD-based dimensionality reduction on these activations reveals
behavioral patterns — normal inputs cluster in expected regions while
adversarial inputs produce anomalous activation signatures. The system
raises "behavioral alarms" without needing to know specific attack types.
## Architecture Documents
| Document | Status | Description |
|----------|--------|-------------|
| [overview.md](overview.md) | Draft | Vision, scope, package structure, dependencies |
| [firewall.md](firewall.md) | Draft | Core firewall API, input screening, alarm protocol |
| [codebook.md](codebook.md) | Draft | SVD basis, detection parameters, codebook compilation |
| [model.md](model.md) | Draft | Model loading, activation extraction, model-agnostic design |
| [configuration.md](configuration.md) | Draft | Thresholds, model selection, detection tuning |
| [open-questions.md](open-questions.md) | Active | Unresolved questions tracker with OQ-IDs |
## ADR Table
| ADR | Title | Status |
|-----|-------|--------|
| [001](decisions/001-python-uv.md) | Python with uv | Accepted |
| [002](decisions/002-behavioral-signals.md) | Behavioral Signal Detection (Not Text Classification) | Accepted |
| [003](decisions/003-small-model-detector.md) | Small Model (~125M) as Detector | Accepted |
| [004](decisions/004-svd-based-detection.md) | SVD-Based Anomaly Detection | Accepted |
| [005](decisions/005-safetensors-only.md) | Safetensors-Only Model Loading | Accepted |
| [006](decisions/006-optional-pytorch.md) | PyTorch as Optional Dependency | Accepted |
| [007](decisions/007-runtime-model-download.md) | Runtime Model Download via HuggingFace Hub | Accepted |
| [008](decisions/008-three-level-alarm.md) | Three-Level Alarm System | Accepted |
| [009](decisions/009-last-token-extraction.md) | Last-Token Activation Extraction | Accepted |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic Spline Distributions | Accepted |
## Open Questions
See [open-questions.md](open-questions.md) for the full tracker.
| OQ | Question | Priority | Status |
|----|----------|----------|--------|
| OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
| OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
| OQ-03 | Should the firewall support streaming/chunked input screening? | low | open |
| OQ-04 | Should detection thresholds be per-model or globally configurable? | medium | open |
| OQ-05 | How should the firewall integrate with existing guardrail systems (LlamaFirewall, NeMo)? | medium | open |
| OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
| OQ-07 | Is a Rust port feasible given current ML framework maturity? | low | open |
## Document Lifecycle
| Status | Meaning | Transitions |
|--------|---------|-------------|
| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
| `reviewed` | Architecture is final. Implementation may begin. Changes require review. | → `stable` when implementation is complete |
| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |

View File

@@ -0,0 +1,248 @@
---
status: draft
last_updated: 2026-06-13
---
# Codebook
The codebook contains the compiled detection parameters — SVD basis vectors,
behavioral region boundaries, and scoring distributions — that the firewall
uses to detect adversarial inputs.
## What It Is
The codebook is the "compiled detector" — the precomputed parameters that
transform raw model activations into behavioral alarm signals. It is to the
firewall what a trained model is to a classifier: the result of an offline
compilation step that produces the runtime detection parameters.
The name "codebook" comes from vector quantization terminology: it defines a
set of reference points (codewords) in activation space that represent known
behavioral patterns. New inputs are compared against these reference patterns.
## Why It Exists
Running full SVD decomposition and distribution fitting on every input would be
prohibitively expensive. The codebook precomputes these offline:
- **SVD basis**: The principal directions in activation space that capture
safety-relevant behavioral variance. Computed once from a calibration
dataset.
- **Behavioral regions**: The expected distribution of normal inputs along each
SVD dimension. Defined by fitted spline distributions.
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
At runtime, the firewall only needs to project new activations onto the
precomputed basis and compare against the precomputed regions — O(k) per input
where k is the number of retained dimensions.
## Key Concepts
### z-Coordinates
The projection of an activation vector onto the SVD basis. Computed as:
```
z = V^T @ (activation - mean)
```
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
mean activation from the calibration dataset. The centering step is critical
— without it, projections are offset by the mean and thresholds would be
incorrect.
z-coordinates are raw (unnormalized) projections. The codebook's spline
distributions are calibrated for this scale, so threshold values in the
codebook are specific to the z-coordinate range of the calibration data.
### SVD Basis
Singular Value Decomposition of the activation space from a calibration dataset
reveals the principal components (directions) that capture the most variance.
The top-k components form the basis that the codebook uses for projection.
Key properties:
- **Interpretable**: Each direction can be inspected for what behavioral
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
- **Efficient**: After decomposition, projection is a matrix multiply
- **Stable**: SVD basis is deterministic for a given calibration dataset
- **Model-specific**: The basis is computed for a specific model architecture
and weights. Changing the detector model requires recomputing the basis
The SVD basis is computed by the codebook training pipeline
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
### Behavioral Regions
For each SVD dimension, the codebook defines the expected distribution of
normal (non-adversarial) inputs. This is modeled as a monotonic spline
distribution that captures the shape of the behavioral region along that
dimension.
Inputs whose projections fall within the normal region score low (CLEAR).
Inputs whose projections fall near or beyond the region boundary score
increasingly high (SUSPICIOUS → DANGEROUS).
### Spline Distributions
Monotonic spline distributions model the probability density along each SVD
dimension (ADR-010). They provide:
- **Smooth scoring**: Continuous score rather than hard threshold
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
anomalous inputs
- **Parametric compactness**: A handful of spline knots represent the full
distribution shape
- **Differentiability**: Scores are differentiable for potential future use in
adversarial training
The spline distribution approach is adapted from the metaspline PoC
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
**Formal definition**: The CDF along each dimension is modeled as a monotonic
cubic spline with 1020 knots. Knot positions are determined by quantiles of
the calibration data (ensuring density of knots where data is dense). Beyond
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
CDF's complement: `score = 1 - cdf(z)`.
**Canonical implementation**: The metaspline PoC files `spline.py`
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
and `space.py` (`unfold`/`fold`) are the reference implementation for the
codebook compilation pipeline.
### Calibration Dataset
The calibration dataset is the set of normal (non-adversarial) inputs used to
compute the SVD basis and fit behavioral region distributions. Requirements:
- **Composition**: Diverse normal inputs representative of the deployment
domain. No adversarial examples — the basis models *normal* behavior, and
anomalies are detected as deviations from it.
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
Practical range: 1,00010,000 inputs. More inputs stabilize the basis but
have diminishing returns.
- **Diversity**: Must cover the range of normal inputs the detector will see
in production. A narrow calibration dataset (e.g., only short English
queries) will produce high false positive rates on unusual but benign inputs.
- **Model-specific**: A calibration dataset must be collected for each detector
model by running that model on the inputs and extracting activations.
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
automates calibration dataset processing.
### Codebook Compilation
The codebook is compiled offline by a training pipeline that:
1. Runs the detector model on a calibration dataset (diverse normal inputs)
2. Extracts hidden state activations at configured layers
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
which uses randomized approximation and may not be deterministic)
4. Fits spline distributions along each retained dimension
5. Computes detection thresholds
6. Serializes the codebook to a portable format (safetensors + JSON config)
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
package** as package data (under `src/alknet_firewall/data/codebook/`). This
keeps the Phase 1 installation simple — no additional download step beyond the
model. The bundled codebook is specific to the default detector model
(SmolLM2-135M at the pinned revision). Users who switch to a different
detector model must provide a matching codebook via `codebook_path`.
## Data Format
The codebook is stored as:
```
codebook/
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
├── regions.safetensors # Region boundary parameters
├── splines.json # Spline knot positions and coefficients
└── config.json # Metadata: model_id, revision, n_dims, thresholds
```
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
### Tensor Specifications
**basis.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
**regions.safetensors**:
| Key | Shape | Dtype | Description |
|-----|-------|-------|-------------|
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
**splines.json**:
| Field | Type | Description |
|-------|------|-------------|
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
## Interfaces
### Internal API
```python
@dataclass
class CodebookConfig:
model_id: str
model_revision: str
n_dimensions: int
layers: list[int]
suspicious_threshold: float # Serialized threshold values
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
class Codebook:
def __init__(self, path: Path): ...
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
"""Project raw activations onto SVD basis → z-coordinates."""
...
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
"""Score z-coordinates against behavioral regions."""
...
@classmethod
def load(cls, path: Path) -> Codebook: ...
@classmethod
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
```
### Constraints
1. **Immutable at runtime** — The codebook is read-only during screening.
Modifying the codebook requires explicit recompilation.
2. **Model-bound** — A codebook is valid only for the specific model it was
compiled for. Loading a codebook with the wrong model produces undefined
results.
3. **Deterministic** — Same codebook + same activations = same scores.
4. **Portable** — Codebook can be saved to disk and reloaded without
recomputation. Can be distributed via HuggingFace Hub.
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
codebook be compressed? (open)
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)

View File

@@ -0,0 +1,107 @@
---
status: draft
last_updated: 2026-06-13
---
# Configuration
Configuration for the firewall: model selection, detection thresholds,
alarm levels, and operational parameters.
## What It Is
The configuration component defines all tunable parameters for the firewall.
It controls which model is used, how aggressively inputs are screened, and
what alarm levels map to what scores.
## Why It Exists
Different deployment contexts need different detection sensitivity. A
high-security environment (e.g., screening inputs to a system with access to
sensitive data) may want aggressive thresholds that flag more suspicious
inputs. A low-risk chatbot may prefer permissive thresholds that minimize
false positives. The configuration component makes these trade-offs explicit
and tunable.
## Configuration Structure
### Thresholds
```python
@dataclass
class Thresholds:
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
dangerous: float = 0.7 # Score above which input is DANGEROUS
per_dimension: dict[int, float] | None = None # Override per SVD dimension
```
Default thresholds are calibrated against the codebook's behavioral regions.
Per-dimension overrides allow tuning sensitivity for specific behavioral
patterns (e.g., lower threshold on the refusal-suppression dimension).
### Model Configuration
```python
@dataclass
class ModelConfig:
model_id: str = "HuggingFaceTB/SmolLM2-135M"
revision: str = "<pinned-commit>" # Specific commit, not "main"
device: str = "cpu"
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
cache_dir: str | None = None
```
Extraction layers are chosen based on EMNLP 2024 findings that safety signals
appear in early layers. The default set covers early (1, 2) and mid (4, 8)
layers of the 12-layer SmolLM2-135M model.
### Codebook Configuration
```python
@dataclass
class CodebookConfig:
source: str = "bundled" # "bundled" | "hf_hub" | "local"
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
revision: str | None = None # HuggingFace revision
path: Path | None = None # Local path if source="local"
n_dimensions: int = 10 # Number of SVD dimensions to retain
```
### Full Configuration
```python
@dataclass
class FirewallConfig:
model: ModelConfig = field(default_factory=ModelConfig)
codebook: CodebookConfig = field(default_factory=CodebookConfig)
thresholds: Thresholds = field(default_factory=Thresholds)
```
## Defaults
All configuration has sensible defaults. The firewall works out of the box:
```python
# All defaults
firewall = Firewall()
alarm = firewall.screen("Hello, how are you?")
# alarm.level == AlarmLevel.CLEAR
```
No configuration file is required. All parameters can be passed via the
constructor. A future phase may add file-based configuration (TOML or YAML).
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)

View File

@@ -0,0 +1,41 @@
# ADR-001: Python with uv
## Status
Accepted
## Context
The project needs a programming language and build toolchain. The PoC was
written in Python using PyTorch, sklearn, and transformers. A Rust port using
burn/cubecl was attempted but failed — the ML framework ecosystem in Rust is
not yet mature enough for this type of work.
The project needs a fast path to a usable system. The PoC already works in
Python. Modern Python packaging (uv, pyproject.toml, src layout) provides a
professional project structure that was not available even a few years ago.
## Decision
Use Python 3.10+ with uv as the package manager and build tool. Use uv_build
as the build backend. Use src/ layout for the package.
## Consequences
**Positive**:
- Fast path to working system — PoC code is already Python
- Rich ML ecosystem (PyTorch, transformers, sklearn, safetensors)
- uv provides 10-100x faster dependency management than pip
- Modern packaging standards (pyproject.toml, PEP 735 dependency groups)
- Easy distribution via PyPI with `pip install alknet-firewall[torch]`
- Type checking via mypy provides strong correctness guarantees
**Negative**:
- Python is slower than Rust for non-ML code (SVD projection, data wrangling)
- PyTorch is a large optional dependency (200MB-2.5GB)
- Rust port remains a future goal (Phase 3, speculative)
## References
- [modern-python-project-setup.md](../research/modern-python-project-setup.md)
- [python-ml-packaging.md](../research/python-ml-packaging.md)

View File

@@ -0,0 +1,52 @@
# ADR-002: Behavioral Signal Detection (Not Text Classification)
## Status
Accepted
## Context
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
text-surface approaches — they classify input text as safe or unsafe. This
fundamentally limits their effectiveness:
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
and pattern matching
- Novel attack types require retraining classifiers
- Text that looks natural to a classifier can still be adversarial when
processed by a model
Academic research (2024-2025) demonstrates that adversarial inputs produce
distinctive activation patterns in model internals, regardless of surface form.
## Decision
Build a behavioral signal detection system that monitors how a model processes
inputs (hidden state activations), not what the inputs say (text surface).
Adversarial inputs produce anomalous activation patterns that are detectable
even when the text itself looks innocent.
## Consequences
**Positive**:
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
produce anomalous patterns
- Multi-dimensional signals provide interpretable detection (which SVD
directions are activated and by how much)
- Complementary to existing text-surface defenses — can be layered
**Negative**:
- Requires running a model on every input (adds latency and compute cost)
- Detection depends on the detector model sharing architectural similarity
with likely attack targets
- False positives possible for unusual but benign inputs (domain-specific
language, technical content)
- No existing production system validates this approach — we are first
## References
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
- HiddenDetect (ACL 2025)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- How Alignment and Jailbreak Work (EMNLP 2024)

View File

@@ -0,0 +1,56 @@
# ADR-003: Small Model (~125M) as Detector
## Status
Accepted
## Context
The behavioral signal detection approach requires running a language model on
every input to extract hidden state activations. The choice of model size
creates a trade-off:
- **Large model (7B+)**: Better representation quality, more behavioral signal
resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
- **Small model (~125M)**: Sufficient representation quality for early-layer
safety signals. Runs on CPU, <10ms latency, negligible cost per check.
- **Tiny model (<50M)**: Too small for safety-relevant representations to
emerge. Lacks the depth where behavioral patterns form.
EMNLP 2024 research confirms that safety signals are detectable in early
layers — the model doesn't need deep processing to produce useful signals.
A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
for safety directions to emerge in early layers.
## Decision
Use a small model (~125M parameters) as the default detector. SmolLM2-135M
(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
CPU. Support model-agnostic detection — any compatible model can be used by
recompiling the codebook.
## Consequences
**Positive**:
- <10ms latency enables real-time pre-inference screening
- CPU-deployable — no GPU required for the firewall
- Can run alongside target model without blocking
- Fast iteration — training/updating a 125M model takes hours, not days
- Small enough to embed in API gateways, CDN edges, client applications
- 269MB model download is feasible via HF Hub with caching
**Negative**:
- Less representation quality than larger models — may miss subtle signals
that a 7B detector would catch
- Detector model must share some architectural similarity with target models
for behavioral signals to transfer
- SmolLM2-135M is English-focused — multilingual detection requires a
multilingual detector model
- Codebook is model-specific — switching models requires recompilation
## References
- [model.md](../model.md)
- EMNLP 2024: Safety signals detectable in early layers
- Subliminal Learning (Nature 2026): Behavioral traits transmit through
non-semantic signals

View File

@@ -0,0 +1,58 @@
# ADR-004: SVD-Based Anomaly Detection
## Status
Accepted
## Context
After extracting hidden state activations from the detector model, the
firewall needs a method to distinguish normal behavioral patterns from
adversarial ones. Options:
- **Single classifier**: Train a binary classifier on activations. Simple but
loses the multi-dimensional structure. Black box.
- **SVD + region comparison**: Decompose activation space into principal
directions, model normal behavioral regions along each direction, detect
inputs that fall outside normal regions. Interpretable, efficient,
multi-dimensional.
- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
detect inputs with high reconstruction error. Complex, not interpretable.
ICML 2025 research shows safety is multi-dimensional in activation space — a
dominant refusal direction plus secondary dimensions. SVD naturally discovers
these directions. Region comparison provides interpretable per-dimension
signals.
## Decision
Use SVD-based anomaly detection: decompose activation space via SVD to
discover principal behavioral directions, model normal regions along each
dimension using monotonic spline distributions, and detect inputs whose
projections fall outside normal regions.
## Consequences
**Positive**:
- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
- Efficient: Projection is O(k) after decomposition, trivial at runtime
- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
- Robust: SVD captures structure of entire activation space, not a single
boundary
- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
(unlike `TruncatedSVD` which uses randomized initialization)
**Negative**:
- SVD basis is model-specific — changing detector model requires recomputation
- Basis quality depends on calibration dataset coverage
- Linear decomposition may miss non-linear behavioral patterns
- Requires a codebook compilation pipeline (Phase 2)
- Full SVD on large calibration datasets may be slow (mitigated by
relatively small hidden dim: 768)
## References
- [codebook.md](../codebook.md)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- HiddenDetect (ACL 2025)

View File

@@ -0,0 +1,47 @@
# ADR-005: Safetensors-Only Model Loading
## Status
Accepted
## Context
Model weight files come in two formats:
- **Pickle-based** (`.pt`, `.bin`, `.pth`): Can execute arbitrary Python code
during loading. Known supply chain attack vector.
- **safetensors**: Simple binary format with JSON header. No code execution.
76x faster CPU loading. Zero-copy/lazy loading support.
This is a security product. Loading untrusted pickle files in a security
product is a contradiction. The LiteLLM supply chain attack (CVE-2026-33634,
CVSS 9.4) demonstrated that compromised model files can lead to credential
theft and backdoors.
## Decision
Only load model weights from safetensors format. Never load `.pt`, `.bin`,
or `.pth` files. Apply this policy to both the detector model and the codebook
tensors.
## Consequences
**Positive**:
- Eliminates entire class of supply chain attacks via model files
- 76x faster model loading on CPU
- Zero-copy/lazy loading reduces memory usage
- Cross-framework compatible (PyTorch, ONNX, numpy)
- Consistent with HuggingFace's own migration to safetensors-default
**Negative**:
- Some older models only ship `.bin` weights — must convert before use
- Safetensors doesn't support saving optimizer state (irrelevant — we only
do inference)
- Explicit `use_safetensors=True` parameter needed in transformers for older
versions
## References
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 6:
safetensors format comparison
- CVE-2026-33634 — LiteLLM supply chain attack

View File

@@ -0,0 +1,64 @@
# ADR-006: PyTorch as Optional Dependency
## Status
Accepted
## Context
PyTorch is the primary inference backend for the detector model. However,
PyTorch is large:
- `torch` (CPU): ~200MB download, ~700MB installed
- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
- `onnxruntime`: ~30-50MB download, ~300MB installed
Making PyTorch a required dependency would force a 200MB-2.5GB download on
every user, even those who already have PyTorch installed or prefer ONNX
Runtime. This is the standard problem for ML libraries, and the HuggingFace
ecosystem has converged on a solution.
## Decision
Make PyTorch an optional dependency via extras (`pip install
alknet-firewall[torch]`). The base install includes all non-ML dependencies
(sklearn, huggingface-hub, safetensors, tokenizers, numpy). ML inference
backends are installed separately.
Use lazy imports with clear error messages when PyTorch is not installed:
```python
try:
import torch
except ImportError:
raise ImportError(
"PyTorch is required for alknet-firewall inference. "
"Install with: pip install 'alknet-firewall[torch]' "
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
)
```
## Consequences
**Positive**:
- Base install is ~30MB download, ~100MB installed — very lightweight
- Users with existing PyTorch installations don't re-download
- ONNX Runtime alternative available for minimal footprint (~100MB total)
- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
hub all use this pattern)
- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
`[[tool.uv.index]]`
**Negative**:
- More complex dependency specification in pyproject.toml
- Users must read installation docs to choose the right extra
- Runtime import errors if users forget to install a backend
- CPU-only torch requires two-step install or uv configuration (can't be
expressed in pip extras alone)
## References
- [modern-python-project-setup.md](../research/modern-python-project-setup.md) —
Section 2: PyTorch handling
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 1:
PyTorch as dependency

View File

@@ -0,0 +1,53 @@
# ADR-007: Runtime Model Download via HuggingFace Hub
## Status
Accepted
## Context
The detector model (SmolLM2-135M) is ~269MB. This is too large to bundle in a
Python package — PyPI has a 60MB per-file limit and 1GB total project size
limit. Even if it were allowed, a 269MB wheel download is terrible UX.
Options:
- **Bundle in package**: Not feasible due to size constraints
- **Separate package for model**: Possible but awkward, requires users to
install two packages
- **Runtime download via HuggingFace Hub**: Standard approach used by
transformers. Provides caching, authentication, offline mode, and
checksum verification
- **Custom download (S3, etc.)**: Works but reinvents the wheel
## Decision
Download the detector model at runtime via HuggingFace Hub (`snapshot_download`
or `from_pretrained` with automatic caching). Support offline mode via
`HF_HUB_OFFLINE=1` or `local_files_only=True`. Provide a CLI command for
pre-downloading models in air-gapped environments.
Pin model revisions to specific commit hashes for reproducibility.
## Consequences
**Positive**:
- Package stays small (~30MB base install)
- HuggingFace Hub provides automatic caching, deduplication, and checksum
verification
- Offline mode supported via environment variable
- Authentication for gated models via `HF_TOKEN`
- Standard approach — users familiar with transformers will recognize the
pattern
**Negative**:
- First run requires network access and ~269MB download (with progress bar)
- Model availability depends on HuggingFace Hub uptime
- Users in restricted networks need to pre-download models
- Different model versions may produce different detection results — must
pin revisions
## References
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 2:
Model file distribution
- [model.md](../model.md)

View File

@@ -0,0 +1,47 @@
# ADR-008: Three-Level Alarm System
## Status
Accepted
## Context
The firewall needs to communicate detection results to downstream systems. The
design choice is how many alarm levels and what they mean.
Alternatives:
- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
don't warrant blocking but should be flagged. Binary forces a single
threshold that either blocks too much (high false positive) or too little
(high false negative).
- **Numeric-only (0.01.0 score)**: Maximum information but requires every
consumer to choose their own threshold. No shared vocabulary for what's
actionable.
- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
pre-inference screening system. The difference between "low" and "medium"
is too subtle for consumers to act on differently.
- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
review. Most practical for automated systems.
## Decision
Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
continuous score (0.01.0) for consumers that need fine-grained decisions.
## Consequences
**Positive**:
- Clear action mapping: pass, flag, block
- Suspicious level enables defense-in-depth (apply additional checks rather
than binary block/allow)
- Continuous score provides gradient for consumers that need it
- Simple to document and communicate
**Negative**:
- Some consumers may need more granularity (but can use the score field)
- "Suspicious" requires consumers to decide what to do — adds decision burden
## References
- [firewall.md](../firewall.md)

View File

@@ -0,0 +1,55 @@
# ADR-009: Last-Token Activation Extraction
## Status
Accepted
## Context
To extract behavioral signals from the detector model, we must choose which
token's hidden state to use from the sequence of hidden states produced during
inference. Options:
- **Last token**: The hidden state at the final position, which has attended
to the entire sequence. Standard for sequence classification (used by BERT
pools, GPT-style models naturally aggregate at the last position).
- **Mean pooling**: Average hidden states across all positions. Smooths out
position-specific effects but dilutes signal from safety-relevant tokens.
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
(LLaMA architecture) does not use a CLS token.
- **First token**: Has seen only the beginning of the sequence. Misses
context from later tokens.
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
position with extreme activation can dominate.
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
models because the last position's hidden state has attended to the full
sequence via causal attention. For safety detection, this means the last
token's representation contains the model's "conclusion" about the entire
input.
## Decision
Extract the last token's hidden state at each configured layer. This is
standard for LLaMA-family models and provides full-sequence context.
## Consequences
**Positive**:
- Standard approach for autoregressive models — well-validated
- Full sequence context via causal attention
- Single vector per layer — simple to project and score
- No padding sensitivity (unlike mean pooling with attention masks)
**Negative**:
- Position-dependent — the last token's representation is influenced by its
position in the sequence, not just its content
- Very short inputs (12 tokens) may not have enough context for meaningful
activation patterns
- May miss patterns in long inputs where the adversarial payload is in the
middle rather than the end
## References
- [model.md](../model.md)
- [codebook.md](../codebook.md)

View File

@@ -0,0 +1,64 @@
# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
## Status
Accepted
## Context
After projecting activations onto SVD dimensions, the firewall needs to score
how "normal" or "anomalous" a projection is relative to the distribution of
normal inputs. This requires modeling the probability density of normal inputs
along each dimension.
Alternatives:
- **Gaussian**: Simple, well-understood. But real behavioral distributions are
often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
bandwidth selection is tricky, and KDE doesn't provide a parametric form for
efficient storage and fast evaluation.
- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
choosing the number of components and risks overfitting.
- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
calibration data points — not compact.
- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
spline. Compact (handful of knots), smooth, tail-sensitive, and
differentiable. The CDF is naturally monotonic, which enforces a valid
probability distribution.
## Decision
Use monotonic spline distributions to model behavioral regions along each SVD
dimension. The CDF is represented as a monotonic cubic spline with a small
number of knots (typically 1020 per dimension). Tail behavior uses
exponential decay beyond the observed range.
The scoring function computes how far a projection falls in the tail of the
distribution — projections well within the normal region score low (CLEAR),
projections near or beyond the tail score increasingly high.
## Consequences
**Positive**:
- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
cliff-edge behavior
- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
inputs without flagging the bulk of normal inputs
- **Parametric compactness**: A handful of spline knots (1020) represent the
full distribution shape. Very small storage footprint.
- **Differentiability**: Scores are differentiable — potential for future
adversarial training or gradient-based analysis
- **No distributional assumptions**: Unlike Gaussian, spline distributions
handle skew, heavy tails, and non-standard shapes
**Negative**:
- More complex than Gaussian — requires spline fitting during codebook
compilation
- Spline knot selection affects scoring quality — poor knot placement can
miss important distribution features
- Less familiar to most ML practitioners than Gaussian or KDE
## References
- [codebook.md](../codebook.md)
- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)

View File

@@ -0,0 +1,200 @@
---
status: draft
last_updated: 2026-06-13
---
# Firewall
The core firewall component: the public API for screening untrusted inputs and
producing behavioral alarms.
## What It Is
The Firewall is the primary entry point for alknet-firewall. It receives
untrusted text input, runs it through the detector model, extracts behavioral
signals from hidden state activations, and produces a structured alarm
indicating whether the input exhibits adversarial behavioral patterns.
## Why It Exists
LLM-based systems need a fast, pre-inference screening mechanism that catches
adversarial inputs *before* they reach the target model. Text-surface
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
detection catches what text hides — adversarial inputs produce anomalous
activation patterns regardless of their surface form (ADR-002).
## Data Flow
```
1. Input Arrives
"Please summarize this document: [hidden injection payload]"
2. Tokenize
tokenizer.encode(input) → input_ids
3. Detector Model Inference
model(input_ids) → hidden_states at key layers
4. Activation Extraction
Extract hidden states from configured layers (early + mid)
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
5. SVD Projection
Project activations onto precomputed SVD basis
z_coords = svd_basis @ activation_vector
6. Codebook Comparison
For each SVD dimension:
- Compute distance from normal behavioral region
- Apply spline scoring (monotonic distribution)
- Aggregate multi-dimensional signals
7. Alarm Generation
Combine per-dimension signals → overall alarm
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
Include per-dimension breakdown for interpretability
```
## Key Concepts
### Behavioral Alarm
Not a simple safe/unsafe binary. A behavioral alarm contains:
- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
- **Score**: Continuous 0.01.0 composite score
- **Signals**: Per-dimension behavioral signal strengths
- **Dimensions**: Which SVD directions are anomalous and by how much
This multi-signal approach reflects that safety is multi-dimensional in
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
that simultaneously shifts the refusal direction while activating role-playing
dimensions is more suspicious than one that shifts only one dimension.
### Score Composition
The overall `Alarm.score` (0.01.0) is computed from per-dimension signals
using a weighted maximum:
```
score = max(w_d * signal_d for d in dimensions)
```
Where `w_d` are dimension weights (default: equal, configurable in
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
single strongly anomalous dimension can trigger an alarm even if other
dimensions are normal. This is critical for catching attacks that exploit
specific behavioral patterns (e.g., refusal-suppression) while leaving other
dimensions unaffected.
The `suspicious` and `dangerous` thresholds are applied to this composite
score to determine `Alarm.level`.
### Alarm Levels
| Level | Meaning | Action |
|-------|---------|--------|
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
### Latency Budget
The firewall must complete screening in <10ms on commodity hardware
(ADR-003). This budget breaks down approximately:
| Step | Target Latency |
|------|----------------|
| Tokenization | ~0.5ms |
| Model inference (125M, CPU) | ~5ms |
| Activation extraction | ~0.1ms |
| SVD projection | ~0.1ms |
| Codebook comparison | ~0.3ms |
| **Total** | **~6ms** |
## Interfaces
### Public API
```python
class AlarmLevel(Enum):
CLEAR = "clear"
SUSPICIOUS = "suspicious"
DANGEROUS = "dangerous"
@dataclass
class DimensionSignal:
dimension: int
deviation: float
score: float
direction_label: str | None
@dataclass
class Alarm:
level: AlarmLevel
score: float
signals: list[DimensionSignal]
input_hash: str # SHA-256 of raw input string (for logging/dedup)
model_id: str
timestamp: float
class Firewall:
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
model_revision: str = DEFAULT_MODEL_REVISION,
codebook_path: Path | None = None,
thresholds: Thresholds | None = None,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def preload(self) -> None: ...
def screen(self, input: str) -> Alarm: ...
```
> `screen_batch` is Phase 2 (see overview.md scope).
### Constraints
1. **No network calls during screening** — the model is lazily loaded on
first `screen()` call or via explicit `preload()`. Download never happens at
import time. Once loaded, screening is entirely local.
2. **Synchronous API**`screen()` is a blocking call. Async is Phase 2.
3. **No target model dependency** — the firewall has no access to the target
LLM's internals. It runs its own detector model.
4. **Reproducible** — Same input + same model + same codebook = same alarm.
Pin model revision and codebook version.
## Error Handling
| Failure Mode | Exception Type | Behavior |
|-------------|---------------|----------|
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
| Empty input | `ValueError` | Raised if input is empty string. |
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
All exception types subclass `AlknetFirewallError` (base library exception).
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)

161
docs/architecture/model.md Normal file
View File

@@ -0,0 +1,161 @@
---
status: draft
last_updated: 2026-06-13
---
# Model
The model component manages detector model loading, inference, and activation
extraction. It is the interface between the firewall and the language model
that provides behavioral signals.
## What It Is
The model component loads a small language model (default: SmolLM2-135M),
runs inference on untrusted inputs, and extracts hidden state activations at
configured layers. It is model-agnostic — any transformer model with
accessible hidden states can serve as a detector.
## Why It Exists
The firewall needs model activations (hidden states) to detect behavioral
patterns. This component encapsulates the complexity of model loading,
inference, and activation extraction behind a clean interface that the
codebook and firewall can consume without knowing model-specific details.
The model-agnostic design (ADR-003) means the firewall is not tied to a
specific detector model. Switching from SmolLM2-135M to another ~100M model
requires recomputing the SVD basis and rebuilding the codebook, but no
changes to the firewall logic.
## Key Concepts
### Activation Extraction
The core operation: running the model on an input and capturing hidden state
representations at specific layers.
```python
# Conceptual
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in configured_layers
}
```
Key decisions:
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral
patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their
signals are highly correlated with the selected layers.
- **Which token**: The last token's hidden state carries the model's
"conclusion" about the full input sequence (ADR-009). This is the standard
choice for autoregressive (LLaMA-family) models.
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
(768 for SmolLM2-135M).
### Model-Agnostic Interface
The model component exposes a generic interface that works with any
transformer model:
```python
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
```
The `infer` method returns hidden states at key layers, abstracting away
whether the backend is PyTorch, ONNX Runtime, or a future Rust inference
engine.
### Lazy Loading
The model is loaded on first use or explicit preload — not at import time.
This keeps the library import fast (~milliseconds) even when torch is
installed.
```python
firewall = Firewall() # Does NOT load model yet
firewall.preload() # Explicit: download + load model
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
```
### Offline Support
The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags.
In air-gapped environments, models must be pre-downloaded. The library
provides a CLI command for this:
```bash
python -m alknet_firewall download
```
## Interfaces
### Public API
```python
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
class HFDetectorModel:
"""Default implementation using HuggingFace transformers."""
DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>" # Specific SmolLM2-135M commit
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
revision: str = DEFAULT_REVISION,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def load(self, device: str | None = None) -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
def is_loaded(self) -> bool: ...
@property
def extraction_layers(self) -> list[int]: ...
```
### Constraints
1. **safetensors-only** — Model weights are loaded exclusively from
safetensors format. Pickle-based `.pt`/`.bin` files are never loaded
(ADR-005). This is a security requirement for a security product.
2. **Model pinning** — Model revision must be pinned for reproducibility.
Default revision is a specific commit hash, not `"main"`.
3. **CPU-first** — Default device is CPU. GPU inference is supported but not
required. The <10ms latency target is achievable on CPU with a 125M model.
4. **No training** — The detector model is inference-only. No gradients are
computed. No model weights are modified at runtime.
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable |
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats |
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports |
| [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled |
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)

View File

@@ -0,0 +1,129 @@
# Open Questions
Centralized tracker for unresolved questions across all architecture documents.
## Theme: Inference Backend
### OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?
- **Origin**: [model.md](model.md), [overview.md](overview.md)
- **Status**: open
- **Priority**: medium
- **Resolution**: (pending)
- **Cross-references**: ADR-006
ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
for PyTorch) and is well-suited for inference-only use. HuggingFace's `optimum`
library provides drop-in replacement classes. However, supporting it in Phase 1
adds complexity: model must be exported to ONNX format, `optimum` integration
must be tested, and the activation extraction API may differ from PyTorch.
Consider: Is the smaller footprint worth the integration complexity in Phase 1,
or should ONNX support wait until Phase 2 when the core API is stable?
---
## Theme: Codebook Design
### OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?
- **Origin**: [codebook.md](codebook.md)
- **Status**: open
- **Priority**: high
- **Resolution**: (pending)
- **Cross-references**: ADR-004
The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
or excessive parameterization from the research phase. Understanding what's
essential vs. exploratory is critical for the initial extraction. The codebook
training pipeline (`run_manifold_projection.py`) should also be analyzed.
Consider: How many SVD dimensions are actually needed? What's the minimum
calibration dataset? Can spline distributions be simplified?
---
## Theme: API Design
### OQ-03: Should the firewall support streaming/chunked input screening?
- **Origin**: [firewall.md](firewall.md)
- **Status**: open
- **Priority**: low
- **Resolution**: (pending)
- **Cross-references**: ADR-003
Some inputs arrive in chunks (streaming API responses, large documents). Should
the firewall support incremental screening as chunks arrive, or require the
full input before screening? Incremental screening could detect attacks earlier
but requires buffering and state management.
This is low priority for Phase 1 but affects the internal API design.
---
### OQ-04: Should detection thresholds be per-model or globally configurable?
- **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
- **Status**: open
- **Priority**: medium
- **Resolution**: (pending)
- **Cross-references**: ADR-003, ADR-004
Different detector models may produce different score distributions. Thresholds
that work for SmolLM2-135M may not work for a different model. Should
thresholds be tied to the codebook (per-model) or set globally by the user?
Consider: Per-model defaults with user overrides? Codebook ships with
recommended thresholds that the user can adjust?
---
## Theme: Integration
### OQ-05: How should the firewall integrate with existing guardrail systems?
- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
- **Status**: open
- **Priority**: medium
- **Resolution**: (pending)
- **Cross-references**: ADR-002
The behavioral firewall is complementary to text-surface defenses. Users may
want to run both Llama Guard (text classification) and alknet-firewall
(behavioral signals) in series. How should these be composed?
Consider: Integration adapters? A common interface? Callback hooks? Or is
composition the user's responsibility and we just provide a clean standalone API?
---
## Theme: Project Setup
### OQ-06: Should file-based configuration use TOML or YAML?
- **Origin**: [configuration.md](configuration.md)
- **Status**: open
- **Priority**: low
- **Resolution**: (pending)
- **Cross-references**: None
Phase 1 uses constructor-based configuration only. A future phase may add
file-based configuration for easier deployment. TOML is consistent with
Python packaging (pyproject.toml) and increasingly the standard for Python
config. YAML is more familiar in ops/ML contexts. Either works.
---
### OQ-07: Is a Rust port feasible given current ML framework maturity?
- **Origin**: [overview.md](overview.md), ADR-001
- **Status**: open
- **Priority**: low
- **Resolution**: (pending)
- **Cross-references**: ADR-001
A Rust port using burn/cubecl was attempted during the PoC phase and failed.
The ML framework ecosystem in Rust is not yet mature enough for this type
of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
matures or alternative Rust ML frameworks emerge.

View File

@@ -0,0 +1,208 @@
---
status: draft
last_updated: 2026-06-13
---
# Overview
## Vision
A pip-installable Python library that screens untrusted inputs for adversarial
content before they reach a target LLM. The library uses behavioral signals —
patterns in hidden state activations from a small language model — to detect
injection attempts, obfuscated payloads, and novel attack types that text-surface
defenses miss.
This project is open source under the MIT license.
## Why This Exists
LLMs process instructions and data in the same token stream. They cannot
reliably distinguish trusted system prompts from untrusted user content. This
architectural weakness enables prompt injection — the #1 LLM vulnerability per
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
of the time with just 10 attempts (International AI Safety Report 2026).
Current defenses are **surface-level**: text classifiers (Llama Guard), regex
filters, perplexity checks, and canary tokens. All examine *what the input
says*, not *how a model processes it*. Adversarial inputs that look natural to
text classifiers still produce distinctive activation patterns when a model
processes them.
Academic research validates this approach:
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
- **EMNLP 2024**: Safety signals detectable in early layers
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
through non-semantic hidden signals
See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
for the full threat analysis and academic evidence.
## Scope
### In Scope
- **Phase 1**: Core behavioral firewall library
- Input screening via small model activation analysis
- SVD-based anomaly detection with configurable thresholds
- Model-agnostic detector (works with any compatible small model)
- SmolLM2-135M as the default detector model
- Multi-dimensional behavioral alarms (not just safe/unsafe)
- PyTorch inference backend (optional dependency)
- Runtime model download and caching via HuggingFace Hub
- safetensors-only model loading (security requirement)
- Synchronous API for single-input screening
- Interpretable detection signals (SVD direction analysis)
- **Phase 2**: Integration and operational hardening
- ONNX Runtime inference backend
- Async/batch screening API
- Integration adapters for LlamaFirewall, NeMo Guardrails
- Metrics and observability
- Codebook training pipeline (`run_manifold_projection.py` extraction)
- **Phase 3**: Advanced capabilities
- Multi-turn attack detection (payload splitting)
- Streaming input screening
- Custom model fine-tuning for domain-specific detection
- Rust port via burn/cubecl (speculative, requires R&D)
### Out of Scope
- Text-surface classification (that's Llama Guard's job)
- Rule-based content filtering (that's NeMo Guardrails' job)
- Output-side safety monitoring
- Target model training or modification
- Multimodal (image) input screening
- Agent orchestration or access control
- Replacement for comprehensive LLM security programs
## Architecture
```
┌──────────────────────────────────────────┐
│ alknet-firewall (Python library) │
│ │
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
(text) │ │ screen(input) → Alarm │ │
│ │ ├─ Tokenize input │ │
│ │ ├─ Run detector model │ │
│ │ ├─ Extract hidden state activations│ │
│ │ ├─ Project onto SVD basis │ │
│ │ ├─ Compare against codebook │ │
│ │ └─ Return behavioral alarm │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Model Manager ────────────────────┐ │
│ │ Load model (HF Hub download/cache) │ │
│ │ Extract activations at key layers │ │
│ │ Model-agnostic interface │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Codebook ──────────────────────────┐ │
│ │ SVD basis vectors (compiled) │ │
│ │ Detection thresholds per dimension │ │
│ │ Behavioral region boundaries │ │
│ │ Spline distributions for scoring │ │
│ └────────────────────────────────────┘ │
│ │
│ ┌─ Configuration ─────────────────────┐ │
│ │ Model selection & revision pinning │ │
│ │ Detection thresholds │ │
│ │ Alarm severity levels │ │
│ └────────────────────────────────────┘ │
└──────────────────────────────────────────┘
┌──────┴──────┐
│ │
HF Hub Cache Detector Model
(~/.cache/) (SmolLM2-135M)
```
## Package Dependencies
### Core (Required)
| Package | Version | Purpose | Notes |
|---------|---------|---------|-------|
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
### Optional (Extras)
| Package | Extra | Version | Purpose | Notes |
|---------|-------|---------|---------|-------|
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
### Development (Not Published)
| Package | Purpose |
|---------|---------|
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
| `pytest` | Testing |
| `pytest-cov` | Coverage |
| `mypy` | Type checking |
| `pre-commit` | Git hooks |
## Exports
This is a Python library. Public API surface:
```python
from alknet_firewall import Firewall, Alarm, AlarmLevel
# Core screening
firewall = Firewall() # loads default model + codebook
alarm: Alarm = firewall.screen("untrusted input text")
# Alarm properties
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
alarm.score # float, 0.0-1.0
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
alarm.dimensions # SVD dimension analysis
```
See [firewall.md](firewall.md) for the full API specification.
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
## Dependencies on Other Projects
- **metaspline**: The core detection logic (codebook, spline distributions,
SVD projection, space transforms) is adapted from the metaspline research
project. The PoC validated the behavioral signal approach; this project
extracts and productionizes ~1,745 lines of the working subset.
- **reverse-proxy**: The architecture documentation structure and SDD process
are adapted from the @alkdev/reverse-proxy project. The documentation
conventions, ADR format, and open questions tracking are reused directly.
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)