feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
71
docs/architecture/README.md
Normal file
71
docs/architecture/README.md
Normal file
@@ -0,0 +1,71 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# alknet-firewall — Architecture
|
||||
|
||||
## Current State
|
||||
|
||||
**Phase 0→1 (Exploration → Architecture)** — The project has a working PoC
|
||||
demonstrating that behavioral signals from small language models can detect
|
||||
adversarial inputs. The core detection logic (~1,745 lines) works reasonably
|
||||
well but lacks tests, has excessive codebook size, and needs extraction from
|
||||
the research codebase into a properly structured Python package.
|
||||
|
||||
This project extracts and productionizes the behavioral signal detection
|
||||
approach from the metaspline research project. A ~125M parameter model
|
||||
(SmolLM2-135M) processes untrusted inputs and produces hidden state
|
||||
activations. SVD-based dimensionality reduction on these activations reveals
|
||||
behavioral patterns — normal inputs cluster in expected regions while
|
||||
adversarial inputs produce anomalous activation signatures. The system
|
||||
raises "behavioral alarms" without needing to know specific attack types.
|
||||
|
||||
## Architecture Documents
|
||||
|
||||
| Document | Status | Description |
|
||||
|----------|--------|-------------|
|
||||
| [overview.md](overview.md) | Draft | Vision, scope, package structure, dependencies |
|
||||
| [firewall.md](firewall.md) | Draft | Core firewall API, input screening, alarm protocol |
|
||||
| [codebook.md](codebook.md) | Draft | SVD basis, detection parameters, codebook compilation |
|
||||
| [model.md](model.md) | Draft | Model loading, activation extraction, model-agnostic design |
|
||||
| [configuration.md](configuration.md) | Draft | Thresholds, model selection, detection tuning |
|
||||
| [open-questions.md](open-questions.md) | Active | Unresolved questions tracker with OQ-IDs |
|
||||
|
||||
## ADR Table
|
||||
|
||||
| ADR | Title | Status |
|
||||
|-----|-------|--------|
|
||||
| [001](decisions/001-python-uv.md) | Python with uv | Accepted |
|
||||
| [002](decisions/002-behavioral-signals.md) | Behavioral Signal Detection (Not Text Classification) | Accepted |
|
||||
| [003](decisions/003-small-model-detector.md) | Small Model (~125M) as Detector | Accepted |
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-Based Anomaly Detection | Accepted |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-Only Model Loading | Accepted |
|
||||
| [006](decisions/006-optional-pytorch.md) | PyTorch as Optional Dependency | Accepted |
|
||||
| [007](decisions/007-runtime-model-download.md) | Runtime Model Download via HuggingFace Hub | Accepted |
|
||||
| [008](decisions/008-three-level-alarm.md) | Three-Level Alarm System | Accepted |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-Token Activation Extraction | Accepted |
|
||||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic Spline Distributions | Accepted |
|
||||
|
||||
## Open Questions
|
||||
|
||||
See [open-questions.md](open-questions.md) for the full tracker.
|
||||
|
||||
| OQ | Question | Priority | Status |
|
||||
|----|----------|----------|--------|
|
||||
| OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
|
||||
| OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
|
||||
| OQ-03 | Should the firewall support streaming/chunked input screening? | low | open |
|
||||
| OQ-04 | Should detection thresholds be per-model or globally configurable? | medium | open |
|
||||
| OQ-05 | How should the firewall integrate with existing guardrail systems (LlamaFirewall, NeMo)? | medium | open |
|
||||
| OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
|
||||
| OQ-07 | Is a Rust port feasible given current ML framework maturity? | low | open |
|
||||
|
||||
## Document Lifecycle
|
||||
|
||||
| Status | Meaning | Transitions |
|
||||
|--------|---------|-------------|
|
||||
| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
|
||||
| `reviewed` | Architecture is final. Implementation may begin. Changes require review. | → `stable` when implementation is complete |
|
||||
| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
|
||||
| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |
|
||||
248
docs/architecture/codebook.md
Normal file
248
docs/architecture/codebook.md
Normal file
@@ -0,0 +1,248 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Codebook
|
||||
|
||||
The codebook contains the compiled detection parameters — SVD basis vectors,
|
||||
behavioral region boundaries, and scoring distributions — that the firewall
|
||||
uses to detect adversarial inputs.
|
||||
|
||||
## What It Is
|
||||
|
||||
The codebook is the "compiled detector" — the precomputed parameters that
|
||||
transform raw model activations into behavioral alarm signals. It is to the
|
||||
firewall what a trained model is to a classifier: the result of an offline
|
||||
compilation step that produces the runtime detection parameters.
|
||||
|
||||
The name "codebook" comes from vector quantization terminology: it defines a
|
||||
set of reference points (codewords) in activation space that represent known
|
||||
behavioral patterns. New inputs are compared against these reference patterns.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
Running full SVD decomposition and distribution fitting on every input would be
|
||||
prohibitively expensive. The codebook precomputes these offline:
|
||||
|
||||
- **SVD basis**: The principal directions in activation space that capture
|
||||
safety-relevant behavioral variance. Computed once from a calibration
|
||||
dataset.
|
||||
- **Behavioral regions**: The expected distribution of normal inputs along each
|
||||
SVD dimension. Defined by fitted spline distributions.
|
||||
- **Thresholds**: Decision boundaries for alarm levels along each dimension.
|
||||
|
||||
At runtime, the firewall only needs to project new activations onto the
|
||||
precomputed basis and compare against the precomputed regions — O(k) per input
|
||||
where k is the number of retained dimensions.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### z-Coordinates
|
||||
|
||||
The projection of an activation vector onto the SVD basis. Computed as:
|
||||
|
||||
```
|
||||
z = V^T @ (activation - mean)
|
||||
```
|
||||
|
||||
Where `V` is the SVD right-singular matrix (basis vectors) and `mean` is the
|
||||
mean activation from the calibration dataset. The centering step is critical
|
||||
— without it, projections are offset by the mean and thresholds would be
|
||||
incorrect.
|
||||
|
||||
z-coordinates are raw (unnormalized) projections. The codebook's spline
|
||||
distributions are calibrated for this scale, so threshold values in the
|
||||
codebook are specific to the z-coordinate range of the calibration data.
|
||||
|
||||
### SVD Basis
|
||||
|
||||
Singular Value Decomposition of the activation space from a calibration dataset
|
||||
reveals the principal components (directions) that capture the most variance.
|
||||
The top-k components form the basis that the codebook uses for projection.
|
||||
|
||||
Key properties:
|
||||
- **Interpretable**: Each direction can be inspected for what behavioral
|
||||
pattern it represents (refusal, role-playing, hypothetical narrative, etc.)
|
||||
- **Efficient**: After decomposition, projection is a matrix multiply
|
||||
- **Stable**: SVD basis is deterministic for a given calibration dataset
|
||||
- **Model-specific**: The basis is computed for a specific model architecture
|
||||
and weights. Changing the detector model requires recomputing the basis
|
||||
|
||||
The SVD basis is computed by the codebook training pipeline
|
||||
(`run_manifold_projection.py` in the PoC) and stored as part of the codebook.
|
||||
|
||||
### Behavioral Regions
|
||||
|
||||
For each SVD dimension, the codebook defines the expected distribution of
|
||||
normal (non-adversarial) inputs. This is modeled as a monotonic spline
|
||||
distribution that captures the shape of the behavioral region along that
|
||||
dimension.
|
||||
|
||||
Inputs whose projections fall within the normal region score low (CLEAR).
|
||||
Inputs whose projections fall near or beyond the region boundary score
|
||||
increasingly high (SUSPICIOUS → DANGEROUS).
|
||||
|
||||
### Spline Distributions
|
||||
|
||||
Monotonic spline distributions model the probability density along each SVD
|
||||
dimension (ADR-010). They provide:
|
||||
|
||||
- **Smooth scoring**: Continuous score rather than hard threshold
|
||||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||||
anomalous inputs
|
||||
- **Parametric compactness**: A handful of spline knots represent the full
|
||||
distribution shape
|
||||
- **Differentiability**: Scores are differentiable for potential future use in
|
||||
adversarial training
|
||||
|
||||
The spline distribution approach is adapted from the metaspline PoC
|
||||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
||||
|
||||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
||||
the calibration data (ensuring density of knots where data is dense). Beyond
|
||||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
||||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
||||
CDF's complement: `score = 1 - cdf(z)`.
|
||||
|
||||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
||||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
||||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
||||
codebook compilation pipeline.
|
||||
|
||||
### Calibration Dataset
|
||||
|
||||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
||||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
||||
|
||||
- **Composition**: Diverse normal inputs representative of the deployment
|
||||
domain. No adversarial examples — the basis models *normal* behavior, and
|
||||
anomalies are detected as deviations from it.
|
||||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
||||
have diminishing returns.
|
||||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
||||
in production. A narrow calibration dataset (e.g., only short English
|
||||
queries) will produce high false positive rates on unusual but benign inputs.
|
||||
- **Model-specific**: A calibration dataset must be collected for each detector
|
||||
model by running that model on the inputs and extracting activations.
|
||||
|
||||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||||
automates calibration dataset processing.
|
||||
|
||||
### Codebook Compilation
|
||||
|
||||
The codebook is compiled offline by a training pipeline that:
|
||||
|
||||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
||||
2. Extracts hidden state activations at configured layers
|
||||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
||||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
||||
which uses randomized approximation and may not be deterministic)
|
||||
4. Fits spline distributions along each retained dimension
|
||||
5. Computes detection thresholds
|
||||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
||||
|
||||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||||
keeps the Phase 1 installation simple — no additional download step beyond the
|
||||
model. The bundled codebook is specific to the default detector model
|
||||
(SmolLM2-135M at the pinned revision). Users who switch to a different
|
||||
detector model must provide a matching codebook via `codebook_path`.
|
||||
|
||||
## Data Format
|
||||
|
||||
The codebook is stored as:
|
||||
|
||||
```
|
||||
codebook/
|
||||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||||
├── regions.safetensors # Region boundary parameters
|
||||
├── splines.json # Spline knot positions and coefficients
|
||||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
||||
```
|
||||
|
||||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||
|
||||
### Tensor Specifications
|
||||
|
||||
**basis.safetensors**:
|
||||
| Key | Shape | Dtype | Description |
|
||||
|-----|-------|-------|-------------|
|
||||
| `basis_vectors` | `(n_layers, n_dims, hidden_dim)` | float32 | SVD right-singular vectors |
|
||||
| `mean` | `(n_layers, hidden_dim)` | float32 | Mean activation per layer (for centering) |
|
||||
|
||||
**regions.safetensors**:
|
||||
| Key | Shape | Dtype | Description |
|
||||
|-----|-------|-------|-------------|
|
||||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||||
|
||||
**splines.json**:
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `knots` | `list[list[float]]` | Knot positions per dimension (n_dims lists of varying length) |
|
||||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Internal API
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CodebookConfig:
|
||||
model_id: str
|
||||
model_revision: str
|
||||
n_dimensions: int
|
||||
layers: list[int]
|
||||
suspicious_threshold: float # Serialized threshold values
|
||||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
||||
|
||||
class Codebook:
|
||||
def __init__(self, path: Path): ...
|
||||
|
||||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||||
"""Project raw activations onto SVD basis → z-coordinates."""
|
||||
...
|
||||
|
||||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
||||
"""Score z-coordinates against behavioral regions."""
|
||||
...
|
||||
|
||||
@classmethod
|
||||
def load(cls, path: Path) -> Codebook: ...
|
||||
|
||||
@classmethod
|
||||
def from_hf_hub(cls, repo_id: str, revision: str = "main") -> Codebook: ...
|
||||
```
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **Immutable at runtime** — The codebook is read-only during screening.
|
||||
Modifying the codebook requires explicit recompilation.
|
||||
2. **Model-bound** — A codebook is valid only for the specific model it was
|
||||
compiled for. Loading a codebook with the wrong model produces undefined
|
||||
results.
|
||||
3. **Deterministic** — Same codebook + same activations = same scores.
|
||||
4. **Portable** — Codebook can be saved to disk and reloaded without
|
||||
recomputation. Can be distributed via HuggingFace Hub.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Interpretable, efficient, multi-dimensional |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Secure format for codebook tensors |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Which activation to use for projection |
|
||||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Behavioral region scoring |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
|
||||
codebook be compressed? (open)
|
||||
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
|
||||
107
docs/architecture/configuration.md
Normal file
107
docs/architecture/configuration.md
Normal file
@@ -0,0 +1,107 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Configuration
|
||||
|
||||
Configuration for the firewall: model selection, detection thresholds,
|
||||
alarm levels, and operational parameters.
|
||||
|
||||
## What It Is
|
||||
|
||||
The configuration component defines all tunable parameters for the firewall.
|
||||
It controls which model is used, how aggressively inputs are screened, and
|
||||
what alarm levels map to what scores.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
Different deployment contexts need different detection sensitivity. A
|
||||
high-security environment (e.g., screening inputs to a system with access to
|
||||
sensitive data) may want aggressive thresholds that flag more suspicious
|
||||
inputs. A low-risk chatbot may prefer permissive thresholds that minimize
|
||||
false positives. The configuration component makes these trade-offs explicit
|
||||
and tunable.
|
||||
|
||||
## Configuration Structure
|
||||
|
||||
### Thresholds
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Thresholds:
|
||||
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
|
||||
dangerous: float = 0.7 # Score above which input is DANGEROUS
|
||||
per_dimension: dict[int, float] | None = None # Override per SVD dimension
|
||||
```
|
||||
|
||||
Default thresholds are calibrated against the codebook's behavioral regions.
|
||||
Per-dimension overrides allow tuning sensitivity for specific behavioral
|
||||
patterns (e.g., lower threshold on the refusal-suppression dimension).
|
||||
|
||||
### Model Configuration
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ModelConfig:
|
||||
model_id: str = "HuggingFaceTB/SmolLM2-135M"
|
||||
revision: str = "<pinned-commit>" # Specific commit, not "main"
|
||||
device: str = "cpu"
|
||||
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
|
||||
cache_dir: str | None = None
|
||||
```
|
||||
|
||||
Extraction layers are chosen based on EMNLP 2024 findings that safety signals
|
||||
appear in early layers. The default set covers early (1, 2) and mid (4, 8)
|
||||
layers of the 12-layer SmolLM2-135M model.
|
||||
|
||||
### Codebook Configuration
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class CodebookConfig:
|
||||
source: str = "bundled" # "bundled" | "hf_hub" | "local"
|
||||
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
|
||||
revision: str | None = None # HuggingFace revision
|
||||
path: Path | None = None # Local path if source="local"
|
||||
n_dimensions: int = 10 # Number of SVD dimensions to retain
|
||||
```
|
||||
|
||||
### Full Configuration
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class FirewallConfig:
|
||||
model: ModelConfig = field(default_factory=ModelConfig)
|
||||
codebook: CodebookConfig = field(default_factory=CodebookConfig)
|
||||
thresholds: Thresholds = field(default_factory=Thresholds)
|
||||
```
|
||||
|
||||
## Defaults
|
||||
|
||||
All configuration has sensible defaults. The firewall works out of the box:
|
||||
|
||||
```python
|
||||
# All defaults
|
||||
firewall = Firewall()
|
||||
alarm = firewall.screen("Hello, how are you?")
|
||||
# alarm.level == AlarmLevel.CLEAR
|
||||
```
|
||||
|
||||
No configuration file is required. All parameters can be passed via the
|
||||
constructor. A future phase may add file-based configuration (TOML or YAML).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
|
||||
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
|
||||
| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
|
||||
41
docs/architecture/decisions/001-python-uv.md
Normal file
41
docs/architecture/decisions/001-python-uv.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# ADR-001: Python with uv
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The project needs a programming language and build toolchain. The PoC was
|
||||
written in Python using PyTorch, sklearn, and transformers. A Rust port using
|
||||
burn/cubecl was attempted but failed — the ML framework ecosystem in Rust is
|
||||
not yet mature enough for this type of work.
|
||||
|
||||
The project needs a fast path to a usable system. The PoC already works in
|
||||
Python. Modern Python packaging (uv, pyproject.toml, src layout) provides a
|
||||
professional project structure that was not available even a few years ago.
|
||||
|
||||
## Decision
|
||||
|
||||
Use Python 3.10+ with uv as the package manager and build tool. Use uv_build
|
||||
as the build backend. Use src/ layout for the package.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Fast path to working system — PoC code is already Python
|
||||
- Rich ML ecosystem (PyTorch, transformers, sklearn, safetensors)
|
||||
- uv provides 10-100x faster dependency management than pip
|
||||
- Modern packaging standards (pyproject.toml, PEP 735 dependency groups)
|
||||
- Easy distribution via PyPI with `pip install alknet-firewall[torch]`
|
||||
- Type checking via mypy provides strong correctness guarantees
|
||||
|
||||
**Negative**:
|
||||
- Python is slower than Rust for non-ML code (SVD projection, data wrangling)
|
||||
- PyTorch is a large optional dependency (200MB-2.5GB)
|
||||
- Rust port remains a future goal (Phase 3, speculative)
|
||||
|
||||
## References
|
||||
|
||||
- [modern-python-project-setup.md](../research/modern-python-project-setup.md)
|
||||
- [python-ml-packaging.md](../research/python-ml-packaging.md)
|
||||
52
docs/architecture/decisions/002-behavioral-signals.md
Normal file
52
docs/architecture/decisions/002-behavioral-signals.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# ADR-002: Behavioral Signal Detection (Not Text Classification)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
|
||||
text-surface approaches — they classify input text as safe or unsafe. This
|
||||
fundamentally limits their effectiveness:
|
||||
|
||||
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
|
||||
and pattern matching
|
||||
- Novel attack types require retraining classifiers
|
||||
- Text that looks natural to a classifier can still be adversarial when
|
||||
processed by a model
|
||||
|
||||
Academic research (2024-2025) demonstrates that adversarial inputs produce
|
||||
distinctive activation patterns in model internals, regardless of surface form.
|
||||
|
||||
## Decision
|
||||
|
||||
Build a behavioral signal detection system that monitors how a model processes
|
||||
inputs (hidden state activations), not what the inputs say (text surface).
|
||||
Adversarial inputs produce anomalous activation patterns that are detectable
|
||||
even when the text itself looks innocent.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
|
||||
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
|
||||
produce anomalous patterns
|
||||
- Multi-dimensional signals provide interpretable detection (which SVD
|
||||
directions are activated and by how much)
|
||||
- Complementary to existing text-surface defenses — can be layered
|
||||
|
||||
**Negative**:
|
||||
- Requires running a model on every input (adds latency and compute cost)
|
||||
- Detection depends on the detector model sharing architectural similarity
|
||||
with likely attack targets
|
||||
- False positives possible for unusual but benign inputs (domain-specific
|
||||
language, technical content)
|
||||
- No existing production system validates this approach — we are first
|
||||
|
||||
## References
|
||||
|
||||
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
||||
- HiddenDetect (ACL 2025)
|
||||
- Hidden Dimensions of LLM Alignment (ICML 2025)
|
||||
- How Alignment and Jailbreak Work (EMNLP 2024)
|
||||
56
docs/architecture/decisions/003-small-model-detector.md
Normal file
56
docs/architecture/decisions/003-small-model-detector.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# ADR-003: Small Model (~125M) as Detector
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The behavioral signal detection approach requires running a language model on
|
||||
every input to extract hidden state activations. The choice of model size
|
||||
creates a trade-off:
|
||||
|
||||
- **Large model (7B+)**: Better representation quality, more behavioral signal
|
||||
resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
|
||||
- **Small model (~125M)**: Sufficient representation quality for early-layer
|
||||
safety signals. Runs on CPU, <10ms latency, negligible cost per check.
|
||||
- **Tiny model (<50M)**: Too small for safety-relevant representations to
|
||||
emerge. Lacks the depth where behavioral patterns form.
|
||||
|
||||
EMNLP 2024 research confirms that safety signals are detectable in early
|
||||
layers — the model doesn't need deep processing to produce useful signals.
|
||||
A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
|
||||
for safety directions to emerge in early layers.
|
||||
|
||||
## Decision
|
||||
|
||||
Use a small model (~125M parameters) as the default detector. SmolLM2-135M
|
||||
(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
|
||||
CPU. Support model-agnostic detection — any compatible model can be used by
|
||||
recompiling the codebook.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- <10ms latency enables real-time pre-inference screening
|
||||
- CPU-deployable — no GPU required for the firewall
|
||||
- Can run alongside target model without blocking
|
||||
- Fast iteration — training/updating a 125M model takes hours, not days
|
||||
- Small enough to embed in API gateways, CDN edges, client applications
|
||||
- 269MB model download is feasible via HF Hub with caching
|
||||
|
||||
**Negative**:
|
||||
- Less representation quality than larger models — may miss subtle signals
|
||||
that a 7B detector would catch
|
||||
- Detector model must share some architectural similarity with target models
|
||||
for behavioral signals to transfer
|
||||
- SmolLM2-135M is English-focused — multilingual detection requires a
|
||||
multilingual detector model
|
||||
- Codebook is model-specific — switching models requires recompilation
|
||||
|
||||
## References
|
||||
|
||||
- [model.md](../model.md)
|
||||
- EMNLP 2024: Safety signals detectable in early layers
|
||||
- Subliminal Learning (Nature 2026): Behavioral traits transmit through
|
||||
non-semantic signals
|
||||
58
docs/architecture/decisions/004-svd-based-detection.md
Normal file
58
docs/architecture/decisions/004-svd-based-detection.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# ADR-004: SVD-Based Anomaly Detection
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
After extracting hidden state activations from the detector model, the
|
||||
firewall needs a method to distinguish normal behavioral patterns from
|
||||
adversarial ones. Options:
|
||||
|
||||
- **Single classifier**: Train a binary classifier on activations. Simple but
|
||||
loses the multi-dimensional structure. Black box.
|
||||
- **SVD + region comparison**: Decompose activation space into principal
|
||||
directions, model normal behavioral regions along each direction, detect
|
||||
inputs that fall outside normal regions. Interpretable, efficient,
|
||||
multi-dimensional.
|
||||
- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
|
||||
detect inputs with high reconstruction error. Complex, not interpretable.
|
||||
|
||||
ICML 2025 research shows safety is multi-dimensional in activation space — a
|
||||
dominant refusal direction plus secondary dimensions. SVD naturally discovers
|
||||
these directions. Region comparison provides interpretable per-dimension
|
||||
signals.
|
||||
|
||||
## Decision
|
||||
|
||||
Use SVD-based anomaly detection: decompose activation space via SVD to
|
||||
discover principal behavioral directions, model normal regions along each
|
||||
dimension using monotonic spline distributions, and detect inputs whose
|
||||
projections fall outside normal regions.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
|
||||
- Efficient: Projection is O(k) after decomposition, trivial at runtime
|
||||
- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
|
||||
- Robust: SVD captures structure of entire activation space, not a single
|
||||
boundary
|
||||
- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
|
||||
- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
|
||||
(unlike `TruncatedSVD` which uses randomized initialization)
|
||||
|
||||
**Negative**:
|
||||
- SVD basis is model-specific — changing detector model requires recomputation
|
||||
- Basis quality depends on calibration dataset coverage
|
||||
- Linear decomposition may miss non-linear behavioral patterns
|
||||
- Requires a codebook compilation pipeline (Phase 2)
|
||||
- Full SVD on large calibration datasets may be slow (mitigated by
|
||||
relatively small hidden dim: 768)
|
||||
|
||||
## References
|
||||
|
||||
- [codebook.md](../codebook.md)
|
||||
- Hidden Dimensions of LLM Alignment (ICML 2025)
|
||||
- HiddenDetect (ACL 2025)
|
||||
47
docs/architecture/decisions/005-safetensors-only.md
Normal file
47
docs/architecture/decisions/005-safetensors-only.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# ADR-005: Safetensors-Only Model Loading
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Model weight files come in two formats:
|
||||
|
||||
- **Pickle-based** (`.pt`, `.bin`, `.pth`): Can execute arbitrary Python code
|
||||
during loading. Known supply chain attack vector.
|
||||
- **safetensors**: Simple binary format with JSON header. No code execution.
|
||||
76x faster CPU loading. Zero-copy/lazy loading support.
|
||||
|
||||
This is a security product. Loading untrusted pickle files in a security
|
||||
product is a contradiction. The LiteLLM supply chain attack (CVE-2026-33634,
|
||||
CVSS 9.4) demonstrated that compromised model files can lead to credential
|
||||
theft and backdoors.
|
||||
|
||||
## Decision
|
||||
|
||||
Only load model weights from safetensors format. Never load `.pt`, `.bin`,
|
||||
or `.pth` files. Apply this policy to both the detector model and the codebook
|
||||
tensors.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Eliminates entire class of supply chain attacks via model files
|
||||
- 76x faster model loading on CPU
|
||||
- Zero-copy/lazy loading reduces memory usage
|
||||
- Cross-framework compatible (PyTorch, ONNX, numpy)
|
||||
- Consistent with HuggingFace's own migration to safetensors-default
|
||||
|
||||
**Negative**:
|
||||
- Some older models only ship `.bin` weights — must convert before use
|
||||
- Safetensors doesn't support saving optimizer state (irrelevant — we only
|
||||
do inference)
|
||||
- Explicit `use_safetensors=True` parameter needed in transformers for older
|
||||
versions
|
||||
|
||||
## References
|
||||
|
||||
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 6:
|
||||
safetensors format comparison
|
||||
- CVE-2026-33634 — LiteLLM supply chain attack
|
||||
64
docs/architecture/decisions/006-optional-pytorch.md
Normal file
64
docs/architecture/decisions/006-optional-pytorch.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# ADR-006: PyTorch as Optional Dependency
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
PyTorch is the primary inference backend for the detector model. However,
|
||||
PyTorch is large:
|
||||
|
||||
- `torch` (CPU): ~200MB download, ~700MB installed
|
||||
- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
|
||||
- `onnxruntime`: ~30-50MB download, ~300MB installed
|
||||
|
||||
Making PyTorch a required dependency would force a 200MB-2.5GB download on
|
||||
every user, even those who already have PyTorch installed or prefer ONNX
|
||||
Runtime. This is the standard problem for ML libraries, and the HuggingFace
|
||||
ecosystem has converged on a solution.
|
||||
|
||||
## Decision
|
||||
|
||||
Make PyTorch an optional dependency via extras (`pip install
|
||||
alknet-firewall[torch]`). The base install includes all non-ML dependencies
|
||||
(sklearn, huggingface-hub, safetensors, tokenizers, numpy). ML inference
|
||||
backends are installed separately.
|
||||
|
||||
Use lazy imports with clear error messages when PyTorch is not installed:
|
||||
|
||||
```python
|
||||
try:
|
||||
import torch
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"PyTorch is required for alknet-firewall inference. "
|
||||
"Install with: pip install 'alknet-firewall[torch]' "
|
||||
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
|
||||
)
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Base install is ~30MB download, ~100MB installed — very lightweight
|
||||
- Users with existing PyTorch installations don't re-download
|
||||
- ONNX Runtime alternative available for minimal footprint (~100MB total)
|
||||
- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
|
||||
hub all use this pattern)
|
||||
- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
|
||||
`[[tool.uv.index]]`
|
||||
|
||||
**Negative**:
|
||||
- More complex dependency specification in pyproject.toml
|
||||
- Users must read installation docs to choose the right extra
|
||||
- Runtime import errors if users forget to install a backend
|
||||
- CPU-only torch requires two-step install or uv configuration (can't be
|
||||
expressed in pip extras alone)
|
||||
|
||||
## References
|
||||
|
||||
- [modern-python-project-setup.md](../research/modern-python-project-setup.md) —
|
||||
Section 2: PyTorch handling
|
||||
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 1:
|
||||
PyTorch as dependency
|
||||
53
docs/architecture/decisions/007-runtime-model-download.md
Normal file
53
docs/architecture/decisions/007-runtime-model-download.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# ADR-007: Runtime Model Download via HuggingFace Hub
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The detector model (SmolLM2-135M) is ~269MB. This is too large to bundle in a
|
||||
Python package — PyPI has a 60MB per-file limit and 1GB total project size
|
||||
limit. Even if it were allowed, a 269MB wheel download is terrible UX.
|
||||
|
||||
Options:
|
||||
- **Bundle in package**: Not feasible due to size constraints
|
||||
- **Separate package for model**: Possible but awkward, requires users to
|
||||
install two packages
|
||||
- **Runtime download via HuggingFace Hub**: Standard approach used by
|
||||
transformers. Provides caching, authentication, offline mode, and
|
||||
checksum verification
|
||||
- **Custom download (S3, etc.)**: Works but reinvents the wheel
|
||||
|
||||
## Decision
|
||||
|
||||
Download the detector model at runtime via HuggingFace Hub (`snapshot_download`
|
||||
or `from_pretrained` with automatic caching). Support offline mode via
|
||||
`HF_HUB_OFFLINE=1` or `local_files_only=True`. Provide a CLI command for
|
||||
pre-downloading models in air-gapped environments.
|
||||
|
||||
Pin model revisions to specific commit hashes for reproducibility.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Package stays small (~30MB base install)
|
||||
- HuggingFace Hub provides automatic caching, deduplication, and checksum
|
||||
verification
|
||||
- Offline mode supported via environment variable
|
||||
- Authentication for gated models via `HF_TOKEN`
|
||||
- Standard approach — users familiar with transformers will recognize the
|
||||
pattern
|
||||
|
||||
**Negative**:
|
||||
- First run requires network access and ~269MB download (with progress bar)
|
||||
- Model availability depends on HuggingFace Hub uptime
|
||||
- Users in restricted networks need to pre-download models
|
||||
- Different model versions may produce different detection results — must
|
||||
pin revisions
|
||||
|
||||
## References
|
||||
|
||||
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 2:
|
||||
Model file distribution
|
||||
- [model.md](../model.md)
|
||||
47
docs/architecture/decisions/008-three-level-alarm.md
Normal file
47
docs/architecture/decisions/008-three-level-alarm.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# ADR-008: Three-Level Alarm System
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The firewall needs to communicate detection results to downstream systems. The
|
||||
design choice is how many alarm levels and what they mean.
|
||||
|
||||
Alternatives:
|
||||
- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
|
||||
don't warrant blocking but should be flagged. Binary forces a single
|
||||
threshold that either blocks too much (high false positive) or too little
|
||||
(high false negative).
|
||||
- **Numeric-only (0.0–1.0 score)**: Maximum information but requires every
|
||||
consumer to choose their own threshold. No shared vocabulary for what's
|
||||
actionable.
|
||||
- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
|
||||
pre-inference screening system. The difference between "low" and "medium"
|
||||
is too subtle for consumers to act on differently.
|
||||
- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
|
||||
nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
|
||||
review. Most practical for automated systems.
|
||||
|
||||
## Decision
|
||||
|
||||
Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
|
||||
continuous score (0.0–1.0) for consumers that need fine-grained decisions.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Clear action mapping: pass, flag, block
|
||||
- Suspicious level enables defense-in-depth (apply additional checks rather
|
||||
than binary block/allow)
|
||||
- Continuous score provides gradient for consumers that need it
|
||||
- Simple to document and communicate
|
||||
|
||||
**Negative**:
|
||||
- Some consumers may need more granularity (but can use the score field)
|
||||
- "Suspicious" requires consumers to decide what to do — adds decision burden
|
||||
|
||||
## References
|
||||
|
||||
- [firewall.md](../firewall.md)
|
||||
55
docs/architecture/decisions/009-last-token-extraction.md
Normal file
55
docs/architecture/decisions/009-last-token-extraction.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# ADR-009: Last-Token Activation Extraction
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
To extract behavioral signals from the detector model, we must choose which
|
||||
token's hidden state to use from the sequence of hidden states produced during
|
||||
inference. Options:
|
||||
|
||||
- **Last token**: The hidden state at the final position, which has attended
|
||||
to the entire sequence. Standard for sequence classification (used by BERT
|
||||
pools, GPT-style models naturally aggregate at the last position).
|
||||
- **Mean pooling**: Average hidden states across all positions. Smooths out
|
||||
position-specific effects but dilutes signal from safety-relevant tokens.
|
||||
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
|
||||
(LLaMA architecture) does not use a CLS token.
|
||||
- **First token**: Has seen only the beginning of the sequence. Misses
|
||||
context from later tokens.
|
||||
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
|
||||
position with extreme activation can dominate.
|
||||
|
||||
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
|
||||
models because the last position's hidden state has attended to the full
|
||||
sequence via causal attention. For safety detection, this means the last
|
||||
token's representation contains the model's "conclusion" about the entire
|
||||
input.
|
||||
|
||||
## Decision
|
||||
|
||||
Extract the last token's hidden state at each configured layer. This is
|
||||
standard for LLaMA-family models and provides full-sequence context.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- Standard approach for autoregressive models — well-validated
|
||||
- Full sequence context via causal attention
|
||||
- Single vector per layer — simple to project and score
|
||||
- No padding sensitivity (unlike mean pooling with attention masks)
|
||||
|
||||
**Negative**:
|
||||
- Position-dependent — the last token's representation is influenced by its
|
||||
position in the sequence, not just its content
|
||||
- Very short inputs (1–2 tokens) may not have enough context for meaningful
|
||||
activation patterns
|
||||
- May miss patterns in long inputs where the adversarial payload is in the
|
||||
middle rather than the end
|
||||
|
||||
## References
|
||||
|
||||
- [model.md](../model.md)
|
||||
- [codebook.md](../codebook.md)
|
||||
@@ -0,0 +1,64 @@
|
||||
# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
After projecting activations onto SVD dimensions, the firewall needs to score
|
||||
how "normal" or "anomalous" a projection is relative to the distribution of
|
||||
normal inputs. This requires modeling the probability density of normal inputs
|
||||
along each dimension.
|
||||
|
||||
Alternatives:
|
||||
- **Gaussian**: Simple, well-understood. But real behavioral distributions are
|
||||
often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
|
||||
- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
|
||||
bandwidth selection is tricky, and KDE doesn't provide a parametric form for
|
||||
efficient storage and fast evaluation.
|
||||
- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
|
||||
choosing the number of components and risks overfitting.
|
||||
- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
|
||||
calibration data points — not compact.
|
||||
- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
|
||||
spline. Compact (handful of knots), smooth, tail-sensitive, and
|
||||
differentiable. The CDF is naturally monotonic, which enforces a valid
|
||||
probability distribution.
|
||||
|
||||
## Decision
|
||||
|
||||
Use monotonic spline distributions to model behavioral regions along each SVD
|
||||
dimension. The CDF is represented as a monotonic cubic spline with a small
|
||||
number of knots (typically 10–20 per dimension). Tail behavior uses
|
||||
exponential decay beyond the observed range.
|
||||
|
||||
The scoring function computes how far a projection falls in the tail of the
|
||||
distribution — projections well within the normal region score low (CLEAR),
|
||||
projections near or beyond the tail score increasingly high.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
|
||||
cliff-edge behavior
|
||||
- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
|
||||
inputs without flagging the bulk of normal inputs
|
||||
- **Parametric compactness**: A handful of spline knots (10–20) represent the
|
||||
full distribution shape. Very small storage footprint.
|
||||
- **Differentiability**: Scores are differentiable — potential for future
|
||||
adversarial training or gradient-based analysis
|
||||
- **No distributional assumptions**: Unlike Gaussian, spline distributions
|
||||
handle skew, heavy tails, and non-standard shapes
|
||||
|
||||
**Negative**:
|
||||
- More complex than Gaussian — requires spline fitting during codebook
|
||||
compilation
|
||||
- Spline knot selection affects scoring quality — poor knot placement can
|
||||
miss important distribution features
|
||||
- Less familiar to most ML practitioners than Gaussian or KDE
|
||||
|
||||
## References
|
||||
|
||||
- [codebook.md](../codebook.md)
|
||||
- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)
|
||||
200
docs/architecture/firewall.md
Normal file
200
docs/architecture/firewall.md
Normal file
@@ -0,0 +1,200 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Firewall
|
||||
|
||||
The core firewall component: the public API for screening untrusted inputs and
|
||||
producing behavioral alarms.
|
||||
|
||||
## What It Is
|
||||
|
||||
The Firewall is the primary entry point for alknet-firewall. It receives
|
||||
untrusted text input, runs it through the detector model, extracts behavioral
|
||||
signals from hidden state activations, and produces a structured alarm
|
||||
indicating whether the input exhibits adversarial behavioral patterns.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
LLM-based systems need a fast, pre-inference screening mechanism that catches
|
||||
adversarial inputs *before* they reach the target model. Text-surface
|
||||
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
|
||||
detection catches what text hides — adversarial inputs produce anomalous
|
||||
activation patterns regardless of their surface form (ADR-002).
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
1. Input Arrives
|
||||
"Please summarize this document: [hidden injection payload]"
|
||||
|
||||
2. Tokenize
|
||||
tokenizer.encode(input) → input_ids
|
||||
|
||||
3. Detector Model Inference
|
||||
model(input_ids) → hidden_states at key layers
|
||||
|
||||
4. Activation Extraction
|
||||
Extract hidden states from configured layers (early + mid)
|
||||
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
||||
|
||||
5. SVD Projection
|
||||
Project activations onto precomputed SVD basis
|
||||
z_coords = svd_basis @ activation_vector
|
||||
|
||||
6. Codebook Comparison
|
||||
For each SVD dimension:
|
||||
- Compute distance from normal behavioral region
|
||||
- Apply spline scoring (monotonic distribution)
|
||||
- Aggregate multi-dimensional signals
|
||||
|
||||
7. Alarm Generation
|
||||
Combine per-dimension signals → overall alarm
|
||||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||||
Include per-dimension breakdown for interpretability
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Behavioral Alarm
|
||||
|
||||
Not a simple safe/unsafe binary. A behavioral alarm contains:
|
||||
|
||||
- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
|
||||
- **Score**: Continuous 0.0–1.0 composite score
|
||||
- **Signals**: Per-dimension behavioral signal strengths
|
||||
- **Dimensions**: Which SVD directions are anomalous and by how much
|
||||
|
||||
This multi-signal approach reflects that safety is multi-dimensional in
|
||||
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
|
||||
that simultaneously shifts the refusal direction while activating role-playing
|
||||
dimensions is more suspicious than one that shifts only one dimension.
|
||||
|
||||
### Score Composition
|
||||
|
||||
The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
|
||||
using a weighted maximum:
|
||||
|
||||
```
|
||||
score = max(w_d * signal_d for d in dimensions)
|
||||
```
|
||||
|
||||
Where `w_d` are dimension weights (default: equal, configurable in
|
||||
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
|
||||
single strongly anomalous dimension can trigger an alarm even if other
|
||||
dimensions are normal. This is critical for catching attacks that exploit
|
||||
specific behavioral patterns (e.g., refusal-suppression) while leaving other
|
||||
dimensions unaffected.
|
||||
|
||||
The `suspicious` and `dangerous` thresholds are applied to this composite
|
||||
score to determine `Alarm.level`.
|
||||
|
||||
### Alarm Levels
|
||||
|
||||
| Level | Meaning | Action |
|
||||
|-------|---------|--------|
|
||||
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
|
||||
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
|
||||
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
|
||||
|
||||
### Latency Budget
|
||||
|
||||
The firewall must complete screening in <10ms on commodity hardware
|
||||
(ADR-003). This budget breaks down approximately:
|
||||
|
||||
| Step | Target Latency |
|
||||
|------|----------------|
|
||||
| Tokenization | ~0.5ms |
|
||||
| Model inference (125M, CPU) | ~5ms |
|
||||
| Activation extraction | ~0.1ms |
|
||||
| SVD projection | ~0.1ms |
|
||||
| Codebook comparison | ~0.3ms |
|
||||
| **Total** | **~6ms** |
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Public API
|
||||
|
||||
```python
|
||||
class AlarmLevel(Enum):
|
||||
CLEAR = "clear"
|
||||
SUSPICIOUS = "suspicious"
|
||||
DANGEROUS = "dangerous"
|
||||
|
||||
@dataclass
|
||||
class DimensionSignal:
|
||||
dimension: int
|
||||
deviation: float
|
||||
score: float
|
||||
direction_label: str | None
|
||||
|
||||
@dataclass
|
||||
class Alarm:
|
||||
level: AlarmLevel
|
||||
score: float
|
||||
signals: list[DimensionSignal]
|
||||
input_hash: str # SHA-256 of raw input string (for logging/dedup)
|
||||
model_id: str
|
||||
timestamp: float
|
||||
|
||||
class Firewall:
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = "HuggingFaceTB/SmolLM2-135M",
|
||||
model_revision: str = DEFAULT_MODEL_REVISION,
|
||||
codebook_path: Path | None = None,
|
||||
thresholds: Thresholds | None = None,
|
||||
device: str = "cpu",
|
||||
cache_dir: str | None = None,
|
||||
): ...
|
||||
|
||||
def preload(self) -> None: ...
|
||||
|
||||
def screen(self, input: str) -> Alarm: ...
|
||||
```
|
||||
|
||||
> `screen_batch` is Phase 2 (see overview.md scope).
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **No network calls during screening** — the model is lazily loaded on
|
||||
first `screen()` call or via explicit `preload()`. Download never happens at
|
||||
import time. Once loaded, screening is entirely local.
|
||||
2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
|
||||
3. **No target model dependency** — the firewall has no access to the target
|
||||
LLM's internals. It runs its own detector model.
|
||||
4. **Reproducible** — Same input + same model + same codebook = same alarm.
|
||||
Pin model revision and codebook version.
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Failure Mode | Exception Type | Behavior |
|
||||
|-------------|---------------|----------|
|
||||
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
|
||||
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
|
||||
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
|
||||
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
|
||||
| Empty input | `ValueError` | Raised if input is empty string. |
|
||||
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
|
||||
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
|
||||
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
|
||||
|
||||
All exception types subclass `AlknetFirewallError` (base library exception).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
|
||||
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
|
||||
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
|
||||
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
|
||||
161
docs/architecture/model.md
Normal file
161
docs/architecture/model.md
Normal file
@@ -0,0 +1,161 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Model
|
||||
|
||||
The model component manages detector model loading, inference, and activation
|
||||
extraction. It is the interface between the firewall and the language model
|
||||
that provides behavioral signals.
|
||||
|
||||
## What It Is
|
||||
|
||||
The model component loads a small language model (default: SmolLM2-135M),
|
||||
runs inference on untrusted inputs, and extracts hidden state activations at
|
||||
configured layers. It is model-agnostic — any transformer model with
|
||||
accessible hidden states can serve as a detector.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
The firewall needs model activations (hidden states) to detect behavioral
|
||||
patterns. This component encapsulates the complexity of model loading,
|
||||
inference, and activation extraction behind a clean interface that the
|
||||
codebook and firewall can consume without knowing model-specific details.
|
||||
|
||||
The model-agnostic design (ADR-003) means the firewall is not tied to a
|
||||
specific detector model. Switching from SmolLM2-135M to another ~100M model
|
||||
requires recomputing the SVD basis and rebuilding the codebook, but no
|
||||
changes to the firewall logic.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Activation Extraction
|
||||
|
||||
The core operation: running the model on an input and capturing hidden state
|
||||
representations at specific layers.
|
||||
|
||||
```python
|
||||
# Conceptual
|
||||
outputs = model(input_ids, output_hidden_states=True)
|
||||
activations = {
|
||||
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
|
||||
for layer_idx in configured_layers
|
||||
}
|
||||
```
|
||||
|
||||
Key decisions:
|
||||
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
|
||||
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
|
||||
Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral
|
||||
patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their
|
||||
signals are highly correlated with the selected layers.
|
||||
- **Which token**: The last token's hidden state carries the model's
|
||||
"conclusion" about the full input sequence (ADR-009). This is the standard
|
||||
choice for autoregressive (LLaMA-family) models.
|
||||
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
|
||||
(768 for SmolLM2-135M).
|
||||
|
||||
### Model-Agnostic Interface
|
||||
|
||||
The model component exposes a generic interface that works with any
|
||||
transformer model:
|
||||
|
||||
```python
|
||||
class DetectorModel(Protocol):
|
||||
model_id: str
|
||||
hidden_dim: int
|
||||
n_layers: int
|
||||
|
||||
def load(self, device: str = "cpu") -> None: ...
|
||||
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
||||
```
|
||||
|
||||
The `infer` method returns hidden states at key layers, abstracting away
|
||||
whether the backend is PyTorch, ONNX Runtime, or a future Rust inference
|
||||
engine.
|
||||
|
||||
### Lazy Loading
|
||||
|
||||
The model is loaded on first use or explicit preload — not at import time.
|
||||
This keeps the library import fast (~milliseconds) even when torch is
|
||||
installed.
|
||||
|
||||
```python
|
||||
firewall = Firewall() # Does NOT load model yet
|
||||
firewall.preload() # Explicit: download + load model
|
||||
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
|
||||
```
|
||||
|
||||
### Offline Support
|
||||
|
||||
The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags.
|
||||
In air-gapped environments, models must be pre-downloaded. The library
|
||||
provides a CLI command for this:
|
||||
|
||||
```bash
|
||||
python -m alknet_firewall download
|
||||
```
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Public API
|
||||
|
||||
```python
|
||||
class DetectorModel(Protocol):
|
||||
model_id: str
|
||||
hidden_dim: int
|
||||
n_layers: int
|
||||
|
||||
def load(self, device: str = "cpu") -> None: ...
|
||||
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
||||
|
||||
class HFDetectorModel:
|
||||
"""Default implementation using HuggingFace transformers."""
|
||||
|
||||
DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>" # Specific SmolLM2-135M commit
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = "HuggingFaceTB/SmolLM2-135M",
|
||||
revision: str = DEFAULT_REVISION,
|
||||
device: str = "cpu",
|
||||
cache_dir: str | None = None,
|
||||
): ...
|
||||
|
||||
def load(self, device: str | None = None) -> None: ...
|
||||
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
||||
def is_loaded(self) -> bool: ...
|
||||
|
||||
@property
|
||||
def extraction_layers(self) -> list[int]: ...
|
||||
```
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **safetensors-only** — Model weights are loaded exclusively from
|
||||
safetensors format. Pickle-based `.pt`/`.bin` files are never loaded
|
||||
(ADR-005). This is a security requirement for a security product.
|
||||
2. **Model pinning** — Model revision must be pinned for reproducibility.
|
||||
Default revision is a specific commit hash, not `"main"`.
|
||||
3. **CPU-first** — Default device is CPU. GPU inference is supported but not
|
||||
required. The <10ms latency target is achievable on CPU with a 125M model.
|
||||
4. **No training** — The detector model is inference-only. No gradients are
|
||||
computed. No model weights are modified at runtime.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats |
|
||||
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports |
|
||||
| [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
|
||||
129
docs/architecture/open-questions.md
Normal file
129
docs/architecture/open-questions.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Open Questions
|
||||
|
||||
Centralized tracker for unresolved questions across all architecture documents.
|
||||
|
||||
## Theme: Inference Backend
|
||||
|
||||
### OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?
|
||||
|
||||
- **Origin**: [model.md](model.md), [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-006
|
||||
|
||||
ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
|
||||
for PyTorch) and is well-suited for inference-only use. HuggingFace's `optimum`
|
||||
library provides drop-in replacement classes. However, supporting it in Phase 1
|
||||
adds complexity: model must be exported to ONNX format, `optimum` integration
|
||||
must be tested, and the activation extraction API may differ from PyTorch.
|
||||
|
||||
Consider: Is the smaller footprint worth the integration complexity in Phase 1,
|
||||
or should ONNX support wait until Phase 2 when the core API is stable?
|
||||
|
||||
---
|
||||
|
||||
## Theme: Codebook Design
|
||||
|
||||
### OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?
|
||||
|
||||
- **Origin**: [codebook.md](codebook.md)
|
||||
- **Status**: open
|
||||
- **Priority**: high
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-004
|
||||
|
||||
The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
|
||||
or excessive parameterization from the research phase. Understanding what's
|
||||
essential vs. exploratory is critical for the initial extraction. The codebook
|
||||
training pipeline (`run_manifold_projection.py`) should also be analyzed.
|
||||
|
||||
Consider: How many SVD dimensions are actually needed? What's the minimum
|
||||
calibration dataset? Can spline distributions be simplified?
|
||||
|
||||
---
|
||||
|
||||
## Theme: API Design
|
||||
|
||||
### OQ-03: Should the firewall support streaming/chunked input screening?
|
||||
|
||||
- **Origin**: [firewall.md](firewall.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-003
|
||||
|
||||
Some inputs arrive in chunks (streaming API responses, large documents). Should
|
||||
the firewall support incremental screening as chunks arrive, or require the
|
||||
full input before screening? Incremental screening could detect attacks earlier
|
||||
but requires buffering and state management.
|
||||
|
||||
This is low priority for Phase 1 but affects the internal API design.
|
||||
|
||||
---
|
||||
|
||||
### OQ-04: Should detection thresholds be per-model or globally configurable?
|
||||
|
||||
- **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-003, ADR-004
|
||||
|
||||
Different detector models may produce different score distributions. Thresholds
|
||||
that work for SmolLM2-135M may not work for a different model. Should
|
||||
thresholds be tied to the codebook (per-model) or set globally by the user?
|
||||
|
||||
Consider: Per-model defaults with user overrides? Codebook ships with
|
||||
recommended thresholds that the user can adjust?
|
||||
|
||||
---
|
||||
|
||||
## Theme: Integration
|
||||
|
||||
### OQ-05: How should the firewall integrate with existing guardrail systems?
|
||||
|
||||
- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-002
|
||||
|
||||
The behavioral firewall is complementary to text-surface defenses. Users may
|
||||
want to run both Llama Guard (text classification) and alknet-firewall
|
||||
(behavioral signals) in series. How should these be composed?
|
||||
|
||||
Consider: Integration adapters? A common interface? Callback hooks? Or is
|
||||
composition the user's responsibility and we just provide a clean standalone API?
|
||||
|
||||
---
|
||||
|
||||
## Theme: Project Setup
|
||||
|
||||
### OQ-06: Should file-based configuration use TOML or YAML?
|
||||
|
||||
- **Origin**: [configuration.md](configuration.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: None
|
||||
|
||||
Phase 1 uses constructor-based configuration only. A future phase may add
|
||||
file-based configuration for easier deployment. TOML is consistent with
|
||||
Python packaging (pyproject.toml) and increasingly the standard for Python
|
||||
config. YAML is more familiar in ops/ML contexts. Either works.
|
||||
|
||||
---
|
||||
|
||||
### OQ-07: Is a Rust port feasible given current ML framework maturity?
|
||||
|
||||
- **Origin**: [overview.md](overview.md), ADR-001
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-001
|
||||
|
||||
A Rust port using burn/cubecl was attempted during the PoC phase and failed.
|
||||
The ML framework ecosystem in Rust is not yet mature enough for this type
|
||||
of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
|
||||
matures or alternative Rust ML frameworks emerge.
|
||||
208
docs/architecture/overview.md
Normal file
208
docs/architecture/overview.md
Normal file
@@ -0,0 +1,208 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
## Vision
|
||||
|
||||
A pip-installable Python library that screens untrusted inputs for adversarial
|
||||
content before they reach a target LLM. The library uses behavioral signals —
|
||||
patterns in hidden state activations from a small language model — to detect
|
||||
injection attempts, obfuscated payloads, and novel attack types that text-surface
|
||||
defenses miss.
|
||||
|
||||
This project is open source under the MIT license.
|
||||
|
||||
## Why This Exists
|
||||
|
||||
LLMs process instructions and data in the same token stream. They cannot
|
||||
reliably distinguish trusted system prompts from untrusted user content. This
|
||||
architectural weakness enables prompt injection — the #1 LLM vulnerability per
|
||||
OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50%
|
||||
of the time with just 10 attempts (International AI Safety Report 2026).
|
||||
|
||||
Current defenses are **surface-level**: text classifiers (Llama Guard), regex
|
||||
filters, perplexity checks, and canary tokens. All examine *what the input
|
||||
says*, not *how a model processes it*. Adversarial inputs that look natural to
|
||||
text classifiers still produce distinctive activation patterns when a model
|
||||
processes them.
|
||||
|
||||
Academic research validates this approach:
|
||||
- **HiddenDetect (ACL 2025)**: Activation-based detection outperforms SOTA
|
||||
- **Hidden Dimensions (ICML 2025)**: Safety is multi-dimensional in activation space
|
||||
- **EMNLP 2024**: Safety signals detectable in early layers
|
||||
- **Subliminal Learning (Nature 2026)**: Models transmit behavioral signals
|
||||
through non-semantic hidden signals
|
||||
|
||||
See [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
|
||||
for the full threat analysis and academic evidence.
|
||||
|
||||
## Scope
|
||||
|
||||
### In Scope
|
||||
|
||||
- **Phase 1**: Core behavioral firewall library
|
||||
- Input screening via small model activation analysis
|
||||
- SVD-based anomaly detection with configurable thresholds
|
||||
- Model-agnostic detector (works with any compatible small model)
|
||||
- SmolLM2-135M as the default detector model
|
||||
- Multi-dimensional behavioral alarms (not just safe/unsafe)
|
||||
- PyTorch inference backend (optional dependency)
|
||||
- Runtime model download and caching via HuggingFace Hub
|
||||
- safetensors-only model loading (security requirement)
|
||||
- Synchronous API for single-input screening
|
||||
- Interpretable detection signals (SVD direction analysis)
|
||||
|
||||
- **Phase 2**: Integration and operational hardening
|
||||
- ONNX Runtime inference backend
|
||||
- Async/batch screening API
|
||||
- Integration adapters for LlamaFirewall, NeMo Guardrails
|
||||
- Metrics and observability
|
||||
- Codebook training pipeline (`run_manifold_projection.py` extraction)
|
||||
|
||||
- **Phase 3**: Advanced capabilities
|
||||
- Multi-turn attack detection (payload splitting)
|
||||
- Streaming input screening
|
||||
- Custom model fine-tuning for domain-specific detection
|
||||
- Rust port via burn/cubecl (speculative, requires R&D)
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- Text-surface classification (that's Llama Guard's job)
|
||||
- Rule-based content filtering (that's NeMo Guardrails' job)
|
||||
- Output-side safety monitoring
|
||||
- Target model training or modification
|
||||
- Multimodal (image) input screening
|
||||
- Agent orchestration or access control
|
||||
- Replacement for comprehensive LLM security programs
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ alknet-firewall (Python library) │
|
||||
│ │
|
||||
Untrusted Input ────► │ ┌─ Firewall API ─────────────────────┐ │
|
||||
(text) │ │ screen(input) → Alarm │ │
|
||||
│ │ ├─ Tokenize input │ │
|
||||
│ │ ├─ Run detector model │ │
|
||||
│ │ ├─ Extract hidden state activations│ │
|
||||
│ │ ├─ Project onto SVD basis │ │
|
||||
│ │ ├─ Compare against codebook │ │
|
||||
│ │ └─ Return behavioral alarm │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Model Manager ────────────────────┐ │
|
||||
│ │ Load model (HF Hub download/cache) │ │
|
||||
│ │ Extract activations at key layers │ │
|
||||
│ │ Model-agnostic interface │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Codebook ──────────────────────────┐ │
|
||||
│ │ SVD basis vectors (compiled) │ │
|
||||
│ │ Detection thresholds per dimension │ │
|
||||
│ │ Behavioral region boundaries │ │
|
||||
│ │ Spline distributions for scoring │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ Configuration ─────────────────────┐ │
|
||||
│ │ Model selection & revision pinning │ │
|
||||
│ │ Detection thresholds │ │
|
||||
│ │ Alarm severity levels │ │
|
||||
│ └────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────┘
|
||||
│
|
||||
┌──────┴──────┐
|
||||
│ │
|
||||
HF Hub Cache Detector Model
|
||||
(~/.cache/) (SmolLM2-135M)
|
||||
```
|
||||
|
||||
## Package Dependencies
|
||||
|
||||
### Core (Required)
|
||||
|
||||
| Package | Version | Purpose | Notes |
|
||||
|---------|---------|---------|-------|
|
||||
| `huggingface-hub` | >=1.5.0,<2.0 | Model download, caching | ~15MB, handles auth and offline mode |
|
||||
| `safetensors` | >=0.4.3 | Safe model weight loading | No arbitrary code execution |
|
||||
| `tokenizers` | >=0.20 | Text tokenization | Fast Rust-based tokenizer |
|
||||
| `numpy` | >=1.24 | Tensor operations | Core numerical dependency |
|
||||
| `scikit-learn` | >=1.3 | SVD computations | TruncatedSVD for basis projection |
|
||||
|
||||
### Optional (Extras)
|
||||
|
||||
| Package | Extra | Version | Purpose | Notes |
|
||||
|---------|-------|---------|---------|-------|
|
||||
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
|
||||
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
|
||||
| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
|
||||
| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
|
||||
|
||||
### Development (Not Published)
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| `ruff` | Linting + formatting (replaces flake8, black, isort) |
|
||||
| `pytest` | Testing |
|
||||
| `pytest-cov` | Coverage |
|
||||
| `mypy` | Type checking |
|
||||
| `pre-commit` | Git hooks |
|
||||
|
||||
## Exports
|
||||
|
||||
This is a Python library. Public API surface:
|
||||
|
||||
```python
|
||||
from alknet_firewall import Firewall, Alarm, AlarmLevel
|
||||
|
||||
# Core screening
|
||||
firewall = Firewall() # loads default model + codebook
|
||||
alarm: Alarm = firewall.screen("untrusted input text")
|
||||
|
||||
# Alarm properties
|
||||
alarm.level # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
|
||||
alarm.score # float, 0.0-1.0
|
||||
alarm.signals # list[DimensionSignal] — per-dimension behavioral signals
|
||||
alarm.dimensions # SVD dimension analysis
|
||||
```
|
||||
|
||||
See [firewall.md](firewall.md) for the full API specification.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [001](decisions/001-python-uv.md) | Python with uv | Python enables direct ML ecosystem integration; uv provides modern packaging |
|
||||
| [002](decisions/002-behavioral-signals.md) | Behavioral signal detection | Detect how models process inputs, not what inputs say |
|
||||
| [003](decisions/003-small-model-detector.md) | Small model as detector | ~125M params: <10ms latency, CPU-deployable, early-layer signals |
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based anomaly detection | Interpretable, efficient, small-model-friendly |
|
||||
| [005](decisions/005-safetensors-only.md) | Safetensors-only loading | No pickle-based model files — security product must be secure |
|
||||
| [006](decisions/006-optional-pytorch.md) | PyTorch as optional dependency | 2GB+ dependency can't be required; extras pattern is industry standard |
|
||||
| [007](decisions/007-runtime-model-download.md) | Runtime model download | 269MB model can't be bundled; HF Hub provides caching and auth |
|
||||
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
|
||||
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
|
||||
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
|
||||
|
||||
## Dependencies on Other Projects
|
||||
|
||||
- **metaspline**: The core detection logic (codebook, spline distributions,
|
||||
SVD projection, space transforms) is adapted from the metaspline research
|
||||
project. The PoC validated the behavioral signal approach; this project
|
||||
extracts and productionizes ~1,745 lines of the working subset.
|
||||
|
||||
- **reverse-proxy**: The architecture documentation structure and SDD process
|
||||
are adapted from the @alkdev/reverse-proxy project. The documentation
|
||||
conventions, ADR format, and open questions tracking are reused directly.
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
|
||||
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
|
||||
Reference in New Issue
Block a user