feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions

View File

@@ -0,0 +1,41 @@
# ADR-001: Python with uv
## Status
Accepted
## Context
The project needs a programming language and build toolchain. The PoC was
written in Python using PyTorch, sklearn, and transformers. A Rust port using
burn/cubecl was attempted but failed — the ML framework ecosystem in Rust is
not yet mature enough for this type of work.
The project needs a fast path to a usable system. The PoC already works in
Python. Modern Python packaging (uv, pyproject.toml, src layout) provides a
professional project structure that was not available even a few years ago.
## Decision
Use Python 3.10+ with uv as the package manager and build tool. Use uv_build
as the build backend. Use src/ layout for the package.
## Consequences
**Positive**:
- Fast path to working system — PoC code is already Python
- Rich ML ecosystem (PyTorch, transformers, sklearn, safetensors)
- uv provides 10-100x faster dependency management than pip
- Modern packaging standards (pyproject.toml, PEP 735 dependency groups)
- Easy distribution via PyPI with `pip install alknet-firewall[torch]`
- Type checking via mypy provides strong correctness guarantees
**Negative**:
- Python is slower than Rust for non-ML code (SVD projection, data wrangling)
- PyTorch is a large optional dependency (200MB-2.5GB)
- Rust port remains a future goal (Phase 3, speculative)
## References
- [modern-python-project-setup.md](../research/modern-python-project-setup.md)
- [python-ml-packaging.md](../research/python-ml-packaging.md)

View File

@@ -0,0 +1,52 @@
# ADR-002: Behavioral Signal Detection (Not Text Classification)
## Status
Accepted
## Context
Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
text-surface approaches — they classify input text as safe or unsafe. This
fundamentally limits their effectiveness:
- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
and pattern matching
- Novel attack types require retraining classifiers
- Text that looks natural to a classifier can still be adversarial when
processed by a model
Academic research (2024-2025) demonstrates that adversarial inputs produce
distinctive activation patterns in model internals, regardless of surface form.
## Decision
Build a behavioral signal detection system that monitors how a model processes
inputs (hidden state activations), not what the inputs say (text surface).
Adversarial inputs produce anomalous activation patterns that are detectable
even when the text itself looks innocent.
## Consequences
**Positive**:
- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
- Anomalous behavior patterns are attack-type agnostic — novel attacks still
produce anomalous patterns
- Multi-dimensional signals provide interpretable detection (which SVD
directions are activated and by how much)
- Complementary to existing text-surface defenses — can be layered
**Negative**:
- Requires running a model on every input (adds latency and compute cost)
- Detection depends on the detector model sharing architectural similarity
with likely attack targets
- False positives possible for unusual but benign inputs (domain-specific
language, technical content)
- No existing production system validates this approach — we are first
## References
- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
- HiddenDetect (ACL 2025)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- How Alignment and Jailbreak Work (EMNLP 2024)

View File

@@ -0,0 +1,56 @@
# ADR-003: Small Model (~125M) as Detector
## Status
Accepted
## Context
The behavioral signal detection approach requires running a language model on
every input to extract hidden state activations. The choice of model size
creates a trade-off:
- **Large model (7B+)**: Better representation quality, more behavioral signal
resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
- **Small model (~125M)**: Sufficient representation quality for early-layer
safety signals. Runs on CPU, <10ms latency, negligible cost per check.
- **Tiny model (<50M)**: Too small for safety-relevant representations to
emerge. Lacks the depth where behavioral patterns form.
EMNLP 2024 research confirms that safety signals are detectable in early
layers — the model doesn't need deep processing to produce useful signals.
A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
for safety directions to emerge in early layers.
## Decision
Use a small model (~125M parameters) as the default detector. SmolLM2-135M
(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
CPU. Support model-agnostic detection — any compatible model can be used by
recompiling the codebook.
## Consequences
**Positive**:
- <10ms latency enables real-time pre-inference screening
- CPU-deployable — no GPU required for the firewall
- Can run alongside target model without blocking
- Fast iteration — training/updating a 125M model takes hours, not days
- Small enough to embed in API gateways, CDN edges, client applications
- 269MB model download is feasible via HF Hub with caching
**Negative**:
- Less representation quality than larger models — may miss subtle signals
that a 7B detector would catch
- Detector model must share some architectural similarity with target models
for behavioral signals to transfer
- SmolLM2-135M is English-focused — multilingual detection requires a
multilingual detector model
- Codebook is model-specific — switching models requires recompilation
## References
- [model.md](../model.md)
- EMNLP 2024: Safety signals detectable in early layers
- Subliminal Learning (Nature 2026): Behavioral traits transmit through
non-semantic signals

View File

@@ -0,0 +1,58 @@
# ADR-004: SVD-Based Anomaly Detection
## Status
Accepted
## Context
After extracting hidden state activations from the detector model, the
firewall needs a method to distinguish normal behavioral patterns from
adversarial ones. Options:
- **Single classifier**: Train a binary classifier on activations. Simple but
loses the multi-dimensional structure. Black box.
- **SVD + region comparison**: Decompose activation space into principal
directions, model normal behavioral regions along each direction, detect
inputs that fall outside normal regions. Interpretable, efficient,
multi-dimensional.
- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
detect inputs with high reconstruction error. Complex, not interpretable.
ICML 2025 research shows safety is multi-dimensional in activation space — a
dominant refusal direction plus secondary dimensions. SVD naturally discovers
these directions. Region comparison provides interpretable per-dimension
signals.
## Decision
Use SVD-based anomaly detection: decompose activation space via SVD to
discover principal behavioral directions, model normal regions along each
dimension using monotonic spline distributions, and detect inputs whose
projections fall outside normal regions.
## Consequences
**Positive**:
- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
- Efficient: Projection is O(k) after decomposition, trivial at runtime
- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
- Robust: SVD captures structure of entire activation space, not a single
boundary
- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
(unlike `TruncatedSVD` which uses randomized initialization)
**Negative**:
- SVD basis is model-specific — changing detector model requires recomputation
- Basis quality depends on calibration dataset coverage
- Linear decomposition may miss non-linear behavioral patterns
- Requires a codebook compilation pipeline (Phase 2)
- Full SVD on large calibration datasets may be slow (mitigated by
relatively small hidden dim: 768)
## References
- [codebook.md](../codebook.md)
- Hidden Dimensions of LLM Alignment (ICML 2025)
- HiddenDetect (ACL 2025)

View File

@@ -0,0 +1,47 @@
# ADR-005: Safetensors-Only Model Loading
## Status
Accepted
## Context
Model weight files come in two formats:
- **Pickle-based** (`.pt`, `.bin`, `.pth`): Can execute arbitrary Python code
during loading. Known supply chain attack vector.
- **safetensors**: Simple binary format with JSON header. No code execution.
76x faster CPU loading. Zero-copy/lazy loading support.
This is a security product. Loading untrusted pickle files in a security
product is a contradiction. The LiteLLM supply chain attack (CVE-2026-33634,
CVSS 9.4) demonstrated that compromised model files can lead to credential
theft and backdoors.
## Decision
Only load model weights from safetensors format. Never load `.pt`, `.bin`,
or `.pth` files. Apply this policy to both the detector model and the codebook
tensors.
## Consequences
**Positive**:
- Eliminates entire class of supply chain attacks via model files
- 76x faster model loading on CPU
- Zero-copy/lazy loading reduces memory usage
- Cross-framework compatible (PyTorch, ONNX, numpy)
- Consistent with HuggingFace's own migration to safetensors-default
**Negative**:
- Some older models only ship `.bin` weights — must convert before use
- Safetensors doesn't support saving optimizer state (irrelevant — we only
do inference)
- Explicit `use_safetensors=True` parameter needed in transformers for older
versions
## References
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 6:
safetensors format comparison
- CVE-2026-33634 — LiteLLM supply chain attack

View File

@@ -0,0 +1,64 @@
# ADR-006: PyTorch as Optional Dependency
## Status
Accepted
## Context
PyTorch is the primary inference backend for the detector model. However,
PyTorch is large:
- `torch` (CPU): ~200MB download, ~700MB installed
- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
- `onnxruntime`: ~30-50MB download, ~300MB installed
Making PyTorch a required dependency would force a 200MB-2.5GB download on
every user, even those who already have PyTorch installed or prefer ONNX
Runtime. This is the standard problem for ML libraries, and the HuggingFace
ecosystem has converged on a solution.
## Decision
Make PyTorch an optional dependency via extras (`pip install
alknet-firewall[torch]`). The base install includes all non-ML dependencies
(sklearn, huggingface-hub, safetensors, tokenizers, numpy). ML inference
backends are installed separately.
Use lazy imports with clear error messages when PyTorch is not installed:
```python
try:
import torch
except ImportError:
raise ImportError(
"PyTorch is required for alknet-firewall inference. "
"Install with: pip install 'alknet-firewall[torch]' "
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
)
```
## Consequences
**Positive**:
- Base install is ~30MB download, ~100MB installed — very lightweight
- Users with existing PyTorch installations don't re-download
- ONNX Runtime alternative available for minimal footprint (~100MB total)
- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
hub all use this pattern)
- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
`[[tool.uv.index]]`
**Negative**:
- More complex dependency specification in pyproject.toml
- Users must read installation docs to choose the right extra
- Runtime import errors if users forget to install a backend
- CPU-only torch requires two-step install or uv configuration (can't be
expressed in pip extras alone)
## References
- [modern-python-project-setup.md](../research/modern-python-project-setup.md) —
Section 2: PyTorch handling
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 1:
PyTorch as dependency

View File

@@ -0,0 +1,53 @@
# ADR-007: Runtime Model Download via HuggingFace Hub
## Status
Accepted
## Context
The detector model (SmolLM2-135M) is ~269MB. This is too large to bundle in a
Python package — PyPI has a 60MB per-file limit and 1GB total project size
limit. Even if it were allowed, a 269MB wheel download is terrible UX.
Options:
- **Bundle in package**: Not feasible due to size constraints
- **Separate package for model**: Possible but awkward, requires users to
install two packages
- **Runtime download via HuggingFace Hub**: Standard approach used by
transformers. Provides caching, authentication, offline mode, and
checksum verification
- **Custom download (S3, etc.)**: Works but reinvents the wheel
## Decision
Download the detector model at runtime via HuggingFace Hub (`snapshot_download`
or `from_pretrained` with automatic caching). Support offline mode via
`HF_HUB_OFFLINE=1` or `local_files_only=True`. Provide a CLI command for
pre-downloading models in air-gapped environments.
Pin model revisions to specific commit hashes for reproducibility.
## Consequences
**Positive**:
- Package stays small (~30MB base install)
- HuggingFace Hub provides automatic caching, deduplication, and checksum
verification
- Offline mode supported via environment variable
- Authentication for gated models via `HF_TOKEN`
- Standard approach — users familiar with transformers will recognize the
pattern
**Negative**:
- First run requires network access and ~269MB download (with progress bar)
- Model availability depends on HuggingFace Hub uptime
- Users in restricted networks need to pre-download models
- Different model versions may produce different detection results — must
pin revisions
## References
- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 2:
Model file distribution
- [model.md](../model.md)

View File

@@ -0,0 +1,47 @@
# ADR-008: Three-Level Alarm System
## Status
Accepted
## Context
The firewall needs to communicate detection results to downstream systems. The
design choice is how many alarm levels and what they mean.
Alternatives:
- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
don't warrant blocking but should be flagged. Binary forces a single
threshold that either blocks too much (high false positive) or too little
(high false negative).
- **Numeric-only (0.01.0 score)**: Maximum information but requires every
consumer to choose their own threshold. No shared vocabulary for what's
actionable.
- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
pre-inference screening system. The difference between "low" and "medium"
is too subtle for consumers to act on differently.
- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
review. Most practical for automated systems.
## Decision
Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
continuous score (0.01.0) for consumers that need fine-grained decisions.
## Consequences
**Positive**:
- Clear action mapping: pass, flag, block
- Suspicious level enables defense-in-depth (apply additional checks rather
than binary block/allow)
- Continuous score provides gradient for consumers that need it
- Simple to document and communicate
**Negative**:
- Some consumers may need more granularity (but can use the score field)
- "Suspicious" requires consumers to decide what to do — adds decision burden
## References
- [firewall.md](../firewall.md)

View File

@@ -0,0 +1,55 @@
# ADR-009: Last-Token Activation Extraction
## Status
Accepted
## Context
To extract behavioral signals from the detector model, we must choose which
token's hidden state to use from the sequence of hidden states produced during
inference. Options:
- **Last token**: The hidden state at the final position, which has attended
to the entire sequence. Standard for sequence classification (used by BERT
pools, GPT-style models naturally aggregate at the last position).
- **Mean pooling**: Average hidden states across all positions. Smooths out
position-specific effects but dilutes signal from safety-relevant tokens.
- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
(LLaMA architecture) does not use a CLS token.
- **First token**: Has seen only the beginning of the sequence. Misses
context from later tokens.
- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
position with extreme activation can dominate.
Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
models because the last position's hidden state has attended to the full
sequence via causal attention. For safety detection, this means the last
token's representation contains the model's "conclusion" about the entire
input.
## Decision
Extract the last token's hidden state at each configured layer. This is
standard for LLaMA-family models and provides full-sequence context.
## Consequences
**Positive**:
- Standard approach for autoregressive models — well-validated
- Full sequence context via causal attention
- Single vector per layer — simple to project and score
- No padding sensitivity (unlike mean pooling with attention masks)
**Negative**:
- Position-dependent — the last token's representation is influenced by its
position in the sequence, not just its content
- Very short inputs (12 tokens) may not have enough context for meaningful
activation patterns
- May miss patterns in long inputs where the adversarial payload is in the
middle rather than the end
## References
- [model.md](../model.md)
- [codebook.md](../codebook.md)

View File

@@ -0,0 +1,64 @@
# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
## Status
Accepted
## Context
After projecting activations onto SVD dimensions, the firewall needs to score
how "normal" or "anomalous" a projection is relative to the distribution of
normal inputs. This requires modeling the probability density of normal inputs
along each dimension.
Alternatives:
- **Gaussian**: Simple, well-understood. But real behavioral distributions are
often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
bandwidth selection is tricky, and KDE doesn't provide a parametric form for
efficient storage and fast evaluation.
- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
choosing the number of components and risks overfitting.
- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
calibration data points — not compact.
- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
spline. Compact (handful of knots), smooth, tail-sensitive, and
differentiable. The CDF is naturally monotonic, which enforces a valid
probability distribution.
## Decision
Use monotonic spline distributions to model behavioral regions along each SVD
dimension. The CDF is represented as a monotonic cubic spline with a small
number of knots (typically 1020 per dimension). Tail behavior uses
exponential decay beyond the observed range.
The scoring function computes how far a projection falls in the tail of the
distribution — projections well within the normal region score low (CLEAR),
projections near or beyond the tail score increasingly high.
## Consequences
**Positive**:
- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
cliff-edge behavior
- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
inputs without flagging the bulk of normal inputs
- **Parametric compactness**: A handful of spline knots (1020) represent the
full distribution shape. Very small storage footprint.
- **Differentiability**: Scores are differentiable — potential for future
adversarial training or gradient-based analysis
- **No distributional assumptions**: Unlike Gaussian, spline distributions
handle skew, heavy tails, and non-standard shapes
**Negative**:
- More complex than Gaussian — requires spline fitting during codebook
compilation
- Spline knot selection affects scoring quality — poor knot placement can
miss important distribution features
- Less familiar to most ML practitioners than Gaussian or KDE
## References
- [codebook.md](../codebook.md)
- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)