feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00
parent 141628bae4
commit cf464c2296
23 changed files with 3900 additions and 44 deletions
--- a/docs/architecture/decisions/001-python-uv.md
+++ b/docs/architecture/decisions/001-python-uv.md
@@ -0,0 +1,41 @@
+# ADR-001: Python with uv
+
+## Status
+
+Accepted
+
+## Context
+
+The project needs a programming language and build toolchain. The PoC was
+written in Python using PyTorch, sklearn, and transformers. A Rust port using
+burn/cubecl was attempted but failed — the ML framework ecosystem in Rust is
+not yet mature enough for this type of work.
+
+The project needs a fast path to a usable system. The PoC already works in
+Python. Modern Python packaging (uv, pyproject.toml, src layout) provides a
+professional project structure that was not available even a few years ago.
+
+## Decision
+
+Use Python 3.10+ with uv as the package manager and build tool. Use uv_build
+as the build backend. Use src/ layout for the package.
+
+## Consequences
+
+**Positive**:
+- Fast path to working system — PoC code is already Python
+- Rich ML ecosystem (PyTorch, transformers, sklearn, safetensors)
+- uv provides 10-100x faster dependency management than pip
+- Modern packaging standards (pyproject.toml, PEP 735 dependency groups)
+- Easy distribution via PyPI with `pip install alknet-firewall[torch]`
+- Type checking via mypy provides strong correctness guarantees
+
+**Negative**:
+- Python is slower than Rust for non-ML code (SVD projection, data wrangling)
+- PyTorch is a large optional dependency (200MB-2.5GB)
+- Rust port remains a future goal (Phase 3, speculative)
+
+## References
+
+- [modern-python-project-setup.md](../research/modern-python-project-setup.md)
+- [python-ml-packaging.md](../research/python-ml-packaging.md)
--- a/docs/architecture/decisions/002-behavioral-signals.md
+++ b/docs/architecture/decisions/002-behavioral-signals.md
@@ -0,0 +1,52 @@
+# ADR-002: Behavioral Signal Detection (Not Text Classification)
+
+## Status
+
+Accepted
+
+## Context
+
+Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are
+text-surface approaches — they classify input text as safe or unsafe. This
+fundamentally limits their effectiveness:
+
+- Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword
+  and pattern matching
+- Novel attack types require retraining classifiers
+- Text that looks natural to a classifier can still be adversarial when
+  processed by a model
+
+Academic research (2024-2025) demonstrates that adversarial inputs produce
+distinctive activation patterns in model internals, regardless of surface form.
+
+## Decision
+
+Build a behavioral signal detection system that monitors how a model processes
+inputs (hidden state activations), not what the inputs say (text surface).
+Adversarial inputs produce anomalous activation patterns that are detectable
+even when the text itself looks innocent.
+
+## Consequences
+
+**Positive**:
+- Catches obfuscated, multilingual, and novel attacks that text classifiers miss
+- Anomalous behavior patterns are attack-type agnostic — novel attacks still
+  produce anomalous patterns
+- Multi-dimensional signals provide interpretable detection (which SVD
+  directions are activated and by how much)
+- Complementary to existing text-surface defenses — can be layered
+
+**Negative**:
+- Requires running a model on every input (adds latency and compute cost)
+- Detection depends on the detector model sharing architectural similarity
+  with likely attack targets
+- False positives possible for unusual but benign inputs (domain-specific
+  language, technical content)
+- No existing production system validates this approach — we are first
+
+## References
+
+- [llm-input-safety-landscape.md](../research/llm-input-safety-landscape.md)
+- HiddenDetect (ACL 2025)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- How Alignment and Jailbreak Work (EMNLP 2024)
--- a/docs/architecture/decisions/003-small-model-detector.md
+++ b/docs/architecture/decisions/003-small-model-detector.md
@@ -0,0 +1,56 @@
+# ADR-003: Small Model (~125M) as Detector
+
+## Status
+
+Accepted
+
+## Context
+
+The behavioral signal detection approach requires running a language model on
+every input to extract hidden state activations. The choice of model size
+creates a trade-off:
+
+- **Large model (7B+)**: Better representation quality, more behavioral signal
+  resolution. But requires GPU, adds ~200-500ms latency, costs more per check.
+- **Small model (~125M)**: Sufficient representation quality for early-layer
+  safety signals. Runs on CPU, <10ms latency, negligible cost per check.
+- **Tiny model (<50M)**: Too small for safety-relevant representations to
+  emerge. Lacks the depth where behavioral patterns form.
+
+EMNLP 2024 research confirms that safety signals are detectable in early
+layers — the model doesn't need deep processing to produce useful signals.
+A ~125M model like SmolLM2-135M has enough depth (12 layers, 768 hidden dim)
+for safety directions to emerge in early layers.
+
+## Decision
+
+Use a small model (~125M parameters) as the default detector. SmolLM2-135M
+(269MB, 12 layers, 768 hidden dim) is the default. Target <10ms latency on
+CPU. Support model-agnostic detection — any compatible model can be used by
+recompiling the codebook.
+
+## Consequences
+
+**Positive**:
+- <10ms latency enables real-time pre-inference screening
+- CPU-deployable — no GPU required for the firewall
+- Can run alongside target model without blocking
+- Fast iteration — training/updating a 125M model takes hours, not days
+- Small enough to embed in API gateways, CDN edges, client applications
+- 269MB model download is feasible via HF Hub with caching
+
+**Negative**:
+- Less representation quality than larger models — may miss subtle signals
+  that a 7B detector would catch
+- Detector model must share some architectural similarity with target models
+  for behavioral signals to transfer
+- SmolLM2-135M is English-focused — multilingual detection requires a
+  multilingual detector model
+- Codebook is model-specific — switching models requires recompilation
+
+## References
+
+- [model.md](../model.md)
+- EMNLP 2024: Safety signals detectable in early layers
+- Subliminal Learning (Nature 2026): Behavioral traits transmit through
+  non-semantic signals
--- a/docs/architecture/decisions/004-svd-based-detection.md
+++ b/docs/architecture/decisions/004-svd-based-detection.md
@@ -0,0 +1,58 @@
+# ADR-004: SVD-Based Anomaly Detection
+
+## Status
+
+Accepted
+
+## Context
+
+After extracting hidden state activations from the detector model, the
+firewall needs a method to distinguish normal behavioral patterns from
+adversarial ones. Options:
+
+- **Single classifier**: Train a binary classifier on activations. Simple but
+  loses the multi-dimensional structure. Black box.
+- **SVD + region comparison**: Decompose activation space into principal
+  directions, model normal behavioral regions along each direction, detect
+  inputs that fall outside normal regions. Interpretable, efficient,
+  multi-dimensional.
+- **Autoencoder anomaly detection**: Train an autoencoder on normal inputs,
+  detect inputs with high reconstruction error. Complex, not interpretable.
+
+ICML 2025 research shows safety is multi-dimensional in activation space — a
+dominant refusal direction plus secondary dimensions. SVD naturally discovers
+these directions. Region comparison provides interpretable per-dimension
+signals.
+
+## Decision
+
+Use SVD-based anomaly detection: decompose activation space via SVD to
+discover principal behavioral directions, model normal regions along each
+dimension using monotonic spline distributions, and detect inputs whose
+projections fall outside normal regions.
+
+## Consequences
+
+**Positive**:
+- Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
+- Efficient: Projection is O(k) after decomposition, trivial at runtime
+- Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
+- Robust: SVD captures structure of entire activation space, not a single
+  boundary
+- Small-model friendly: SVD on 768-dim hidden states is computationally trivial
+- Deterministic: `scipy.linalg.svd` produces exact, reproducible decomposition
+  (unlike `TruncatedSVD` which uses randomized initialization)
+
+**Negative**:
+- SVD basis is model-specific — changing detector model requires recomputation
+- Basis quality depends on calibration dataset coverage
+- Linear decomposition may miss non-linear behavioral patterns
+- Requires a codebook compilation pipeline (Phase 2)
+- Full SVD on large calibration datasets may be slow (mitigated by
+  relatively small hidden dim: 768)
+
+## References
+
+- [codebook.md](../codebook.md)
+- Hidden Dimensions of LLM Alignment (ICML 2025)
+- HiddenDetect (ACL 2025)
--- a/docs/architecture/decisions/005-safetensors-only.md
+++ b/docs/architecture/decisions/005-safetensors-only.md
@@ -0,0 +1,47 @@
+# ADR-005: Safetensors-Only Model Loading
+
+## Status
+
+Accepted
+
+## Context
+
+Model weight files come in two formats:
+
+- **Pickle-based** (`.pt`, `.bin`, `.pth`): Can execute arbitrary Python code
+  during loading. Known supply chain attack vector.
+- **safetensors**: Simple binary format with JSON header. No code execution.
+  76x faster CPU loading. Zero-copy/lazy loading support.
+
+This is a security product. Loading untrusted pickle files in a security
+product is a contradiction. The LiteLLM supply chain attack (CVE-2026-33634,
+CVSS 9.4) demonstrated that compromised model files can lead to credential
+theft and backdoors.
+
+## Decision
+
+Only load model weights from safetensors format. Never load `.pt`, `.bin`,
+or `.pth` files. Apply this policy to both the detector model and the codebook
+tensors.
+
+## Consequences
+
+**Positive**:
+- Eliminates entire class of supply chain attacks via model files
+- 76x faster model loading on CPU
+- Zero-copy/lazy loading reduces memory usage
+- Cross-framework compatible (PyTorch, ONNX, numpy)
+- Consistent with HuggingFace's own migration to safetensors-default
+
+**Negative**:
+- Some older models only ship `.bin` weights — must convert before use
+- Safetensors doesn't support saving optimizer state (irrelevant — we only
+  do inference)
+- Explicit `use_safetensors=True` parameter needed in transformers for older
+  versions
+
+## References
+
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 6:
+  safetensors format comparison
+- CVE-2026-33634 — LiteLLM supply chain attack
--- a/docs/architecture/decisions/006-optional-pytorch.md
+++ b/docs/architecture/decisions/006-optional-pytorch.md
@@ -0,0 +1,64 @@
+# ADR-006: PyTorch as Optional Dependency
+
+## Status
+
+Accepted
+
+## Context
+
+PyTorch is the primary inference backend for the detector model. However,
+PyTorch is large:
+
+- `torch` (CPU): ~200MB download, ~700MB installed
+- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
+- `onnxruntime`: ~30-50MB download, ~300MB installed
+
+Making PyTorch a required dependency would force a 200MB-2.5GB download on
+every user, even those who already have PyTorch installed or prefer ONNX
+Runtime. This is the standard problem for ML libraries, and the HuggingFace
+ecosystem has converged on a solution.
+
+## Decision
+
+Make PyTorch an optional dependency via extras (`pip install
+alknet-firewall[torch]`). The base install includes all non-ML dependencies
+(sklearn, huggingface-hub, safetensors, tokenizers, numpy). ML inference
+backends are installed separately.
+
+Use lazy imports with clear error messages when PyTorch is not installed:
+
+```python
+try:
+    import torch
+except ImportError:
+    raise ImportError(
+        "PyTorch is required for alknet-firewall inference. "
+        "Install with: pip install 'alknet-firewall[torch]' "
+        "or pip install torch --index-url https://download.pytorch.org/whl/cpu"
+    )
+```
+
+## Consequences
+
+**Positive**:
+- Base install is ~30MB download, ~100MB installed — very lightweight
+- Users with existing PyTorch installations don't re-download
+- ONNX Runtime alternative available for minimal footprint (~100MB total)
+- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
+  hub all use this pattern)
+- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
+  `[[tool.uv.index]]`
+
+**Negative**:
+- More complex dependency specification in pyproject.toml
+- Users must read installation docs to choose the right extra
+- Runtime import errors if users forget to install a backend
+- CPU-only torch requires two-step install or uv configuration (can't be
+  expressed in pip extras alone)
+
+## References
+
+- [modern-python-project-setup.md](../research/modern-python-project-setup.md) —
+  Section 2: PyTorch handling
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 1:
+  PyTorch as dependency
--- a/docs/architecture/decisions/007-runtime-model-download.md
+++ b/docs/architecture/decisions/007-runtime-model-download.md
@@ -0,0 +1,53 @@
+# ADR-007: Runtime Model Download via HuggingFace Hub
+
+## Status
+
+Accepted
+
+## Context
+
+The detector model (SmolLM2-135M) is ~269MB. This is too large to bundle in a
+Python package — PyPI has a 60MB per-file limit and 1GB total project size
+limit. Even if it were allowed, a 269MB wheel download is terrible UX.
+
+Options:
+- **Bundle in package**: Not feasible due to size constraints
+- **Separate package for model**: Possible but awkward, requires users to
+  install two packages
+- **Runtime download via HuggingFace Hub**: Standard approach used by
+  transformers. Provides caching, authentication, offline mode, and
+  checksum verification
+- **Custom download (S3, etc.)**: Works but reinvents the wheel
+
+## Decision
+
+Download the detector model at runtime via HuggingFace Hub (`snapshot_download`
+or `from_pretrained` with automatic caching). Support offline mode via
+`HF_HUB_OFFLINE=1` or `local_files_only=True`. Provide a CLI command for
+pre-downloading models in air-gapped environments.
+
+Pin model revisions to specific commit hashes for reproducibility.
+
+## Consequences
+
+**Positive**:
+- Package stays small (~30MB base install)
+- HuggingFace Hub provides automatic caching, deduplication, and checksum
+  verification
+- Offline mode supported via environment variable
+- Authentication for gated models via `HF_TOKEN`
+- Standard approach — users familiar with transformers will recognize the
+  pattern
+
+**Negative**:
+- First run requires network access and ~269MB download (with progress bar)
+- Model availability depends on HuggingFace Hub uptime
+- Users in restricted networks need to pre-download models
+- Different model versions may produce different detection results — must
+  pin revisions
+
+## References
+
+- [python-ml-packaging.md](../research/python-ml-packaging.md) — Section 2:
+  Model file distribution
+- [model.md](../model.md)
--- a/docs/architecture/decisions/008-three-level-alarm.md
+++ b/docs/architecture/decisions/008-three-level-alarm.md
@@ -0,0 +1,47 @@
+# ADR-008: Three-Level Alarm System
+
+## Status
+
+Accepted
+
+## Context
+
+The firewall needs to communicate detection results to downstream systems. The
+design choice is how many alarm levels and what they mean.
+
+Alternatives:
+- **Binary (safe/unsafe)**: Simple but loses nuance. Many suspicious inputs
+  don't warrant blocking but should be flagged. Binary forces a single
+  threshold that either blocks too much (high false positive) or too little
+  (high false negative).
+- **Numeric-only (0.0–1.0 score)**: Maximum information but requires every
+  consumer to choose their own threshold. No shared vocabulary for what's
+  actionable.
+- **Five-tier** (safe/low/medium/high/critical): Over-engineered for a
+  pre-inference screening system. The difference between "low" and "medium"
+  is too subtle for consumers to act on differently.
+- **Three-tier** (clear/suspicious/dangerous): Balances simplicity with
+  nuance. Clear = pass. Dangerous = block. Suspicious = flag for additional
+  review. Most practical for automated systems.
+
+## Decision
+
+Use three alarm levels: `CLEAR`, `SUSPICIOUS`, `DANGEROUS`. Include a
+continuous score (0.0–1.0) for consumers that need fine-grained decisions.
+
+## Consequences
+
+**Positive**:
+- Clear action mapping: pass, flag, block
+- Suspicious level enables defense-in-depth (apply additional checks rather
+  than binary block/allow)
+- Continuous score provides gradient for consumers that need it
+- Simple to document and communicate
+
+**Negative**:
+- Some consumers may need more granularity (but can use the score field)
+- "Suspicious" requires consumers to decide what to do — adds decision burden
+
+## References
+
+- [firewall.md](../firewall.md)
--- a/docs/architecture/decisions/009-last-token-extraction.md
+++ b/docs/architecture/decisions/009-last-token-extraction.md
@@ -0,0 +1,55 @@
+# ADR-009: Last-Token Activation Extraction
+
+## Status
+
+Accepted
+
+## Context
+
+To extract behavioral signals from the detector model, we must choose which
+token's hidden state to use from the sequence of hidden states produced during
+inference. Options:
+
+- **Last token**: The hidden state at the final position, which has attended
+  to the entire sequence. Standard for sequence classification (used by BERT
+  pools, GPT-style models naturally aggregate at the last position).
+- **Mean pooling**: Average hidden states across all positions. Smooths out
+  position-specific effects but dilutes signal from safety-relevant tokens.
+- **CLS token**: A dedicated classification token (BERT-style). SmolLM2-135M
+  (LLaMA architecture) does not use a CLS token.
+- **First token**: Has seen only the beginning of the sequence. Misses
+  context from later tokens.
+- **Max pooling**: Per-dimension maximum across positions. Noisy — a single
+  position with extreme activation can dominate.
+
+Last-token extraction is the standard for autoregressive (GPT/LLaMA-style)
+models because the last position's hidden state has attended to the full
+sequence via causal attention. For safety detection, this means the last
+token's representation contains the model's "conclusion" about the entire
+input.
+
+## Decision
+
+Extract the last token's hidden state at each configured layer. This is
+standard for LLaMA-family models and provides full-sequence context.
+
+## Consequences
+
+**Positive**:
+- Standard approach for autoregressive models — well-validated
+- Full sequence context via causal attention
+- Single vector per layer — simple to project and score
+- No padding sensitivity (unlike mean pooling with attention masks)
+
+**Negative**:
+- Position-dependent — the last token's representation is influenced by its
+  position in the sequence, not just its content
+- Very short inputs (1–2 tokens) may not have enough context for meaningful
+  activation patterns
+- May miss patterns in long inputs where the adversarial payload is in the
+  middle rather than the end
+
+## References
+
+- [model.md](../model.md)
+- [codebook.md](../codebook.md)
--- a/docs/architecture/decisions/010-monotonic-spline-distributions.md
+++ b/docs/architecture/decisions/010-monotonic-spline-distributions.md
@@ -0,0 +1,64 @@
+# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
+
+## Status
+
+Accepted
+
+## Context
+
+After projecting activations onto SVD dimensions, the firewall needs to score
+how "normal" or "anomalous" a projection is relative to the distribution of
+normal inputs. This requires modeling the probability density of normal inputs
+along each dimension.
+
+Alternatives:
+- **Gaussian**: Simple, well-understood. But real behavioral distributions are
+  often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
+- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
+  bandwidth selection is tricky, and KDE doesn't provide a parametric form for
+  efficient storage and fast evaluation.
+- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
+  choosing the number of components and risks overfitting.
+- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
+  calibration data points — not compact.
+- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
+  spline. Compact (handful of knots), smooth, tail-sensitive, and
+  differentiable. The CDF is naturally monotonic, which enforces a valid
+  probability distribution.
+
+## Decision
+
+Use monotonic spline distributions to model behavioral regions along each SVD
+dimension. The CDF is represented as a monotonic cubic spline with a small
+number of knots (typically 10–20 per dimension). Tail behavior uses
+exponential decay beyond the observed range.
+
+The scoring function computes how far a projection falls in the tail of the
+distribution — projections well within the normal region score low (CLEAR),
+projections near or beyond the tail score increasingly high.
+
+## Consequences
+
+**Positive**:
+- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
+  cliff-edge behavior
+- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
+  inputs without flagging the bulk of normal inputs
+- **Parametric compactness**: A handful of spline knots (10–20) represent the
+  full distribution shape. Very small storage footprint.
+- **Differentiability**: Scores are differentiable — potential for future
+  adversarial training or gradient-based analysis
+- **No distributional assumptions**: Unlike Gaussian, spline distributions
+  handle skew, heavy tails, and non-standard shapes
+
+**Negative**:
+- More complex than Gaussian — requires spline fitting during codebook
+  compilation
+- Spline knot selection affects scoring quality — poor knot placement can
+  miss important distribution features
+- Less familiar to most ML practitioners than Gaussian or KDE
+
+## References
+
+- [codebook.md](../codebook.md)
+- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)