feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
@@ -0,0 +1,64 @@
|
||||
# ADR-010: Monotonic Spline Distributions for Behavioral Region Modeling
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
After projecting activations onto SVD dimensions, the firewall needs to score
|
||||
how "normal" or "anomalous" a projection is relative to the distribution of
|
||||
normal inputs. This requires modeling the probability density of normal inputs
|
||||
along each dimension.
|
||||
|
||||
Alternatives:
|
||||
- **Gaussian**: Simple, well-understood. But real behavioral distributions are
|
||||
often skewed, multimodal, or heavy-tailed. Gaussian assumes symmetry.
|
||||
- **Kernel Density Estimation (KDE)**: Non-parametric, flexible. But
|
||||
bandwidth selection is tricky, and KDE doesn't provide a parametric form for
|
||||
efficient storage and fast evaluation.
|
||||
- **Mixture of Gaussians**: More flexible than single Gaussian. But requires
|
||||
choosing the number of components and risks overfitting.
|
||||
- **Empirical CDF**: Non-parametric, no assumptions. But requires storing all
|
||||
calibration data points — not compact.
|
||||
- **Monotonic spline distributions**: Parametric CDF modeled as a monotonic
|
||||
spline. Compact (handful of knots), smooth, tail-sensitive, and
|
||||
differentiable. The CDF is naturally monotonic, which enforces a valid
|
||||
probability distribution.
|
||||
|
||||
## Decision
|
||||
|
||||
Use monotonic spline distributions to model behavioral regions along each SVD
|
||||
dimension. The CDF is represented as a monotonic cubic spline with a small
|
||||
number of knots (typically 10–20 per dimension). Tail behavior uses
|
||||
exponential decay beyond the observed range.
|
||||
|
||||
The scoring function computes how far a projection falls in the tail of the
|
||||
distribution — projections well within the normal region score low (CLEAR),
|
||||
projections near or beyond the tail score increasingly high.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive**:
|
||||
- **Smooth scoring**: Continuous score rather than hard threshold, avoiding
|
||||
cliff-edge behavior
|
||||
- **Tail sensitivity**: Exponential tails capture rare-but-critical anomalous
|
||||
inputs without flagging the bulk of normal inputs
|
||||
- **Parametric compactness**: A handful of spline knots (10–20) represent the
|
||||
full distribution shape. Very small storage footprint.
|
||||
- **Differentiability**: Scores are differentiable — potential for future
|
||||
adversarial training or gradient-based analysis
|
||||
- **No distributional assumptions**: Unlike Gaussian, spline distributions
|
||||
handle skew, heavy tails, and non-standard shapes
|
||||
|
||||
**Negative**:
|
||||
- More complex than Gaussian — requires spline fitting during codebook
|
||||
compilation
|
||||
- Spline knot selection affects scoring quality — poor knot placement can
|
||||
miss important distribution features
|
||||
- Less familiar to most ML practitioners than Gaussian or KDE
|
||||
|
||||
## References
|
||||
|
||||
- [codebook.md](../codebook.md)
|
||||
- metaspline PoC: `spline.py`, `transform.py`, `space.py` (~280 lines total)
|
||||
Reference in New Issue
Block a user