The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.
Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
Token-Level Smoothing section, classifier weights and direction profiles
to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
decomposition, smoothing, and direction classification; update score
composition to use direction-level P(active); update DimensionSignal
dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
pipeline
Phase 0→1 (Exploration → Architecture) — The project has a working PoC
demonstrating that behavioral signals from small language models can detect
adversarial inputs. The core detection logic (~1,745 lines) works reasonably
well but lacks tests, has excessive codebook size, and needs extraction from
the research codebase into a properly structured Python package.
This project extracts and productionizes the behavioral signal detection
approach from the metaspline research project. A ~125M parameter model
(SmolLM2-135M) processes untrusted inputs and produces hidden state
activations. SVD-based dimensionality reduction on these activations reveals
behavioral patterns — normal inputs cluster in expected regions while
adversarial inputs produce anomalous activation signatures. The system
raises "behavioral alarms" without needing to know specific attack types.