- OQ-04 resolved: thresholds are both model-specific (shipped with
codebook) and user-overridable. Inspired by platonic representation
hypothesis — calibrated models converge on similar behavioral patterns.
- OQ-07 removed: Rust port is an alknet project concern, not relevant
to the Python package architecture. Removed from overview.md Phase 3.
- OQ-03 enriched: rolling window token screening for granular detection
in documents (PDF→markdown use case, academic paper injection detection).
Upgraded from low to medium priority.
- OQ-01 updated: likely path is PyTorch first, ONNX export by default.
- OQ-05 updated: needs deep dive into guardrail landscape.
- Updated threshold description in configuration.md with platonic
representation context.
Phase 0→1 (Exploration → Architecture) — The project has a working PoC
demonstrating that behavioral signals from small language models can detect
adversarial inputs. The core detection logic (~1,745 lines) works reasonably
well but lacks tests, has excessive codebook size, and needs extraction from
the research codebase into a properly structured Python package.
This project extracts and productionizes the behavioral signal detection
approach from the metaspline research project. A ~125M parameter model
(SmolLM2-135M) processes untrusted inputs and produces hidden state
activations. SVD-based dimensionality reduction on these activations reveals
behavioral patterns — normal inputs cluster in expected regions while
adversarial inputs produce anomalous activation signatures. The system
raises "behavioral alarms" without needing to know specific attack types.