Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:
- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
activation extraction natively (optimum #972 closed as not planned),
bloated model exports; burn/cublas via safetensors is a better future path
- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
Structure and Extraction from PoC sections to codebook.md based on PoC
analysis of metaspline firewall_codebook.py
- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
Firewall.screen() only, Phase 2 adds <100-line adapter packages for
LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails
- OQ-06: TOML for file-based config — standard modern Python, two-way door
Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
Phase 0→1 (Exploration → Architecture) — The project has a working PoC
demonstrating that behavioral signals from small language models can detect
adversarial inputs. The core detection logic (~1,745 lines) works reasonably
well but lacks tests, has excessive codebook size, and needs extraction from
the research codebase into a properly structured Python package.
This project extracts and productionizes the behavioral signal detection
approach from the metaspline research project. A ~125M parameter model
(SmolLM2-135M) processes untrusted inputs and produces hidden state
activations. SVD-based dimensionality reduction on these activations reveals
behavioral patterns — normal inputs cluster in expected regions while
adversarial inputs produce anomalous activation signatures. The system
raises "behavioral alarms" without needing to know specific attack types.