Files

glm-5.1 11620e8398 docs: resolve OQ-04, remove OQ-07, enrich OQ-03 with rolling windows

- OQ-04 resolved: thresholds are both model-specific (shipped with
  codebook) and user-overridable. Inspired by platonic representation
  hypothesis — calibrated models converge on similar behavioral patterns.
- OQ-07 removed: Rust port is an alknet project concern, not relevant
  to the Python package architecture. Removed from overview.md Phase 3.
- OQ-03 enriched: rolling window token screening for granular detection
  in documents (PDF→markdown use case, academic paper injection detection).
  Upgraded from low to medium priority.
- OQ-01 updated: likely path is PyTorch first, ONNX export by default.
- OQ-05 updated: needs deep dive into guardrail landscape.
- Updated threshold description in configuration.md with platonic
  representation context.

2026-06-13 05:47:44 +00:00

3.9 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

alknet-firewall — Architecture

Current State

Phase 0→1 (Exploration → Architecture) — The project has a working PoC demonstrating that behavioral signals from small language models can detect adversarial inputs. The core detection logic (~1,745 lines) works reasonably well but lacks tests, has excessive codebook size, and needs extraction from the research codebase into a properly structured Python package.

This project extracts and productionizes the behavioral signal detection approach from the metaspline research project. A ~125M parameter model (SmolLM2-135M) processes untrusted inputs and produces hidden state activations. SVD-based dimensionality reduction on these activations reveals behavioral patterns — normal inputs cluster in expected regions while adversarial inputs produce anomalous activation signatures. The system raises "behavioral alarms" without needing to know specific attack types.

Architecture Documents

Document	Status	Description
overview.md	Draft	Vision, scope, package structure, dependencies
firewall.md	Draft	Core firewall API, input screening, alarm protocol
codebook.md	Draft	SVD basis, detection parameters, codebook compilation
model.md	Draft	Model loading, activation extraction, model-agnostic design
configuration.md	Draft	Thresholds, model selection, detection tuning
open-questions.md	Active	Unresolved questions tracker with OQ-IDs

ADR Table

ADR	Title	Status
001	Python with uv	Accepted
002	Behavioral Signal Detection (Not Text Classification)	Accepted
003	Small Model (~125M) as Detector	Accepted
004	SVD-Based Anomaly Detection	Accepted
005	Safetensors-Only Model Loading	Accepted
006	PyTorch as Optional Dependency	Accepted
007	Runtime Model Download via HuggingFace Hub	Accepted
008	Three-Level Alarm System	Accepted
009	Last-Token Activation Extraction	Accepted
010	Monotonic Spline Distributions	Accepted

Open Questions

See open-questions.md for the full tracker.

OQ	Question	Priority	Status
OQ-01	Should ONNX Runtime be a supported inference backend in Phase 1?	medium	open
OQ-02	What is the minimum viable codebook — can the 1,245-line codebook be compressed?	high	open
OQ-03	Should the firewall support streaming/chunked input screening?	medium	open
~~OQ-04~~	~~Should detection thresholds be per-model or globally configurable?~~	~~medium~~	resolved (both: model-specific defaults, user-overridable)
OQ-05	How should the firewall integrate with existing guardrail systems?	medium	open
OQ-06	Should file-based configuration use TOML or YAML?	low	open

Document Lifecycle

Status	Meaning	Transitions
`draft`	Under active development. May change significantly.	→ `reviewed` when open questions are resolved
`reviewed`	Architecture is final. Implementation may begin. Changes require review.	→ `stable` when implementation is complete
`stable`	Locked. Changes require review and may warrant an ADR.	→ `deprecated` when superseded
`deprecated`	Superseded. Kept for reference.	Removed when no longer referenced

3.9 KiB Raw Blame History