Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

11 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

Overview

Vision

A pip-installable Python library that screens untrusted inputs for adversarial content before they reach a target LLM. The library uses behavioral signals — patterns in hidden state activations from a small language model — to detect injection attempts, obfuscated payloads, and novel attack types that text-surface defenses miss.

This project is open source under the MIT license.

Why This Exists

LLMs process instructions and data in the same token stream. They cannot reliably distinguish trusted system prompts from untrusted user content. This architectural weakness enables prompt injection — the #1 LLM vulnerability per OWASP LLM01:2025. Sophisticated attackers bypass the best-defended models ~50% of the time with just 10 attempts (International AI Safety Report 2026).

Current defenses are surface-level: text classifiers (Llama Guard), regex filters, perplexity checks, and canary tokens. All examine what the input says, not how a model processes it. Adversarial inputs that look natural to text classifiers still produce distinctive activation patterns when a model processes them.

Academic research validates this approach:

HiddenDetect (ACL 2025): Activation-based detection outperforms SOTA
Hidden Dimensions (ICML 2025): Safety is multi-dimensional in activation space
EMNLP 2024: Safety signals detectable in early layers
Subliminal Learning (Nature 2026): Models transmit behavioral signals through non-semantic hidden signals

See llm-input-safety-landscape.md for the full threat analysis and academic evidence.

Scope

In Scope

Phase 1: Core behavioral firewall library
- Input screening via small model activation analysis
- SVD-based anomaly detection with configurable thresholds
- Model-agnostic detector (works with any compatible small model)
- SmolLM2-135M as the default detector model
- Multi-dimensional behavioral alarms (not just safe/unsafe)
- PyTorch inference backend (optional dependency)
- Runtime model download and caching via HuggingFace Hub
- safetensors-only model loading (security requirement)
- Synchronous API for single-input screening
- Interpretable detection signals (SVD direction analysis)
Phase 2: Integration and operational hardening
- Async/batch screening API
- Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
- Metrics and observability
- Codebook training pipeline (run_manifold_projection.py extraction)
- Streaming/rolling-window input screening (granular detection for documents)
Phase 3: Advanced capabilities
- Multi-turn attack detection (payload splitting)
- Custom model fine-tuning for domain-specific detection
- Alternative inference backends (burn/cublas via safetensors)

Out of Scope

Text-surface classification (that's Llama Guard's job)
Rule-based content filtering (that's NeMo Guardrails' job)
Output-side safety monitoring
Target model training or modification
Multimodal (image) input screening
Agent orchestration or access control
Replacement for comprehensive LLM security programs

Architecture

                        ┌──────────────────────────────────────────┐
                        │  alknet-firewall (Python library)          │
                        │                                            │
  Untrusted Input ────► │  ┌─ Firewall API ─────────────────────┐   │
  (text)                │  │  screen(input) → Alarm              │   │
                        │  │  ├─ Tokenize input                   │   │
                        │  │  ├─ Run detector model              │   │
                        │  │  ├─ Extract hidden state activations│   │
                        │  │  ├─ Project onto SVD basis           │   │
                        │  │  ├─ Compare against codebook         │   │
                        │  │  └─ Return behavioral alarm          │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Model Manager ────────────────────┐   │
                        │  │  Load model (HF Hub download/cache) │   │
                        │  │  Extract activations at key layers   │   │
                        │  │  Model-agnostic interface            │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Codebook ──────────────────────────┐   │
                        │  │  SVD basis vectors (compiled)        │   │
                        │  │  Detection thresholds per dimension  │   │
                        │  │  Behavioral region boundaries        │   │
                        │  │  Spline distributions for scoring    │   │
                        │  └────────────────────────────────────┘   │
                        │                                            │
                        │  ┌─ Configuration ─────────────────────┐   │
                        │  │  Model selection & revision pinning  │   │
                        │  │  Detection thresholds               │   │
                        │  │  Alarm severity levels              │   │
                        │  └────────────────────────────────────┘   │
                        └──────────────────────────────────────────┘
                                      │
                               ┌──────┴──────┐
                               │             │
                        HF Hub Cache    Detector Model
                        (~/.cache/)    (SmolLM2-135M)

Package Dependencies

Core (Required)

Package	Version	Purpose	Notes
`huggingface-hub`	>=1.5.0,<2.0	Model download, caching	~15MB, handles auth and offline mode
`safetensors`	>=0.4.3	Safe model weight loading	No arbitrary code execution
`tokenizers`	>=0.20	Text tokenization	Fast Rust-based tokenizer
`numpy`	>=1.24	Tensor operations	Core numerical dependency
`scikit-learn`	>=1.3	SVD computations	TruncatedSVD for basis projection

Optional (Extras)

Package	Extra	Version	Purpose	Notes
`torch`	`[torch]`	>=2.2	Model inference	200MB-2.5GB; optional dependency
`transformers`	`[torch]`	>=4.40	Model loading pipeline	Required with torch extra

Development (Not Published)

Package	Purpose
`ruff`	Linting + formatting (replaces flake8, black, isort)
`pytest`	Testing
`pytest-cov`	Coverage
`mypy`	Type checking
`pre-commit`	Git hooks

Exports

This is a Python library. Public API surface:

from alknet_firewall import Firewall, Alarm, AlarmLevel

# Core screening
firewall = Firewall()  # loads default model + codebook
alarm: Alarm = firewall.screen("untrusted input text")

# Alarm properties
alarm.level          # AlarmLevel.CLEAR | SUSPICIOUS | DANGEROUS
alarm.score          # float, 0.0-1.0
alarm.signals        # list[DimensionSignal] — per-dimension behavioral signals
alarm.dimensions     # SVD dimension analysis

See firewall.md for the full API specification.

Design Decisions

All design decisions are documented as ADRs in decisions/.

ADR	Decision	Summary
001	Python with uv	Python enables direct ML ecosystem integration; uv provides modern packaging
002	Behavioral signal detection	Detect how models process inputs, not what inputs say
003	Small model as detector	~125M params: <10ms latency, CPU-deployable, early-layer signals
004	SVD-based anomaly detection	Interpretable, efficient, small-model-friendly
005	Safetensors-only loading	No pickle-based model files — security product must be secure
006	PyTorch as optional dependency	2GB+ dependency can't be required; extras pattern is industry standard
007	Runtime model download	269MB model can't be bundled; HF Hub provides caching and auth
008	Three-level alarm system	CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance
009	Last-token activation extraction	Standard for autoregressive models; full sequence context
010	Monotonic spline distributions	Compact, smooth, tail-sensitive behavioral region modeling
011	Standalone API + thin adapters	Phase 1 standalone, Phase 2 thin adapter packages

Dependencies on Other Projects

metaspline: The core detection logic (codebook, spline distributions, SVD projection, space transforms) is adapted from the metaspline research project. The PoC validated the behavioral signal approach; this project extracts and productionizes ~1,745 lines of the working subset.
reverse-proxy: The architecture documentation structure and SDD process are adapted from the @alkdev/reverse-proxy project. The documentation conventions, ADR format, and open questions tracking are reused directly.

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
OQ-05: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)

11 KiB Raw Blame History