Files

glm-5.1 cf464c2296 feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).

2026-06-13 05:17:40 +00:00

1.9 KiB

Raw Blame History

ADR-002: Behavioral Signal Detection (Not Text Classification)

Status

Accepted

Context

Existing LLM input defenses (Llama Guard, NeMo Guardrails, Rebuff) are text-surface approaches — they classify input text as safe or unsafe. This fundamentally limits their effectiveness:

Obfuscated inputs (Base64, multilingual, synonym substitution) evade keyword and pattern matching
Novel attack types require retraining classifiers
Text that looks natural to a classifier can still be adversarial when processed by a model

Academic research (2024-2025) demonstrates that adversarial inputs produce distinctive activation patterns in model internals, regardless of surface form.

Decision

Build a behavioral signal detection system that monitors how a model processes inputs (hidden state activations), not what the inputs say (text surface). Adversarial inputs produce anomalous activation patterns that are detectable even when the text itself looks innocent.

Consequences

Positive:

Catches obfuscated, multilingual, and novel attacks that text classifiers miss
Anomalous behavior patterns are attack-type agnostic — novel attacks still produce anomalous patterns
Multi-dimensional signals provide interpretable detection (which SVD directions are activated and by how much)
Complementary to existing text-surface defenses — can be layered

Negative:

Requires running a model on every input (adds latency and compute cost)
Detection depends on the detector model sharing architectural similarity with likely attack targets
False positives possible for unusual but benign inputs (domain-specific language, technical content)
No existing production system validates this approach — we are first

References

llm-input-safety-landscape.md
HiddenDetect (ACL 2025)
Hidden Dimensions of LLM Alignment (ICML 2025)
How Alignment and Jailbreak Work (EMNLP 2024)

1.9 KiB Raw Blame History