Files

glm-5.1 cf464c2296 feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).

2026-06-13 05:17:40 +00:00

2.3 KiB

Raw Blame History

ADR-004: SVD-Based Anomaly Detection

Status

Accepted

Context

After extracting hidden state activations from the detector model, the firewall needs a method to distinguish normal behavioral patterns from adversarial ones. Options:

Single classifier: Train a binary classifier on activations. Simple but loses the multi-dimensional structure. Black box.
SVD + region comparison: Decompose activation space into principal directions, model normal behavioral regions along each direction, detect inputs that fall outside normal regions. Interpretable, efficient, multi-dimensional.
Autoencoder anomaly detection: Train an autoencoder on normal inputs, detect inputs with high reconstruction error. Complex, not interpretable.

ICML 2025 research shows safety is multi-dimensional in activation space — a dominant refusal direction plus secondary dimensions. SVD naturally discovers these directions. Region comparison provides interpretable per-dimension signals.

Decision

Use SVD-based anomaly detection: decompose activation space via SVD to discover principal behavioral directions, model normal regions along each dimension using monotonic spline distributions, and detect inputs whose projections fall outside normal regions.

Consequences

Positive:

Interpretable: Each SVD direction can be labeled (refusal, role-playing, etc.)
Efficient: Projection is O(k) after decomposition, trivial at runtime
Multi-dimensional: Captures the multi-directional nature of safety (ICML 2025)
Robust: SVD captures structure of entire activation space, not a single boundary
Small-model friendly: SVD on 768-dim hidden states is computationally trivial
Deterministic: scipy.linalg.svd produces exact, reproducible decomposition (unlike TruncatedSVD which uses randomized initialization)

Negative:

SVD basis is model-specific — changing detector model requires recomputation
Basis quality depends on calibration dataset coverage
Linear decomposition may miss non-linear behavioral patterns
Requires a codebook compilation pipeline (Phase 2)
Full SVD on large calibration datasets may be slow (mitigated by relatively small hidden dim: 768)

References

codebook.md
Hidden Dimensions of LLM Alignment (ICML 2025)
HiddenDetect (ACL 2025)

2.3 KiB Raw Blame History