Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
3.4 KiB
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
Configuration
Configuration for the firewall: model selection, detection thresholds, alarm levels, and operational parameters.
What It Is
The configuration component defines all tunable parameters for the firewall. It controls which model is used, how aggressively inputs are screened, and what alarm levels map to what scores.
Why It Exists
Different deployment contexts need different detection sensitivity. A high-security environment (e.g., screening inputs to a system with access to sensitive data) may want aggressive thresholds that flag more suspicious inputs. A low-risk chatbot may prefer permissive thresholds that minimize false positives. The configuration component makes these trade-offs explicit and tunable.
Configuration Structure
Thresholds
@dataclass
class Thresholds:
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
dangerous: float = 0.7 # Score above which input is DANGEROUS
per_dimension: dict[int, float] | None = None # Override per SVD dimension
Default thresholds are calibrated against the codebook's behavioral regions. Per-dimension overrides allow tuning sensitivity for specific behavioral patterns (e.g., lower threshold on the refusal-suppression dimension).
Model Configuration
@dataclass
class ModelConfig:
model_id: str = "HuggingFaceTB/SmolLM2-135M"
revision: str = "<pinned-commit>" # Specific commit, not "main"
device: str = "cpu"
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
cache_dir: str | None = None
Extraction layers are chosen based on EMNLP 2024 findings that safety signals appear in early layers. The default set covers early (1, 2) and mid (4, 8) layers of the 12-layer SmolLM2-135M model.
Codebook Configuration
@dataclass
class CodebookConfig:
source: str = "bundled" # "bundled" | "hf_hub" | "local"
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
revision: str | None = None # HuggingFace revision
path: Path | None = None # Local path if source="local"
n_dimensions: int = 10 # Number of SVD dimensions to retain
Full Configuration
@dataclass
class FirewallConfig:
model: ModelConfig = field(default_factory=ModelConfig)
codebook: CodebookConfig = field(default_factory=CodebookConfig)
thresholds: Thresholds = field(default_factory=Thresholds)
Defaults
All configuration has sensible defaults. The firewall works out of the box:
# All defaults
firewall = Firewall()
alarm = firewall.screen("Hello, how are you?")
# alarm.level == AlarmLevel.CLEAR
No configuration file is required. All parameters can be passed via the constructor. A future phase may add file-based configuration (TOML or YAML).
Design Decisions
| ADR | Decision | Summary |
|---|---|---|
| 003 | Small model detector | Defaults to SmolLM2-135M |
| 006 | Optional PyTorch | Device config allows CPU-only |
| 007 | Runtime download | Model revision must be pinned |
Open Questions
Open questions are tracked in open-questions.md. Key questions affecting this document:
- OQ-04: Should detection thresholds be per-model or globally configurable? (open)