Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
107 lines
3.4 KiB
Markdown
107 lines
3.4 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-06-13
|
|
---
|
|
|
|
# Configuration
|
|
|
|
Configuration for the firewall: model selection, detection thresholds,
|
|
alarm levels, and operational parameters.
|
|
|
|
## What It Is
|
|
|
|
The configuration component defines all tunable parameters for the firewall.
|
|
It controls which model is used, how aggressively inputs are screened, and
|
|
what alarm levels map to what scores.
|
|
|
|
## Why It Exists
|
|
|
|
Different deployment contexts need different detection sensitivity. A
|
|
high-security environment (e.g., screening inputs to a system with access to
|
|
sensitive data) may want aggressive thresholds that flag more suspicious
|
|
inputs. A low-risk chatbot may prefer permissive thresholds that minimize
|
|
false positives. The configuration component makes these trade-offs explicit
|
|
and tunable.
|
|
|
|
## Configuration Structure
|
|
|
|
### Thresholds
|
|
|
|
```python
|
|
@dataclass
|
|
class Thresholds:
|
|
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
|
|
dangerous: float = 0.7 # Score above which input is DANGEROUS
|
|
per_dimension: dict[int, float] | None = None # Override per SVD dimension
|
|
```
|
|
|
|
Default thresholds are calibrated against the codebook's behavioral regions.
|
|
Per-dimension overrides allow tuning sensitivity for specific behavioral
|
|
patterns (e.g., lower threshold on the refusal-suppression dimension).
|
|
|
|
### Model Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class ModelConfig:
|
|
model_id: str = "HuggingFaceTB/SmolLM2-135M"
|
|
revision: str = "<pinned-commit>" # Specific commit, not "main"
|
|
device: str = "cpu"
|
|
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
|
|
cache_dir: str | None = None
|
|
```
|
|
|
|
Extraction layers are chosen based on EMNLP 2024 findings that safety signals
|
|
appear in early layers. The default set covers early (1, 2) and mid (4, 8)
|
|
layers of the 12-layer SmolLM2-135M model.
|
|
|
|
### Codebook Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class CodebookConfig:
|
|
source: str = "bundled" # "bundled" | "hf_hub" | "local"
|
|
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
|
|
revision: str | None = None # HuggingFace revision
|
|
path: Path | None = None # Local path if source="local"
|
|
n_dimensions: int = 10 # Number of SVD dimensions to retain
|
|
```
|
|
|
|
### Full Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class FirewallConfig:
|
|
model: ModelConfig = field(default_factory=ModelConfig)
|
|
codebook: CodebookConfig = field(default_factory=CodebookConfig)
|
|
thresholds: Thresholds = field(default_factory=Thresholds)
|
|
```
|
|
|
|
## Defaults
|
|
|
|
All configuration has sensible defaults. The firewall works out of the box:
|
|
|
|
```python
|
|
# All defaults
|
|
firewall = Firewall()
|
|
alarm = firewall.screen("Hello, how are you?")
|
|
# alarm.level == AlarmLevel.CLEAR
|
|
```
|
|
|
|
No configuration file is required. All parameters can be passed via the
|
|
constructor. A future phase may add file-based configuration (TOML or YAML).
|
|
|
|
## Design Decisions
|
|
|
|
| ADR | Decision | Summary |
|
|
|-----|----------|---------|
|
|
| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
|
|
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
|
|
| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
|
|
|
|
## Open Questions
|
|
|
|
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
|
questions affecting this document:
|
|
|
|
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open) |