- OQ-04 resolved: thresholds are both model-specific (shipped with codebook) and user-overridable. Inspired by platonic representation hypothesis — calibrated models converge on similar behavioral patterns. - OQ-07 removed: Rust port is an alknet project concern, not relevant to the Python package architecture. Removed from overview.md Phase 3. - OQ-03 enriched: rolling window token screening for granular detection in documents (PDF→markdown use case, academic paper injection detection). Upgraded from low to medium priority. - OQ-01 updated: likely path is PyTorch first, ONNX export by default. - OQ-05 updated: needs deep dive into guardrail landscape. - Updated threshold description in configuration.md with platonic representation context.
111 lines
3.8 KiB
Markdown
111 lines
3.8 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-06-13
|
|
---
|
|
|
|
# Configuration
|
|
|
|
Configuration for the firewall: model selection, detection thresholds,
|
|
alarm levels, and operational parameters.
|
|
|
|
## What It Is
|
|
|
|
The configuration component defines all tunable parameters for the firewall.
|
|
It controls which model is used, how aggressively inputs are screened, and
|
|
what alarm levels map to what scores.
|
|
|
|
## Why It Exists
|
|
|
|
Different deployment contexts need different detection sensitivity. A
|
|
high-security environment (e.g., screening inputs to a system with access to
|
|
sensitive data) may want aggressive thresholds that flag more suspicious
|
|
inputs. A low-risk chatbot may prefer permissive thresholds that minimize
|
|
false positives. The configuration component makes these trade-offs explicit
|
|
and tunable.
|
|
|
|
## Configuration Structure
|
|
|
|
### Thresholds
|
|
|
|
```python
|
|
@dataclass
|
|
class Thresholds:
|
|
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
|
|
dangerous: float = 0.7 # Score above which input is DANGEROUS
|
|
per_dimension: dict[int, float] | None = None # Override per SVD dimension
|
|
```
|
|
|
|
Default thresholds are calibrated against the codebook's behavioral regions
|
|
and shipped with each codebook. Once calibrated, models produce remarkably
|
|
similar behavioral patterns (inspired by the "platonic representation hypothesis"
|
|
— different models converge on similar internal representations). Per-dimension
|
|
overrides allow tuning sensitivity for specific behavioral patterns (e.g.,
|
|
lower threshold on the refusal-suppression dimension). Users can always
|
|
override the codebook's recommended thresholds.
|
|
|
|
### Model Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class ModelConfig:
|
|
model_id: str = "HuggingFaceTB/SmolLM2-135M"
|
|
revision: str = "<pinned-commit>" # Specific commit, not "main"
|
|
device: str = "cpu"
|
|
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
|
|
cache_dir: str | None = None
|
|
```
|
|
|
|
Extraction layers are chosen based on EMNLP 2024 findings that safety signals
|
|
appear in early layers. The default set covers early (1, 2) and mid (4, 8)
|
|
layers of the 12-layer SmolLM2-135M model.
|
|
|
|
### Codebook Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class CodebookConfig:
|
|
source: str = "bundled" # "bundled" | "hf_hub" | "local"
|
|
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
|
|
revision: str | None = None # HuggingFace revision
|
|
path: Path | None = None # Local path if source="local"
|
|
n_dimensions: int = 10 # Number of SVD dimensions to retain
|
|
```
|
|
|
|
### Full Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class FirewallConfig:
|
|
model: ModelConfig = field(default_factory=ModelConfig)
|
|
codebook: CodebookConfig = field(default_factory=CodebookConfig)
|
|
thresholds: Thresholds = field(default_factory=Thresholds)
|
|
```
|
|
|
|
## Defaults
|
|
|
|
All configuration has sensible defaults. The firewall works out of the box:
|
|
|
|
```python
|
|
# All defaults
|
|
firewall = Firewall()
|
|
alarm = firewall.screen("Hello, how are you?")
|
|
# alarm.level == AlarmLevel.CLEAR
|
|
```
|
|
|
|
No configuration file is required. All parameters can be passed via the
|
|
constructor. A future phase may add file-based configuration (TOML or YAML).
|
|
|
|
## Design Decisions
|
|
|
|
| ADR | Decision | Summary |
|
|
|-----|----------|---------|
|
|
| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
|
|
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
|
|
| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
|
|
|
|
## Open Questions
|
|
|
|
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
|
questions affecting this document:
|
|
|
|
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable) |