Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
113 lines
4.0 KiB
Markdown
113 lines
4.0 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-06-13
|
|
---
|
|
|
|
# Configuration
|
|
|
|
Configuration for the firewall: model selection, detection thresholds,
|
|
alarm levels, and operational parameters.
|
|
|
|
## What It Is
|
|
|
|
The configuration component defines all tunable parameters for the firewall.
|
|
It controls which model is used, how aggressively inputs are screened, and
|
|
what alarm levels map to what scores.
|
|
|
|
## Why It Exists
|
|
|
|
Different deployment contexts need different detection sensitivity. A
|
|
high-security environment (e.g., screening inputs to a system with access to
|
|
sensitive data) may want aggressive thresholds that flag more suspicious
|
|
inputs. A low-risk chatbot may prefer permissive thresholds that minimize
|
|
false positives. The configuration component makes these trade-offs explicit
|
|
and tunable.
|
|
|
|
## Configuration Structure
|
|
|
|
### Thresholds
|
|
|
|
```python
|
|
@dataclass
|
|
class Thresholds:
|
|
suspicious: float = 0.3 # Score above which input is SUSPICIOUS
|
|
dangerous: float = 0.7 # Score above which input is DANGEROUS
|
|
per_dimension: dict[int, float] | None = None # Override per SVD dimension
|
|
```
|
|
|
|
Default thresholds are calibrated against the codebook's behavioral regions
|
|
and shipped with each codebook. Once calibrated, models produce remarkably
|
|
similar behavioral patterns (inspired by the "platonic representation hypothesis"
|
|
— different models converge on similar internal representations). Per-dimension
|
|
overrides allow tuning sensitivity for specific behavioral patterns (e.g.,
|
|
lower threshold on the refusal-suppression dimension). Users can always
|
|
override the codebook's recommended thresholds.
|
|
|
|
### Model Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class ModelConfig:
|
|
model_id: str = "HuggingFaceTB/SmolLM2-135M"
|
|
revision: str = "<pinned-commit>" # Specific commit, not "main"
|
|
device: str = "cpu"
|
|
extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
|
|
cache_dir: str | None = None
|
|
```
|
|
|
|
Extraction layers are chosen based on EMNLP 2024 findings that safety signals
|
|
appear in early layers. The default set covers early (1, 2) and mid (4, 8)
|
|
layers of the 12-layer SmolLM2-135M model.
|
|
|
|
### Codebook Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class CodebookConfig:
|
|
source: str = "bundled" # "bundled" | "hf_hub" | "local"
|
|
repo_id: str | None = None # HuggingFace repo if source="hf_hub"
|
|
revision: str | None = None # HuggingFace revision
|
|
path: Path | None = None # Local path if source="local"
|
|
n_dimensions: int = 10 # Number of SVD dimensions to retain
|
|
```
|
|
|
|
### Full Configuration
|
|
|
|
```python
|
|
@dataclass
|
|
class FirewallConfig:
|
|
model: ModelConfig = field(default_factory=ModelConfig)
|
|
codebook: CodebookConfig = field(default_factory=CodebookConfig)
|
|
thresholds: Thresholds = field(default_factory=Thresholds)
|
|
```
|
|
|
|
## Defaults
|
|
|
|
All configuration has sensible defaults. The firewall works out of the box:
|
|
|
|
```python
|
|
# All defaults
|
|
firewall = Firewall()
|
|
alarm = firewall.screen("Hello, how are you?")
|
|
# alarm.level == AlarmLevel.CLEAR
|
|
```
|
|
|
|
No configuration file is required. All parameters can be passed via the
|
|
constructor. A future phase may add file-based configuration (TOML, consistent
|
|
with Python packaging conventions and `pyproject.toml`).
|
|
|
|
## Design Decisions
|
|
|
|
| ADR | Decision | Summary |
|
|
|-----|----------|---------|
|
|
| [003](decisions/003-small-model-detector.md) | Small model detector | Defaults to SmolLM2-135M |
|
|
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Device config allows CPU-only |
|
|
| [007](decisions/007-runtime-model-download.md) | Runtime download | Model revision must be pinned |
|
|
|
|
## Open Questions
|
|
|
|
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
|
questions affecting this document:
|
|
|
|
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)
|
|
- ~~**OQ-06**~~: ~~Should file-based configuration use TOML or YAML?~~ (resolved — TOML, consistent with modern Python packaging) |