Files

glm-5.1 11620e8398 docs: resolve OQ-04, remove OQ-07, enrich OQ-03 with rolling windows

- OQ-04 resolved: thresholds are both model-specific (shipped with
  codebook) and user-overridable. Inspired by platonic representation
  hypothesis — calibrated models converge on similar behavioral patterns.
- OQ-07 removed: Rust port is an alknet project concern, not relevant
  to the Python package architecture. Removed from overview.md Phase 3.
- OQ-03 enriched: rolling window token screening for granular detection
  in documents (PDF→markdown use case, academic paper injection detection).
  Upgraded from low to medium priority.
- OQ-01 updated: likely path is PyTorch first, ONNX export by default.
- OQ-05 updated: needs deep dive into guardrail landscape.
- Updated threshold description in configuration.md with platonic
  representation context.

2026-06-13 05:47:44 +00:00

3.8 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

Configuration

Configuration for the firewall: model selection, detection thresholds, alarm levels, and operational parameters.

What It Is

The configuration component defines all tunable parameters for the firewall. It controls which model is used, how aggressively inputs are screened, and what alarm levels map to what scores.

Why It Exists

Different deployment contexts need different detection sensitivity. A high-security environment (e.g., screening inputs to a system with access to sensitive data) may want aggressive thresholds that flag more suspicious inputs. A low-risk chatbot may prefer permissive thresholds that minimize false positives. The configuration component makes these trade-offs explicit and tunable.

Configuration Structure

Thresholds

@dataclass
class Thresholds:
    suspicious: float = 0.3    # Score above which input is SUSPICIOUS
    dangerous: float = 0.7    # Score above which input is DANGEROUS
    per_dimension: dict[int, float] | None = None  # Override per SVD dimension

Default thresholds are calibrated against the codebook's behavioral regions and shipped with each codebook. Once calibrated, models produce remarkably similar behavioral patterns (inspired by the "platonic representation hypothesis" — different models converge on similar internal representations). Per-dimension overrides allow tuning sensitivity for specific behavioral patterns (e.g., lower threshold on the refusal-suppression dimension). Users can always override the codebook's recommended thresholds.

Model Configuration

@dataclass
class ModelConfig:
    model_id: str = "HuggingFaceTB/SmolLM2-135M"
    revision: str = "<pinned-commit>"   # Specific commit, not "main"
    device: str = "cpu"
    extraction_layers: list[int] = field(default_factory=lambda: [1, 2, 4, 8])
    cache_dir: str | None = None

Extraction layers are chosen based on EMNLP 2024 findings that safety signals appear in early layers. The default set covers early (1, 2) and mid (4, 8) layers of the 12-layer SmolLM2-135M model.

Codebook Configuration

@dataclass
class CodebookConfig:
    source: str = "bundled"         # "bundled" | "hf_hub" | "local"
    repo_id: str | None = None      # HuggingFace repo if source="hf_hub"
    revision: str | None = None     # HuggingFace revision
    path: Path | None = None        # Local path if source="local"
    n_dimensions: int = 10          # Number of SVD dimensions to retain

Full Configuration

@dataclass
class FirewallConfig:
    model: ModelConfig = field(default_factory=ModelConfig)
    codebook: CodebookConfig = field(default_factory=CodebookConfig)
    thresholds: Thresholds = field(default_factory=Thresholds)

Defaults

All configuration has sensible defaults. The firewall works out of the box:

# All defaults
firewall = Firewall()
alarm = firewall.screen("Hello, how are you?")
# alarm.level == AlarmLevel.CLEAR

No configuration file is required. All parameters can be passed via the constructor. A future phase may add file-based configuration (TOML or YAML).

Design Decisions

ADR	Decision	Summary
003	Small model detector	Defaults to SmolLM2-135M
006	Optional PyTorch	Device config allows CPU-only
007	Runtime download	Model revision must be pinned

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

~~OQ-04~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)

3.8 KiB Raw Blame History