docs: resolve OQ-04, remove OQ-07, enrich OQ-03 with rolling windows
- OQ-04 resolved: thresholds are both model-specific (shipped with codebook) and user-overridable. Inspired by platonic representation hypothesis — calibrated models converge on similar behavioral patterns. - OQ-07 removed: Rust port is an alknet project concern, not relevant to the Python package architecture. Removed from overview.md Phase 3. - OQ-03 enriched: rolling window token screening for granular detection in documents (PDF→markdown use case, academic paper injection detection). Upgraded from low to medium priority. - OQ-01 updated: likely path is PyTorch first, ONNX export by default. - OQ-05 updated: needs deep dive into guardrail landscape. - Updated threshold description in configuration.md with platonic representation context.
This commit is contained in:
@@ -35,9 +35,13 @@ class Thresholds:
|
||||
per_dimension: dict[int, float] | None = None # Override per SVD dimension
|
||||
```
|
||||
|
||||
Default thresholds are calibrated against the codebook's behavioral regions.
|
||||
Per-dimension overrides allow tuning sensitivity for specific behavioral
|
||||
patterns (e.g., lower threshold on the refusal-suppression dimension).
|
||||
Default thresholds are calibrated against the codebook's behavioral regions
|
||||
and shipped with each codebook. Once calibrated, models produce remarkably
|
||||
similar behavioral patterns (inspired by the "platonic representation hypothesis"
|
||||
— different models converge on similar internal representations). Per-dimension
|
||||
overrides allow tuning sensitivity for specific behavioral patterns (e.g.,
|
||||
lower threshold on the refusal-suppression dimension). Users can always
|
||||
override the codebook's recommended thresholds.
|
||||
|
||||
### Model Configuration
|
||||
|
||||
@@ -104,4 +108,4 @@ constructor. A future phase may add file-based configuration (TOML or YAML).
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
|
||||
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)
|
||||
Reference in New Issue
Block a user