docs: resolve OQ-04, remove OQ-07, enrich OQ-03 with rolling windows

- OQ-04 resolved: thresholds are both model-specific (shipped with codebook) and user-overridable. Inspired by platonic representation hypothesis — calibrated models converge on similar behavioral patterns. - OQ-07 removed: Rust port is an alknet project concern, not relevant to the Python package architecture. Removed from overview.md Phase 3. - OQ-03 enriched: rolling window token screening for granular detection in documents (PDF→markdown use case, academic paper injection detection). Upgraded from low to medium priority. - OQ-01 updated: likely path is PyTorch first, ONNX export by default. - OQ-05 updated: needs deep dive into guardrail landscape. - Updated threshold description in configuration.md with platonic representation context.
2026-06-13 05:47:44 +00:00
parent cf464c2296
commit 11620e8398
5 changed files with 70 additions and 52 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -55,11 +55,10 @@ See [open-questions.md](open-questions.md) for the full tracker.
 |----|----------|----------|--------|
 | OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
 | OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
-| OQ-03 | Should the firewall support streaming/chunked input screening? | low | open |
-| OQ-04 | Should detection thresholds be per-model or globally configurable? | medium | open |
-| OQ-05 | How should the firewall integrate with existing guardrail systems (LlamaFirewall, NeMo)? | medium | open |
+| OQ-03 | Should the firewall support streaming/chunked input screening? | medium | open |
+| ~~OQ-04~~ | ~~Should detection thresholds be per-model or globally configurable?~~ | ~~medium~~ | **resolved** (both: model-specific defaults, user-overridable) |
+| OQ-05 | How should the firewall integrate with existing guardrail systems? | medium | open |
 | OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
-| OQ-07 | Is a Rust port feasible given current ML framework maturity? | low | open |

 ## Document Lifecycle

--- a/docs/architecture/configuration.md
+++ b/docs/architecture/configuration.md
@@ -35,9 +35,13 @@ class Thresholds:
    per_dimension: dict[int, float] | None = None  # Override per SVD dimension
 ```

-Default thresholds are calibrated against the codebook's behavioral regions.
-Per-dimension overrides allow tuning sensitivity for specific behavioral
-patterns (e.g., lower threshold on the refusal-suppression dimension).
+Default thresholds are calibrated against the codebook's behavioral regions
+and shipped with each codebook. Once calibrated, models produce remarkably
+similar behavioral patterns (inspired by the "platonic representation hypothesis"
+— different models converge on similar internal representations). Per-dimension
+overrides allow tuning sensitivity for specific behavioral patterns (e.g.,
+lower threshold on the refusal-suppression dimension). Users can always
+override the codebook's recommended thresholds.

 ### Model Configuration

@@ -104,4 +108,4 @@ constructor. A future phase may add file-based configuration (TOML or YAML).
 Open questions are tracked in [open-questions.md](open-questions.md). Key
 questions affecting this document:

- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
+- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -196,5 +196,5 @@ All exception types subclass `AlknetFirewallError` (base library exception).
 Open questions are tracked in [open-questions.md](open-questions.md). Key
 questions affecting this document:

- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
+- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising)
+- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open — needs research)
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -9,7 +9,7 @@ Centralized tracker for unresolved questions across all architecture documents.
 - **Origin**: [model.md](model.md), [overview.md](overview.md)
 - **Status**: open
 - **Priority**: medium
- **Resolution**: (pending)
+- **Resolution**: (pending — needs research into ONNX export path)
 - **Cross-references**: ADR-006

 ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
@@ -18,8 +18,9 @@ library provides drop-in replacement classes. However, supporting it in Phase 1
 adds complexity: model must be exported to ONNX format, `optimum` integration
 must be tested, and the activation extraction API may differ from PyTorch.

-Consider: Is the smaller footprint worth the integration complexity in Phase 1,
-or should ONNX support wait until Phase 2 when the core API is stable?
+The likely path is: build with PyTorch first, then export to ONNX by default.
+This needs research to confirm the activation extraction API compatibility and
+ONNX export quality for SmolLM2-135M. Leave open for now.

 ---

@@ -30,7 +31,7 @@ or should ONNX support wait until Phase 2 when the core API is stable?
 - **Origin**: [codebook.md](codebook.md)
 - **Status**: open
 - **Priority**: high
- **Resolution**: (pending)
+- **Resolution**: (pending — dedicated research session needed)
 - **Cross-references**: ADR-004

 The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
@@ -39,7 +40,8 @@ essential vs. exploratory is critical for the initial extraction. The codebook
 training pipeline (`run_manifold_projection.py`) should also be analyzed.

 Consider: How many SVD dimensions are actually needed? What's the minimum
-calibration dataset? Can spline distributions be simplified?
+calibration dataset? Can spline distributions be simplified? This needs a
+dedicated session to analyze the PoC codebase.

 ---

@@ -49,34 +51,54 @@ calibration dataset? Can spline distributions be simplified?

 - **Origin**: [firewall.md](firewall.md)
 - **Status**: open
- **Priority**: low
- **Resolution**: (pending)
- **Cross-references**: ADR-003
+- **Priority**: medium
+- **Cross-references**: ADR-003, OQ-05

 Some inputs arrive in chunks (streaming API responses, large documents). Should
 the firewall support incremental screening as chunks arrive, or require the
 full input before screening? Incremental screening could detect attacks earlier
 but requires buffering and state management.

-This is low priority for Phase 1 but affects the internal API design.
+**Rolling window approach**: One promising direction is rolling windows of
+tokens — chunking large text into overlapping windows and screening each
+window independently. This enables:
+
+1. **Granular detection**: For the instruction firewall use case (screening
+   academic papers converted from PDF to markdown), rolling windows can
+   red-flag specific *sections* of a document rather than the whole thing.
+   This is directly useful for catching hidden prompt injections in academic
+   research papers (~20 real examples found of researchers slipping injections
+   past peer review).
+2. **Parallel processing**: Windows can be screened in parallel, enabling
+   throughput scaling.
+3. **Large input handling**: No need to truncate long documents; each window
+   is independently screened within the model's context length.
+
+The PoC has directional (but buggy) Rust code for creating rolling windows
+that can be referenced when designing this feature. This connects to OQ-05
+because streaming/chunking affects how the firewall composes with other
+guardrail systems in a pipeline.
+
+Leave open for Phase 1 design, but the rolling window approach is the leading
+candidate for Phase 2.

 ---

-### OQ-04: Should detection thresholds be per-model or globally configurable?
+### ~~OQ-04: Should detection thresholds be per-model or globally configurable?~~

 - **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
- **Status**: open
+- **Status**: **resolved**
 - **Priority**: medium
- **Resolution**: (pending)
+- **Resolution**: Both — thresholds are **model-specific by default** (shipped
+  with the codebook) but **globally overridable by the user**. Once calibrated,
+  models produce remarkably similar behavioral patterns across models (inspired
+  by the "platonic representation hypothesis" — different models converge on
+  similar internal representations of the same data). The individual activation
+  spaces differ, but the behavioral patterns they encode are consistent enough
+  that thresholds transfer reasonably well. The codebook ships recommended
+  thresholds calibrated for its model; users can adjust.
 - **Cross-references**: ADR-003, ADR-004

-Different detector models may produce different score distributions. Thresholds
-that work for SmolLM2-135M may not work for a different model. Should
-thresholds be tied to the codebook (per-model) or set globally by the user?
-
-Consider: Per-model defaults with user overrides? Codebook ships with
-recommended thresholds that the user can adjust?
-
 ---

 ## Theme: Integration
@@ -86,15 +108,23 @@ recommended thresholds that the user can adjust?
 - **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
 - **Status**: open
 - **Priority**: medium
- **Resolution**: (pending)
+- **Resolution**: (pending — needs deep dive into current guardrail landscape)
 - **Cross-references**: ADR-002

 The behavioral firewall is complementary to text-surface defenses. Users may
 want to run both Llama Guard (text classification) and alknet-firewall
-(behavioral signals) in series. How should these be composed?
+(behavioral signals) in series. However, what we're doing is fundamentally
+different — it requires having the model and having trained on its specific
+behavioral signals. This means direct API-level integration with other systems
+may not be straightforward.

-Consider: Integration adapters? A common interface? Callback hooks? Or is
-composition the user's responsibility and we just provide a clean standalone API?
+A deep dive into the current state of guardrail integration patterns
+(LlamaFirewall's scanner interface, NeMo Guardrails' Colang DSL, etc.) is
+needed to determine whether we should build adapters, define a common
+interface, or simply provide a clean standalone API and let users compose
+systems themselves.
+
+Leave open — will research soon.

 ---

@@ -105,25 +135,10 @@ composition the user's responsibility and we just provide a clean standalone API
 - **Origin**: [configuration.md](configuration.md)
 - **Status**: open
 - **Priority**: low
- **Resolution**: (pending)
+- **Resolution**: (pending — Phase 2 concern)
 - **Cross-references**: None

 Phase 1 uses constructor-based configuration only. A future phase may add
 file-based configuration for easier deployment. TOML is consistent with
 Python packaging (pyproject.toml) and increasingly the standard for Python
-config. YAML is more familiar in ops/ML contexts. Either works.
-
---
-
-### OQ-07: Is a Rust port feasible given current ML framework maturity?
-
- **Origin**: [overview.md](overview.md), ADR-001
- **Status**: open
- **Priority**: low
- **Resolution**: (pending)
- **Cross-references**: ADR-001
-
-A Rust port using burn/cubecl was attempted during the PoC phase and failed.
-The ML framework ecosystem in Rust is not yet mature enough for this type
-of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
-matures or alternative Rust ML frameworks emerge.
+config. YAML is more familiar in ops/ML contexts. Either works.
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -64,9 +64,9 @@ for the full threat analysis and academic evidence.

 - **Phase 3**: Advanced capabilities
  - Multi-turn attack detection (payload splitting)
-  - Streaming input screening
+  - Streaming/rolling-window input screening (granular detection for documents)
  - Custom model fine-tuning for domain-specific detection
-  - Rust port via burn/cubecl (speculative, requires R&D)
+  - ONNX Runtime inference backend (export from PyTorch)

 ### Out of Scope