docs: resolve OQ-03 — adopt rolling token window screening (ADR-012)

Research confirmed rolling token windows as the right approach for long document screening. ADR-012 formalizes the decision: Phase 2 implements screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max pooling aggregation, and character offset tracking. Short inputs fall through to screen() unchanged. This resolves the last open question. All 6 original OQs are now resolved: - OQ-01: ONNX removed (burn/cublas better future path) - OQ-02: 65% codebook compression achievable - OQ-03: Rolling token windows for Phase 2 (ADR-012) - OQ-04: Both model-specific defaults + user-overridable - OQ-05: Standalone API + thin adapters (ADR-011) - OQ-06: TOML for file-based config
2026-06-13 08:25:12 +00:00
parent 45a0e0798c
commit c225cf420c
5 changed files with 96 additions and 33 deletions
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -42,40 +42,22 @@ Centralized tracker for unresolved questions across all architecture documents.

 ## Theme: API Design

-### OQ-03: Should the firewall support streaming/chunked input screening?
+### ~~OQ-03: Should the firewall support streaming/chunked input screening?~~

 - **Origin**: [firewall.md](firewall.md)
- **Status**: open
+- **Status**: **resolved**
 - **Priority**: medium
- **Cross-references**: ADR-003, OQ-05
-
-Some inputs arrive in chunks (streaming API responses, large documents). Should
-the firewall support incremental screening as chunks arrive, or require the
-full input before screening? Incremental screening could detect attacks earlier
-but requires buffering and state management.
-
-**Rolling window approach**: One promising direction is rolling windows of
-tokens — chunking large text into overlapping windows and screening each
-window independently. This enables:
-
-1. **Granular detection**: For the instruction firewall use case (screening
-   academic papers converted from PDF to markdown), rolling windows can
-   red-flag specific *sections* of a document rather than the whole thing.
-   This is directly useful for catching hidden prompt injections in academic
-   research papers (~20 real examples found of researchers slipping injections
-   past peer review).
-2. **Parallel processing**: Windows can be screened in parallel, enabling
-   throughput scaling.
-3. **Large input handling**: No need to truncate long documents; each window
-   is independently screened within the model's context length.
-
-The PoC has directional (but buggy) Rust code for creating rolling windows
-that can be referenced when designing this feature. This connects to OQ-05
-because streaming/chunking affects how the firewall composes with other
-guardrail systems in a pipeline.
-
-Leave open for Phase 1 design, but the rolling window approach is the leading
-candidate for Phase 2.
+- **Resolution**: Rolling token window approach (ADR-012). Phase 2 implements
+  `screen_document()` with overlapping token windows (25% overlap, model's
+  full context length per window), max pooling for score aggregation, and
+  character offset tracking for granular "which sections are suspicious"
+  reporting. Short inputs fall through to the single-window `screen()` path.
+  The research doc includes a directionally correct implementation sketch.
+  Two distinct windowing concepts are now clearly separated: token-level
+  smoothing (within a single forward pass, already in codebook) vs
+  input-level rolling windows (multiple forward passes for long documents,
+  Phase 2).
+- **Cross-references**: ADR-003, ADR-012

 ---