docs: resolve OQ-03 — adopt rolling token window screening (ADR-012)

Research confirmed rolling token windows as the right approach for long
document screening. ADR-012 formalizes the decision: Phase 2 implements
screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max
pooling aggregation, and character offset tracking. Short inputs fall
through to screen() unchanged.

This resolves the last open question. All 6 original OQs are now resolved:
- OQ-01: ONNX removed (burn/cublas better future path)
- OQ-02: 65% codebook compression achievable
- OQ-03: Rolling token windows for Phase 2 (ADR-012)
- OQ-04: Both model-specific defaults + user-overridable
- OQ-05: Standalone API + thin adapters (ADR-011)
- OQ-06: TOML for file-based config
This commit is contained in:
2026-06-13 08:25:12 +00:00
parent 45a0e0798c
commit c225cf420c
5 changed files with 96 additions and 33 deletions

View File

@@ -42,40 +42,22 @@ Centralized tracker for unresolved questions across all architecture documents.
## Theme: API Design
### OQ-03: Should the firewall support streaming/chunked input screening?
### ~~OQ-03: Should the firewall support streaming/chunked input screening?~~
- **Origin**: [firewall.md](firewall.md)
- **Status**: open
- **Status**: **resolved**
- **Priority**: medium
- **Cross-references**: ADR-003, OQ-05
Some inputs arrive in chunks (streaming API responses, large documents). Should
the firewall support incremental screening as chunks arrive, or require the
full input before screening? Incremental screening could detect attacks earlier
but requires buffering and state management.
**Rolling window approach**: One promising direction is rolling windows of
tokens — chunking large text into overlapping windows and screening each
window independently. This enables:
1. **Granular detection**: For the instruction firewall use case (screening
academic papers converted from PDF to markdown), rolling windows can
red-flag specific *sections* of a document rather than the whole thing.
This is directly useful for catching hidden prompt injections in academic
research papers (~20 real examples found of researchers slipping injections
past peer review).
2. **Parallel processing**: Windows can be screened in parallel, enabling
throughput scaling.
3. **Large input handling**: No need to truncate long documents; each window
is independently screened within the model's context length.
The PoC has directional (but buggy) Rust code for creating rolling windows
that can be referenced when designing this feature. This connects to OQ-05
because streaming/chunking affects how the firewall composes with other
guardrail systems in a pipeline.
Leave open for Phase 1 design, but the rolling window approach is the leading
candidate for Phase 2.
- **Resolution**: Rolling token window approach (ADR-012). Phase 2 implements
`screen_document()` with overlapping token windows (25% overlap, model's
full context length per window), max pooling for score aggregation, and
character offset tracking for granular "which sections are suspicious"
reporting. Short inputs fall through to the single-window `screen()` path.
The research doc includes a directionally correct implementation sketch.
Two distinct windowing concepts are now clearly separated: token-level
smoothing (within a single forward pass, already in codebook) vs
input-level rolling windows (multiple forward passes for long documents,
Phase 2).
- **Cross-references**: ADR-003, ADR-012
---