docs: resolve OQ-03 — adopt rolling token window screening (ADR-012)

Research confirmed rolling token windows as the right approach for long document screening. ADR-012 formalizes the decision: Phase 2 implements screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max pooling aggregation, and character offset tracking. Short inputs fall through to screen() unchanged. This resolves the last open question. All 6 original OQs are now resolved: - OQ-01: ONNX removed (burn/cublas better future path) - OQ-02: 65% codebook compression achievable - OQ-03: Rolling token windows for Phase 2 (ADR-012) - OQ-04: Both model-specific defaults + user-overridable - OQ-05: Standalone API + thin adapters (ADR-011) - OQ-06: TOML for file-based config
2026-06-13 08:25:12 +00:00
parent 45a0e0798c
commit c225cf420c
5 changed files with 96 additions and 33 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -47,6 +47,7 @@ raises "behavioral alarms" without needing to know specific attack types.
 | [009](decisions/009-last-token-extraction.md) | Last-Token Activation Extraction | Accepted |
 | [010](decisions/010-monotonic-spline-distributions.md) | Monotonic Spline Distributions | Accepted |
 | [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + Thin Adapter Integration | Accepted |
 | [012](decisions/012-rolling-window-screening.md) | Rolling Token Window Screening | Accepted |
 ## Open Questions
@@ -56,7 +57,7 @@ See [open-questions.md](open-questions.md) for the full tracker.
 |----|----------|----------|--------|
 | ~~OQ-01~~ | ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ | ~~medium~~ | **resolved** (removed from scope; burn/cublas is better future path) |
 | ~~OQ-02~~ | ~~What is the minimum viable codebook — can the 1,245-line codebook be compressed?~~ | ~~high~~ | **resolved** (~65% compression to 500–600 lines) |
-| OQ-03 | Should the firewall support streaming/chunked input screening? | medium | open (research complete, Phase 2) |
+| ~~OQ-03~~ | ~~Should the firewall support streaming/chunked input screening?~~ | ~~medium~~ | **resolved** (ADR-012: rolling token windows Phase 2) |
 | ~~OQ-04~~ | ~~Should detection thresholds be per-model or globally configurable?~~ | ~~medium~~ | **resolved** (both: model-specific defaults, user-overridable) |
 | ~~OQ-05~~ | ~~How should the firewall integrate with existing guardrail systems?~~ | ~~medium~~ | **resolved** (ADR-011: standalone API + thin adapters) |
 | ~~OQ-06~~ | ~~Should file-based configuration use TOML or YAML?~~ | ~~low~~ | **resolved** (TOML) |
--- a/docs/architecture/decisions/012-rolling-window-screening.md
+++ b/docs/architecture/decisions/012-rolling-window-screening.md
@@ -0,0 +1,79 @@
 # ADR-012: Rolling Token Window Screening for Long Documents
 ## Status
 Accepted
 ## Context
 The Phase 1 `screen()` API processes the full input as a single forward pass
 through the detector model. This works for inputs within the model's context
 window (2048 tokens for SmolLM2-135M) but fails for longer documents. Two
 distinct windowing concepts exist in the detection pipeline:
 1. **Token-level smoothing** (already in the codebook): Within a single
   forward pass, per-token z-coordinates are smoothed with a rolling average
   (window=8) before classification. This operates on the `(seq_len, 3)` z
   coordinate sequence.
 2. **Input-level rolling windows** (this ADR): For long documents that exceed
   the model's context window, chunk the text into overlapping token windows
   and screen each window independently. Each window produces its own z-vector
   and alarm. Windows are aggregated into a document-level verdict.
 Research ([rolling-window-analysis.md](../../research/streaming-screening-patterns/rolling-window-analysis.md))
 confirmed that:
 - Meta's PromptGuard 2 uses a similar approach (512-token segments)
 - Max pooling is the correct aggregation strategy (consistent with existing
  weighted-max score composition)
 - 25% overlap (512 tokens for SmolLM2-135M) balances detection quality vs
  throughput — enough to catch boundary-spanning injections
 - Character offset mapping (from HuggingFace tokenizer `offset_mapping`)
  enables granular "section X is suspicious" reporting
 - The Rust reference implementation in taskgraph-semantic validates the
  window creation algorithm
 ## Decision
 Implement rolling token window screening as the Phase 2 `screen_document()`
 API, with the following parameters:
 - **Window size**: Model's max sequence length (2048 for SmolLM2-135M)
 - **Overlap**: 25% (512 tokens) — same as PromptGuard's entire context window
 - **Aggregation**: Max pooling across per-window, per-direction P(active)
  scores
 - **Short input handling**: Inputs shorter than one window fall through to
  `screen()` with no overhead
 - **Character offset tracking**: Token-to-character mapping for granular
  reporting of flagged sections
 The two windowing concepts (token-level smoothing, input-level rolling windows)
 are composable and solve different problems at different levels.
 ## Consequences
 **Positive**:
 - Long documents (academic papers, reports) can be screened without truncation
 - Granular reporting identifies which sections are suspicious, not just the
  whole document
 - Windows can be processed in parallel for throughput scaling
 - Natural fallback: short inputs get the fast single-window path
 - Character offsets enable UI integration (highlighting flagged sections)
 - Pattern translates directly to Rust for future embedding system integration
 **Negative**:
 - Throughput cost: N windows = N forward passes. A 10K-token document needs
  ~7 windows at 25% overlap.
 - Overlap regions are processed multiple times, increasing compute
 - API surface expands — users must choose between `screen()` and
  `screen_document()`
 - Edge cases around window boundaries (partial word tokens, very short
  windows) need careful handling
 ## References
 - [rolling-window-analysis.md](../../research/streaming-screening-patterns/rolling-window-analysis.md) — Full research with API design and implementation sketch
 - [OQ-03](../open-questions.md) — Original open question
 - [firewall.md](../firewall.md) — Current screening API
 - [codebook.md](../codebook.md) — Token-level smoothing (separate from this)
 - taskgraph-semantic: `/workspace/@alkimiadev/taskgraph-semantic/src/embedding.rs` — Rust reference for `create_rolling_windows()`
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -221,5 +221,5 @@ All exception types subclass `AlknetFirewallError` (base library exception).
 Open questions are tracked in [open-questions.md](open-questions.md). Key
 questions affecting this document:
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; [research complete](../research/streaming-screening-patterns/rolling-window-analysis.md))
+- ~~**OQ-03**~~: ~~Should the firewall support streaming/chunked input screening?~~ (resolved — ADR-012: rolling token windows with `screen_document()` in Phase 2)
 - ~~**OQ-05**~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2)
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -42,40 +42,22 @@ Centralized tracker for unresolved questions across all architecture documents.
 ## Theme: API Design
-### OQ-03: Should the firewall support streaming/chunked input screening?
+### ~~OQ-03: Should the firewall support streaming/chunked input screening?~~
 - **Origin**: [firewall.md](firewall.md)
- **Status**: open
+- **Status**: **resolved**
 - **Priority**: medium
- **Cross-references**: ADR-003, OQ-05
+- **Resolution**: Rolling token window approach (ADR-012). Phase 2 implements
-
+  `screen_document()` with overlapping token windows (25% overlap, model's
-Some inputs arrive in chunks (streaming API responses, large documents). Should
+  full context length per window), max pooling for score aggregation, and
-the firewall support incremental screening as chunks arrive, or require the
+  character offset tracking for granular "which sections are suspicious"
-full input before screening? Incremental screening could detect attacks earlier
+  reporting. Short inputs fall through to the single-window `screen()` path.
-but requires buffering and state management.
+  The research doc includes a directionally correct implementation sketch.
-
+  Two distinct windowing concepts are now clearly separated: token-level
-**Rolling window approach**: One promising direction is rolling windows of
+  smoothing (within a single forward pass, already in codebook) vs
-tokens — chunking large text into overlapping windows and screening each
+  input-level rolling windows (multiple forward passes for long documents,
-window independently. This enables:
+  Phase 2).
-
+- **Cross-references**: ADR-003, ADR-012
 1. **Granular detection**: For the instruction firewall use case (screening
   academic papers converted from PDF to markdown), rolling windows can
   red-flag specific *sections* of a document rather than the whole thing.
   This is directly useful for catching hidden prompt injections in academic
   research papers (~20 real examples found of researchers slipping injections
   past peer review).
 2. **Parallel processing**: Windows can be screened in parallel, enabling
   throughput scaling.
 3. **Large input handling**: No need to truncate long documents; each window
   is independently screened within the model's context length.
 The PoC has directional (but buggy) Rust code for creating rolling windows
 that can be referenced when designing this feature. This connects to OQ-05
 because streaming/chunking affects how the firewall composes with other
 guardrail systems in a pipeline.
 Leave open for Phase 1 design, but the rolling window approach is the leading
 candidate for Phase 2.
 ---
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -185,6 +185,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
 | [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
 | [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
 | [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
 | [012](decisions/012-rolling-window-screening.md) | Rolling token window screening | Phase 2 `screen_document()` with 25% overlap, max pooling |
 ## Dependencies on Other Projects