Research confirmed rolling token windows as the right approach for long document screening. ADR-012 formalizes the decision: Phase 2 implements screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max pooling aggregation, and character offset tracking. Short inputs fall through to screen() unchanged. This resolves the last open question. All 6 original OQs are now resolved: - OQ-01: ONNX removed (burn/cublas better future path) - OQ-02: 65% codebook compression achievable - OQ-03: Rolling token windows for Phase 2 (ADR-012) - OQ-04: Both model-specific defaults + user-overridable - OQ-05: Standalone API + thin adapters (ADR-011) - OQ-06: TOML for file-based config
3.6 KiB
ADR-012: Rolling Token Window Screening for Long Documents
Status
Accepted
Context
The Phase 1 screen() API processes the full input as a single forward pass
through the detector model. This works for inputs within the model's context
window (2048 tokens for SmolLM2-135M) but fails for longer documents. Two
distinct windowing concepts exist in the detection pipeline:
-
Token-level smoothing (already in the codebook): Within a single forward pass, per-token z-coordinates are smoothed with a rolling average (window=8) before classification. This operates on the
(seq_len, 3)z coordinate sequence. -
Input-level rolling windows (this ADR): For long documents that exceed the model's context window, chunk the text into overlapping token windows and screen each window independently. Each window produces its own z-vector and alarm. Windows are aggregated into a document-level verdict.
Research (rolling-window-analysis.md) confirmed that:
- Meta's PromptGuard 2 uses a similar approach (512-token segments)
- Max pooling is the correct aggregation strategy (consistent with existing weighted-max score composition)
- 25% overlap (512 tokens for SmolLM2-135M) balances detection quality vs throughput — enough to catch boundary-spanning injections
- Character offset mapping (from HuggingFace tokenizer
offset_mapping) enables granular "section X is suspicious" reporting - The Rust reference implementation in taskgraph-semantic validates the window creation algorithm
Decision
Implement rolling token window screening as the Phase 2 screen_document()
API, with the following parameters:
- Window size: Model's max sequence length (2048 for SmolLM2-135M)
- Overlap: 25% (512 tokens) — same as PromptGuard's entire context window
- Aggregation: Max pooling across per-window, per-direction P(active) scores
- Short input handling: Inputs shorter than one window fall through to
screen()with no overhead - Character offset tracking: Token-to-character mapping for granular reporting of flagged sections
The two windowing concepts (token-level smoothing, input-level rolling windows) are composable and solve different problems at different levels.
Consequences
Positive:
- Long documents (academic papers, reports) can be screened without truncation
- Granular reporting identifies which sections are suspicious, not just the whole document
- Windows can be processed in parallel for throughput scaling
- Natural fallback: short inputs get the fast single-window path
- Character offsets enable UI integration (highlighting flagged sections)
- Pattern translates directly to Rust for future embedding system integration
Negative:
- Throughput cost: N windows = N forward passes. A 10K-token document needs ~7 windows at 25% overlap.
- Overlap regions are processed multiple times, increasing compute
- API surface expands — users must choose between
screen()andscreen_document() - Edge cases around window boundaries (partial word tokens, very short windows) need careful handling
References
- rolling-window-analysis.md — Full research with API design and implementation sketch
- OQ-03 — Original open question
- firewall.md — Current screening API
- codebook.md — Token-level smoothing (separate from this)
- taskgraph-semantic:
/workspace/@alkimiadev/taskgraph-semantic/src/embedding.rs— Rust reference forcreate_rolling_windows()