Research confirmed rolling token windows as the right approach for long document screening. ADR-012 formalizes the decision: Phase 2 implements screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max pooling aggregation, and character offset tracking. Short inputs fall through to screen() unchanged. This resolves the last open question. All 6 original OQs are now resolved: - OQ-01: ONNX removed (burn/cublas better future path) - OQ-02: 65% codebook compression achievable - OQ-03: Rolling token windows for Phase 2 (ADR-012) - OQ-04: Both model-specific defaults + user-overridable - OQ-05: Standalone API + thin adapters (ADR-011) - OQ-06: TOML for file-based config
79 lines
3.6 KiB
Markdown
79 lines
3.6 KiB
Markdown
# ADR-012: Rolling Token Window Screening for Long Documents
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The Phase 1 `screen()` API processes the full input as a single forward pass
|
|
through the detector model. This works for inputs within the model's context
|
|
window (2048 tokens for SmolLM2-135M) but fails for longer documents. Two
|
|
distinct windowing concepts exist in the detection pipeline:
|
|
|
|
1. **Token-level smoothing** (already in the codebook): Within a single
|
|
forward pass, per-token z-coordinates are smoothed with a rolling average
|
|
(window=8) before classification. This operates on the `(seq_len, 3)` z
|
|
coordinate sequence.
|
|
|
|
2. **Input-level rolling windows** (this ADR): For long documents that exceed
|
|
the model's context window, chunk the text into overlapping token windows
|
|
and screen each window independently. Each window produces its own z-vector
|
|
and alarm. Windows are aggregated into a document-level verdict.
|
|
|
|
Research ([rolling-window-analysis.md](../../research/streaming-screening-patterns/rolling-window-analysis.md))
|
|
confirmed that:
|
|
- Meta's PromptGuard 2 uses a similar approach (512-token segments)
|
|
- Max pooling is the correct aggregation strategy (consistent with existing
|
|
weighted-max score composition)
|
|
- 25% overlap (512 tokens for SmolLM2-135M) balances detection quality vs
|
|
throughput — enough to catch boundary-spanning injections
|
|
- Character offset mapping (from HuggingFace tokenizer `offset_mapping`)
|
|
enables granular "section X is suspicious" reporting
|
|
- The Rust reference implementation in taskgraph-semantic validates the
|
|
window creation algorithm
|
|
|
|
## Decision
|
|
|
|
Implement rolling token window screening as the Phase 2 `screen_document()`
|
|
API, with the following parameters:
|
|
|
|
- **Window size**: Model's max sequence length (2048 for SmolLM2-135M)
|
|
- **Overlap**: 25% (512 tokens) — same as PromptGuard's entire context window
|
|
- **Aggregation**: Max pooling across per-window, per-direction P(active)
|
|
scores
|
|
- **Short input handling**: Inputs shorter than one window fall through to
|
|
`screen()` with no overhead
|
|
- **Character offset tracking**: Token-to-character mapping for granular
|
|
reporting of flagged sections
|
|
|
|
The two windowing concepts (token-level smoothing, input-level rolling windows)
|
|
are composable and solve different problems at different levels.
|
|
|
|
## Consequences
|
|
|
|
**Positive**:
|
|
- Long documents (academic papers, reports) can be screened without truncation
|
|
- Granular reporting identifies which sections are suspicious, not just the
|
|
whole document
|
|
- Windows can be processed in parallel for throughput scaling
|
|
- Natural fallback: short inputs get the fast single-window path
|
|
- Character offsets enable UI integration (highlighting flagged sections)
|
|
- Pattern translates directly to Rust for future embedding system integration
|
|
|
|
**Negative**:
|
|
- Throughput cost: N windows = N forward passes. A 10K-token document needs
|
|
~7 windows at 25% overlap.
|
|
- Overlap regions are processed multiple times, increasing compute
|
|
- API surface expands — users must choose between `screen()` and
|
|
`screen_document()`
|
|
- Edge cases around window boundaries (partial word tokens, very short
|
|
windows) need careful handling
|
|
|
|
## References
|
|
|
|
- [rolling-window-analysis.md](../../research/streaming-screening-patterns/rolling-window-analysis.md) — Full research with API design and implementation sketch
|
|
- [OQ-03](../open-questions.md) — Original open question
|
|
- [firewall.md](../firewall.md) — Current screening API
|
|
- [codebook.md](../codebook.md) — Token-level smoothing (separate from this)
|
|
- taskgraph-semantic: `/workspace/@alkimiadev/taskgraph-semantic/src/embedding.rs` — Rust reference for `create_rolling_windows()` |