Files

glm-5.1 c225cf420c docs: resolve OQ-03 — adopt rolling token window screening (ADR-012)

Research confirmed rolling token windows as the right approach for long
document screening. ADR-012 formalizes the decision: Phase 2 implements
screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max
pooling aggregation, and character offset tracking. Short inputs fall
through to screen() unchanged.

This resolves the last open question. All 6 original OQs are now resolved:
- OQ-01: ONNX removed (burn/cublas better future path)
- OQ-02: 65% codebook compression achievable
- OQ-03: Rolling token windows for Phase 2 (ADR-012)
- OQ-04: Both model-specific defaults + user-overridable
- OQ-05: Standalone API + thin adapters (ADR-011)
- OQ-06: TOML for file-based config

2026-06-13 08:25:12 +00:00

4.7 KiB

Raw Permalink Blame History

Open Questions

Centralized tracker for unresolved questions across all architecture documents.

Theme: Inference Backend

OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?

Origin: model.md, overview.md
Status: resolved
Priority: medium
Resolution: Removed from scope entirely. ONNX Runtime does not support output_hidden_states=True natively (HuggingFace optimum issue #972 was closed as "not planned"), making activation extraction — the core operation — impractical without a custom ONNX graph modification pipeline. The ONNX model format also produces bloated exports. A future alternative inference path using burn/cublas with safetensors is more promising since it supports all platforms and uses the same model format we already require.
Cross-references: ADR-006

Theme: Codebook Design

OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?

Origin: codebook.md
Status: resolved
Priority: high
Resolution: Yes — ~65% compression to 500–600 lines total (400–500 runtime
- 150–200 training). The PoC contains ~480 lines of essential runtime code plus ~178 lines needed from metaspline core. The 5x-repeated decomposition pipeline collapses into a single decompose() function (~50 lines saved). The histogram classifier (~130 lines) is exploratory and not MVP. The build() method (429 lines) is decomposed: training logic moves to training/compiler.py, runtime state becomes immutable serialized data. See poc-architecture.md and the Package Structure section in codebook.md.
Cross-references: ADR-004

Theme: API Design

OQ-03: Should the firewall support streaming/chunked input screening?

Origin: firewall.md
Status: resolved
Priority: medium
Resolution: Rolling token window approach (ADR-012). Phase 2 implements screen_document() with overlapping token windows (25% overlap, model's full context length per window), max pooling for score aggregation, and character offset tracking for granular "which sections are suspicious" reporting. Short inputs fall through to the single-window screen() path. The research doc includes a directionally correct implementation sketch. Two distinct windowing concepts are now clearly separated: token-level smoothing (within a single forward pass, already in codebook) vs input-level rolling windows (multiple forward passes for long documents, Phase 2).
Cross-references: ADR-003, ADR-012

OQ-04: Should detection thresholds be per-model or globally configurable?

Origin: configuration.md, codebook.md
Status: resolved
Priority: medium
Resolution: Both — thresholds are model-specific by default (shipped with the codebook) but globally overridable by the user. Once calibrated, models produce remarkably similar behavioral patterns across models (inspired by the "platonic representation hypothesis" — different models converge on similar internal representations of the same data). The individual activation spaces differ, but the behavioral patterns they encode are consistent enough that thresholds transfer reasonably well. The codebook ships recommended thresholds calibrated for its model; users can adjust.
Cross-references: ADR-003, ADR-004

Theme: Integration

OQ-05: How should the firewall integrate with existing guardrail systems?

Origin: firewall.md, overview.md
Status: resolved
Priority: medium
Resolution: Standalone API + thin adapter pattern (ADR-011). Phase 1: ship the standalone Firewall.screen(text) → Alarm API only. Phase 2: build thin adapter packages (<100 lines each) for LlamaFirewall, OpenAI Agents SDK, and NeMo Guardrails as optional dependencies. Do NOT build a common ScreeningProvider interface — behavioral detection is fundamentally different from text-surface defenses and premature abstraction would be constraining.
Cross-references: ADR-002, ADR-011

Theme: Project Setup

OQ-06: Should file-based configuration use TOML or YAML?

Origin: configuration.md
Status: resolved
Priority: low
Resolution: TOML. Consistent with modern Python packaging conventions (pyproject.toml) and increasingly the standard for Python configuration. This is a two-way door decision — reverting to YAML later is straightforward.
Cross-references: None

4.7 KiB Raw Permalink Blame History Unescape Escape

Open Questions

Theme: Inference Backend

OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?

Theme: Codebook Design

OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?

Theme: API Design

OQ-03: Should the firewall support streaming/chunked input screening?

OQ-04: Should detection thresholds be per-model or globally configurable?

Theme: Integration

OQ-05: How should the firewall integrate with existing guardrail systems?

Theme: Project Setup

OQ-06: Should file-based configuration use TOML or YAML?

4.7 KiB

Raw Permalink Blame History