docs: resolve OQ-04, remove OQ-07, enrich OQ-03 with rolling windows
- OQ-04 resolved: thresholds are both model-specific (shipped with codebook) and user-overridable. Inspired by platonic representation hypothesis — calibrated models converge on similar behavioral patterns. - OQ-07 removed: Rust port is an alknet project concern, not relevant to the Python package architecture. Removed from overview.md Phase 3. - OQ-03 enriched: rolling window token screening for granular detection in documents (PDF→markdown use case, academic paper injection detection). Upgraded from low to medium priority. - OQ-01 updated: likely path is PyTorch first, ONNX export by default. - OQ-05 updated: needs deep dive into guardrail landscape. - Updated threshold description in configuration.md with platonic representation context.
This commit is contained in:
@@ -9,7 +9,7 @@ Centralized tracker for unresolved questions across all architecture documents.
|
||||
- **Origin**: [model.md](model.md), [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Resolution**: (pending — needs research into ONNX export path)
|
||||
- **Cross-references**: ADR-006
|
||||
|
||||
ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
|
||||
@@ -18,8 +18,9 @@ library provides drop-in replacement classes. However, supporting it in Phase 1
|
||||
adds complexity: model must be exported to ONNX format, `optimum` integration
|
||||
must be tested, and the activation extraction API may differ from PyTorch.
|
||||
|
||||
Consider: Is the smaller footprint worth the integration complexity in Phase 1,
|
||||
or should ONNX support wait until Phase 2 when the core API is stable?
|
||||
The likely path is: build with PyTorch first, then export to ONNX by default.
|
||||
This needs research to confirm the activation extraction API compatibility and
|
||||
ONNX export quality for SmolLM2-135M. Leave open for now.
|
||||
|
||||
---
|
||||
|
||||
@@ -30,7 +31,7 @@ or should ONNX support wait until Phase 2 when the core API is stable?
|
||||
- **Origin**: [codebook.md](codebook.md)
|
||||
- **Status**: open
|
||||
- **Priority**: high
|
||||
- **Resolution**: (pending)
|
||||
- **Resolution**: (pending — dedicated research session needed)
|
||||
- **Cross-references**: ADR-004
|
||||
|
||||
The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
|
||||
@@ -39,7 +40,8 @@ essential vs. exploratory is critical for the initial extraction. The codebook
|
||||
training pipeline (`run_manifold_projection.py`) should also be analyzed.
|
||||
|
||||
Consider: How many SVD dimensions are actually needed? What's the minimum
|
||||
calibration dataset? Can spline distributions be simplified?
|
||||
calibration dataset? Can spline distributions be simplified? This needs a
|
||||
dedicated session to analyze the PoC codebase.
|
||||
|
||||
---
|
||||
|
||||
@@ -49,34 +51,54 @@ calibration dataset? Can spline distributions be simplified?
|
||||
|
||||
- **Origin**: [firewall.md](firewall.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-003
|
||||
- **Priority**: medium
|
||||
- **Cross-references**: ADR-003, OQ-05
|
||||
|
||||
Some inputs arrive in chunks (streaming API responses, large documents). Should
|
||||
the firewall support incremental screening as chunks arrive, or require the
|
||||
full input before screening? Incremental screening could detect attacks earlier
|
||||
but requires buffering and state management.
|
||||
|
||||
This is low priority for Phase 1 but affects the internal API design.
|
||||
**Rolling window approach**: One promising direction is rolling windows of
|
||||
tokens — chunking large text into overlapping windows and screening each
|
||||
window independently. This enables:
|
||||
|
||||
1. **Granular detection**: For the instruction firewall use case (screening
|
||||
academic papers converted from PDF to markdown), rolling windows can
|
||||
red-flag specific *sections* of a document rather than the whole thing.
|
||||
This is directly useful for catching hidden prompt injections in academic
|
||||
research papers (~20 real examples found of researchers slipping injections
|
||||
past peer review).
|
||||
2. **Parallel processing**: Windows can be screened in parallel, enabling
|
||||
throughput scaling.
|
||||
3. **Large input handling**: No need to truncate long documents; each window
|
||||
is independently screened within the model's context length.
|
||||
|
||||
The PoC has directional (but buggy) Rust code for creating rolling windows
|
||||
that can be referenced when designing this feature. This connects to OQ-05
|
||||
because streaming/chunking affects how the firewall composes with other
|
||||
guardrail systems in a pipeline.
|
||||
|
||||
Leave open for Phase 1 design, but the rolling window approach is the leading
|
||||
candidate for Phase 2.
|
||||
|
||||
---
|
||||
|
||||
### OQ-04: Should detection thresholds be per-model or globally configurable?
|
||||
### ~~OQ-04: Should detection thresholds be per-model or globally configurable?~~
|
||||
|
||||
- **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
|
||||
- **Status**: open
|
||||
- **Status**: **resolved**
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Resolution**: Both — thresholds are **model-specific by default** (shipped
|
||||
with the codebook) but **globally overridable by the user**. Once calibrated,
|
||||
models produce remarkably similar behavioral patterns across models (inspired
|
||||
by the "platonic representation hypothesis" — different models converge on
|
||||
similar internal representations of the same data). The individual activation
|
||||
spaces differ, but the behavioral patterns they encode are consistent enough
|
||||
that thresholds transfer reasonably well. The codebook ships recommended
|
||||
thresholds calibrated for its model; users can adjust.
|
||||
- **Cross-references**: ADR-003, ADR-004
|
||||
|
||||
Different detector models may produce different score distributions. Thresholds
|
||||
that work for SmolLM2-135M may not work for a different model. Should
|
||||
thresholds be tied to the codebook (per-model) or set globally by the user?
|
||||
|
||||
Consider: Per-model defaults with user overrides? Codebook ships with
|
||||
recommended thresholds that the user can adjust?
|
||||
|
||||
---
|
||||
|
||||
## Theme: Integration
|
||||
@@ -86,15 +108,23 @@ recommended thresholds that the user can adjust?
|
||||
- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Resolution**: (pending)
|
||||
- **Resolution**: (pending — needs deep dive into current guardrail landscape)
|
||||
- **Cross-references**: ADR-002
|
||||
|
||||
The behavioral firewall is complementary to text-surface defenses. Users may
|
||||
want to run both Llama Guard (text classification) and alknet-firewall
|
||||
(behavioral signals) in series. How should these be composed?
|
||||
(behavioral signals) in series. However, what we're doing is fundamentally
|
||||
different — it requires having the model and having trained on its specific
|
||||
behavioral signals. This means direct API-level integration with other systems
|
||||
may not be straightforward.
|
||||
|
||||
Consider: Integration adapters? A common interface? Callback hooks? Or is
|
||||
composition the user's responsibility and we just provide a clean standalone API?
|
||||
A deep dive into the current state of guardrail integration patterns
|
||||
(LlamaFirewall's scanner interface, NeMo Guardrails' Colang DSL, etc.) is
|
||||
needed to determine whether we should build adapters, define a common
|
||||
interface, or simply provide a clean standalone API and let users compose
|
||||
systems themselves.
|
||||
|
||||
Leave open — will research soon.
|
||||
|
||||
---
|
||||
|
||||
@@ -105,25 +135,10 @@ composition the user's responsibility and we just provide a clean standalone API
|
||||
- **Origin**: [configuration.md](configuration.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Resolution**: (pending — Phase 2 concern)
|
||||
- **Cross-references**: None
|
||||
|
||||
Phase 1 uses constructor-based configuration only. A future phase may add
|
||||
file-based configuration for easier deployment. TOML is consistent with
|
||||
Python packaging (pyproject.toml) and increasingly the standard for Python
|
||||
config. YAML is more familiar in ops/ML contexts. Either works.
|
||||
|
||||
---
|
||||
|
||||
### OQ-07: Is a Rust port feasible given current ML framework maturity?
|
||||
|
||||
- **Origin**: [overview.md](overview.md), ADR-001
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: (pending)
|
||||
- **Cross-references**: ADR-001
|
||||
|
||||
A Rust port using burn/cubecl was attempted during the PoC phase and failed.
|
||||
The ML framework ecosystem in Rust is not yet mature enough for this type
|
||||
of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
|
||||
matures or alternative Rust ML frameworks emerge.
|
||||
config. YAML is more familiar in ops/ML contexts. Either works.
|
||||
Reference in New Issue
Block a user