|
|
|
|
@@ -9,7 +9,7 @@ Centralized tracker for unresolved questions across all architecture documents.
|
|
|
|
|
- **Origin**: [model.md](model.md), [overview.md](overview.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: medium
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Resolution**: (pending — needs research into ONNX export path)
|
|
|
|
|
- **Cross-references**: ADR-006
|
|
|
|
|
|
|
|
|
|
ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
|
|
|
|
|
@@ -18,8 +18,9 @@ library provides drop-in replacement classes. However, supporting it in Phase 1
|
|
|
|
|
adds complexity: model must be exported to ONNX format, `optimum` integration
|
|
|
|
|
must be tested, and the activation extraction API may differ from PyTorch.
|
|
|
|
|
|
|
|
|
|
Consider: Is the smaller footprint worth the integration complexity in Phase 1,
|
|
|
|
|
or should ONNX support wait until Phase 2 when the core API is stable?
|
|
|
|
|
The likely path is: build with PyTorch first, then export to ONNX by default.
|
|
|
|
|
This needs research to confirm the activation extraction API compatibility and
|
|
|
|
|
ONNX export quality for SmolLM2-135M. Leave open for now.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -30,7 +31,7 @@ or should ONNX support wait until Phase 2 when the core API is stable?
|
|
|
|
|
- **Origin**: [codebook.md](codebook.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: high
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Resolution**: (pending — dedicated research session needed)
|
|
|
|
|
- **Cross-references**: ADR-004
|
|
|
|
|
|
|
|
|
|
The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
|
|
|
|
|
@@ -39,7 +40,8 @@ essential vs. exploratory is critical for the initial extraction. The codebook
|
|
|
|
|
training pipeline (`run_manifold_projection.py`) should also be analyzed.
|
|
|
|
|
|
|
|
|
|
Consider: How many SVD dimensions are actually needed? What's the minimum
|
|
|
|
|
calibration dataset? Can spline distributions be simplified?
|
|
|
|
|
calibration dataset? Can spline distributions be simplified? This needs a
|
|
|
|
|
dedicated session to analyze the PoC codebase.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -49,34 +51,54 @@ calibration dataset? Can spline distributions be simplified?
|
|
|
|
|
|
|
|
|
|
- **Origin**: [firewall.md](firewall.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: low
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Cross-references**: ADR-003
|
|
|
|
|
- **Priority**: medium
|
|
|
|
|
- **Cross-references**: ADR-003, OQ-05
|
|
|
|
|
|
|
|
|
|
Some inputs arrive in chunks (streaming API responses, large documents). Should
|
|
|
|
|
the firewall support incremental screening as chunks arrive, or require the
|
|
|
|
|
full input before screening? Incremental screening could detect attacks earlier
|
|
|
|
|
but requires buffering and state management.
|
|
|
|
|
|
|
|
|
|
This is low priority for Phase 1 but affects the internal API design.
|
|
|
|
|
**Rolling window approach**: One promising direction is rolling windows of
|
|
|
|
|
tokens — chunking large text into overlapping windows and screening each
|
|
|
|
|
window independently. This enables:
|
|
|
|
|
|
|
|
|
|
1. **Granular detection**: For the instruction firewall use case (screening
|
|
|
|
|
academic papers converted from PDF to markdown), rolling windows can
|
|
|
|
|
red-flag specific *sections* of a document rather than the whole thing.
|
|
|
|
|
This is directly useful for catching hidden prompt injections in academic
|
|
|
|
|
research papers (~20 real examples found of researchers slipping injections
|
|
|
|
|
past peer review).
|
|
|
|
|
2. **Parallel processing**: Windows can be screened in parallel, enabling
|
|
|
|
|
throughput scaling.
|
|
|
|
|
3. **Large input handling**: No need to truncate long documents; each window
|
|
|
|
|
is independently screened within the model's context length.
|
|
|
|
|
|
|
|
|
|
The PoC has directional (but buggy) Rust code for creating rolling windows
|
|
|
|
|
that can be referenced when designing this feature. This connects to OQ-05
|
|
|
|
|
because streaming/chunking affects how the firewall composes with other
|
|
|
|
|
guardrail systems in a pipeline.
|
|
|
|
|
|
|
|
|
|
Leave open for Phase 1 design, but the rolling window approach is the leading
|
|
|
|
|
candidate for Phase 2.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### OQ-04: Should detection thresholds be per-model or globally configurable?
|
|
|
|
|
### ~~OQ-04: Should detection thresholds be per-model or globally configurable?~~
|
|
|
|
|
|
|
|
|
|
- **Origin**: [configuration.md](configuration.md), [codebook.md](codebook.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Status**: **resolved**
|
|
|
|
|
- **Priority**: medium
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Resolution**: Both — thresholds are **model-specific by default** (shipped
|
|
|
|
|
with the codebook) but **globally overridable by the user**. Once calibrated,
|
|
|
|
|
models produce remarkably similar behavioral patterns across models (inspired
|
|
|
|
|
by the "platonic representation hypothesis" — different models converge on
|
|
|
|
|
similar internal representations of the same data). The individual activation
|
|
|
|
|
spaces differ, but the behavioral patterns they encode are consistent enough
|
|
|
|
|
that thresholds transfer reasonably well. The codebook ships recommended
|
|
|
|
|
thresholds calibrated for its model; users can adjust.
|
|
|
|
|
- **Cross-references**: ADR-003, ADR-004
|
|
|
|
|
|
|
|
|
|
Different detector models may produce different score distributions. Thresholds
|
|
|
|
|
that work for SmolLM2-135M may not work for a different model. Should
|
|
|
|
|
thresholds be tied to the codebook (per-model) or set globally by the user?
|
|
|
|
|
|
|
|
|
|
Consider: Per-model defaults with user overrides? Codebook ships with
|
|
|
|
|
recommended thresholds that the user can adjust?
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Theme: Integration
|
|
|
|
|
@@ -86,15 +108,23 @@ recommended thresholds that the user can adjust?
|
|
|
|
|
- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: medium
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Resolution**: (pending — needs deep dive into current guardrail landscape)
|
|
|
|
|
- **Cross-references**: ADR-002
|
|
|
|
|
|
|
|
|
|
The behavioral firewall is complementary to text-surface defenses. Users may
|
|
|
|
|
want to run both Llama Guard (text classification) and alknet-firewall
|
|
|
|
|
(behavioral signals) in series. How should these be composed?
|
|
|
|
|
(behavioral signals) in series. However, what we're doing is fundamentally
|
|
|
|
|
different — it requires having the model and having trained on its specific
|
|
|
|
|
behavioral signals. This means direct API-level integration with other systems
|
|
|
|
|
may not be straightforward.
|
|
|
|
|
|
|
|
|
|
Consider: Integration adapters? A common interface? Callback hooks? Or is
|
|
|
|
|
composition the user's responsibility and we just provide a clean standalone API?
|
|
|
|
|
A deep dive into the current state of guardrail integration patterns
|
|
|
|
|
(LlamaFirewall's scanner interface, NeMo Guardrails' Colang DSL, etc.) is
|
|
|
|
|
needed to determine whether we should build adapters, define a common
|
|
|
|
|
interface, or simply provide a clean standalone API and let users compose
|
|
|
|
|
systems themselves.
|
|
|
|
|
|
|
|
|
|
Leave open — will research soon.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
@@ -105,25 +135,10 @@ composition the user's responsibility and we just provide a clean standalone API
|
|
|
|
|
- **Origin**: [configuration.md](configuration.md)
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: low
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Resolution**: (pending — Phase 2 concern)
|
|
|
|
|
- **Cross-references**: None
|
|
|
|
|
|
|
|
|
|
Phase 1 uses constructor-based configuration only. A future phase may add
|
|
|
|
|
file-based configuration for easier deployment. TOML is consistent with
|
|
|
|
|
Python packaging (pyproject.toml) and increasingly the standard for Python
|
|
|
|
|
config. YAML is more familiar in ops/ML contexts. Either works.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
### OQ-07: Is a Rust port feasible given current ML framework maturity?
|
|
|
|
|
|
|
|
|
|
- **Origin**: [overview.md](overview.md), ADR-001
|
|
|
|
|
- **Status**: open
|
|
|
|
|
- **Priority**: low
|
|
|
|
|
- **Resolution**: (pending)
|
|
|
|
|
- **Cross-references**: ADR-001
|
|
|
|
|
|
|
|
|
|
A Rust port using burn/cubecl was attempted during the PoC phase and failed.
|
|
|
|
|
The ML framework ecosystem in Rust is not yet mature enough for this type
|
|
|
|
|
of work. This remains a speculative Phase 3 goal. Revisit when burn/cubecl
|
|
|
|
|
matures or alternative Rust ML frameworks emerge.
|
|
|
|
|
config. YAML is more familiar in ops/ML contexts. Either works.
|