docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
This commit is contained in:
2026-06-13 07:27:40 +00:00
parent 11620e8398
commit 7d8a39a88a
13 changed files with 2576 additions and 83 deletions

View File

@@ -46,6 +46,7 @@ raises "behavioral alarms" without needing to know specific attack types.
| [008](decisions/008-three-level-alarm.md) | Three-Level Alarm System | Accepted |
| [009](decisions/009-last-token-extraction.md) | Last-Token Activation Extraction | Accepted |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic Spline Distributions | Accepted |
| [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + Thin Adapter Integration | Accepted |
## Open Questions
@@ -53,12 +54,12 @@ See [open-questions.md](open-questions.md) for the full tracker.
| OQ | Question | Priority | Status |
|----|----------|----------|--------|
| OQ-01 | Should ONNX Runtime be a supported inference backend in Phase 1? | medium | open |
| OQ-02 | What is the minimum viable codebook — can the 1,245-line codebook be compressed? | high | open |
| OQ-03 | Should the firewall support streaming/chunked input screening? | medium | open |
| ~~OQ-01~~ | ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ | ~~medium~~ | **resolved** (removed from scope; burn/cublas is better future path) |
| ~~OQ-02~~ | ~~What is the minimum viable codebook — can the 1,245-line codebook be compressed?~~ | ~~high~~ | **resolved** (~65% compression to 500600 lines) |
| OQ-03 | Should the firewall support streaming/chunked input screening? | medium | open (research complete, Phase 2) |
| ~~OQ-04~~ | ~~Should detection thresholds be per-model or globally configurable?~~ | ~~medium~~ | **resolved** (both: model-specific defaults, user-overridable) |
| OQ-05 | How should the firewall integrate with existing guardrail systems? | medium | open |
| OQ-06 | Should file-based configuration use TOML or YAML? | low | open |
| ~~OQ-05~~ | ~~How should the firewall integrate with existing guardrail systems?~~ | ~~medium~~ | **resolved** (ADR-011: standalone API + thin adapters) |
| ~~OQ-06~~ | ~~Should file-based configuration use TOML or YAML?~~ | ~~low~~ | **resolved** (TOML) |
## Document Lifecycle

View File

@@ -151,6 +151,71 @@ model. The bundled codebook is specific to the default detector model
(SmolLM2-135M at the pinned revision). Users who switch to a different
detector model must provide a matching codebook via `codebook_path`.
## Package Structure
Based on analysis of the PoC codebook
([poc-architecture.md](../research/codebook-analysis/poc-architecture.md)),
the production codebook decomposes into:
```
src/alknet_firewall/
├── codebook/
│ ├── __init__.py # Public exports
│ ├── codebook.py # Codebook class (init, load, project, score)
│ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex
│ ├── splines.py # MonotonicCubicSpline, SplineDistribution
│ ├── profiles.py # DirectionProfile, population stats
│ ├── classifiers.py # DirectionClassifier (logistic weights)
│ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel
│ ├── projection.py # project(), decompose()
│ └── detection.py # detect(), threshold comparison
├── training/
│ ├── __init__.py
│ ├── compiler.py # build() — SVD, spline fitting, profile comp
│ ├── stats.py # pooled_std, cohen_d, silhouette
│ └── data_loader.py # Condition catalog, prompt sets, data loading
└── data/
└── codebook/
├── basis.safetensors
├── regions.safetensors
├── splines.json
└── config.json
```
### Extraction from PoC
The PoC `firewall_codebook.py` is 1,245 lines with significant duplication
(the decomposition pipeline z → CDF → simplex → barycentric → (sum, u, v) is
repeated 5 times). Analysis identifies:
- **~480 lines of essential runtime code** in the PoC
- **~178 lines needed from metaspline core** (SplineDistribution,
MonotonicCubicSpline, ensure_strictly_increasing, simplex)
- **~130 lines of histogram classifier** — exploratory alternative, not MVP
(the continuous logistic classifier is superior)
- **~95 lines of AUC evaluation** — testing tool, not runtime
- **~429 lines in `build()`** — must be decomposed: training moves to
`training/compiler.py`, runtime state becomes immutable serialized data
Target: **~400500 lines runtime + ~150200 lines training = ~65% compression**
from the PoC's 1,245 lines.
### Key Extraction Decisions
1. **`build()` moves entirely to `training/compiler.py`** — Runtime codebook
is read-only. The codebook class should not have a `build()` method.
2. **`decompose()` becomes a pure function** — `decompose(z, splines)` is a
pure mathematical transform. No state dependencies beyond splines.
3. **Detection is separate from the codebook class**`detect()` is a
stateless function given codebook data. Enables swapping detection
strategies without touching the codebook.
4. **Only 4 of 502 metaspline core lines are needed at runtime**
`SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`,
and `simplex()`. Everything else (DensitySpline, unfold/fold, dcs_norm) is
dropped entirely.
5. **Saved `.pt` files from the PoC provide golden test data** — manifold
projection results for Qwen3-0.6B/1.7B can be reused for integration tests.
## Data Format
The codebook is stored as:
@@ -243,6 +308,5 @@ class Codebook:
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-02**: What is the minimum viable codebook — can the 1,245-line PoC
codebook be compressed? (open)
- **OQ-04**: Should detection thresholds be per-model or globally configurable? (open)
- **OQ-02**: ~~What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~ (resolved — ~65% compression to 500600 lines; see Package Structure section)
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults, user-overridable)

View File

@@ -93,7 +93,8 @@ alarm = firewall.screen("Hello, how are you?")
```
No configuration file is required. All parameters can be passed via the
constructor. A future phase may add file-based configuration (TOML or YAML).
constructor. A future phase may add file-based configuration (TOML, consistent
with Python packaging conventions and `pyproject.toml`).
## Design Decisions
@@ -108,4 +109,5 @@ constructor. A future phase may add file-based configuration (TOML or YAML).
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)
- ~~**OQ-04**~~: ~~Should detection thresholds be per-model or globally configurable?~~ (resolved — both: model-specific defaults shipped with codebook, user-overridable)
- ~~**OQ-06**~~: ~~Should file-based configuration use TOML or YAML?~~ (resolved — TOML, consistent with modern Python packaging)

View File

@@ -6,17 +6,16 @@ Accepted
## Context
PyTorch is the primary inference backend for the detector model. However,
PyTorch is large:
PyTorch is the inference backend for the detector model. However, PyTorch is
large:
- `torch` (CPU): ~200MB download, ~700MB installed
- `torch` (CUDA): ~2.5GB download, ~5GB+ installed
- `onnxruntime`: ~30-50MB download, ~300MB installed
Making PyTorch a required dependency would force a 200MB-2.5GB download on
every user, even those who already have PyTorch installed or prefer ONNX
Runtime. This is the standard problem for ML libraries, and the HuggingFace
ecosystem has converged on a solution.
every user, even those who already have PyTorch installed. This is the
standard problem for ML libraries, and the HuggingFace ecosystem has
converged on a solution.
## Decision
@@ -43,7 +42,6 @@ except ImportError:
**Positive**:
- Base install is ~30MB download, ~100MB installed — very lightweight
- Users with existing PyTorch installations don't re-download
- ONNX Runtime alternative available for minimal footprint (~100MB total)
- Follows HuggingFace ecosystem conventions (transformers, safetensors, HF
hub all use this pattern)
- uv supports CPU/GPU torch variant selection via `[tool.uv.sources]` and
@@ -55,6 +53,8 @@ except ImportError:
- Runtime import errors if users forget to install a backend
- CPU-only torch requires two-step install or uv configuration (can't be
expressed in pip extras alone)
- PyTorch is the only supported inference backend; future alternatives
(burn/cublas via safetensors) would require separate integration work
## References

View File

@@ -0,0 +1,75 @@
# ADR-011: Standalone API with Thin Adapter Integration Strategy
## Status
Accepted
## Context
alknet-firewall provides behavioral signal detection — fundamentally different
from text-surface defenses like Llama Guard, NeMo Guardrails, or Guardrails AI.
It requires running a small detector model and extracting hidden state
activations, not classifying input text. Users may want to run both text-surface
defenses and behavioral detection in series.
Research into existing guardrail systems ([patterns-analysis.md](../../research/guardrail-integration-patterns/patterns-analysis.md))
identified three viable integration targets with high compatibility:
- **LlamaFirewall**: `BaseScanner.scan()``ScanResult` maps directly to
`Firewall.screen()``Alarm`
- **OpenAI Agents SDK**: `@input_guardrail` decorator pattern with blocking
execution
- **NeMo Guardrails**: Custom Python action in input rails (Colang DSL can't
express behavioral detection natively)
Two systems have low compatibility: Guardrails AI (expects text-surface
validators with content fixes, not alarms) and Amazon Bedrock Guardrails
(closed service, no extension mechanism).
## Decision
**Phase 1**: Ship a standalone API only. No adapters, no common interface.
```python
# The core API — simple, composable, no framework dependencies
firewall = Firewall()
alarm = firewall.screen("untrusted input text")
```
**Phase 2**: Build thin adapter packages as optional dependencies. Each adapter
is <100 lines and has no impact on the core library:
- `alknet-firewall-llamafirewall`: Custom `BaseScanner` subclass
- `alknet-firewall-agents`: `@input_guardrail` wrapper
- `alknet-firewall-nemo`: Custom NeMo input rail action
Do NOT build a common `ScreeningProvider` interface. The integration patterns
differ enough between systems that a shared abstraction would be premature and
constraining. If a common pattern emerges organically from the adapters,
extract it then.
## Consequences
**Positive**:
- Phase 1 ships faster — no adapter development or testing overhead
- Core API stays clean and framework-independent
- Users can compose manually: call `firewall.screen()` then pass results to
any guardrail system
- Adapters are optional packages, not core dependencies — no coupling
- Thin adapters are easy to maintain when guardrail frameworks change their
APIs
**Negative**:
- Phase 1 users must write their own glue code (typically 510 lines)
- No "pip install and configure" experience until Phase 2
- Multiple small adapter packages to maintain
- Risk of API drift between core and adapters if adapters are maintained
infrequently
## References
- [OQ-05](../open-questions.md) — How should the firewall integrate with
existing guardrail systems?
- [patterns-analysis.md](../../research/guardrail-integration-patterns/patterns-analysis.md) — Full research analysis
- [ADR-002](002-behavioral-signals.md) — Behavioral signal detection (not text
classification)

View File

@@ -196,5 +196,5 @@ All exception types subclass `AlknetFirewallError` (base library exception).
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open — needs research)
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; [research complete](../research/streaming-screening-patterns/rolling-window-analysis.md))
- ~~**OQ-05**~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2)

View File

@@ -72,8 +72,7 @@ class DetectorModel(Protocol):
```
The `infer` method returns hidden states at key layers, abstracting away
whether the backend is PyTorch, ONNX Runtime, or a future Rust inference
engine.
whether the backend is PyTorch or a future alternative inference engine.
### Lazy Loading
@@ -158,4 +157,4 @@ class HFDetectorModel:
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
- **OQ-01**: ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ (resolved — removed from scope; burn/cublas is a better future path)

View File

@@ -4,45 +4,40 @@ Centralized tracker for unresolved questions across all architecture documents.
## Theme: Inference Backend
### OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?
### ~~OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1?~~
- **Origin**: [model.md](model.md), [overview.md](overview.md)
- **Status**: open
- **Status**: **resolved**
- **Priority**: medium
- **Resolution**: (pending — needs research into ONNX export path)
- **Resolution**: Removed from scope entirely. ONNX Runtime does not support
`output_hidden_states=True` natively (HuggingFace optimum issue #972 was
closed as "not planned"), making activation extraction — the core operation —
impractical without a custom ONNX graph modification pipeline. The ONNX
model format also produces bloated exports. A future alternative inference
path using burn/cublas with safetensors is more promising since it supports
all platforms and uses the same model format we already require.
- **Cross-references**: ADR-006
ONNX Runtime provides a much smaller install footprint (~30-50MB vs 200MB-2.5GB
for PyTorch) and is well-suited for inference-only use. HuggingFace's `optimum`
library provides drop-in replacement classes. However, supporting it in Phase 1
adds complexity: model must be exported to ONNX format, `optimum` integration
must be tested, and the activation extraction API may differ from PyTorch.
The likely path is: build with PyTorch first, then export to ONNX by default.
This needs research to confirm the activation extraction API compatibility and
ONNX export quality for SmolLM2-135M. Leave open for now.
---
## Theme: Codebook Design
### OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?
### ~~OQ-02: What is the minimum viable codebook — can the 1,245-line PoC codebook be compressed?~~
- **Origin**: [codebook.md](codebook.md)
- **Status**: open
- **Status**: **resolved**
- **Priority**: high
- **Resolution**: (pending — dedicated research session needed)
- **Resolution**: Yes — ~65% compression to 500600 lines total (400500 runtime
+ 150200 training). The PoC contains ~480 lines of essential runtime code
plus ~178 lines needed from metaspline core. The 5x-repeated decomposition
pipeline collapses into a single `decompose()` function (~50 lines saved).
The histogram classifier (~130 lines) is exploratory and not MVP. The
`build()` method (429 lines) is decomposed: training logic moves to
`training/compiler.py`, runtime state becomes immutable serialized data.
See [poc-architecture.md](../research/codebook-analysis/poc-architecture.md)
and the Package Structure section in [codebook.md](codebook.md).
- **Cross-references**: ADR-004
The PoC codebook is 1,245 lines — much of it may be boilerplate, dead code,
or excessive parameterization from the research phase. Understanding what's
essential vs. exploratory is critical for the initial extraction. The codebook
training pipeline (`run_manifold_projection.py`) should also be analyzed.
Consider: How many SVD dimensions are actually needed? What's the minimum
calibration dataset? Can spline distributions be simplified? This needs a
dedicated session to analyze the PoC codebase.
---
## Theme: API Design
@@ -103,42 +98,30 @@ candidate for Phase 2.
## Theme: Integration
### OQ-05: How should the firewall integrate with existing guardrail systems?
### ~~OQ-05: How should the firewall integrate with existing guardrail systems?~~
- **Origin**: [firewall.md](firewall.md), [overview.md](overview.md)
- **Status**: open
- **Status**: **resolved**
- **Priority**: medium
- **Resolution**: (pending — needs deep dive into current guardrail landscape)
- **Cross-references**: ADR-002
The behavioral firewall is complementary to text-surface defenses. Users may
want to run both Llama Guard (text classification) and alknet-firewall
(behavioral signals) in series. However, what we're doing is fundamentally
different — it requires having the model and having trained on its specific
behavioral signals. This means direct API-level integration with other systems
may not be straightforward.
A deep dive into the current state of guardrail integration patterns
(LlamaFirewall's scanner interface, NeMo Guardrails' Colang DSL, etc.) is
needed to determine whether we should build adapters, define a common
interface, or simply provide a clean standalone API and let users compose
systems themselves.
Leave open — will research soon.
- **Resolution**: Standalone API + thin adapter pattern (ADR-011). Phase 1:
ship the standalone `Firewall.screen(text) → Alarm` API only. Phase 2:
build thin adapter packages (<100 lines each) for LlamaFirewall,
OpenAI Agents SDK, and NeMo Guardrails as optional dependencies. Do NOT
build a common `ScreeningProvider` interface — behavioral detection is
fundamentally different from text-surface defenses and premature abstraction
would be constraining.
- **Cross-references**: ADR-002, ADR-011
---
## Theme: Project Setup
### OQ-06: Should file-based configuration use TOML or YAML?
### ~~OQ-06: Should file-based configuration use TOML or YAML?~~
- **Origin**: [configuration.md](configuration.md)
- **Status**: open
- **Status**: **resolved**
- **Priority**: low
- **Resolution**: (pending — Phase 2 concern)
- **Cross-references**: None
Phase 1 uses constructor-based configuration only. A future phase may add
file-based configuration for easier deployment. TOML is consistent with
Python packaging (pyproject.toml) and increasingly the standard for Python
config. YAML is more familiar in ops/ML contexts. Either works.
- **Resolution**: TOML. Consistent with modern Python packaging conventions
(`pyproject.toml`) and increasingly the standard for Python configuration.
This is a two-way door decision — reverting to YAML later is straightforward.
- **Cross-references**: None

View File

@@ -56,17 +56,16 @@ for the full threat analysis and academic evidence.
- Interpretable detection signals (SVD direction analysis)
- **Phase 2**: Integration and operational hardening
- ONNX Runtime inference backend
- Async/batch screening API
- Integration adapters for LlamaFirewall, NeMo Guardrails
- Integration adapters for LlamaFirewall, NeMo Guardrails, OpenAI Agents SDK
- Metrics and observability
- Codebook training pipeline (`run_manifold_projection.py` extraction)
- Streaming/rolling-window input screening (granular detection for documents)
- **Phase 3**: Advanced capabilities
- Multi-turn attack detection (payload splitting)
- Streaming/rolling-window input screening (granular detection for documents)
- Custom model fine-tuning for domain-specific detection
- ONNX Runtime inference backend (export from PyTorch)
- Alternative inference backends (burn/cublas via safetensors)
### Out of Scope
@@ -138,8 +137,6 @@ for the full threat analysis and academic evidence.
|---------|-------|---------|---------|-------|
| `torch` | `[torch]` | >=2.2 | Model inference | 200MB-2.5GB; optional dependency |
| `transformers` | `[torch]` | >=4.40 | Model loading pipeline | Required with torch extra |
| `onnxruntime` | `[onnx]` | >=1.17 | Alternative inference | ~30-50MB; Phase 2 |
| `optimum` | `[onnx]` | latest | ONNX Runtime integration | Phase 2 |
### Development (Not Published)
@@ -187,6 +184,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [008](decisions/008-three-level-alarm.md) | Three-level alarm system | CLEAR/SUSPICIOUS/DANGEROUS balances simplicity with nuance |
| [009](decisions/009-last-token-extraction.md) | Last-token activation extraction | Standard for autoregressive models; full sequence context |
| [010](decisions/010-monotonic-spline-distributions.md) | Monotonic spline distributions | Compact, smooth, tail-sensitive behavioral region modeling |
| [011](decisions/011-guardrail-integration-strategy.md) | Standalone API + thin adapters | Phase 1 standalone, Phase 2 thin adapter packages |
## Dependencies on Other Projects
@@ -204,5 +202,5 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (open)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
- **OQ-01**: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; ONNX doesn't support activation extraction natively, and burn/cublas is a better future path)
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters in Phase 2)