Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
20 KiB
Research: ONNX Runtime as Inference Backend for alknet-firewall
Date: 2026-06-13 Question: Should ONNX Runtime be a supported inference backend in Phase 1? Status: Open question OQ-01
Executive Summary
ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2. The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where output_hidden_states=True is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem.
1. ONNX Runtime Overview
What It Is
ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch.
Install Footprint
| Package | Wheel Size | Installed Size | Notes |
|---|---|---|---|
onnxruntime (CPU) |
~18 MB | ~180-200 MB | Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings |
torch (CPU-only) |
~200 MB | ~700 MB | libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB) |
torch (CUDA) |
~2.5 GB | ~5+ GB | Default pip install torch since PyTorch 2.11 |
optimum[onnxruntime] |
~5 MB | ~20 MB | Python wrapper; depends on onnxruntime + transformers |
Sources: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks).
Revised claim: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate.
Performance Characteristics
- CPU inference: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support
- Warm start: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast
- Memory: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers)
- Thread scaling: Good multi-threaded CPU performance via OpenMP/MLAS
CPU Deployment Story
ONNX Runtime excels at CPU deployment, which is alknet-firewall's target:
- No CUDA/GPU dependency
- Cross-platform (Linux, macOS, Windows, ARM)
- Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML)
- Well-suited for containerized and embedded deployments
2. HuggingFace Optimum Integration
How Optimum Works
HuggingFace's optimum-onnx (formerly optimum[onnxruntime]) provides drop-in replacement classes for HuggingFace transformers models:
# PyTorch path
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
# ONNX Runtime path (drop-in replacement)
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained(
"onnx-community/SmolLM2-135M-ONNX",
export=False, # Use pre-exported ONNX model
)
# OR: export on the fly from PyTorch weights
model = ORTModelForCausalLM.from_pretrained(
"HuggingFaceTB/SmolLM2-135M",
export=True, # Auto-export to ONNX at load time
)
Export Process
The ONNX export can be done via:
- CLI:
optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/ - Programmatic:
ORTModelForCausalLM.from_pretrained("...", export=True) - Pre-exported: Use existing ONNX models from
onnx-community/on HuggingFace Hub
For causal LMs, the export produces:
- A decoder model (with or without past key values)
- Optionally a merged decoder combining initial pass and cached pass into one model
Model Compatibility
SmolLM2-135M uses the LLaMA architecture. The optimum ONNX export supports LLaMA-family models:
| Architecture | Export Support | ORTModelForCausalLM Support |
|---|---|---|
llama (SmolLM2) |
✓ Supported | ✓ Supported |
gpt2 |
✓ Supported | ✓ Supported |
bloom |
✓ Supported | ✓ Supported |
mistral |
✓ Supported | ✓ Supported |
Pre-exported model available: onnx-community/SmolLM2-135M-ONNX exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format.
3. Activation Extraction Feasibility ⚠️ CRITICAL
This is the make-or-break question for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial:
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in [1, 2, 4, 8]
}
The Problem
ORTModelForCausalLM does NOT support output_hidden_states. This is confirmed by:
-
GitHub Issue #972 on
huggingface/optimum: "Add output ofoutput_hidden_statesfor onnx model export" — filed April 2023, closed as "not planned". The request was to add hidden state outputs to the ONNX export forORTModelForCausalLM, noting that the merged decoder only outputs logits + past key/values. -
ORTModelForCausalLM.forward() documentation: The
forward()method signature includesinput_ids,attention_mask,past_key_values,position_ids,use_cache, and**kwargs— but nooutput_hidden_statesparameter. The return type is logits + past key values only. -
ONNX graph structure: Standard ONNX exports of causal LMs define outputs as
logitsandpast_key_values. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs.
Why This Is Hard
ONNX is a static graph format. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set output_hidden_states=True at runtime, ONNX requires the graph to explicitly include those output connections.
The sklearn-onnx documentation explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."
Workable Paths (All Require Extra Engineering)
Path A: Custom ONNX Export with Hidden State Outputs
Approach: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs.
import onnx
# Load the standard exported ONNX model
model = onnx.load("model.onnx")
# Find the intermediate layer output names in the graph
# For LLaMA/SmolLM2, each transformer layer outputs hidden states
# Names follow patterns like: "/model/layers.0/output_0"
# Add intermediate outputs to the graph
for layer_idx in [1, 2, 4, 8]:
# Find the node output for each layer
intermediate_name = f"/model/layers.{layer_idx}/output_0"
model.graph.output.append(
onnx.helper.make_tensor_value_info(
intermediate_name,
onnx.TensorProto.FLOAT,
["batch", "seq_len", "hidden_dim"]
)
)
onnx.save(model, "model_with_hidden_states.onnx")
Then use onnxruntime.InferenceSession directly (not through ORTModelForCausalLM) to request these outputs:
session = onnxruntime.InferenceSession("model_with_hidden_states.onnx")
outputs = session.run(
["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...],
{"input_ids": input_ids, "attention_mask": attention_mask}
)
Pros: Works with standard ONNX Runtime; no PyTorch dependency at inference time. Cons:
- Requires careful ONNX graph manipulation (naming conventions vary by export version)
- Must validate that intermediate node names are stable across export runs
- Must handle the merged decoder model correctly (past key values branch)
- Loss of
ORTModelForCausalLMconvenience (manual session management, nogenerate(), no caching) - Must discover intermediate node names via
onnxlibrary inspection - Graph modifications may invalidate ONNX Runtime optimizations
Path B: Separate Encoder-Style ONNX Export
Approach: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers.
This would require writing a custom torch.onnx.export call that traces the model with output_hidden_states=True and captures the intermediate outputs.
Pros: Clean separation of concerns; each sub-model can be optimized independently. Cons:
- Requires PyTorch for the initial export (but not at runtime)
- Significant custom code to manage multiple ONNX sub-models
- Past key value caching becomes much more complex with sub-models
- Not supported by
optimumCLI or ORTModel classes
Path C: Direct ONNX Runtime with Modified Graph (Recommended Path)
Approach: Combine a custom ONNX export with direct onnxruntime.InferenceSession usage, bypassing ORTModelForCausalLM entirely.
import onnxruntime as ort
import onnx
from transformers import AutoTokenizer
# Step 1: Export with hidden state outputs (one-time, requires PyTorch)
# Use optimum CLI or programmatic export, then modify the graph
# Step 2: Load modified model and run inference
session = ort.InferenceSession("smollm2_with_hidden_states.onnx")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
inputs = tokenizer("Hello world", return_tensors="np")
output_names = [o.name for o in session.get_outputs()]
# Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ...
results = session.run(output_names, dict(inputs))
hidden_states = {
1: results[output_names.index("hidden_state_1")][:, -1, :],
2: results[output_names.index("hidden_state_2")][:, -1, :],
...
}
Pros: Full control; no PyTorch at runtime; smallest possible footprint. Cons:
- Must write and maintain custom ONNX graph modification code
- Must re-export whenever the model architecture changes
- Must validate numerical equivalence against PyTorch reference
- Bypasses the
ORTModelForCausalLMabstraction entirely - Past key value handling must be manual (no generate() support)
- This is essentially a custom inference backend, not a drop-in replacement
Comparison with PyTorch
| Aspect | PyTorch | ONNX Runtime (Standard) | ONNX Runtime (Custom) |
|---|---|---|---|
output_hidden_states=True |
✅ Native, one flag | ❌ Not supported | ⚠️ Requires graph modification |
| Activation extraction API | outputs.hidden_states[layer][:, -1, :] |
N/A | Manual session.run() with named outputs |
| Effort to implement | Minimal (built-in) | N/A | High (custom export + graph hacking) |
| Numerical accuracy | Ground truth | Must validate | Must validate against PyTorch |
| Maintenance burden | Low | N/A | High (graph names change, ONNX spec evolves) |
4. SmolLM2-135M ONNX Export
Known Status
- Pre-exported model exists:
onnx-community/SmolLM2-135M-ONNXon HuggingFace Hub - Architecture: LLaMA family, which is well-supported by
optimumONNX export - Export method: Automated by HuggingFace's ONNX conversion space (convert-to-onnx)
- Model card: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction
Export Configuration
The LLaMA architecture maps to optimum's LlamaOnnxConfig (SmolLM2 uses the LLaMA architecture). The standard export produces:
decoder_model.onnx— for initial forward pass (no past key values)decoder_with_past_model.onnx— for subsequent generation steps (with past key values)- Or
decoder_model_merged.onnx— combined model with conditional branching
Known Issues
-
Hidden states not in standard export: The default
optimumexport for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs. -
Merged decoder complexity: The merged decoder model uses a
use_cache_branchflag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure. -
Node naming stability: Internal ONNX node names (e.g.,
/model/layers.0/output_0) may change betweenoptimumversions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden.
5. Comparison Table
| Criteria | PyTorch (CPU-only) | ONNX Runtime (Standard) | ONNX Runtime (Custom Graph) |
|---|---|---|---|
| Install size (download) | ~200 MB | ~18 MB | ~18 MB |
| Install size (disk) | ~700 MB | ~180-200 MB | ~180-200 MB |
output_hidden_states=True |
✅ Built-in | ❌ Not supported | ⚠️ Custom graph modification |
| Activation extraction API | model(**inputs, output_hidden_states=True) |
N/A | Manual session.run() with named outputs |
| Drop-in with optimum | ✅ AutoModelForCausalLM |
⚠️ ORTModelForCausalLM but no hidden states |
❌ Must bypass ORTModel classes |
| Past key value caching | ✅ Automatic | ✅ Automatic via ORTModel | ❌ Must handle manually |
| Numerical equivalence | Ground truth | Must validate | Must validate |
| Implementation effort | Low (built-in) | N/A (doesn't work) | High (custom export + graph mod) |
| Maintenance burden | Low | N/A | High (brittle node names) |
| Runtime performance | Good | Better (graph-optimized) | Better (graph-optimized) |
| CPU deployment | ✅ Supported | ✅ Excellent | ✅ Excellent |
| safetensors loading | ✅ Via transformers | ✅ Via optimum | ❌ Requires separate model loading |
| Model pinning (revision) | ✅ Via transformers | ✅ Via optimum | ⚠️ Custom handling |
| Offline/air-gapped | ✅ HF Hub cache | ✅ HF Hub cache | ⚠️ Custom export files |
| License | BSD-3 | MIT | MIT |
6. Recommendation
Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.
Rationale
-
The activation extraction problem is unsolved for ORTModelForCausalLM. Issue #972 requesting
output_hidden_statessupport was closed as "not planned" by theoptimumteam. This means the standard, supported path does not work for alknet-firewall's core requirement. -
Custom ONNX graph modification is a significant engineering effort with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across
optimumversion updates. -
The install-size advantage is real but not decisive. While
onnxruntime(~180 MB installed) is significantly smaller thantorchCPU-only (~700 MB installed), the difference is manageable:- The model weights (269 MB for SmolLM2-135M) dwarf the
onnxruntimesavings - The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB
- The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB
- Savings: ~500 MB, which is meaningful but not transformative
- The model weights (269 MB for SmolLM2-135M) dwarf the
-
PyTorch is already optional. ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support.
-
The
DetectorModelprotocol already accommodates multiple backends. The architecture is designed for this:class DetectorModel(Protocol): def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...Adding an
ONNXDetectorModelimplementation in Phase 2 is a clean extension.
Phase 2 Plan
When ONNX Runtime support is added in Phase 2, the recommended approach is:
- Create a custom ONNX export pipeline that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition
- Store the custom-exported model on HuggingFace Hub (e.g.,
alknet/smollm2-135m-onnx-activations) with the modified graph - Use
onnxruntime.InferenceSessiondirectly (bypassingORTModelForCausalLM) for inference, requesting the hidden state outputs by name - Validate numerical equivalence against the PyTorch reference implementation at each model version
- Pin the
optimumversion used for the initial export to ensure node name stability
Alternatively, if optimum adds output_hidden_states support in a future version (the issue could be reopened), the implementation becomes much simpler and could use ORTModelForCausalLM directly.
Phase 1 Actions
- Update ADR-006 to note that ONNX Runtime is deferred to Phase 2
- Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap"
- Update
pyproject.tomlto remove the[onnx]extra from Phase 1 scope (or mark it as experimental/unstable) - Ensure the
DetectorModelprotocol andHFDetectorModelimplementation are clean enough to extend with anONNXDetectorModelin Phase 2
7. References
-
HuggingFace optimum Issue #972: "Add output of
output_hidden_statesfor onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support. -
ONNX Runtime InferenceSession API: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that
session.run()can only return values declared as graph outputs. -
sklearn-onnx intermediate outputs: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."
-
Stack Overflow: Extract intermediate layer outputs from ONNX: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding
ValueInfoPrototomodel.graph.outputto expose intermediate values. -
optimum-onnx GitHub: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models.
-
ORTModelForCausalLM documentation: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the
forward()method; notably absent isoutput_hidden_statesparameter. -
SmolLM2-135M ONNX on HuggingFace Hub: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M.
-
optimum ONNX export documentation: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration.
-
DeepWiki: ORTModelForCausalLM text generation models: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling.
-
DeepWiki: ONNX Model Export: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations.
-
ONNX Runtime performance: https://onnxruntime.ai/docs/performance/ — Official performance documentation.
-
OpenNN deployment size comparison: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB.
-
onnxruntime PyPI: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB.
-
onnx-modifier: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs.
-
ONNX graph surgery: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs.
-
ADR-006: Optional PyTorch:
/docs/architecture/decisions/006-optional-pytorch.md— The ADR documenting why PyTorch is optional and the install size comparison. -
Model architecture doc:
/docs/architecture/model.md— Documents activation extraction design,DetectorModelprotocol, and layer selection.