Files
alknet-firewall/docs/research/onnx-inference-backend/feasibility-analysis.md
glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure
Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
2026-06-13 07:27:40 +00:00

368 lines
20 KiB
Markdown

# Research: ONNX Runtime as Inference Backend for alknet-firewall
**Date**: 2026-06-13
**Question**: Should ONNX Runtime be a supported inference backend in Phase 1?
**Status**: Open question OQ-01
## Executive Summary
**ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2.** The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where `output_hidden_states=True` is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem.
---
## 1. ONNX Runtime Overview
### What It Is
ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch.
### Install Footprint
| Package | Wheel Size | Installed Size | Notes |
|---------|-----------|---------------|-------|
| `onnxruntime` (CPU) | ~18 MB | ~180-200 MB | Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings |
| `torch` (CPU-only) | ~200 MB | ~700 MB | libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB) |
| `torch` (CUDA) | ~2.5 GB | ~5+ GB | Default `pip install torch` since PyTorch 2.11 |
| `optimum[onnxruntime]` | ~5 MB | ~20 MB | Python wrapper; depends on onnxruntime + transformers |
**Sources**: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks).
**Revised claim**: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate.
### Performance Characteristics
- **CPU inference**: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support
- **Warm start**: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast
- **Memory**: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers)
- **Thread scaling**: Good multi-threaded CPU performance via OpenMP/MLAS
### CPU Deployment Story
ONNX Runtime excels at CPU deployment, which is alknet-firewall's target:
- No CUDA/GPU dependency
- Cross-platform (Linux, macOS, Windows, ARM)
- Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML)
- Well-suited for containerized and embedded deployments
---
## 2. HuggingFace Optimum Integration
### How Optimum Works
HuggingFace's `optimum-onnx` (formerly `optimum[onnxruntime]`) provides drop-in replacement classes for HuggingFace transformers models:
```python
# PyTorch path
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
# ONNX Runtime path (drop-in replacement)
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained(
"onnx-community/SmolLM2-135M-ONNX",
export=False, # Use pre-exported ONNX model
)
# OR: export on the fly from PyTorch weights
model = ORTModelForCausalLM.from_pretrained(
"HuggingFaceTB/SmolLM2-135M",
export=True, # Auto-export to ONNX at load time
)
```
### Export Process
The ONNX export can be done via:
1. **CLI**: `optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/`
2. **Programmatic**: `ORTModelForCausalLM.from_pretrained("...", export=True)`
3. **Pre-exported**: Use existing ONNX models from `onnx-community/` on HuggingFace Hub
For causal LMs, the export produces:
- A **decoder model** (with or without past key values)
- Optionally a **merged decoder** combining initial pass and cached pass into one model
### Model Compatibility
SmolLM2-135M uses the LLaMA architecture. The `optimum` ONNX export supports LLaMA-family models:
| Architecture | Export Support | ORTModelForCausalLM Support |
|---|---|---|
| `llama` (SmolLM2) | ✓ Supported | ✓ Supported |
| `gpt2` | ✓ Supported | ✓ Supported |
| `bloom` | ✓ Supported | ✓ Supported |
| `mistral` | ✓ Supported | ✓ Supported |
**Pre-exported model available**: `onnx-community/SmolLM2-135M-ONNX` exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format.
---
## 3. Activation Extraction Feasibility ⚠️ CRITICAL
This is the **make-or-break question** for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial:
```python
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in [1, 2, 4, 8]
}
```
### The Problem
**ORTModelForCausalLM does NOT support `output_hidden_states`.** This is confirmed by:
1. **GitHub Issue #972** on `huggingface/optimum`: "Add output of `output_hidden_states` for onnx model export" — filed April 2023, **closed as "not planned"**. The request was to add hidden state outputs to the ONNX export for `ORTModelForCausalLM`, noting that the merged decoder only outputs logits + past key/values.
2. **ORTModelForCausalLM.forward() documentation**: The `forward()` method signature includes `input_ids`, `attention_mask`, `past_key_values`, `position_ids`, `use_cache`, and `**kwargs` — but **no `output_hidden_states` parameter**. The return type is logits + past key values only.
3. **ONNX graph structure**: Standard ONNX exports of causal LMs define outputs as `logits` and `past_key_values`. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs.
### Why This Is Hard
ONNX is a **static graph format**. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set `output_hidden_states=True` at runtime, ONNX requires the graph to explicitly include those output connections.
The `sklearn-onnx` documentation explicitly states: *"There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."*
### Workable Paths (All Require Extra Engineering)
#### Path A: Custom ONNX Export with Hidden State Outputs
**Approach**: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs.
```python
import onnx
# Load the standard exported ONNX model
model = onnx.load("model.onnx")
# Find the intermediate layer output names in the graph
# For LLaMA/SmolLM2, each transformer layer outputs hidden states
# Names follow patterns like: "/model/layers.0/output_0"
# Add intermediate outputs to the graph
for layer_idx in [1, 2, 4, 8]:
# Find the node output for each layer
intermediate_name = f"/model/layers.{layer_idx}/output_0"
model.graph.output.append(
onnx.helper.make_tensor_value_info(
intermediate_name,
onnx.TensorProto.FLOAT,
["batch", "seq_len", "hidden_dim"]
)
)
onnx.save(model, "model_with_hidden_states.onnx")
```
Then use `onnxruntime.InferenceSession` directly (not through `ORTModelForCausalLM`) to request these outputs:
```python
session = onnxruntime.InferenceSession("model_with_hidden_states.onnx")
outputs = session.run(
["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...],
{"input_ids": input_ids, "attention_mask": attention_mask}
)
```
**Pros**: Works with standard ONNX Runtime; no PyTorch dependency at inference time.
**Cons**:
- Requires careful ONNX graph manipulation (naming conventions vary by export version)
- Must validate that intermediate node names are stable across export runs
- Must handle the merged decoder model correctly (past key values branch)
- Loss of `ORTModelForCausalLM` convenience (manual session management, no `generate()`, no caching)
- Must discover intermediate node names via `onnx` library inspection
- Graph modifications may invalidate ONNX Runtime optimizations
#### Path B: Separate Encoder-Style ONNX Export
**Approach**: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers.
This would require writing a custom `torch.onnx.export` call that traces the model with `output_hidden_states=True` and captures the intermediate outputs.
**Pros**: Clean separation of concerns; each sub-model can be optimized independently.
**Cons**:
- Requires PyTorch for the initial export (but not at runtime)
- Significant custom code to manage multiple ONNX sub-models
- Past key value caching becomes much more complex with sub-models
- Not supported by `optimum` CLI or ORTModel classes
#### Path C: Direct ONNX Runtime with Modified Graph (Recommended Path)
**Approach**: Combine a custom ONNX export with direct `onnxruntime.InferenceSession` usage, bypassing `ORTModelForCausalLM` entirely.
```python
import onnxruntime as ort
import onnx
from transformers import AutoTokenizer
# Step 1: Export with hidden state outputs (one-time, requires PyTorch)
# Use optimum CLI or programmatic export, then modify the graph
# Step 2: Load modified model and run inference
session = ort.InferenceSession("smollm2_with_hidden_states.onnx")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
inputs = tokenizer("Hello world", return_tensors="np")
output_names = [o.name for o in session.get_outputs()]
# Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ...
results = session.run(output_names, dict(inputs))
hidden_states = {
1: results[output_names.index("hidden_state_1")][:, -1, :],
2: results[output_names.index("hidden_state_2")][:, -1, :],
...
}
```
**Pros**: Full control; no PyTorch at runtime; smallest possible footprint.
**Cons**:
- Must write and maintain custom ONNX graph modification code
- Must re-export whenever the model architecture changes
- Must validate numerical equivalence against PyTorch reference
- Bypasses the `ORTModelForCausalLM` abstraction entirely
- Past key value handling must be manual (no generate() support)
- This is essentially a custom inference backend, not a drop-in replacement
### Comparison with PyTorch
| Aspect | PyTorch | ONNX Runtime (Standard) | ONNX Runtime (Custom) |
|--------|---------|--------------------------|----------------------|
| `output_hidden_states=True` | ✅ Native, one flag | ❌ Not supported | ⚠️ Requires graph modification |
| Activation extraction API | `outputs.hidden_states[layer][:, -1, :]` | N/A | Manual `session.run()` with named outputs |
| Effort to implement | Minimal (built-in) | N/A | High (custom export + graph hacking) |
| Numerical accuracy | Ground truth | Must validate | Must validate against PyTorch |
| Maintenance burden | Low | N/A | High (graph names change, ONNX spec evolves) |
---
## 4. SmolLM2-135M ONNX Export
### Known Status
- **Pre-exported model exists**: `onnx-community/SmolLM2-135M-ONNX` on HuggingFace Hub
- **Architecture**: LLaMA family, which is well-supported by `optimum` ONNX export
- **Export method**: Automated by HuggingFace's ONNX conversion space (convert-to-onnx)
- **Model card**: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction
### Export Configuration
The LLaMA architecture maps to `optimum`'s `LlamaOnnxConfig` (SmolLM2 uses the LLaMA architecture). The standard export produces:
- `decoder_model.onnx` — for initial forward pass (no past key values)
- `decoder_with_past_model.onnx` — for subsequent generation steps (with past key values)
- Or `decoder_model_merged.onnx` — combined model with conditional branching
### Known Issues
1. **Hidden states not in standard export**: The default `optimum` export for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs.
2. **Merged decoder complexity**: The merged decoder model uses a `use_cache_branch` flag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure.
3. **Node naming stability**: Internal ONNX node names (e.g., `/model/layers.0/output_0`) may change between `optimum` versions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden.
---
## 5. Comparison Table
| Criteria | PyTorch (CPU-only) | ONNX Runtime (Standard) | ONNX Runtime (Custom Graph) |
|---|---|---|---|
| **Install size (download)** | ~200 MB | ~18 MB | ~18 MB |
| **Install size (disk)** | ~700 MB | ~180-200 MB | ~180-200 MB |
| **`output_hidden_states=True`** | ✅ Built-in | ❌ Not supported | ⚠️ Custom graph modification |
| **Activation extraction API** | `model(**inputs, output_hidden_states=True)` | N/A | Manual `session.run()` with named outputs |
| **Drop-in with optimum** | ✅ `AutoModelForCausalLM` | ⚠️ `ORTModelForCausalLM` but no hidden states | ❌ Must bypass ORTModel classes |
| **Past key value caching** | ✅ Automatic | ✅ Automatic via ORTModel | ❌ Must handle manually |
| **Numerical equivalence** | Ground truth | Must validate | Must validate |
| **Implementation effort** | Low (built-in) | N/A (doesn't work) | High (custom export + graph mod) |
| **Maintenance burden** | Low | N/A | High (brittle node names) |
| **Runtime performance** | Good | Better (graph-optimized) | Better (graph-optimized) |
| **CPU deployment** | ✅ Supported | ✅ Excellent | ✅ Excellent |
| **safetensors loading** | ✅ Via transformers | ✅ Via optimum | ❌ Requires separate model loading |
| **Model pinning (revision)** | ✅ Via transformers | ✅ Via optimum | ⚠️ Custom handling |
| **Offline/air-gapped** | ✅ HF Hub cache | ✅ HF Hub cache | ⚠️ Custom export files |
| **License** | BSD-3 | MIT | MIT |
---
## 6. Recommendation
### **Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.**
### Rationale
1. **The activation extraction problem is unsolved for ORTModelForCausalLM.** Issue #972 requesting `output_hidden_states` support was closed as "not planned" by the `optimum` team. This means the standard, supported path does not work for alknet-firewall's core requirement.
2. **Custom ONNX graph modification is a significant engineering effort** with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across `optimum` version updates.
3. **The install-size advantage is real but not decisive.** While `onnxruntime` (~180 MB installed) is significantly smaller than `torch` CPU-only (~700 MB installed), the difference is manageable:
- The model weights (269 MB for SmolLM2-135M) dwarf the `onnxruntime` savings
- The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB
- The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB
- Savings: ~500 MB, which is meaningful but not transformative
4. **PyTorch is already optional.** ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support.
5. **The `DetectorModel` protocol already accommodates multiple backends.** The architecture is designed for this:
```python
class DetectorModel(Protocol):
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
```
Adding an `ONNXDetectorModel` implementation in Phase 2 is a clean extension.
### Phase 2 Plan
When ONNX Runtime support is added in Phase 2, the recommended approach is:
1. **Create a custom ONNX export pipeline** that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition
2. **Store the custom-exported model** on HuggingFace Hub (e.g., `alknet/smollm2-135m-onnx-activations`) with the modified graph
3. **Use `onnxruntime.InferenceSession` directly** (bypassing `ORTModelForCausalLM`) for inference, requesting the hidden state outputs by name
4. **Validate numerical equivalence** against the PyTorch reference implementation at each model version
5. **Pin the `optimum` version** used for the initial export to ensure node name stability
Alternatively, if `optimum` adds `output_hidden_states` support in a future version (the issue could be reopened), the implementation becomes much simpler and could use `ORTModelForCausalLM` directly.
### Phase 1 Actions
- Update ADR-006 to note that ONNX Runtime is deferred to Phase 2
- Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap"
- Update `pyproject.toml` to remove the `[onnx]` extra from Phase 1 scope (or mark it as experimental/unstable)
- Ensure the `DetectorModel` protocol and `HFDetectorModel` implementation are clean enough to extend with an `ONNXDetectorModel` in Phase 2
---
## 7. References
1. **HuggingFace optimum Issue #972**: "Add output of `output_hidden_states` for onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support.
2. **ONNX Runtime InferenceSession API**: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that `session.run()` can only return values declared as graph outputs.
3. **sklearn-onnx intermediate outputs**: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."
4. **Stack Overflow: Extract intermediate layer outputs from ONNX**: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding `ValueInfoProto` to `model.graph.output` to expose intermediate values.
5. **optimum-onnx GitHub**: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models.
6. **ORTModelForCausalLM documentation**: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the `forward()` method; notably absent is `output_hidden_states` parameter.
7. **SmolLM2-135M ONNX on HuggingFace Hub**: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M.
8. **optimum ONNX export documentation**: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration.
9. **DeepWiki: ORTModelForCausalLM text generation models**: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling.
10. **DeepWiki: ONNX Model Export**: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations.
11. **ONNX Runtime performance**: https://onnxruntime.ai/docs/performance/ — Official performance documentation.
12. **OpenNN deployment size comparison**: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB.
13. **onnxruntime PyPI**: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB.
14. **onnx-modifier**: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs.
15. **ONNX graph surgery**: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs.
16. **ADR-006: Optional PyTorch**: `/docs/architecture/decisions/006-optional-pytorch.md` — The ADR documenting why PyTorch is optional and the install size comparison.
17. **Model architecture doc**: `/docs/architecture/model.md` — Documents activation extraction design, `DetectorModel` protocol, and layer selection.