# Research: ONNX Runtime as Inference Backend for alknet-firewall **Date**: 2026-06-13 **Question**: Should ONNX Runtime be a supported inference backend in Phase 1? **Status**: Open question OQ-01 ## Executive Summary **ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2.** The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where `output_hidden_states=True` is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem. --- ## 1. ONNX Runtime Overview ### What It Is ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch. ### Install Footprint | Package | Wheel Size | Installed Size | Notes | |---------|-----------|---------------|-------| | `onnxruntime` (CPU) | ~18 MB | ~180-200 MB | Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings | | `torch` (CPU-only) | ~200 MB | ~700 MB | libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB) | | `torch` (CUDA) | ~2.5 GB | ~5+ GB | Default `pip install torch` since PyTorch 2.11 | | `optimum[onnxruntime]` | ~5 MB | ~20 MB | Python wrapper; depends on onnxruntime + transformers | **Sources**: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks). **Revised claim**: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate. ### Performance Characteristics - **CPU inference**: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support - **Warm start**: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast - **Memory**: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers) - **Thread scaling**: Good multi-threaded CPU performance via OpenMP/MLAS ### CPU Deployment Story ONNX Runtime excels at CPU deployment, which is alknet-firewall's target: - No CUDA/GPU dependency - Cross-platform (Linux, macOS, Windows, ARM) - Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML) - Well-suited for containerized and embedded deployments --- ## 2. HuggingFace Optimum Integration ### How Optimum Works HuggingFace's `optimum-onnx` (formerly `optimum[onnxruntime]`) provides drop-in replacement classes for HuggingFace transformers models: ```python # PyTorch path from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M") # ONNX Runtime path (drop-in replacement) from optimum.onnxruntime import ORTModelForCausalLM model = ORTModelForCausalLM.from_pretrained( "onnx-community/SmolLM2-135M-ONNX", export=False, # Use pre-exported ONNX model ) # OR: export on the fly from PyTorch weights model = ORTModelForCausalLM.from_pretrained( "HuggingFaceTB/SmolLM2-135M", export=True, # Auto-export to ONNX at load time ) ``` ### Export Process The ONNX export can be done via: 1. **CLI**: `optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/` 2. **Programmatic**: `ORTModelForCausalLM.from_pretrained("...", export=True)` 3. **Pre-exported**: Use existing ONNX models from `onnx-community/` on HuggingFace Hub For causal LMs, the export produces: - A **decoder model** (with or without past key values) - Optionally a **merged decoder** combining initial pass and cached pass into one model ### Model Compatibility SmolLM2-135M uses the LLaMA architecture. The `optimum` ONNX export supports LLaMA-family models: | Architecture | Export Support | ORTModelForCausalLM Support | |---|---|---| | `llama` (SmolLM2) | ✓ Supported | ✓ Supported | | `gpt2` | ✓ Supported | ✓ Supported | | `bloom` | ✓ Supported | ✓ Supported | | `mistral` | ✓ Supported | ✓ Supported | **Pre-exported model available**: `onnx-community/SmolLM2-135M-ONNX` exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format. --- ## 3. Activation Extraction Feasibility ⚠️ CRITICAL This is the **make-or-break question** for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial: ```python outputs = model(input_ids, output_hidden_states=True) activations = { layer_idx: outputs.hidden_states[layer_idx][:, -1, :] for layer_idx in [1, 2, 4, 8] } ``` ### The Problem **ORTModelForCausalLM does NOT support `output_hidden_states`.** This is confirmed by: 1. **GitHub Issue #972** on `huggingface/optimum`: "Add output of `output_hidden_states` for onnx model export" — filed April 2023, **closed as "not planned"**. The request was to add hidden state outputs to the ONNX export for `ORTModelForCausalLM`, noting that the merged decoder only outputs logits + past key/values. 2. **ORTModelForCausalLM.forward() documentation**: The `forward()` method signature includes `input_ids`, `attention_mask`, `past_key_values`, `position_ids`, `use_cache`, and `**kwargs` — but **no `output_hidden_states` parameter**. The return type is logits + past key values only. 3. **ONNX graph structure**: Standard ONNX exports of causal LMs define outputs as `logits` and `past_key_values`. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs. ### Why This Is Hard ONNX is a **static graph format**. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set `output_hidden_states=True` at runtime, ONNX requires the graph to explicitly include those output connections. The `sklearn-onnx` documentation explicitly states: *"There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."* ### Workable Paths (All Require Extra Engineering) #### Path A: Custom ONNX Export with Hidden State Outputs **Approach**: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs. ```python import onnx # Load the standard exported ONNX model model = onnx.load("model.onnx") # Find the intermediate layer output names in the graph # For LLaMA/SmolLM2, each transformer layer outputs hidden states # Names follow patterns like: "/model/layers.0/output_0" # Add intermediate outputs to the graph for layer_idx in [1, 2, 4, 8]: # Find the node output for each layer intermediate_name = f"/model/layers.{layer_idx}/output_0" model.graph.output.append( onnx.helper.make_tensor_value_info( intermediate_name, onnx.TensorProto.FLOAT, ["batch", "seq_len", "hidden_dim"] ) ) onnx.save(model, "model_with_hidden_states.onnx") ``` Then use `onnxruntime.InferenceSession` directly (not through `ORTModelForCausalLM`) to request these outputs: ```python session = onnxruntime.InferenceSession("model_with_hidden_states.onnx") outputs = session.run( ["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...], {"input_ids": input_ids, "attention_mask": attention_mask} ) ``` **Pros**: Works with standard ONNX Runtime; no PyTorch dependency at inference time. **Cons**: - Requires careful ONNX graph manipulation (naming conventions vary by export version) - Must validate that intermediate node names are stable across export runs - Must handle the merged decoder model correctly (past key values branch) - Loss of `ORTModelForCausalLM` convenience (manual session management, no `generate()`, no caching) - Must discover intermediate node names via `onnx` library inspection - Graph modifications may invalidate ONNX Runtime optimizations #### Path B: Separate Encoder-Style ONNX Export **Approach**: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers. This would require writing a custom `torch.onnx.export` call that traces the model with `output_hidden_states=True` and captures the intermediate outputs. **Pros**: Clean separation of concerns; each sub-model can be optimized independently. **Cons**: - Requires PyTorch for the initial export (but not at runtime) - Significant custom code to manage multiple ONNX sub-models - Past key value caching becomes much more complex with sub-models - Not supported by `optimum` CLI or ORTModel classes #### Path C: Direct ONNX Runtime with Modified Graph (Recommended Path) **Approach**: Combine a custom ONNX export with direct `onnxruntime.InferenceSession` usage, bypassing `ORTModelForCausalLM` entirely. ```python import onnxruntime as ort import onnx from transformers import AutoTokenizer # Step 1: Export with hidden state outputs (one-time, requires PyTorch) # Use optimum CLI or programmatic export, then modify the graph # Step 2: Load modified model and run inference session = ort.InferenceSession("smollm2_with_hidden_states.onnx") tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M") inputs = tokenizer("Hello world", return_tensors="np") output_names = [o.name for o in session.get_outputs()] # Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ... results = session.run(output_names, dict(inputs)) hidden_states = { 1: results[output_names.index("hidden_state_1")][:, -1, :], 2: results[output_names.index("hidden_state_2")][:, -1, :], ... } ``` **Pros**: Full control; no PyTorch at runtime; smallest possible footprint. **Cons**: - Must write and maintain custom ONNX graph modification code - Must re-export whenever the model architecture changes - Must validate numerical equivalence against PyTorch reference - Bypasses the `ORTModelForCausalLM` abstraction entirely - Past key value handling must be manual (no generate() support) - This is essentially a custom inference backend, not a drop-in replacement ### Comparison with PyTorch | Aspect | PyTorch | ONNX Runtime (Standard) | ONNX Runtime (Custom) | |--------|---------|--------------------------|----------------------| | `output_hidden_states=True` | ✅ Native, one flag | ❌ Not supported | ⚠️ Requires graph modification | | Activation extraction API | `outputs.hidden_states[layer][:, -1, :]` | N/A | Manual `session.run()` with named outputs | | Effort to implement | Minimal (built-in) | N/A | High (custom export + graph hacking) | | Numerical accuracy | Ground truth | Must validate | Must validate against PyTorch | | Maintenance burden | Low | N/A | High (graph names change, ONNX spec evolves) | --- ## 4. SmolLM2-135M ONNX Export ### Known Status - **Pre-exported model exists**: `onnx-community/SmolLM2-135M-ONNX` on HuggingFace Hub - **Architecture**: LLaMA family, which is well-supported by `optimum` ONNX export - **Export method**: Automated by HuggingFace's ONNX conversion space (convert-to-onnx) - **Model card**: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction ### Export Configuration The LLaMA architecture maps to `optimum`'s `LlamaOnnxConfig` (SmolLM2 uses the LLaMA architecture). The standard export produces: - `decoder_model.onnx` — for initial forward pass (no past key values) - `decoder_with_past_model.onnx` — for subsequent generation steps (with past key values) - Or `decoder_model_merged.onnx` — combined model with conditional branching ### Known Issues 1. **Hidden states not in standard export**: The default `optimum` export for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs. 2. **Merged decoder complexity**: The merged decoder model uses a `use_cache_branch` flag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure. 3. **Node naming stability**: Internal ONNX node names (e.g., `/model/layers.0/output_0`) may change between `optimum` versions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden. --- ## 5. Comparison Table | Criteria | PyTorch (CPU-only) | ONNX Runtime (Standard) | ONNX Runtime (Custom Graph) | |---|---|---|---| | **Install size (download)** | ~200 MB | ~18 MB | ~18 MB | | **Install size (disk)** | ~700 MB | ~180-200 MB | ~180-200 MB | | **`output_hidden_states=True`** | ✅ Built-in | ❌ Not supported | ⚠️ Custom graph modification | | **Activation extraction API** | `model(**inputs, output_hidden_states=True)` | N/A | Manual `session.run()` with named outputs | | **Drop-in with optimum** | ✅ `AutoModelForCausalLM` | ⚠️ `ORTModelForCausalLM` but no hidden states | ❌ Must bypass ORTModel classes | | **Past key value caching** | ✅ Automatic | ✅ Automatic via ORTModel | ❌ Must handle manually | | **Numerical equivalence** | Ground truth | Must validate | Must validate | | **Implementation effort** | Low (built-in) | N/A (doesn't work) | High (custom export + graph mod) | | **Maintenance burden** | Low | N/A | High (brittle node names) | | **Runtime performance** | Good | Better (graph-optimized) | Better (graph-optimized) | | **CPU deployment** | ✅ Supported | ✅ Excellent | ✅ Excellent | | **safetensors loading** | ✅ Via transformers | ✅ Via optimum | ❌ Requires separate model loading | | **Model pinning (revision)** | ✅ Via transformers | ✅ Via optimum | ⚠️ Custom handling | | **Offline/air-gapped** | ✅ HF Hub cache | ✅ HF Hub cache | ⚠️ Custom export files | | **License** | BSD-3 | MIT | MIT | --- ## 6. Recommendation ### **Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.** ### Rationale 1. **The activation extraction problem is unsolved for ORTModelForCausalLM.** Issue #972 requesting `output_hidden_states` support was closed as "not planned" by the `optimum` team. This means the standard, supported path does not work for alknet-firewall's core requirement. 2. **Custom ONNX graph modification is a significant engineering effort** with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across `optimum` version updates. 3. **The install-size advantage is real but not decisive.** While `onnxruntime` (~180 MB installed) is significantly smaller than `torch` CPU-only (~700 MB installed), the difference is manageable: - The model weights (269 MB for SmolLM2-135M) dwarf the `onnxruntime` savings - The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB - The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB - Savings: ~500 MB, which is meaningful but not transformative 4. **PyTorch is already optional.** ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support. 5. **The `DetectorModel` protocol already accommodates multiple backends.** The architecture is designed for this: ```python class DetectorModel(Protocol): def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ... ``` Adding an `ONNXDetectorModel` implementation in Phase 2 is a clean extension. ### Phase 2 Plan When ONNX Runtime support is added in Phase 2, the recommended approach is: 1. **Create a custom ONNX export pipeline** that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition 2. **Store the custom-exported model** on HuggingFace Hub (e.g., `alknet/smollm2-135m-onnx-activations`) with the modified graph 3. **Use `onnxruntime.InferenceSession` directly** (bypassing `ORTModelForCausalLM`) for inference, requesting the hidden state outputs by name 4. **Validate numerical equivalence** against the PyTorch reference implementation at each model version 5. **Pin the `optimum` version** used for the initial export to ensure node name stability Alternatively, if `optimum` adds `output_hidden_states` support in a future version (the issue could be reopened), the implementation becomes much simpler and could use `ORTModelForCausalLM` directly. ### Phase 1 Actions - Update ADR-006 to note that ONNX Runtime is deferred to Phase 2 - Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap" - Update `pyproject.toml` to remove the `[onnx]` extra from Phase 1 scope (or mark it as experimental/unstable) - Ensure the `DetectorModel` protocol and `HFDetectorModel` implementation are clean enough to extend with an `ONNXDetectorModel` in Phase 2 --- ## 7. References 1. **HuggingFace optimum Issue #972**: "Add output of `output_hidden_states` for onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support. 2. **ONNX Runtime InferenceSession API**: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that `session.run()` can only return values declared as graph outputs. 3. **sklearn-onnx intermediate outputs**: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime." 4. **Stack Overflow: Extract intermediate layer outputs from ONNX**: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding `ValueInfoProto` to `model.graph.output` to expose intermediate values. 5. **optimum-onnx GitHub**: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models. 6. **ORTModelForCausalLM documentation**: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the `forward()` method; notably absent is `output_hidden_states` parameter. 7. **SmolLM2-135M ONNX on HuggingFace Hub**: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M. 8. **optimum ONNX export documentation**: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration. 9. **DeepWiki: ORTModelForCausalLM text generation models**: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling. 10. **DeepWiki: ONNX Model Export**: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations. 11. **ONNX Runtime performance**: https://onnxruntime.ai/docs/performance/ — Official performance documentation. 12. **OpenNN deployment size comparison**: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB. 13. **onnxruntime PyPI**: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB. 14. **onnx-modifier**: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs. 15. **ONNX graph surgery**: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs. 16. **ADR-006: Optional PyTorch**: `/docs/architecture/decisions/006-optional-pytorch.md` — The ADR documenting why PyTorch is optional and the install size comparison. 17. **Model architecture doc**: `/docs/architecture/model.md` — Documents activation extraction design, `DetectorModel` protocol, and layer selection.