alknet-firewall/docs/research/onnx-inference-backend/feasibility-analysis.md

# Research: ONNX Runtime as Inference Backend for alknet-firewall

**Date**: 2026-06-13
**Question**: Should ONNX Runtime be a supported inference backend in Phase 1?
**Status**: Open question OQ-01

## Executive Summary

**ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2.** The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where `output_hidden_states=True` is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem.

---

## 1. ONNX Runtime Overview

### What It Is

ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch.

### Install Footprint

| Package | Wheel Size | Installed Size | Notes |
|---------|-----------|---------------|-------|
| `onnxruntime` (CPU) | ~18 MB | ~180-200 MB | Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings |
| `torch` (CPU-only) | ~200 MB | ~700 MB | libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB) |
| `torch` (CUDA) | ~2.5 GB | ~5+ GB | Default `pip install torch` since PyTorch 2.11 |
| `optimum[onnxruntime]` | ~5 MB | ~20 MB | Python wrapper; depends on onnxruntime + transformers |

**Sources**: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks).

**Revised claim**: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate.

### Performance Characteristics

- **CPU inference**: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support
- **Warm start**: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast
- **Memory**: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers)
- **Thread scaling**: Good multi-threaded CPU performance via OpenMP/MLAS

### CPU Deployment Story

ONNX Runtime excels at CPU deployment, which is alknet-firewall's target:
- No CUDA/GPU dependency
- Cross-platform (Linux, macOS, Windows, ARM)
- Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML)
- Well-suited for containerized and embedded deployments

---

## 2. HuggingFace Optimum Integration

### How Optimum Works

HuggingFace's `optimum-onnx` (formerly `optimum[onnxruntime]`) provides drop-in replacement classes for HuggingFace transformers models:

```python
# PyTorch path
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")

# ONNX Runtime path (drop-in replacement)
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/SmolLM2-135M-ONNX",
    export=False,  # Use pre-exported ONNX model
)
# OR: export on the fly from PyTorch weights
model = ORTModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M",
    export=True,  # Auto-export to ONNX at load time
)
```

### Export Process

The ONNX export can be done via:
1. **CLI**: `optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/`
2. **Programmatic**: `ORTModelForCausalLM.from_pretrained("...", export=True)`
3. **Pre-exported**: Use existing ONNX models from `onnx-community/` on HuggingFace Hub

For causal LMs, the export produces:
- A **decoder model** (with or without past key values)
- Optionally a **merged decoder** combining initial pass and cached pass into one model

### Model Compatibility

SmolLM2-135M uses the LLaMA architecture. The `optimum` ONNX export supports LLaMA-family models:

| Architecture | Export Support | ORTModelForCausalLM Support |
|---|---|---|
| `llama` (SmolLM2) | ✓ Supported | ✓ Supported |
| `gpt2` | ✓ Supported | ✓ Supported |
| `bloom` | ✓ Supported | ✓ Supported |
| `mistral` | ✓ Supported | ✓ Supported |

**Pre-exported model available**: `onnx-community/SmolLM2-135M-ONNX` exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format.

---

## 3. Activation Extraction Feasibility ⚠️ CRITICAL

This is the **make-or-break question** for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial:

```python
outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in [1, 2, 4, 8]
}
```

### The Problem

**ORTModelForCausalLM does NOT support `output_hidden_states`.** This is confirmed by:

1. **GitHub Issue #972** on `huggingface/optimum`: "Add output of `output_hidden_states` for onnx model export" — filed April 2023, **closed as "not planned"**. The request was to add hidden state outputs to the ONNX export for `ORTModelForCausalLM`, noting that the merged decoder only outputs logits + past key/values.

2. **ORTModelForCausalLM.forward() documentation**: The `forward()` method signature includes `input_ids`, `attention_mask`, `past_key_values`, `position_ids`, `use_cache`, and `**kwargs` — but **no `output_hidden_states` parameter**. The return type is logits + past key values only.

3. **ONNX graph structure**: Standard ONNX exports of causal LMs define outputs as `logits` and `past_key_values`. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs.

### Why This Is Hard

ONNX is a **static graph format**. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set `output_hidden_states=True` at runtime, ONNX requires the graph to explicitly include those output connections.

The `sklearn-onnx` documentation explicitly states: *"There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."*

### Workable Paths (All Require Extra Engineering)

#### Path A: Custom ONNX Export with Hidden State Outputs

**Approach**: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs.

```python
import onnx

# Load the standard exported ONNX model
model = onnx.load("model.onnx")

# Find the intermediate layer output names in the graph
# For LLaMA/SmolLM2, each transformer layer outputs hidden states
# Names follow patterns like: "/model/layers.0/output_0"

# Add intermediate outputs to the graph
for layer_idx in [1, 2, 4, 8]:
    # Find the node output for each layer
    intermediate_name = f"/model/layers.{layer_idx}/output_0"
    model.graph.output.append(
        onnx.helper.make_tensor_value_info(
            intermediate_name,
            onnx.TensorProto.FLOAT,
            ["batch", "seq_len", "hidden_dim"]
        )
    )

onnx.save(model, "model_with_hidden_states.onnx")
```

Then use `onnxruntime.InferenceSession` directly (not through `ORTModelForCausalLM`) to request these outputs:

```python
session = onnxruntime.InferenceSession("model_with_hidden_states.onnx")
outputs = session.run(
    ["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...],
    {"input_ids": input_ids, "attention_mask": attention_mask}
)
```

**Pros**: Works with standard ONNX Runtime; no PyTorch dependency at inference time.
**Cons**:
- Requires careful ONNX graph manipulation (naming conventions vary by export version)
- Must validate that intermediate node names are stable across export runs
- Must handle the merged decoder model correctly (past key values branch)
- Loss of `ORTModelForCausalLM` convenience (manual session management, no `generate()`, no caching)
- Must discover intermediate node names via `onnx` library inspection
- Graph modifications may invalidate ONNX Runtime optimizations

#### Path B: Separate Encoder-Style ONNX Export

**Approach**: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers.

This would require writing a custom `torch.onnx.export` call that traces the model with `output_hidden_states=True` and captures the intermediate outputs.

**Pros**: Clean separation of concerns; each sub-model can be optimized independently.
**Cons**:
- Requires PyTorch for the initial export (but not at runtime)
- Significant custom code to manage multiple ONNX sub-models
- Past key value caching becomes much more complex with sub-models
- Not supported by `optimum` CLI or ORTModel classes

#### Path C: Direct ONNX Runtime with Modified Graph (Recommended Path)

**Approach**: Combine a custom ONNX export with direct `onnxruntime.InferenceSession` usage, bypassing `ORTModelForCausalLM` entirely.

```python
import onnxruntime as ort
import onnx
from transformers import AutoTokenizer

# Step 1: Export with hidden state outputs (one-time, requires PyTorch)
# Use optimum CLI or programmatic export, then modify the graph

# Step 2: Load modified model and run inference
session = ort.InferenceSession("smollm2_with_hidden_states.onnx")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

inputs = tokenizer("Hello world", return_tensors="np")
output_names = [o.name for o in session.get_outputs()]
# Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ...

results = session.run(output_names, dict(inputs))
hidden_states = {
    1: results[output_names.index("hidden_state_1")][:, -1, :],
    2: results[output_names.index("hidden_state_2")][:, -1, :],
    ...
}
```

**Pros**: Full control; no PyTorch at runtime; smallest possible footprint.
**Cons**:
- Must write and maintain custom ONNX graph modification code
- Must re-export whenever the model architecture changes
- Must validate numerical equivalence against PyTorch reference
- Bypasses the `ORTModelForCausalLM` abstraction entirely
- Past key value handling must be manual (no generate() support)
- This is essentially a custom inference backend, not a drop-in replacement

### Comparison with PyTorch

| Aspect | PyTorch | ONNX Runtime (Standard) | ONNX Runtime (Custom) |
|--------|---------|--------------------------|----------------------|
| `output_hidden_states=True` | ✅ Native, one flag | ❌ Not supported | ⚠️ Requires graph modification |
| Activation extraction API | `outputs.hidden_states[layer][:, -1, :]` | N/A | Manual `session.run()` with named outputs |
| Effort to implement | Minimal (built-in) | N/A | High (custom export + graph hacking) |
| Numerical accuracy | Ground truth | Must validate | Must validate against PyTorch |
| Maintenance burden | Low | N/A | High (graph names change, ONNX spec evolves) |

---

## 4. SmolLM2-135M ONNX Export

### Known Status

- **Pre-exported model exists**: `onnx-community/SmolLM2-135M-ONNX` on HuggingFace Hub
- **Architecture**: LLaMA family, which is well-supported by `optimum` ONNX export
- **Export method**: Automated by HuggingFace's ONNX conversion space (convert-to-onnx)
- **Model card**: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction

### Export Configuration

The LLaMA architecture maps to `optimum`'s `LlamaOnnxConfig` (SmolLM2 uses the LLaMA architecture). The standard export produces:

- `decoder_model.onnx` — for initial forward pass (no past key values)
- `decoder_with_past_model.onnx` — for subsequent generation steps (with past key values)
- Or `decoder_model_merged.onnx` — combined model with conditional branching

### Known Issues

1. **Hidden states not in standard export**: The default `optimum` export for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs.

2. **Merged decoder complexity**: The merged decoder model uses a `use_cache_branch` flag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure.

3. **Node naming stability**: Internal ONNX node names (e.g., `/model/layers.0/output_0`) may change between `optimum` versions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden.

---

## 5. Comparison Table

| Criteria | PyTorch (CPU-only) | ONNX Runtime (Standard) | ONNX Runtime (Custom Graph) |
|---|---|---|---|
| **Install size (download)** | ~200 MB | ~18 MB | ~18 MB |
| **Install size (disk)** | ~700 MB | ~180-200 MB | ~180-200 MB |
| **`output_hidden_states=True`** | ✅ Built-in | ❌ Not supported | ⚠️ Custom graph modification |
| **Activation extraction API** | `model(**inputs, output_hidden_states=True)` | N/A | Manual `session.run()` with named outputs |
| **Drop-in with optimum** | ✅ `AutoModelForCausalLM` | ⚠️ `ORTModelForCausalLM` but no hidden states | ❌ Must bypass ORTModel classes |
| **Past key value caching** | ✅ Automatic | ✅ Automatic via ORTModel | ❌ Must handle manually |
| **Numerical equivalence** | Ground truth | Must validate | Must validate |
| **Implementation effort** | Low (built-in) | N/A (doesn't work) | High (custom export + graph mod) |
| **Maintenance burden** | Low | N/A | High (brittle node names) |
| **Runtime performance** | Good | Better (graph-optimized) | Better (graph-optimized) |
| **CPU deployment** | ✅ Supported | ✅ Excellent | ✅ Excellent |
| **safetensors loading** | ✅ Via transformers | ✅ Via optimum | ❌ Requires separate model loading |
| **Model pinning (revision)** | ✅ Via transformers | ✅ Via optimum | ⚠️ Custom handling |
| **Offline/air-gapped** | ✅ HF Hub cache | ✅ HF Hub cache | ⚠️ Custom export files |
| **License** | BSD-3 | MIT | MIT |

---

## 6. Recommendation

### **Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.**

### Rationale

1. **The activation extraction problem is unsolved for ORTModelForCausalLM.** Issue #972 requesting `output_hidden_states` support was closed as "not planned" by the `optimum` team. This means the standard, supported path does not work for alknet-firewall's core requirement.

2. **Custom ONNX graph modification is a significant engineering effort** with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across `optimum` version updates.

3. **The install-size advantage is real but not decisive.** While `onnxruntime` (~180 MB installed) is significantly smaller than `torch` CPU-only (~700 MB installed), the difference is manageable:
   - The model weights (269 MB for SmolLM2-135M) dwarf the `onnxruntime` savings
   - The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB
   - The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB
   - Savings: ~500 MB, which is meaningful but not transformative

4. **PyTorch is already optional.** ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support.

5. **The `DetectorModel` protocol already accommodates multiple backends.** The architecture is designed for this:
   ```python
   class DetectorModel(Protocol):
       def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
   ```
   Adding an `ONNXDetectorModel` implementation in Phase 2 is a clean extension.

### Phase 2 Plan

When ONNX Runtime support is added in Phase 2, the recommended approach is:

1. **Create a custom ONNX export pipeline** that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition
2. **Store the custom-exported model** on HuggingFace Hub (e.g., `alknet/smollm2-135m-onnx-activations`) with the modified graph
3. **Use `onnxruntime.InferenceSession` directly** (bypassing `ORTModelForCausalLM`) for inference, requesting the hidden state outputs by name
4. **Validate numerical equivalence** against the PyTorch reference implementation at each model version
5. **Pin the `optimum` version** used for the initial export to ensure node name stability

Alternatively, if `optimum` adds `output_hidden_states` support in a future version (the issue could be reopened), the implementation becomes much simpler and could use `ORTModelForCausalLM` directly.

### Phase 1 Actions

- Update ADR-006 to note that ONNX Runtime is deferred to Phase 2
- Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap"
- Update `pyproject.toml` to remove the `[onnx]` extra from Phase 1 scope (or mark it as experimental/unstable)
- Ensure the `DetectorModel` protocol and `HFDetectorModel` implementation are clean enough to extend with an `ONNXDetectorModel` in Phase 2

---

## 7. References

1. **HuggingFace optimum Issue #972**: "Add output of `output_hidden_states` for onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support.

2. **ONNX Runtime InferenceSession API**: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that `session.run()` can only return values declared as graph outputs.

3. **sklearn-onnx intermediate outputs**: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."

4. **Stack Overflow: Extract intermediate layer outputs from ONNX**: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding `ValueInfoProto` to `model.graph.output` to expose intermediate values.

5. **optimum-onnx GitHub**: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models.

6. **ORTModelForCausalLM documentation**: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the `forward()` method; notably absent is `output_hidden_states` parameter.

7. **SmolLM2-135M ONNX on HuggingFace Hub**: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M.

8. **optimum ONNX export documentation**: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration.

9. **DeepWiki: ORTModelForCausalLM text generation models**: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling.

10. **DeepWiki: ONNX Model Export**: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations.

11. **ONNX Runtime performance**: https://onnxruntime.ai/docs/performance/ — Official performance documentation.

12. **OpenNN deployment size comparison**: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB.

13. **onnxruntime PyPI**: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB.

14. **onnx-modifier**: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs.

15. **ONNX graph surgery**: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs.

16. **ADR-006: Optional PyTorch**: `/docs/architecture/decisions/006-optional-pytorch.md` — The ADR documenting why PyTorch is optional and the install size comparison.

17. **Model architecture doc**: `/docs/architecture/model.md` — Documents activation extraction design, `DetectorModel` protocol, and layer selection.