Files
alknet-firewall/docs/research/onnx-inference-backend/feasibility-analysis.md
glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure
Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
2026-06-13 07:27:40 +00:00

20 KiB

Research: ONNX Runtime as Inference Backend for alknet-firewall

Date: 2026-06-13 Question: Should ONNX Runtime be a supported inference backend in Phase 1? Status: Open question OQ-01

Executive Summary

ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2. The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where output_hidden_states=True is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem.


1. ONNX Runtime Overview

What It Is

ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch.

Install Footprint

Package Wheel Size Installed Size Notes
onnxruntime (CPU) ~18 MB ~180-200 MB Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings
torch (CPU-only) ~200 MB ~700 MB libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB)
torch (CUDA) ~2.5 GB ~5+ GB Default pip install torch since PyTorch 2.11
optimum[onnxruntime] ~5 MB ~20 MB Python wrapper; depends on onnxruntime + transformers

Sources: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks).

Revised claim: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate.

Performance Characteristics

  • CPU inference: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support
  • Warm start: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast
  • Memory: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers)
  • Thread scaling: Good multi-threaded CPU performance via OpenMP/MLAS

CPU Deployment Story

ONNX Runtime excels at CPU deployment, which is alknet-firewall's target:

  • No CUDA/GPU dependency
  • Cross-platform (Linux, macOS, Windows, ARM)
  • Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML)
  • Well-suited for containerized and embedded deployments

2. HuggingFace Optimum Integration

How Optimum Works

HuggingFace's optimum-onnx (formerly optimum[onnxruntime]) provides drop-in replacement classes for HuggingFace transformers models:

# PyTorch path
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")

# ONNX Runtime path (drop-in replacement)
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/SmolLM2-135M-ONNX",
    export=False,  # Use pre-exported ONNX model
)
# OR: export on the fly from PyTorch weights
model = ORTModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M",
    export=True,  # Auto-export to ONNX at load time
)

Export Process

The ONNX export can be done via:

  1. CLI: optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/
  2. Programmatic: ORTModelForCausalLM.from_pretrained("...", export=True)
  3. Pre-exported: Use existing ONNX models from onnx-community/ on HuggingFace Hub

For causal LMs, the export produces:

  • A decoder model (with or without past key values)
  • Optionally a merged decoder combining initial pass and cached pass into one model

Model Compatibility

SmolLM2-135M uses the LLaMA architecture. The optimum ONNX export supports LLaMA-family models:

Architecture Export Support ORTModelForCausalLM Support
llama (SmolLM2) ✓ Supported ✓ Supported
gpt2 ✓ Supported ✓ Supported
bloom ✓ Supported ✓ Supported
mistral ✓ Supported ✓ Supported

Pre-exported model available: onnx-community/SmolLM2-135M-ONNX exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format.


3. Activation Extraction Feasibility ⚠️ CRITICAL

This is the make-or-break question for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial:

outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in [1, 2, 4, 8]
}

The Problem

ORTModelForCausalLM does NOT support output_hidden_states. This is confirmed by:

  1. GitHub Issue #972 on huggingface/optimum: "Add output of output_hidden_states for onnx model export" — filed April 2023, closed as "not planned". The request was to add hidden state outputs to the ONNX export for ORTModelForCausalLM, noting that the merged decoder only outputs logits + past key/values.

  2. ORTModelForCausalLM.forward() documentation: The forward() method signature includes input_ids, attention_mask, past_key_values, position_ids, use_cache, and **kwargs — but no output_hidden_states parameter. The return type is logits + past key values only.

  3. ONNX graph structure: Standard ONNX exports of causal LMs define outputs as logits and past_key_values. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs.

Why This Is Hard

ONNX is a static graph format. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set output_hidden_states=True at runtime, ONNX requires the graph to explicitly include those output connections.

The sklearn-onnx documentation explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."

Workable Paths (All Require Extra Engineering)

Path A: Custom ONNX Export with Hidden State Outputs

Approach: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs.

import onnx

# Load the standard exported ONNX model
model = onnx.load("model.onnx")

# Find the intermediate layer output names in the graph
# For LLaMA/SmolLM2, each transformer layer outputs hidden states
# Names follow patterns like: "/model/layers.0/output_0"

# Add intermediate outputs to the graph
for layer_idx in [1, 2, 4, 8]:
    # Find the node output for each layer
    intermediate_name = f"/model/layers.{layer_idx}/output_0"
    model.graph.output.append(
        onnx.helper.make_tensor_value_info(
            intermediate_name,
            onnx.TensorProto.FLOAT,
            ["batch", "seq_len", "hidden_dim"]
        )
    )

onnx.save(model, "model_with_hidden_states.onnx")

Then use onnxruntime.InferenceSession directly (not through ORTModelForCausalLM) to request these outputs:

session = onnxruntime.InferenceSession("model_with_hidden_states.onnx")
outputs = session.run(
    ["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...],
    {"input_ids": input_ids, "attention_mask": attention_mask}
)

Pros: Works with standard ONNX Runtime; no PyTorch dependency at inference time. Cons:

  • Requires careful ONNX graph manipulation (naming conventions vary by export version)
  • Must validate that intermediate node names are stable across export runs
  • Must handle the merged decoder model correctly (past key values branch)
  • Loss of ORTModelForCausalLM convenience (manual session management, no generate(), no caching)
  • Must discover intermediate node names via onnx library inspection
  • Graph modifications may invalidate ONNX Runtime optimizations

Path B: Separate Encoder-Style ONNX Export

Approach: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers.

This would require writing a custom torch.onnx.export call that traces the model with output_hidden_states=True and captures the intermediate outputs.

Pros: Clean separation of concerns; each sub-model can be optimized independently. Cons:

  • Requires PyTorch for the initial export (but not at runtime)
  • Significant custom code to manage multiple ONNX sub-models
  • Past key value caching becomes much more complex with sub-models
  • Not supported by optimum CLI or ORTModel classes

Approach: Combine a custom ONNX export with direct onnxruntime.InferenceSession usage, bypassing ORTModelForCausalLM entirely.

import onnxruntime as ort
import onnx
from transformers import AutoTokenizer

# Step 1: Export with hidden state outputs (one-time, requires PyTorch)
# Use optimum CLI or programmatic export, then modify the graph

# Step 2: Load modified model and run inference
session = ort.InferenceSession("smollm2_with_hidden_states.onnx")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

inputs = tokenizer("Hello world", return_tensors="np")
output_names = [o.name for o in session.get_outputs()]
# Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ...

results = session.run(output_names, dict(inputs))
hidden_states = {
    1: results[output_names.index("hidden_state_1")][:, -1, :],
    2: results[output_names.index("hidden_state_2")][:, -1, :],
    ...
}

Pros: Full control; no PyTorch at runtime; smallest possible footprint. Cons:

  • Must write and maintain custom ONNX graph modification code
  • Must re-export whenever the model architecture changes
  • Must validate numerical equivalence against PyTorch reference
  • Bypasses the ORTModelForCausalLM abstraction entirely
  • Past key value handling must be manual (no generate() support)
  • This is essentially a custom inference backend, not a drop-in replacement

Comparison with PyTorch

Aspect PyTorch ONNX Runtime (Standard) ONNX Runtime (Custom)
output_hidden_states=True Native, one flag Not supported ⚠️ Requires graph modification
Activation extraction API outputs.hidden_states[layer][:, -1, :] N/A Manual session.run() with named outputs
Effort to implement Minimal (built-in) N/A High (custom export + graph hacking)
Numerical accuracy Ground truth Must validate Must validate against PyTorch
Maintenance burden Low N/A High (graph names change, ONNX spec evolves)

4. SmolLM2-135M ONNX Export

Known Status

  • Pre-exported model exists: onnx-community/SmolLM2-135M-ONNX on HuggingFace Hub
  • Architecture: LLaMA family, which is well-supported by optimum ONNX export
  • Export method: Automated by HuggingFace's ONNX conversion space (convert-to-onnx)
  • Model card: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction

Export Configuration

The LLaMA architecture maps to optimum's LlamaOnnxConfig (SmolLM2 uses the LLaMA architecture). The standard export produces:

  • decoder_model.onnx — for initial forward pass (no past key values)
  • decoder_with_past_model.onnx — for subsequent generation steps (with past key values)
  • Or decoder_model_merged.onnx — combined model with conditional branching

Known Issues

  1. Hidden states not in standard export: The default optimum export for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs.

  2. Merged decoder complexity: The merged decoder model uses a use_cache_branch flag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure.

  3. Node naming stability: Internal ONNX node names (e.g., /model/layers.0/output_0) may change between optimum versions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden.


5. Comparison Table

Criteria PyTorch (CPU-only) ONNX Runtime (Standard) ONNX Runtime (Custom Graph)
Install size (download) ~200 MB ~18 MB ~18 MB
Install size (disk) ~700 MB ~180-200 MB ~180-200 MB
output_hidden_states=True Built-in Not supported ⚠️ Custom graph modification
Activation extraction API model(**inputs, output_hidden_states=True) N/A Manual session.run() with named outputs
Drop-in with optimum AutoModelForCausalLM ⚠️ ORTModelForCausalLM but no hidden states Must bypass ORTModel classes
Past key value caching Automatic Automatic via ORTModel Must handle manually
Numerical equivalence Ground truth Must validate Must validate
Implementation effort Low (built-in) N/A (doesn't work) High (custom export + graph mod)
Maintenance burden Low N/A High (brittle node names)
Runtime performance Good Better (graph-optimized) Better (graph-optimized)
CPU deployment Supported Excellent Excellent
safetensors loading Via transformers Via optimum Requires separate model loading
Model pinning (revision) Via transformers Via optimum ⚠️ Custom handling
Offline/air-gapped HF Hub cache HF Hub cache ⚠️ Custom export files
License BSD-3 MIT MIT

6. Recommendation

Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.

Rationale

  1. The activation extraction problem is unsolved for ORTModelForCausalLM. Issue #972 requesting output_hidden_states support was closed as "not planned" by the optimum team. This means the standard, supported path does not work for alknet-firewall's core requirement.

  2. Custom ONNX graph modification is a significant engineering effort with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across optimum version updates.

  3. The install-size advantage is real but not decisive. While onnxruntime (~180 MB installed) is significantly smaller than torch CPU-only (~700 MB installed), the difference is manageable:

    • The model weights (269 MB for SmolLM2-135M) dwarf the onnxruntime savings
    • The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB
    • The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB
    • Savings: ~500 MB, which is meaningful but not transformative
  4. PyTorch is already optional. ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support.

  5. The DetectorModel protocol already accommodates multiple backends. The architecture is designed for this:

    class DetectorModel(Protocol):
        def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
    

    Adding an ONNXDetectorModel implementation in Phase 2 is a clean extension.

Phase 2 Plan

When ONNX Runtime support is added in Phase 2, the recommended approach is:

  1. Create a custom ONNX export pipeline that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition
  2. Store the custom-exported model on HuggingFace Hub (e.g., alknet/smollm2-135m-onnx-activations) with the modified graph
  3. Use onnxruntime.InferenceSession directly (bypassing ORTModelForCausalLM) for inference, requesting the hidden state outputs by name
  4. Validate numerical equivalence against the PyTorch reference implementation at each model version
  5. Pin the optimum version used for the initial export to ensure node name stability

Alternatively, if optimum adds output_hidden_states support in a future version (the issue could be reopened), the implementation becomes much simpler and could use ORTModelForCausalLM directly.

Phase 1 Actions

  • Update ADR-006 to note that ONNX Runtime is deferred to Phase 2
  • Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap"
  • Update pyproject.toml to remove the [onnx] extra from Phase 1 scope (or mark it as experimental/unstable)
  • Ensure the DetectorModel protocol and HFDetectorModel implementation are clean enough to extend with an ONNXDetectorModel in Phase 2

7. References

  1. HuggingFace optimum Issue #972: "Add output of output_hidden_states for onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support.

  2. ONNX Runtime InferenceSession API: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that session.run() can only return values declared as graph outputs.

  3. sklearn-onnx intermediate outputs: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."

  4. Stack Overflow: Extract intermediate layer outputs from ONNX: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding ValueInfoProto to model.graph.output to expose intermediate values.

  5. optimum-onnx GitHub: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models.

  6. ORTModelForCausalLM documentation: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the forward() method; notably absent is output_hidden_states parameter.

  7. SmolLM2-135M ONNX on HuggingFace Hub: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M.

  8. optimum ONNX export documentation: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration.

  9. DeepWiki: ORTModelForCausalLM text generation models: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling.

  10. DeepWiki: ONNX Model Export: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations.

  11. ONNX Runtime performance: https://onnxruntime.ai/docs/performance/ — Official performance documentation.

  12. OpenNN deployment size comparison: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB.

  13. onnxruntime PyPI: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB.

  14. onnx-modifier: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs.

  15. ONNX graph surgery: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs.

  16. ADR-006: Optional PyTorch: /docs/architecture/decisions/006-optional-pytorch.md — The ADR documenting why PyTorch is optional and the install size comparison.

  17. Model architecture doc: /docs/architecture/model.md — Documents activation extraction design, DetectorModel protocol, and layer selection.