Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

20 KiB

Raw Blame History

Research: ONNX Runtime as Inference Backend for alknet-firewall

Date: 2026-06-13 Question: Should ONNX Runtime be a supported inference backend in Phase 1? Status: Open question OQ-01

Executive Summary

ONNX Runtime is feasible as an inference backend but should be deferred to Phase 2. The core challenge is that ONNX Runtime's standard inference pipeline does not natively expose intermediate layer hidden states — the critical data alknet-firewall needs for activation-based detection. While there is a workable path (custom ONNX graph modification to add intermediate outputs), it requires significant additional engineering, testing, and maintenance compared to the PyTorch path where output_hidden_states=True is a single flag. The install-size advantage is real (~180MB vs ~700MB for CPU-only torch), but not decisive for Phase 1 when the activation extraction problem is unsolved in the ONNX ecosystem.

1. ONNX Runtime Overview

What It Is

ONNX Runtime (ORT) is Microsoft's cross-platform, high-performance inference engine for ONNX (Open Neural Network Exchange) format models. It is purpose-built for inference — no training, no autograd, no JIT compiler. This focus makes it significantly smaller and faster to load than PyTorch.

Install Footprint

Package	Wheel Size	Installed Size	Notes
`onnxruntime` (CPU)	~18 MB	~180-200 MB	Measured from onnxruntime 1.26.0 PyPI wheel; includes libonnxruntime.so (~22 MB) plus Python bindings
`torch` (CPU-only)	~200 MB	~700 MB	libtorch_cpu.so ~442 MB; pip default since 2.11 ships CUDA wheels (~2.5 GB)
`torch` (CUDA)	~2.5 GB	~5+ GB	Default `pip install torch` since PyTorch 2.11
`optimum[onnxruntime]`	~5 MB	~20 MB	Python wrapper; depends on onnxruntime + transformers

Sources: onnxruntime 1.26.0 PyPI wheel for Linux x86_64 is 18.2 MB. The libonnxruntime.so shared library is 22.0 MB. PyTorch CPU libtorch_cpu.so is 441.8 MB per download.pytorch.org (measured 2026-06-07 by OpenNN benchmarks).

Revised claim: The ADR-006 claim of "onnxruntime: ~30-50MB download, ~300MB installed" is approximately correct for the wheel, but the installed size is closer to 180-200 MB (not 300 MB). The PyTorch CPU-only claim of "200MB download, ~700MB installed" is accurate.

Performance Characteristics

CPU inference: ORT is generally faster than PyTorch for CPU inference due to graph optimization, operator fusion, and quantization support
Warm start: ORT session creation has overhead (~100ms-1s depending on model), but inference calls are fast
Memory: Lower peak memory usage than PyTorch (no autograd graph, no gradient buffers)
Thread scaling: Good multi-threaded CPU performance via OpenMP/MLAS

CPU Deployment Story

ONNX Runtime excels at CPU deployment, which is alknet-firewall's target:

No CUDA/GPU dependency
Cross-platform (Linux, macOS, Windows, ARM)
Hardware acceleration via execution providers (Intel OpenVINO, ARM Compute Library, Apple CoreML)
Well-suited for containerized and embedded deployments

2. HuggingFace Optimum Integration

How Optimum Works

HuggingFace's optimum-onnx (formerly optimum[onnxruntime]) provides drop-in replacement classes for HuggingFace transformers models:

# PyTorch path
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")

# ONNX Runtime path (drop-in replacement)
from optimum.onnxruntime import ORTModelForCausalLM
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/SmolLM2-135M-ONNX",
    export=False,  # Use pre-exported ONNX model
)
# OR: export on the fly from PyTorch weights
model = ORTModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M",
    export=True,  # Auto-export to ONNX at load time
)

Export Process

The ONNX export can be done via:

CLI: optimum-cli export onnx --model HuggingFaceTB/SmolLM2-135M onnx_output/
Programmatic: ORTModelForCausalLM.from_pretrained("...", export=True)
Pre-exported: Use existing ONNX models from onnx-community/ on HuggingFace Hub

For causal LMs, the export produces:

A decoder model (with or without past key values)
Optionally a merged decoder combining initial pass and cached pass into one model

Model Compatibility

SmolLM2-135M uses the LLaMA architecture. The optimum ONNX export supports LLaMA-family models:

Architecture	Export Support	ORTModelForCausalLM Support
`llama` (SmolLM2)	✓ Supported	✓ Supported
`gpt2`	✓ Supported	✓ Supported
`bloom`	✓ Supported	✓ Supported
`mistral`	✓ Supported	✓ Supported

Pre-exported model available: onnx-community/SmolLM2-135M-ONNX exists on HuggingFace Hub, confirming successful export of SmolLM2-135M to ONNX format.

3. Activation Extraction Feasibility ⚠️ CRITICAL

This is the make-or-break question for ONNX Runtime support. alknet-firewall needs hidden state activations from intermediate layers. In PyTorch, this is trivial:

outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in [1, 2, 4, 8]
}

The Problem

ORTModelForCausalLM does NOT support output_hidden_states. This is confirmed by:

GitHub Issue #972 on huggingface/optimum: "Add output of output_hidden_states for onnx model export" — filed April 2023, closed as "not planned". The request was to add hidden state outputs to the ONNX export for ORTModelForCausalLM, noting that the merged decoder only outputs logits + past key/values.
ORTModelForCausalLM.forward() documentation: The forward() method signature includes input_ids, attention_mask, past_key_values, position_ids, use_cache, and **kwargs — but no output_hidden_states parameter. The return type is logits + past key values only.
ONNX graph structure: Standard ONNX exports of causal LMs define outputs as logits and past_key_values. Hidden states at intermediate layers are not included in the graph outputs. ONNX Runtime can only return values that are declared as graph outputs.

Why This Is Hard

ONNX is a static graph format. The computation graph is defined at export time, and only declared outputs can be retrieved at inference time. Unlike PyTorch's dynamic computation where you can set output_hidden_states=True at runtime, ONNX requires the graph to explicitly include those output connections.

The sklearn-onnx documentation explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."

Workable Paths (All Require Extra Engineering)

Path A: Custom ONNX Export with Hidden State Outputs

Approach: Modify the ONNX export configuration to include intermediate layer hidden states as graph outputs.

import onnx

# Load the standard exported ONNX model
model = onnx.load("model.onnx")

# Find the intermediate layer output names in the graph
# For LLaMA/SmolLM2, each transformer layer outputs hidden states
# Names follow patterns like: "/model/layers.0/output_0"

# Add intermediate outputs to the graph
for layer_idx in [1, 2, 4, 8]:
    # Find the node output for each layer
    intermediate_name = f"/model/layers.{layer_idx}/output_0"
    model.graph.output.append(
        onnx.helper.make_tensor_value_info(
            intermediate_name,
            onnx.TensorProto.FLOAT,
            ["batch", "seq_len", "hidden_dim"]
        )
    )

onnx.save(model, "model_with_hidden_states.onnx")

Then use onnxruntime.InferenceSession directly (not through ORTModelForCausalLM) to request these outputs:

session = onnxruntime.InferenceSession("model_with_hidden_states.onnx")
outputs = session.run(
    ["logits", "/model/layers.1/output_0", "/model/layers.2/output_0", ...],
    {"input_ids": input_ids, "attention_mask": attention_mask}
)

Pros: Works with standard ONNX Runtime; no PyTorch dependency at inference time. Cons:

Requires careful ONNX graph manipulation (naming conventions vary by export version)
Must validate that intermediate node names are stable across export runs
Must handle the merged decoder model correctly (past key values branch)
Loss of ORTModelForCausalLM convenience (manual session management, no generate(), no caching)
Must discover intermediate node names via onnx library inspection
Graph modifications may invalidate ONNX Runtime optimizations

Path B: Separate Encoder-Style ONNX Export

Approach: Create a custom export that treats each transformer layer as a separate ONNX model, or export a modified model that outputs hidden states at specific layers.

This would require writing a custom torch.onnx.export call that traces the model with output_hidden_states=True and captures the intermediate outputs.

Pros: Clean separation of concerns; each sub-model can be optimized independently. Cons:

Requires PyTorch for the initial export (but not at runtime)
Significant custom code to manage multiple ONNX sub-models
Past key value caching becomes much more complex with sub-models
Not supported by optimum CLI or ORTModel classes

Path C: Direct ONNX Runtime with Modified Graph (Recommended Path)

Approach: Combine a custom ONNX export with direct onnxruntime.InferenceSession usage, bypassing ORTModelForCausalLM entirely.

import onnxruntime as ort
import onnx
from transformers import AutoTokenizer

# Step 1: Export with hidden state outputs (one-time, requires PyTorch)
# Use optimum CLI or programmatic export, then modify the graph

# Step 2: Load modified model and run inference
session = ort.InferenceSession("smollm2_with_hidden_states.onnx")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")

inputs = tokenizer("Hello world", return_tensors="np")
output_names = [o.name for o in session.get_outputs()]
# Includes: logits, past_key_values, hidden_state_1, hidden_state_2, ...

results = session.run(output_names, dict(inputs))
hidden_states = {
    1: results[output_names.index("hidden_state_1")][:, -1, :],
    2: results[output_names.index("hidden_state_2")][:, -1, :],
    ...
}

Pros: Full control; no PyTorch at runtime; smallest possible footprint. Cons:

Must write and maintain custom ONNX graph modification code
Must re-export whenever the model architecture changes
Must validate numerical equivalence against PyTorch reference
Bypasses the ORTModelForCausalLM abstraction entirely
Past key value handling must be manual (no generate() support)
This is essentially a custom inference backend, not a drop-in replacement

Comparison with PyTorch

Aspect	PyTorch	ONNX Runtime (Standard)	ONNX Runtime (Custom)
`output_hidden_states=True`	✅ Native, one flag	❌ Not supported	⚠️ Requires graph modification
Activation extraction API	`outputs.hidden_states[layer][:, -1, :]`	N/A	Manual `session.run()` with named outputs
Effort to implement	Minimal (built-in)	N/A	High (custom export + graph hacking)
Numerical accuracy	Ground truth	Must validate	Must validate against PyTorch
Maintenance burden	Low	N/A	High (graph names change, ONNX spec evolves)

4. SmolLM2-135M ONNX Export

Known Status

Pre-exported model exists: onnx-community/SmolLM2-135M-ONNX on HuggingFace Hub
Architecture: LLaMA family, which is well-supported by optimum ONNX export
Export method: Automated by HuggingFace's ONNX conversion space (convert-to-onnx)
Model card: Lists Transformers.js as primary usage, indicating the ONNX model is set up for text generation (logits output), not hidden state extraction

Export Configuration

The LLaMA architecture maps to optimum's LlamaOnnxConfig (SmolLM2 uses the LLaMA architecture). The standard export produces:

decoder_model.onnx — for initial forward pass (no past key values)
decoder_with_past_model.onnx — for subsequent generation steps (with past key values)
Or decoder_model_merged.onnx — combined model with conditional branching

Known Issues

Hidden states not in standard export: The default optimum export for causal LMs does not include intermediate hidden states as outputs. This is by design — the export configuration only specifies logits and past key values as outputs.
Merged decoder complexity: The merged decoder model uses a use_cache_branch flag for conditional execution. Adding hidden state outputs to this graph requires understanding the branching structure.
Node naming stability: Internal ONNX node names (e.g., /model/layers.0/output_0) may change between optimum versions or ONNX opset versions. Relying on these names for activation extraction creates a maintenance burden.

5. Comparison Table

Criteria	PyTorch (CPU-only)	ONNX Runtime (Standard)	ONNX Runtime (Custom Graph)
Install size (download)	~200 MB	~18 MB	~18 MB
Install size (disk)	~700 MB	~180-200 MB	~180-200 MB
`output_hidden_states=True`	✅ Built-in	❌ Not supported	⚠️ Custom graph modification
Activation extraction API	`model(**inputs, output_hidden_states=True)`	N/A	Manual `session.run()` with named outputs
Drop-in with optimum	✅ `AutoModelForCausalLM`	⚠️ `ORTModelForCausalLM` but no hidden states	❌ Must bypass ORTModel classes
Past key value caching	✅ Automatic	✅ Automatic via ORTModel	❌ Must handle manually
Numerical equivalence	Ground truth	Must validate	Must validate
Implementation effort	Low (built-in)	N/A (doesn't work)	High (custom export + graph mod)
Maintenance burden	Low	N/A	High (brittle node names)
Runtime performance	Good	Better (graph-optimized)	Better (graph-optimized)
CPU deployment	✅ Supported	✅ Excellent	✅ Excellent
safetensors loading	✅ Via transformers	✅ Via optimum	❌ Requires separate model loading
Model pinning (revision)	✅ Via transformers	✅ Via optimum	⚠️ Custom handling
Offline/air-gapped	✅ HF Hub cache	✅ HF Hub cache	⚠️ Custom export files
License	BSD-3	MIT	MIT

6. Recommendation

Defer ONNX Runtime to Phase 2. Use PyTorch for Phase 1.

Rationale

The activation extraction problem is unsolved for ORTModelForCausalLM. Issue #972 requesting output_hidden_states support was closed as "not planned" by the optimum team. This means the standard, supported path does not work for alknet-firewall's core requirement.
Custom ONNX graph modification is a significant engineering effort with ongoing maintenance burden. It would essentially require alknet-firewall to maintain a custom ONNX export pipeline, validate numerical equivalence, and keep node names synchronized across optimum version updates.
The install-size advantage is real but not decisive. While onnxruntime (~180 MB installed) is significantly smaller than torch CPU-only (~700 MB installed), the difference is manageable:
- The model weights (269 MB for SmolLM2-135M) dwarf the onnxruntime savings
- The total installed size for PyTorch path: ~700 MB (torch) + ~50 MB (transformers) + ~269 MB (model) ≈ 1 GB
- The total installed size for ONNX path: ~180 MB (onnxruntime) + ~50 MB (optimum) + ~269 MB (model) ≈ 500 MB
- Savings: ~500 MB, which is meaningful but not transformative
PyTorch is already optional. ADR-006 correctly made PyTorch optional via extras. Users who can't install PyTorch simply won't have a working inference backend until Phase 2 adds ONNX support.
The DetectorModel protocol already accommodates multiple backends. The architecture is designed for this:
```
class DetectorModel(Protocol):
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
```
Adding an ONNXDetectorModel implementation in Phase 2 is a clean extension.

Phase 2 Plan

When ONNX Runtime support is added in Phase 2, the recommended approach is:

Create a custom ONNX export pipeline that includes hidden state outputs for layers 1, 2, 4, 8 in the ONNX graph definition
Store the custom-exported model on HuggingFace Hub (e.g., alknet/smollm2-135m-onnx-activations) with the modified graph
Use onnxruntime.InferenceSession directly (bypassing ORTModelForCausalLM) for inference, requesting the hidden state outputs by name
Validate numerical equivalence against the PyTorch reference implementation at each model version
Pin the optimum version used for the initial export to ensure node name stability

Alternatively, if optimum adds output_hidden_states support in a future version (the issue could be reopened), the implementation becomes much simpler and could use ORTModelForCausalLM directly.

Phase 1 Actions

Update ADR-006 to note that ONNX Runtime is deferred to Phase 2
Resolve OQ-01 as "ONNX Runtime deferred to Phase 2 due to hidden state extraction gap"
Update pyproject.toml to remove the [onnx] extra from Phase 1 scope (or mark it as experimental/unstable)
Ensure the DetectorModel protocol and HFDetectorModel implementation are clean enough to extend with an ONNXDetectorModel in Phase 2

7. References

HuggingFace optimum Issue #972: "Add output of output_hidden_states for onnx model export" — https://github.com/huggingface/optimum/issues/972 — Closed as "not planned". The key issue documenting the lack of hidden state output support.
ONNX Runtime InferenceSession API: https://onnxruntime.ai/docs/api/python/api_summary.html — Documents that session.run() can only return values declared as graph outputs.
sklearn-onnx intermediate outputs: https://onnx.ai/sklearn-onnx/auto_examples/plot_intermediate_outputs.html — Explicitly states: "There is actually no way to ask onnxruntime to retrieve the output of intermediate nodes. We need to modify the ONNX [graph] before it is given to onnxruntime."
Stack Overflow: Extract intermediate layer outputs from ONNX: https://stackoverflow.com/questions/69658166/get-intermediate-layer-output-for-onnx-mode — Shows the approach of adding ValueInfoProto to model.graph.output to expose intermediate values.
optimum-onnx GitHub: https://github.com/huggingface/optimum-onnx — The ONNX integration library for HuggingFace models.
ORTModelForCausalLM documentation: https://huggingface.co/docs/optimum-onnx/onnxruntime/package_reference/modeling_ort — Documents the forward() method; notably absent is output_hidden_states parameter.
SmolLM2-135M ONNX on HuggingFace Hub: https://huggingface.co/onnx-community/SmolLM2-135M-ONNX — Pre-exported ONNX version of SmolLM2-135M.
optimum ONNX export documentation: https://huggingface.co/docs/optimum-onnx/onnx/usage_guides/export_a_model — Documents the export process and configuration.
DeepWiki: ORTModelForCausalLM text generation models: https://deepwiki.com/huggingface/optimum-onnx/3.3-text-generation-models — Documents past key value caching, merged/non-merged model variants, and architecture-specific handling.
DeepWiki: ONNX Model Export: https://deepwiki.com/huggingface/optimum-onnx/2-onnx-model-export — Documents the export system architecture, validation, and graph transformations.
ONNX Runtime performance: https://onnxruntime.ai/docs/performance/ — Official performance documentation.
OpenNN deployment size comparison: https://www.opennn.net/blog/deployment-size-on-cpu-opennn-vs-pytorch-vs-tensorflow/ — Measured deployment sizes: ONNX Runtime libonnxruntime.so = 22 MB, PyTorch libtorch_cpu.so = 442 MB.
onnxruntime PyPI: https://pypi.org/project/onnxruntime/ — Wheel sizes: onnxruntime 1.26.0 for Linux x86_64 = 18.2 MB.
onnx-modifier: https://github.com/ZhangGe6/onnx-modifier — Tool for modifying ONNX models, including adding intermediate outputs.
ONNX graph surgery: https://tlbvr.com/blog/onnx-graph-surgery/ — Techniques for embedding custom operations in ONNX graphs.
ADR-006: Optional PyTorch: /docs/architecture/decisions/006-optional-pytorch.md — The ADR documenting why PyTorch is optional and the install size comparison.
Model architecture doc: /docs/architecture/model.md — Documents activation extraction design, DetectorModel protocol, and layer selection.

20 KiB Raw Blame History