alknet-firewall/docs/research/python-ml-packaging.md

# Research: Packaging Python Libraries with PyTorch Dependencies

## Question

How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.

---

## 1. PyTorch as a Dependency

### How Mature ML Packages Handle It

The three major HuggingFace packages each take a different approach:

#### `transformers` — Torch as Optional Extra

From `setup.py` (v5.x), `transformers` does **NOT** include `torch` in `install_requires`. Instead:

```python
# Hard dependencies (install_requires)
install_requires = [
    "huggingface-hub>=1.5.0,<2.0",
    "numpy>=1.17",
    "packaging>=20.0",
    "pyyaml>=5.1",
    "regex>=2025.10.22",
    "tokenizers>=0.22.0,<=0.23.0",
    "safetensors>=0.4.3",
    "tqdm>=4.60",
    "typer",
]

# Torch is an OPTIONAL extra
extras["torch"] = deps_list("torch", "accelerate")
```

Users install with `pip install "transformers[torch]"`. If you just `pip install transformers` without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.

**Key insight**: `transformers` is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses `dummy_*.py` modules that provide placeholder classes when a framework isn't installed, giving better error messages.

#### `safetensors` — Framework-Specific Optional Extras

From `pyproject.toml`:

```toml
[project.optional-dependencies]
numpy = ["numpy>=1.24.6"]
torch = ["safetensors[numpy]", "torch>=2.4"]
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
mlx = ["mlx>=0.0.9"]
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]
```

The base `safetensors` package (no extras) can load files and return raw tensor data (as numpy arrays via the `numpy` extra). Each framework extra adds the framework-specific save/load functions. The `convert` extra specifically chains to `torch`.

**Key insight**: Safetensors uses a **chained extras** pattern — `torch` depends on `numpy`, so `safetensors[torch]` pulls both. This is clean and explicit.

#### `huggingface_hub` — Minimal Core, Framework Extras

From `setup.py`:

```python
install_requires = [
    "click>=8.4.0",
    "filelock>=3.10.0",
    "fsspec>=2023.5.0",
    "hf-xet>=1.5.1,<2.0.0",  # conditional on platform
    "httpx>=0.23.0, <1",
    "packaging>=20.9",
    "pyyaml>=5.1",
    "tqdm>=4.42.1",
    "typer>=0.20.0,<0.26.0",
    "typing-extensions>=4.1.0",
]

extras["torch"] = ["torch", "safetensors[torch]"]
extras["mcp"] = ["mcp>=1.8.0"]
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]
```

**Key insight**: `huggingface_hub` is deliberately minimal. Torch is only needed for certain features. The `hf_xet` dependency uses platform markers for conditional installation.

### Options Summary

| Approach | Used By | Pros | Cons |
|----------|---------|------|------|
| **Optional extra** (`package[torch]`) | transformers, safetensors, huggingface_hub | Users control their torch version; avoids forcing 2GB+ install | Must document clearly; code must handle missing torch gracefully |
| **Required dependency** | Few mature packages | Simpler code; guaranteed torch available | Forces 2GB+ download; version conflicts with user's torch |
| **Lazy imports + graceful error** | transformers (internal) | Good UX when torch missing; no crashes on import | More code complexity; can't type-check torch-dependent code |
| **Platform-conditional** | huggingface_hub (hf_xet) | Right dependency for right platform | Complex setup.py; torch doesn't support this well |

### Recommendation for alknet-firewall

**Use optional extras with lazy imports.** This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:

1. **`pip install alknet-firewall`** — minimal install, downloads model at first run, requires torch to already be present
2. **`pip install "alknet-firewall[torch]"`** — installs torch as a dependency

In your code, use lazy imports with a clear error message:

```python
def _require_torch():
    try:
        import torch
        return torch
    except ImportError:
        raise ImportError(
            "PyTorch is required for alknet-firewall inference. "
            "Install it with: pip install 'alknet-firewall[torch]' "
            "or pip install torch --index-url https://download.pytorch.org/whl/cpu"
        )
```

---

## 2. Model File Distribution

### Size Reality Check: SmolLM2-135M

The SmolLM2-135M model consists of:
- `model.safetensors` — ~269MB (model weights)
- `config.json` — ~700 bytes
- `tokenizer.json` — ~2-4MB
- `tokenizer_config.json` — ~1KB
- `generation_config.json` — ~200 bytes

**Total: ~272MB+**

This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.

### Distribution Options

| Approach | Feasibility | When to Use |
|----------|-------------|-------------|
| **Bundled in package_data** | ❌ Not feasible at 269MB | Only for files <10MB (configs, tokenizers) |
| **Runtime download via huggingface_hub** | ✅ **Recommended** | Default approach for any model >10MB |
| **Separate package for model artifacts** | ⚠️ Possible but awkward | When you need offline-first install |
| **Custom download (S3, etc.)** | ⚠️ Works but reinvents the wheel | When HF Hub isn't available |

### Recommended Approach: Runtime Download via huggingface_hub

This is exactly what `transformers` does. The pattern:

```python
from huggingface_hub import hf_hub_download, snapshot_download

# Download entire model (with caching)
model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
    # Users can set HF_HOME or HF_HUB_CACHE to control cache location
)

# Or download individual files
safetensors_path = hf_hub_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    filename="model.safetensors",
)
```

### Caching Strategy

`huggingface_hub` handles caching automatically:

- **Default cache location**: `~/.cache/huggingface/hub/`
- **Configurable via**: `HF_HOME`, `HF_HUB_CACHE`, or `cache_dir` parameter
- **Structure**: Content-addressed storage with symlinks (blobs + snapshots)
- **Deduplication**: Same file across revisions → single blob on disk
- **No re-downloads**: Cached files are checked before download
- **Offline mode**: Set `HF_HUB_OFFLINE=1` to skip all network calls

The cache structure:
```
~/.cache/huggingface/hub/
├── models--HuggingFaceTB--SmolLM2-135M/
│   ├── blobs/           # actual files, named by hash
│   ├── refs/            # branch/tag → commit mappings
│   └── snapshots/       # symlinks to blobs, one per revision
```

### Pinning Model Versions

To ensure reproducibility, pin the model revision:

```python
# Pin to a specific commit hash for reproducibility
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"

model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    revision=MODEL_REVISION,
)
```

Or pin to a tag if the model has version tags.

### Gated Model Authentication

If your model requires authentication (accepting license terms on HF Hub):

1. User sets `HF_TOKEN` environment variable or logs in via `huggingface-cli login`
2. `hf_hub_download()` automatically picks up the token
3. Document this requirement clearly

```python
# If the model is gated, this will fail without auth
# with a clear error message from huggingface_hub
model_path = snapshot_download(
    repo_id="YourOrg/YourGatedModel",
    token=True,  # explicitly use stored token
)
```

SmolLM2-135M is **not gated** as of this writing, but your own fine-tuned version could be.

---

## 3. Inference-Only Considerations

### CPU-Only PyTorch

**Yes, you can install torch without CUDA.** The official method:

```bash
# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

**Problem**: You can't express this in `pyproject.toml` extras. The CPU-only torch is served from a different index URL (`https://download.pytorch.org/whl/cpu`), not from PyPI. This means:

1. `pip install "alknet-firewall[torch]"` will install the default (CUDA) torch from PyPI — ~2GB
2. To get CPU-only torch, users must do a two-step install:
   ```bash
   pip install torch --index-url https://download.pytorch.org/whl/cpu
   pip install alknet-firewall
   ```

**Workaround**: Document both installation paths clearly:

```markdown
## Installation

# With CUDA (default torch):
pip install "alknet-firewall[torch]"

# CPU-only (smaller, for inference without GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
```

### torch.compile() for Faster Inference

`torch.compile()` (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:

```python
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model = torch.compile(model)  # JIT compile for faster inference
```

**Caveats**:
- First run is slow (compilation overhead)
- Best for repeated inference (the compiled model is cached)
- CPU-only works but benefits are smaller than on GPU
- Adds complexity; not worth it for a ~135M model unless latency is critical

**Recommendation**: Make this optional. Don't `torch.compile()` by default — offer it as a performance tuning option.

### torch.export() / TorchDynamo

`torch.export()` (PyTorch 2.1+) produces a portable model artifact:

```python
exported_model = torch.export.export(model, (input_ids,))
```

This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.

### ONNX Runtime as an Alternative

**This is the most compelling alternative to raw PyTorch for inference-only use cases.**

HuggingFace's `optimum` library provides seamless ONNX Runtime integration:

```python
# Instead of:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Use:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(model_id)
```

**Benefits**:
- `onnxruntime` package is ~30-50MB vs `torch` at ~200-2000MB+
- ONNX Runtime is optimized for inference (no autograd, no training overhead)
- Often faster inference on CPU than PyTorch
- Cross-platform (CPU, GPU, mobile, edge devices)

**Drawbacks**:
- Need to export model to ONNX format first (one-time step)
- Not all model architectures support ONNX export equally
- Quantization/int8 support varies by architecture
- Adds `onnxruntime` + `optimum` as dependencies (still much smaller than torch)

**Size comparison**:

| Package | Install Size |
|---------|-------------|
| `torch` (CUDA) | ~2.5GB |
| `torch` (CPU only) | ~200MB |
| `onnxruntime` | ~30-50MB |
| `onnxruntime-gpu` | ~500MB |

**Recommendation**: Consider offering ONNX Runtime as an **alternative inference backend** via an extra:

```toml
[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]
```

For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.

### Using transformers Without Training Dependencies

`transformers` is already split this way. The base `pip install transformers` does NOT include torch. You need `pip install "transformers[torch]"` to get torch support.

Additional ways to keep transformers lean:
- Don't install `accelerate` unless you need multi-GPU / device_map="auto"
- Don't install training extras (`deepspeed`, `peft`, etc.)
- For inference only, you don't need: `scipy`, `scikit-learn` (from transformers extras), `tensorboard`, etc.

**What transformers needs for basic inference**:
- `torch` (or `tensorflow`, or `flax`)
- `safetensors`
- `tokenizers`
- `huggingface-hub`
- `numpy`
- `packaging`
- `pyyaml`
- `regex`
- `tqdm`

---

## 4. sklearn + PyTorch Coexistence

### Compatibility: Generally Fine

sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.

**Potential concerns**:

1. **numpy version**: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:
   - torch>=2.4 requires `numpy>=1.17` (no upper bound in practice)
   - scikit-learn>=1.5 requires `numpy>=1.19.5`
   - These are compatible

2. **Dependency tree size**: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.

3. **BLAS/LAPACK**: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.

4. **Joblib vs torch parallelism**: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:
   ```python
   import torch
   torch.set_num_threads(4)  # limit torch threads

   import sklearn
   # joblib respects SKLEARN_MAX_THREADS or can be configured per-call
   ```

**Recommendation**: No special handling needed. Just include both as dependencies. Set `torch.set_num_threads()` if you notice CPU contention.

---

## 5. Package Size Optimization

### What to Make Required vs Optional

For alknet-firewall, here's a practical breakdown:

| Component | Required? | Rationale |
|-----------|-----------|-----------|
| `huggingface_hub` | ✅ Required | Model downloading, caching |
| `safetensors` | ✅ Required | Loading model weights |
| `tokenizers` | ✅ Required | Text preprocessing |
| `numpy` | ✅ Required | Tensor operations, sklearn dependency |
| `scikit-learn` | ✅ Required | SVD computations (core feature) |
| `packaging` | ✅ Required | Version comparisons |
| `filelock` | ✅ Required | File locking for cache |
| `tqdm` | ✅ Required | Progress bars |
| `pyyaml` | ✅ Required | Config parsing |
| `torch` | ❌ Optional (extra) | Large; user may already have it |
| `transformers` | ❌ Optional (extra) | Pulls many deps; only for model loading |
| `onnxruntime` | ❌ Optional (extra) | Alternative inference backend |
| `optimum` | ❌ Optional (extra) | ONNX Runtime integration |

### Practical pyproject.toml Structure

```toml
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
# Full torch-based inference
torch = [
    "torch>=2.4",
    "transformers>=4.40",
]
# ONNX Runtime inference (lighter)
onnx = [
    "onnxruntime>=1.17",
    "optimum[onnxruntime]",
    "transformers>=4.40",
]
# Development
dev = [
    "pytest>=7",
    "ruff>=0.9",
    "mypy",
]
```

### Estimated Install Sizes

| Install Command | Download Size | Disk Size |
|----------------|---------------|-----------|
| `pip install alknet-firewall` | ~30MB | ~100MB |
| `pip install "alknet-firewall[torch]"` | ~2GB+ | ~5GB+ |
| `pip install "alknet-firewall[onnx]"` | ~100MB | ~300MB |
| + model download (first run) | ~269MB | ~269MB |

---

## 6. safetensors Format

### Why safetensors Over PyTorch Pickle

| Property | `.safetensors` | `.pt` / `.bin` (pickle) |
|----------|---------------|------------------------|
| **Security** | ✅ No arbitrary code execution | ❌ Pickle can execute arbitrary code |
| **Speed (CPU)** | ~76x faster than pickle | Baseline |
| **Speed (GPU)** | ~2x faster than pickle | Baseline |
| **Zero-copy** | ✅ Memory-mapped loading | ❌ Extra copies |
| **Lazy loading** | ✅ Load only needed tensors | ❌ Must load entire file |
| **Cross-framework** | ✅ pt, tf, jax, numpy, mlx | ❌ Framework-specific |
| **File size limit** | ✅ No practical limit | ⚠️ Practical limits exist |
| **Layout control** | ✅ Deterministic | ❌ Non-deterministic |

### Security Implications

**Pickle-based `.pt` / `.bin` files are a known security risk.** Loading a `.pt` file with `torch.load()` executes arbitrary Python code embedded in the file. This is a supply chain attack vector.

`safetensors` eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.

**For a security-focused product (firewall)**, this is critical. You should:
1. **Only load model weights from safetensors format** — never `.pt` or `.bin`
2. **Verify checksums** when downloading models (huggingface_hub does this automatically)
3. **Pin model revisions** to specific commit hashes

### Loading safetensors in Practice

```python
# Method 1: via transformers (uses safetensors automatically)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    use_safetensors=True,  # explicit, though default now
)

# Method 2: direct loading (framework-agnostic)
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    for key in f.keys():
        tensors[key] = f.get_tensor(key)

# Method 3: lazy loading (only some tensors)
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    embedding = f.get_tensor("model.embed_tokens.weight")
```

**Recommendation**: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use `use_safetensors=True` explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).

---

## 7. HuggingFace Integration

### How to Depend on huggingface_hub

`huggingface_hub` is lightweight (~15MB installed) and well-maintained. It should be a **required dependency** for any package that downloads models from the Hub.

```toml
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
]
```

The version pin `>=1.5.0,<2.0` follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.

### Key Features to Use

1. **`hf_hub_download()`** — Download a single file with caching
2. **`snapshot_download()`** — Download an entire repo with caching
3. **`try_to_load_from_cache()`** — Check if a file is already cached (no network call)
4. **Offline mode** — `HF_HUB_OFFLINE=1` or `local_files_only=True`
5. **Authentication** — Automatic via `HF_TOKEN` env var or `huggingface-cli login`
6. **Filtering** — `allow_patterns` / `ignore_patterns` to download only what's needed

### Download Pattern for alknet-firewall

```python
import os
from huggingface_hub import snapshot_download, try_to_load_from_cache

# Configuration
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M"  # or your fine-tuned version
DEFAULT_MODEL_REVISION = "main"  # or pin a specific commit hash

def ensure_model_downloaded(
    model_id: str = DEFAULT_MODEL_ID,
    revision: str = DEFAULT_MODEL_REVISION,
    cache_dir: str | None = None,
) -> str:
    """Download model if not cached, return local path.

    Respects HF_HUB_OFFLINE for air-gapped environments.
    """
    offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"

    model_path = snapshot_download(
        repo_id=model_id,
        revision=revision,
        cache_dir=cache_dir,
        allow_patterns=[
            "*.safetensors",
            "config.json",
            "tokenizer.json",
            "tokenizer_config.json",
            "generation_config.json",
            "special_tokens_map.json",
        ],
        local_files_only=offline,
    )
    return model_path
```

### Caching

`huggingface_hub` caching is automatic and robust:
- **Content-addressed**: Files are stored by SHA256 hash
- **Symlink-based**: Multiple revisions share the same blob
- **No redundant downloads**: Already-cached files are never re-downloaded
- **Cache inspection**: `hf cache ls` CLI or `scan_cache_dir()` Python API
- **Cache cleanup**: `hf cache prune` removes unreferenced revisions

You don't need to implement your own caching layer. Just use `huggingface_hub` and let it handle everything.

### Authentication for Gated Models

If your fine-tuned model is gated (requires license acceptance):

```python
# User must:
# 1. Accept the model license on huggingface.co
# 2. Create an access token at huggingface.co/settings/tokens
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login

# Your code just works — huggingface_hub reads the token automatically
model_path = snapshot_download(
    repo_id="YourOrg/GatedModel",
    token=True,  # explicitly use stored token
)
```

**Recommendation**: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.

### Environment Variables

Key environment variables your users might need:

| Variable | Purpose | Default |
|----------|---------|---------|
| `HF_HOME` | Root cache directory | `~/.cache/huggingface` |
| `HF_HUB_CACHE` | Specific cache directory for hub files | `$HF_HOME/hub` |
| `HF_HUB_OFFLINE` | Skip all network calls | `0` |
| `HF_TOKEN` | Authentication token | None |
| `HF_HUB_DOWNLOAD_TIMEOUT` | Download timeout in seconds | `10` |
| `TRANSFORMERS_CACHE` | Transformers-specific cache | Deprecated; use `HF_HUB_CACHE` |

---

## Summary of Recommendations

### Dependency Strategy

```toml
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
cpu = ["torch>=2.4", "transformers>=4.40"]  # same as torch; document CPU install separately
dev = ["pytest>=7", "ruff>=0.9"]
```

### Model Distribution

- **Runtime download** via `huggingface_hub.snapshot_download()`
- **Cache** in default HF cache (`~/.cache/huggingface/hub/`)
- **Pin model revision** for reproducibility
- **Filter downloads** with `allow_patterns` (skip `.bin`, `.msgpack`, etc.)
- **Support offline mode** via `HF_HUB_OFFLINE` / `local_files_only=True`

### Inference Backend

- **Primary**: PyTorch + transformers (via `[torch]` extra)
- **Alternative**: ONNX Runtime (via `[onnx]` extra) — much smaller footprint
- **CPU-only**: Document two-step install for CPU-only torch
- **Don't torch.compile() by default** — make it opt-in

### Security

- **Only load safetensors format** — never pickle-based `.pt`/`.bin`
- **Verify model provenance** — pin to specific HF revisions
- **Don't bundle model weights** — runtime download with checksums

### Installation Paths (for docs)

```bash
# Full install (with CUDA torch)
pip install "alknet-firewall[torch]"

# CPU-only (smaller download)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

# ONNX Runtime (smallest footprint)
pip install "alknet-firewall[onnx]"

# Pre-download model for offline use
alknet-firewall download  # CLI command to pre-fetch model
# Or set HF_HUB_OFFLINE=1 after first download
```

---

## References

- [HuggingFace Transformers setup.py](https://github.com/huggingface/transformers/blob/main/setup.py) — torch as optional extra pattern
- [HuggingFace Safetensors pyproject.toml](https://github.com/huggingface/safetensors/blob/main/bindings/python/pyproject.toml) — chained extras pattern
- [HuggingFace Hub setup.py](https://github.com/huggingface/huggingface_hub/blob/main/setup.py) — minimal core with extras
- [HuggingFace Hub caching docs](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache)
- [HuggingFace Hub download docs](https://huggingface.co/docs/huggingface_hub/en/guides/download)
- [HuggingFace Safetensors docs](https://huggingface.co/docs/safetensors/index)
- [Safetensors speed comparison](https://huggingface.co/docs/safetensors/en/speed) — 76x faster CPU load than pickle
- [HuggingFace Optimum](https://github.com/huggingface/optimum) — ONNX Runtime integration
- [HuggingFace Optimum ONNX quickstart](https://huggingface.co/docs/optimum-onnx/en/quickstart)
- [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference engine
- [PyTorch installation](https://pytorch.org/get-started/locally/) — CPU-only install via `--index-url`
- [Transformers installation docs](https://huggingface.co/docs/transformers/installation) — CPU-only torch install pattern