Files
alknet-firewall/docs/research/python-ml-packaging.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

689 lines
25 KiB
Markdown

# Research: Packaging Python Libraries with PyTorch Dependencies
## Question
How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.
---
## 1. PyTorch as a Dependency
### How Mature ML Packages Handle It
The three major HuggingFace packages each take a different approach:
#### `transformers` — Torch as Optional Extra
From `setup.py` (v5.x), `transformers` does **NOT** include `torch` in `install_requires`. Instead:
```python
# Hard dependencies (install_requires)
install_requires = [
"huggingface-hub>=1.5.0,<2.0",
"numpy>=1.17",
"packaging>=20.0",
"pyyaml>=5.1",
"regex>=2025.10.22",
"tokenizers>=0.22.0,<=0.23.0",
"safetensors>=0.4.3",
"tqdm>=4.60",
"typer",
]
# Torch is an OPTIONAL extra
extras["torch"] = deps_list("torch", "accelerate")
```
Users install with `pip install "transformers[torch]"`. If you just `pip install transformers` without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.
**Key insight**: `transformers` is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses `dummy_*.py` modules that provide placeholder classes when a framework isn't installed, giving better error messages.
#### `safetensors` — Framework-Specific Optional Extras
From `pyproject.toml`:
```toml
[project.optional-dependencies]
numpy = ["numpy>=1.24.6"]
torch = ["safetensors[numpy]", "torch>=2.4"]
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
mlx = ["mlx>=0.0.9"]
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]
```
The base `safetensors` package (no extras) can load files and return raw tensor data (as numpy arrays via the `numpy` extra). Each framework extra adds the framework-specific save/load functions. The `convert` extra specifically chains to `torch`.
**Key insight**: Safetensors uses a **chained extras** pattern — `torch` depends on `numpy`, so `safetensors[torch]` pulls both. This is clean and explicit.
#### `huggingface_hub` — Minimal Core, Framework Extras
From `setup.py`:
```python
install_requires = [
"click>=8.4.0",
"filelock>=3.10.0",
"fsspec>=2023.5.0",
"hf-xet>=1.5.1,<2.0.0", # conditional on platform
"httpx>=0.23.0, <1",
"packaging>=20.9",
"pyyaml>=5.1",
"tqdm>=4.42.1",
"typer>=0.20.0,<0.26.0",
"typing-extensions>=4.1.0",
]
extras["torch"] = ["torch", "safetensors[torch]"]
extras["mcp"] = ["mcp>=1.8.0"]
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]
```
**Key insight**: `huggingface_hub` is deliberately minimal. Torch is only needed for certain features. The `hf_xet` dependency uses platform markers for conditional installation.
### Options Summary
| Approach | Used By | Pros | Cons |
|----------|---------|------|------|
| **Optional extra** (`package[torch]`) | transformers, safetensors, huggingface_hub | Users control their torch version; avoids forcing 2GB+ install | Must document clearly; code must handle missing torch gracefully |
| **Required dependency** | Few mature packages | Simpler code; guaranteed torch available | Forces 2GB+ download; version conflicts with user's torch |
| **Lazy imports + graceful error** | transformers (internal) | Good UX when torch missing; no crashes on import | More code complexity; can't type-check torch-dependent code |
| **Platform-conditional** | huggingface_hub (hf_xet) | Right dependency for right platform | Complex setup.py; torch doesn't support this well |
### Recommendation for alknet-firewall
**Use optional extras with lazy imports.** This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:
1. **`pip install alknet-firewall`** — minimal install, downloads model at first run, requires torch to already be present
2. **`pip install "alknet-firewall[torch]"`** — installs torch as a dependency
In your code, use lazy imports with a clear error message:
```python
def _require_torch():
try:
import torch
return torch
except ImportError:
raise ImportError(
"PyTorch is required for alknet-firewall inference. "
"Install it with: pip install 'alknet-firewall[torch]' "
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
)
```
---
## 2. Model File Distribution
### Size Reality Check: SmolLM2-135M
The SmolLM2-135M model consists of:
- `model.safetensors` — ~269MB (model weights)
- `config.json` — ~700 bytes
- `tokenizer.json` — ~2-4MB
- `tokenizer_config.json` — ~1KB
- `generation_config.json` — ~200 bytes
**Total: ~272MB+**
This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.
### Distribution Options
| Approach | Feasibility | When to Use |
|----------|-------------|-------------|
| **Bundled in package_data** | ❌ Not feasible at 269MB | Only for files <10MB (configs, tokenizers) |
| **Runtime download via huggingface_hub** | ✅ **Recommended** | Default approach for any model >10MB |
| **Separate package for model artifacts** | ⚠️ Possible but awkward | When you need offline-first install |
| **Custom download (S3, etc.)** | ⚠️ Works but reinvents the wheel | When HF Hub isn't available |
### Recommended Approach: Runtime Download via huggingface_hub
This is exactly what `transformers` does. The pattern:
```python
from huggingface_hub import hf_hub_download, snapshot_download
# Download entire model (with caching)
model_path = snapshot_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
# Users can set HF_HOME or HF_HUB_CACHE to control cache location
)
# Or download individual files
safetensors_path = hf_hub_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
filename="model.safetensors",
)
```
### Caching Strategy
`huggingface_hub` handles caching automatically:
- **Default cache location**: `~/.cache/huggingface/hub/`
- **Configurable via**: `HF_HOME`, `HF_HUB_CACHE`, or `cache_dir` parameter
- **Structure**: Content-addressed storage with symlinks (blobs + snapshots)
- **Deduplication**: Same file across revisions → single blob on disk
- **No re-downloads**: Cached files are checked before download
- **Offline mode**: Set `HF_HUB_OFFLINE=1` to skip all network calls
The cache structure:
```
~/.cache/huggingface/hub/
├── models--HuggingFaceTB--SmolLM2-135M/
│ ├── blobs/ # actual files, named by hash
│ ├── refs/ # branch/tag → commit mappings
│ └── snapshots/ # symlinks to blobs, one per revision
```
### Pinning Model Versions
To ensure reproducibility, pin the model revision:
```python
# Pin to a specific commit hash for reproducibility
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"
model_path = snapshot_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
revision=MODEL_REVISION,
)
```
Or pin to a tag if the model has version tags.
### Gated Model Authentication
If your model requires authentication (accepting license terms on HF Hub):
1. User sets `HF_TOKEN` environment variable or logs in via `huggingface-cli login`
2. `hf_hub_download()` automatically picks up the token
3. Document this requirement clearly
```python
# If the model is gated, this will fail without auth
# with a clear error message from huggingface_hub
model_path = snapshot_download(
repo_id="YourOrg/YourGatedModel",
token=True, # explicitly use stored token
)
```
SmolLM2-135M is **not gated** as of this writing, but your own fine-tuned version could be.
---
## 3. Inference-Only Considerations
### CPU-Only PyTorch
**Yes, you can install torch without CUDA.** The official method:
```bash
# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cpu
```
**Problem**: You can't express this in `pyproject.toml` extras. The CPU-only torch is served from a different index URL (`https://download.pytorch.org/whl/cpu`), not from PyPI. This means:
1. `pip install "alknet-firewall[torch]"` will install the default (CUDA) torch from PyPI — ~2GB
2. To get CPU-only torch, users must do a two-step install:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
```
**Workaround**: Document both installation paths clearly:
```markdown
## Installation
# With CUDA (default torch):
pip install "alknet-firewall[torch]"
# CPU-only (smaller, for inference without GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
```
### torch.compile() for Faster Inference
`torch.compile()` (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:
```python
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model = torch.compile(model) # JIT compile for faster inference
```
**Caveats**:
- First run is slow (compilation overhead)
- Best for repeated inference (the compiled model is cached)
- CPU-only works but benefits are smaller than on GPU
- Adds complexity; not worth it for a ~135M model unless latency is critical
**Recommendation**: Make this optional. Don't `torch.compile()` by default — offer it as a performance tuning option.
### torch.export() / TorchDynamo
`torch.export()` (PyTorch 2.1+) produces a portable model artifact:
```python
exported_model = torch.export.export(model, (input_ids,))
```
This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.
### ONNX Runtime as an Alternative
**This is the most compelling alternative to raw PyTorch for inference-only use cases.**
HuggingFace's `optimum` library provides seamless ONNX Runtime integration:
```python
# Instead of:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_id)
# Use:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(model_id)
```
**Benefits**:
- `onnxruntime` package is ~30-50MB vs `torch` at ~200-2000MB+
- ONNX Runtime is optimized for inference (no autograd, no training overhead)
- Often faster inference on CPU than PyTorch
- Cross-platform (CPU, GPU, mobile, edge devices)
**Drawbacks**:
- Need to export model to ONNX format first (one-time step)
- Not all model architectures support ONNX export equally
- Quantization/int8 support varies by architecture
- Adds `onnxruntime` + `optimum` as dependencies (still much smaller than torch)
**Size comparison**:
| Package | Install Size |
|---------|-------------|
| `torch` (CUDA) | ~2.5GB |
| `torch` (CPU only) | ~200MB |
| `onnxruntime` | ~30-50MB |
| `onnxruntime-gpu` | ~500MB |
**Recommendation**: Consider offering ONNX Runtime as an **alternative inference backend** via an extra:
```toml
[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]
```
For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.
### Using transformers Without Training Dependencies
`transformers` is already split this way. The base `pip install transformers` does NOT include torch. You need `pip install "transformers[torch]"` to get torch support.
Additional ways to keep transformers lean:
- Don't install `accelerate` unless you need multi-GPU / device_map="auto"
- Don't install training extras (`deepspeed`, `peft`, etc.)
- For inference only, you don't need: `scipy`, `scikit-learn` (from transformers extras), `tensorboard`, etc.
**What transformers needs for basic inference**:
- `torch` (or `tensorflow`, or `flax`)
- `safetensors`
- `tokenizers`
- `huggingface-hub`
- `numpy`
- `packaging`
- `pyyaml`
- `regex`
- `tqdm`
---
## 4. sklearn + PyTorch Coexistence
### Compatibility: Generally Fine
sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.
**Potential concerns**:
1. **numpy version**: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:
- torch>=2.4 requires `numpy>=1.17` (no upper bound in practice)
- scikit-learn>=1.5 requires `numpy>=1.19.5`
- These are compatible
2. **Dependency tree size**: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.
3. **BLAS/LAPACK**: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.
4. **Joblib vs torch parallelism**: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:
```python
import torch
torch.set_num_threads(4) # limit torch threads
import sklearn
# joblib respects SKLEARN_MAX_THREADS or can be configured per-call
```
**Recommendation**: No special handling needed. Just include both as dependencies. Set `torch.set_num_threads()` if you notice CPU contention.
---
## 5. Package Size Optimization
### What to Make Required vs Optional
For alknet-firewall, here's a practical breakdown:
| Component | Required? | Rationale |
|-----------|-----------|-----------|
| `huggingface_hub` | ✅ Required | Model downloading, caching |
| `safetensors` | ✅ Required | Loading model weights |
| `tokenizers` | ✅ Required | Text preprocessing |
| `numpy` | ✅ Required | Tensor operations, sklearn dependency |
| `scikit-learn` | ✅ Required | SVD computations (core feature) |
| `packaging` | ✅ Required | Version comparisons |
| `filelock` | ✅ Required | File locking for cache |
| `tqdm` | ✅ Required | Progress bars |
| `pyyaml` | ✅ Required | Config parsing |
| `torch` | ❌ Optional (extra) | Large; user may already have it |
| `transformers` | ❌ Optional (extra) | Pulls many deps; only for model loading |
| `onnxruntime` | ❌ Optional (extra) | Alternative inference backend |
| `optimum` | ❌ Optional (extra) | ONNX Runtime integration |
### Practical pyproject.toml Structure
```toml
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
"safetensors>=0.4.3",
"tokenizers>=0.20",
"numpy>=1.24",
"scikit-learn>=1.3",
"packaging>=20.0",
"filelock>=3.10",
"tqdm>=4.60",
"pyyaml>=5.1",
]
[project.optional-dependencies]
# Full torch-based inference
torch = [
"torch>=2.4",
"transformers>=4.40",
]
# ONNX Runtime inference (lighter)
onnx = [
"onnxruntime>=1.17",
"optimum[onnxruntime]",
"transformers>=4.40",
]
# Development
dev = [
"pytest>=7",
"ruff>=0.9",
"mypy",
]
```
### Estimated Install Sizes
| Install Command | Download Size | Disk Size |
|----------------|---------------|-----------|
| `pip install alknet-firewall` | ~30MB | ~100MB |
| `pip install "alknet-firewall[torch]"` | ~2GB+ | ~5GB+ |
| `pip install "alknet-firewall[onnx]"` | ~100MB | ~300MB |
| + model download (first run) | ~269MB | ~269MB |
---
## 6. safetensors Format
### Why safetensors Over PyTorch Pickle
| Property | `.safetensors` | `.pt` / `.bin` (pickle) |
|----------|---------------|------------------------|
| **Security** | ✅ No arbitrary code execution | ❌ Pickle can execute arbitrary code |
| **Speed (CPU)** | ~76x faster than pickle | Baseline |
| **Speed (GPU)** | ~2x faster than pickle | Baseline |
| **Zero-copy** | ✅ Memory-mapped loading | ❌ Extra copies |
| **Lazy loading** | ✅ Load only needed tensors | ❌ Must load entire file |
| **Cross-framework** | ✅ pt, tf, jax, numpy, mlx | ❌ Framework-specific |
| **File size limit** | ✅ No practical limit | ⚠️ Practical limits exist |
| **Layout control** | ✅ Deterministic | ❌ Non-deterministic |
### Security Implications
**Pickle-based `.pt` / `.bin` files are a known security risk.** Loading a `.pt` file with `torch.load()` executes arbitrary Python code embedded in the file. This is a supply chain attack vector.
`safetensors` eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.
**For a security-focused product (firewall)**, this is critical. You should:
1. **Only load model weights from safetensors format** — never `.pt` or `.bin`
2. **Verify checksums** when downloading models (huggingface_hub does this automatically)
3. **Pin model revisions** to specific commit hashes
### Loading safetensors in Practice
```python
# Method 1: via transformers (uses safetensors automatically)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
use_safetensors=True, # explicit, though default now
)
# Method 2: direct loading (framework-agnostic)
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
# Method 3: lazy loading (only some tensors)
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
embedding = f.get_tensor("model.embed_tokens.weight")
```
**Recommendation**: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use `use_safetensors=True` explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).
---
## 7. HuggingFace Integration
### How to Depend on huggingface_hub
`huggingface_hub` is lightweight (~15MB installed) and well-maintained. It should be a **required dependency** for any package that downloads models from the Hub.
```toml
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
]
```
The version pin `>=1.5.0,<2.0` follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.
### Key Features to Use
1. **`hf_hub_download()`** — Download a single file with caching
2. **`snapshot_download()`** — Download an entire repo with caching
3. **`try_to_load_from_cache()`** — Check if a file is already cached (no network call)
4. **Offline mode** — `HF_HUB_OFFLINE=1` or `local_files_only=True`
5. **Authentication** — Automatic via `HF_TOKEN` env var or `huggingface-cli login`
6. **Filtering** — `allow_patterns` / `ignore_patterns` to download only what's needed
### Download Pattern for alknet-firewall
```python
import os
from huggingface_hub import snapshot_download, try_to_load_from_cache
# Configuration
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M" # or your fine-tuned version
DEFAULT_MODEL_REVISION = "main" # or pin a specific commit hash
def ensure_model_downloaded(
model_id: str = DEFAULT_MODEL_ID,
revision: str = DEFAULT_MODEL_REVISION,
cache_dir: str | None = None,
) -> str:
"""Download model if not cached, return local path.
Respects HF_HUB_OFFLINE for air-gapped environments.
"""
offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
model_path = snapshot_download(
repo_id=model_id,
revision=revision,
cache_dir=cache_dir,
allow_patterns=[
"*.safetensors",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"generation_config.json",
"special_tokens_map.json",
],
local_files_only=offline,
)
return model_path
```
### Caching
`huggingface_hub` caching is automatic and robust:
- **Content-addressed**: Files are stored by SHA256 hash
- **Symlink-based**: Multiple revisions share the same blob
- **No redundant downloads**: Already-cached files are never re-downloaded
- **Cache inspection**: `hf cache ls` CLI or `scan_cache_dir()` Python API
- **Cache cleanup**: `hf cache prune` removes unreferenced revisions
You don't need to implement your own caching layer. Just use `huggingface_hub` and let it handle everything.
### Authentication for Gated Models
If your fine-tuned model is gated (requires license acceptance):
```python
# User must:
# 1. Accept the model license on huggingface.co
# 2. Create an access token at huggingface.co/settings/tokens
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login
# Your code just works — huggingface_hub reads the token automatically
model_path = snapshot_download(
repo_id="YourOrg/GatedModel",
token=True, # explicitly use stored token
)
```
**Recommendation**: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.
### Environment Variables
Key environment variables your users might need:
| Variable | Purpose | Default |
|----------|---------|---------|
| `HF_HOME` | Root cache directory | `~/.cache/huggingface` |
| `HF_HUB_CACHE` | Specific cache directory for hub files | `$HF_HOME/hub` |
| `HF_HUB_OFFLINE` | Skip all network calls | `0` |
| `HF_TOKEN` | Authentication token | None |
| `HF_HUB_DOWNLOAD_TIMEOUT` | Download timeout in seconds | `10` |
| `TRANSFORMERS_CACHE` | Transformers-specific cache | Deprecated; use `HF_HUB_CACHE` |
---
## Summary of Recommendations
### Dependency Strategy
```toml
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
"safetensors>=0.4.3",
"tokenizers>=0.20",
"numpy>=1.24",
"scikit-learn>=1.3",
"packaging>=20.0",
"filelock>=3.10",
"tqdm>=4.60",
"pyyaml>=5.1",
]
[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
cpu = ["torch>=2.4", "transformers>=4.40"] # same as torch; document CPU install separately
dev = ["pytest>=7", "ruff>=0.9"]
```
### Model Distribution
- **Runtime download** via `huggingface_hub.snapshot_download()`
- **Cache** in default HF cache (`~/.cache/huggingface/hub/`)
- **Pin model revision** for reproducibility
- **Filter downloads** with `allow_patterns` (skip `.bin`, `.msgpack`, etc.)
- **Support offline mode** via `HF_HUB_OFFLINE` / `local_files_only=True`
### Inference Backend
- **Primary**: PyTorch + transformers (via `[torch]` extra)
- **Alternative**: ONNX Runtime (via `[onnx]` extra) — much smaller footprint
- **CPU-only**: Document two-step install for CPU-only torch
- **Don't torch.compile() by default** — make it opt-in
### Security
- **Only load safetensors format** — never pickle-based `.pt`/`.bin`
- **Verify model provenance** — pin to specific HF revisions
- **Don't bundle model weights** — runtime download with checksums
### Installation Paths (for docs)
```bash
# Full install (with CUDA torch)
pip install "alknet-firewall[torch]"
# CPU-only (smaller download)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
# ONNX Runtime (smallest footprint)
pip install "alknet-firewall[onnx]"
# Pre-download model for offline use
alknet-firewall download # CLI command to pre-fetch model
# Or set HF_HUB_OFFLINE=1 after first download
```
---
## References
- [HuggingFace Transformers setup.py](https://github.com/huggingface/transformers/blob/main/setup.py) — torch as optional extra pattern
- [HuggingFace Safetensors pyproject.toml](https://github.com/huggingface/safetensors/blob/main/bindings/python/pyproject.toml) — chained extras pattern
- [HuggingFace Hub setup.py](https://github.com/huggingface/huggingface_hub/blob/main/setup.py) — minimal core with extras
- [HuggingFace Hub caching docs](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache)
- [HuggingFace Hub download docs](https://huggingface.co/docs/huggingface_hub/en/guides/download)
- [HuggingFace Safetensors docs](https://huggingface.co/docs/safetensors/index)
- [Safetensors speed comparison](https://huggingface.co/docs/safetensors/en/speed) — 76x faster CPU load than pickle
- [HuggingFace Optimum](https://github.com/huggingface/optimum) — ONNX Runtime integration
- [HuggingFace Optimum ONNX quickstart](https://huggingface.co/docs/optimum-onnx/en/quickstart)
- [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference engine
- [PyTorch installation](https://pytorch.org/get-started/locally/) — CPU-only install via `--index-url`
- [Transformers installation docs](https://huggingface.co/docs/transformers/installation) — CPU-only torch install pattern