Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
25 KiB
Research: Packaging Python Libraries with PyTorch Dependencies
Question
How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.
1. PyTorch as a Dependency
How Mature ML Packages Handle It
The three major HuggingFace packages each take a different approach:
transformers — Torch as Optional Extra
From setup.py (v5.x), transformers does NOT include torch in install_requires. Instead:
# Hard dependencies (install_requires)
install_requires = [
"huggingface-hub>=1.5.0,<2.0",
"numpy>=1.17",
"packaging>=20.0",
"pyyaml>=5.1",
"regex>=2025.10.22",
"tokenizers>=0.22.0,<=0.23.0",
"safetensors>=0.4.3",
"tqdm>=4.60",
"typer",
]
# Torch is an OPTIONAL extra
extras["torch"] = deps_list("torch", "accelerate")
Users install with pip install "transformers[torch]". If you just pip install transformers without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.
Key insight: transformers is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses dummy_*.py modules that provide placeholder classes when a framework isn't installed, giving better error messages.
safetensors — Framework-Specific Optional Extras
From pyproject.toml:
[project.optional-dependencies]
numpy = ["numpy>=1.24.6"]
torch = ["safetensors[numpy]", "torch>=2.4"]
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
mlx = ["mlx>=0.0.9"]
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]
The base safetensors package (no extras) can load files and return raw tensor data (as numpy arrays via the numpy extra). Each framework extra adds the framework-specific save/load functions. The convert extra specifically chains to torch.
Key insight: Safetensors uses a chained extras pattern — torch depends on numpy, so safetensors[torch] pulls both. This is clean and explicit.
huggingface_hub — Minimal Core, Framework Extras
From setup.py:
install_requires = [
"click>=8.4.0",
"filelock>=3.10.0",
"fsspec>=2023.5.0",
"hf-xet>=1.5.1,<2.0.0", # conditional on platform
"httpx>=0.23.0, <1",
"packaging>=20.9",
"pyyaml>=5.1",
"tqdm>=4.42.1",
"typer>=0.20.0,<0.26.0",
"typing-extensions>=4.1.0",
]
extras["torch"] = ["torch", "safetensors[torch]"]
extras["mcp"] = ["mcp>=1.8.0"]
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]
Key insight: huggingface_hub is deliberately minimal. Torch is only needed for certain features. The hf_xet dependency uses platform markers for conditional installation.
Options Summary
| Approach | Used By | Pros | Cons |
|---|---|---|---|
Optional extra (package[torch]) |
transformers, safetensors, huggingface_hub | Users control their torch version; avoids forcing 2GB+ install | Must document clearly; code must handle missing torch gracefully |
| Required dependency | Few mature packages | Simpler code; guaranteed torch available | Forces 2GB+ download; version conflicts with user's torch |
| Lazy imports + graceful error | transformers (internal) | Good UX when torch missing; no crashes on import | More code complexity; can't type-check torch-dependent code |
| Platform-conditional | huggingface_hub (hf_xet) | Right dependency for right platform | Complex setup.py; torch doesn't support this well |
Recommendation for alknet-firewall
Use optional extras with lazy imports. This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:
pip install alknet-firewall— minimal install, downloads model at first run, requires torch to already be presentpip install "alknet-firewall[torch]"— installs torch as a dependency
In your code, use lazy imports with a clear error message:
def _require_torch():
try:
import torch
return torch
except ImportError:
raise ImportError(
"PyTorch is required for alknet-firewall inference. "
"Install it with: pip install 'alknet-firewall[torch]' "
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
)
2. Model File Distribution
Size Reality Check: SmolLM2-135M
The SmolLM2-135M model consists of:
model.safetensors— ~269MB (model weights)config.json— ~700 bytestokenizer.json— ~2-4MBtokenizer_config.json— ~1KBgeneration_config.json— ~200 bytes
Total: ~272MB+
This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.
Distribution Options
| Approach | Feasibility | When to Use |
|---|---|---|
| Bundled in package_data | ❌ Not feasible at 269MB | Only for files <10MB (configs, tokenizers) |
| Runtime download via huggingface_hub | ✅ Recommended | Default approach for any model >10MB |
| Separate package for model artifacts | ⚠️ Possible but awkward | When you need offline-first install |
| Custom download (S3, etc.) | ⚠️ Works but reinvents the wheel | When HF Hub isn't available |
Recommended Approach: Runtime Download via huggingface_hub
This is exactly what transformers does. The pattern:
from huggingface_hub import hf_hub_download, snapshot_download
# Download entire model (with caching)
model_path = snapshot_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
# Users can set HF_HOME or HF_HUB_CACHE to control cache location
)
# Or download individual files
safetensors_path = hf_hub_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
filename="model.safetensors",
)
Caching Strategy
huggingface_hub handles caching automatically:
- Default cache location:
~/.cache/huggingface/hub/ - Configurable via:
HF_HOME,HF_HUB_CACHE, orcache_dirparameter - Structure: Content-addressed storage with symlinks (blobs + snapshots)
- Deduplication: Same file across revisions → single blob on disk
- No re-downloads: Cached files are checked before download
- Offline mode: Set
HF_HUB_OFFLINE=1to skip all network calls
The cache structure:
~/.cache/huggingface/hub/
├── models--HuggingFaceTB--SmolLM2-135M/
│ ├── blobs/ # actual files, named by hash
│ ├── refs/ # branch/tag → commit mappings
│ └── snapshots/ # symlinks to blobs, one per revision
Pinning Model Versions
To ensure reproducibility, pin the model revision:
# Pin to a specific commit hash for reproducibility
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"
model_path = snapshot_download(
repo_id="HuggingFaceTB/SmolLM2-135M",
revision=MODEL_REVISION,
)
Or pin to a tag if the model has version tags.
Gated Model Authentication
If your model requires authentication (accepting license terms on HF Hub):
- User sets
HF_TOKENenvironment variable or logs in viahuggingface-cli login hf_hub_download()automatically picks up the token- Document this requirement clearly
# If the model is gated, this will fail without auth
# with a clear error message from huggingface_hub
model_path = snapshot_download(
repo_id="YourOrg/YourGatedModel",
token=True, # explicitly use stored token
)
SmolLM2-135M is not gated as of this writing, but your own fine-tuned version could be.
3. Inference-Only Considerations
CPU-Only PyTorch
Yes, you can install torch without CUDA. The official method:
# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cpu
Problem: You can't express this in pyproject.toml extras. The CPU-only torch is served from a different index URL (https://download.pytorch.org/whl/cpu), not from PyPI. This means:
pip install "alknet-firewall[torch]"will install the default (CUDA) torch from PyPI — ~2GB- To get CPU-only torch, users must do a two-step install:
pip install torch --index-url https://download.pytorch.org/whl/cpu pip install alknet-firewall
Workaround: Document both installation paths clearly:
## Installation
# With CUDA (default torch):
pip install "alknet-firewall[torch]"
# CPU-only (smaller, for inference without GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
torch.compile() for Faster Inference
torch.compile() (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model = torch.compile(model) # JIT compile for faster inference
Caveats:
- First run is slow (compilation overhead)
- Best for repeated inference (the compiled model is cached)
- CPU-only works but benefits are smaller than on GPU
- Adds complexity; not worth it for a ~135M model unless latency is critical
Recommendation: Make this optional. Don't torch.compile() by default — offer it as a performance tuning option.
torch.export() / TorchDynamo
torch.export() (PyTorch 2.1+) produces a portable model artifact:
exported_model = torch.export.export(model, (input_ids,))
This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.
ONNX Runtime as an Alternative
This is the most compelling alternative to raw PyTorch for inference-only use cases.
HuggingFace's optimum library provides seamless ONNX Runtime integration:
# Instead of:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_id)
# Use:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(model_id)
Benefits:
onnxruntimepackage is ~30-50MB vstorchat ~200-2000MB+- ONNX Runtime is optimized for inference (no autograd, no training overhead)
- Often faster inference on CPU than PyTorch
- Cross-platform (CPU, GPU, mobile, edge devices)
Drawbacks:
- Need to export model to ONNX format first (one-time step)
- Not all model architectures support ONNX export equally
- Quantization/int8 support varies by architecture
- Adds
onnxruntime+optimumas dependencies (still much smaller than torch)
Size comparison:
| Package | Install Size |
|---|---|
torch (CUDA) |
~2.5GB |
torch (CPU only) |
~200MB |
onnxruntime |
~30-50MB |
onnxruntime-gpu |
~500MB |
Recommendation: Consider offering ONNX Runtime as an alternative inference backend via an extra:
[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]
For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.
Using transformers Without Training Dependencies
transformers is already split this way. The base pip install transformers does NOT include torch. You need pip install "transformers[torch]" to get torch support.
Additional ways to keep transformers lean:
- Don't install
accelerateunless you need multi-GPU / device_map="auto" - Don't install training extras (
deepspeed,peft, etc.) - For inference only, you don't need:
scipy,scikit-learn(from transformers extras),tensorboard, etc.
What transformers needs for basic inference:
torch(ortensorflow, orflax)safetensorstokenizershuggingface-hubnumpypackagingpyyamlregextqdm
4. sklearn + PyTorch Coexistence
Compatibility: Generally Fine
sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.
Potential concerns:
-
numpy version: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:
- torch>=2.4 requires
numpy>=1.17(no upper bound in practice) - scikit-learn>=1.5 requires
numpy>=1.19.5 - These are compatible
- torch>=2.4 requires
-
Dependency tree size: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.
-
BLAS/LAPACK: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.
-
Joblib vs torch parallelism: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:
import torch torch.set_num_threads(4) # limit torch threads import sklearn # joblib respects SKLEARN_MAX_THREADS or can be configured per-call
Recommendation: No special handling needed. Just include both as dependencies. Set torch.set_num_threads() if you notice CPU contention.
5. Package Size Optimization
What to Make Required vs Optional
For alknet-firewall, here's a practical breakdown:
| Component | Required? | Rationale |
|---|---|---|
huggingface_hub |
✅ Required | Model downloading, caching |
safetensors |
✅ Required | Loading model weights |
tokenizers |
✅ Required | Text preprocessing |
numpy |
✅ Required | Tensor operations, sklearn dependency |
scikit-learn |
✅ Required | SVD computations (core feature) |
packaging |
✅ Required | Version comparisons |
filelock |
✅ Required | File locking for cache |
tqdm |
✅ Required | Progress bars |
pyyaml |
✅ Required | Config parsing |
torch |
❌ Optional (extra) | Large; user may already have it |
transformers |
❌ Optional (extra) | Pulls many deps; only for model loading |
onnxruntime |
❌ Optional (extra) | Alternative inference backend |
optimum |
❌ Optional (extra) | ONNX Runtime integration |
Practical pyproject.toml Structure
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
"safetensors>=0.4.3",
"tokenizers>=0.20",
"numpy>=1.24",
"scikit-learn>=1.3",
"packaging>=20.0",
"filelock>=3.10",
"tqdm>=4.60",
"pyyaml>=5.1",
]
[project.optional-dependencies]
# Full torch-based inference
torch = [
"torch>=2.4",
"transformers>=4.40",
]
# ONNX Runtime inference (lighter)
onnx = [
"onnxruntime>=1.17",
"optimum[onnxruntime]",
"transformers>=4.40",
]
# Development
dev = [
"pytest>=7",
"ruff>=0.9",
"mypy",
]
Estimated Install Sizes
| Install Command | Download Size | Disk Size |
|---|---|---|
pip install alknet-firewall |
~30MB | ~100MB |
pip install "alknet-firewall[torch]" |
~2GB+ | ~5GB+ |
pip install "alknet-firewall[onnx]" |
~100MB | ~300MB |
| + model download (first run) | ~269MB | ~269MB |
6. safetensors Format
Why safetensors Over PyTorch Pickle
| Property | .safetensors |
.pt / .bin (pickle) |
|---|---|---|
| Security | ✅ No arbitrary code execution | ❌ Pickle can execute arbitrary code |
| Speed (CPU) | ~76x faster than pickle | Baseline |
| Speed (GPU) | ~2x faster than pickle | Baseline |
| Zero-copy | ✅ Memory-mapped loading | ❌ Extra copies |
| Lazy loading | ✅ Load only needed tensors | ❌ Must load entire file |
| Cross-framework | ✅ pt, tf, jax, numpy, mlx | ❌ Framework-specific |
| File size limit | ✅ No practical limit | ⚠️ Practical limits exist |
| Layout control | ✅ Deterministic | ❌ Non-deterministic |
Security Implications
Pickle-based .pt / .bin files are a known security risk. Loading a .pt file with torch.load() executes arbitrary Python code embedded in the file. This is a supply chain attack vector.
safetensors eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.
For a security-focused product (firewall), this is critical. You should:
- Only load model weights from safetensors format — never
.ptor.bin - Verify checksums when downloading models (huggingface_hub does this automatically)
- Pin model revisions to specific commit hashes
Loading safetensors in Practice
# Method 1: via transformers (uses safetensors automatically)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
use_safetensors=True, # explicit, though default now
)
# Method 2: direct loading (framework-agnostic)
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
for key in f.keys():
tensors[key] = f.get_tensor(key)
# Method 3: lazy loading (only some tensors)
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
embedding = f.get_tensor("model.embed_tokens.weight")
Recommendation: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use use_safetensors=True explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).
7. HuggingFace Integration
How to Depend on huggingface_hub
huggingface_hub is lightweight (~15MB installed) and well-maintained. It should be a required dependency for any package that downloads models from the Hub.
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
]
The version pin >=1.5.0,<2.0 follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.
Key Features to Use
hf_hub_download()— Download a single file with cachingsnapshot_download()— Download an entire repo with cachingtry_to_load_from_cache()— Check if a file is already cached (no network call)- Offline mode —
HF_HUB_OFFLINE=1orlocal_files_only=True - Authentication — Automatic via
HF_TOKENenv var orhuggingface-cli login - Filtering —
allow_patterns/ignore_patternsto download only what's needed
Download Pattern for alknet-firewall
import os
from huggingface_hub import snapshot_download, try_to_load_from_cache
# Configuration
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M" # or your fine-tuned version
DEFAULT_MODEL_REVISION = "main" # or pin a specific commit hash
def ensure_model_downloaded(
model_id: str = DEFAULT_MODEL_ID,
revision: str = DEFAULT_MODEL_REVISION,
cache_dir: str | None = None,
) -> str:
"""Download model if not cached, return local path.
Respects HF_HUB_OFFLINE for air-gapped environments.
"""
offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
model_path = snapshot_download(
repo_id=model_id,
revision=revision,
cache_dir=cache_dir,
allow_patterns=[
"*.safetensors",
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"generation_config.json",
"special_tokens_map.json",
],
local_files_only=offline,
)
return model_path
Caching
huggingface_hub caching is automatic and robust:
- Content-addressed: Files are stored by SHA256 hash
- Symlink-based: Multiple revisions share the same blob
- No redundant downloads: Already-cached files are never re-downloaded
- Cache inspection:
hf cache lsCLI orscan_cache_dir()Python API - Cache cleanup:
hf cache pruneremoves unreferenced revisions
You don't need to implement your own caching layer. Just use huggingface_hub and let it handle everything.
Authentication for Gated Models
If your fine-tuned model is gated (requires license acceptance):
# User must:
# 1. Accept the model license on huggingface.co
# 2. Create an access token at huggingface.co/settings/tokens
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login
# Your code just works — huggingface_hub reads the token automatically
model_path = snapshot_download(
repo_id="YourOrg/GatedModel",
token=True, # explicitly use stored token
)
Recommendation: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.
Environment Variables
Key environment variables your users might need:
| Variable | Purpose | Default |
|---|---|---|
HF_HOME |
Root cache directory | ~/.cache/huggingface |
HF_HUB_CACHE |
Specific cache directory for hub files | $HF_HOME/hub |
HF_HUB_OFFLINE |
Skip all network calls | 0 |
HF_TOKEN |
Authentication token | None |
HF_HUB_DOWNLOAD_TIMEOUT |
Download timeout in seconds | 10 |
TRANSFORMERS_CACHE |
Transformers-specific cache | Deprecated; use HF_HUB_CACHE |
Summary of Recommendations
Dependency Strategy
[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
"huggingface-hub>=1.5.0,<2.0",
"safetensors>=0.4.3",
"tokenizers>=0.20",
"numpy>=1.24",
"scikit-learn>=1.3",
"packaging>=20.0",
"filelock>=3.10",
"tqdm>=4.60",
"pyyaml>=5.1",
]
[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
cpu = ["torch>=2.4", "transformers>=4.40"] # same as torch; document CPU install separately
dev = ["pytest>=7", "ruff>=0.9"]
Model Distribution
- Runtime download via
huggingface_hub.snapshot_download() - Cache in default HF cache (
~/.cache/huggingface/hub/) - Pin model revision for reproducibility
- Filter downloads with
allow_patterns(skip.bin,.msgpack, etc.) - Support offline mode via
HF_HUB_OFFLINE/local_files_only=True
Inference Backend
- Primary: PyTorch + transformers (via
[torch]extra) - Alternative: ONNX Runtime (via
[onnx]extra) — much smaller footprint - CPU-only: Document two-step install for CPU-only torch
- Don't torch.compile() by default — make it opt-in
Security
- Only load safetensors format — never pickle-based
.pt/.bin - Verify model provenance — pin to specific HF revisions
- Don't bundle model weights — runtime download with checksums
Installation Paths (for docs)
# Full install (with CUDA torch)
pip install "alknet-firewall[torch]"
# CPU-only (smaller download)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall
# ONNX Runtime (smallest footprint)
pip install "alknet-firewall[onnx]"
# Pre-download model for offline use
alknet-firewall download # CLI command to pre-fetch model
# Or set HF_HUB_OFFLINE=1 after first download
References
- HuggingFace Transformers setup.py — torch as optional extra pattern
- HuggingFace Safetensors pyproject.toml — chained extras pattern
- HuggingFace Hub setup.py — minimal core with extras
- HuggingFace Hub caching docs
- HuggingFace Hub download docs
- HuggingFace Safetensors docs
- Safetensors speed comparison — 76x faster CPU load than pickle
- HuggingFace Optimum — ONNX Runtime integration
- HuggingFace Optimum ONNX quickstart
- ONNX Runtime — cross-platform inference engine
- PyTorch installation — CPU-only install via
--index-url - Transformers installation docs — CPU-only torch install pattern