# Research: Packaging Python Libraries with PyTorch Dependencies ## Question How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable. --- ## 1. PyTorch as a Dependency ### How Mature ML Packages Handle It The three major HuggingFace packages each take a different approach: #### `transformers` — Torch as Optional Extra From `setup.py` (v5.x), `transformers` does **NOT** include `torch` in `install_requires`. Instead: ```python # Hard dependencies (install_requires) install_requires = [ "huggingface-hub>=1.5.0,<2.0", "numpy>=1.17", "packaging>=20.0", "pyyaml>=5.1", "regex>=2025.10.22", "tokenizers>=0.22.0,<=0.23.0", "safetensors>=0.4.3", "tqdm>=4.60", "typer", ] # Torch is an OPTIONAL extra extras["torch"] = deps_list("torch", "accelerate") ``` Users install with `pip install "transformers[torch]"`. If you just `pip install transformers` without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code. **Key insight**: `transformers` is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses `dummy_*.py` modules that provide placeholder classes when a framework isn't installed, giving better error messages. #### `safetensors` — Framework-Specific Optional Extras From `pyproject.toml`: ```toml [project.optional-dependencies] numpy = ["numpy>=1.24.6"] torch = ["safetensors[numpy]", "torch>=2.4"] tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"] jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"] mlx = ["mlx>=0.0.9"] paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"] convert = ["safetensors[torch]", "huggingface_hub>=1.4"] ``` The base `safetensors` package (no extras) can load files and return raw tensor data (as numpy arrays via the `numpy` extra). Each framework extra adds the framework-specific save/load functions. The `convert` extra specifically chains to `torch`. **Key insight**: Safetensors uses a **chained extras** pattern — `torch` depends on `numpy`, so `safetensors[torch]` pulls both. This is clean and explicit. #### `huggingface_hub` — Minimal Core, Framework Extras From `setup.py`: ```python install_requires = [ "click>=8.4.0", "filelock>=3.10.0", "fsspec>=2023.5.0", "hf-xet>=1.5.1,<2.0.0", # conditional on platform "httpx>=0.23.0, <1", "packaging>=20.9", "pyyaml>=5.1", "tqdm>=4.42.1", "typer>=0.20.0,<0.26.0", "typing-extensions>=4.1.0", ] extras["torch"] = ["torch", "safetensors[torch]"] extras["mcp"] = ["mcp>=1.8.0"] extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...] ``` **Key insight**: `huggingface_hub` is deliberately minimal. Torch is only needed for certain features. The `hf_xet` dependency uses platform markers for conditional installation. ### Options Summary | Approach | Used By | Pros | Cons | |----------|---------|------|------| | **Optional extra** (`package[torch]`) | transformers, safetensors, huggingface_hub | Users control their torch version; avoids forcing 2GB+ install | Must document clearly; code must handle missing torch gracefully | | **Required dependency** | Few mature packages | Simpler code; guaranteed torch available | Forces 2GB+ download; version conflicts with user's torch | | **Lazy imports + graceful error** | transformers (internal) | Good UX when torch missing; no crashes on import | More code complexity; can't type-check torch-dependent code | | **Platform-conditional** | huggingface_hub (hf_xet) | Right dependency for right platform | Complex setup.py; torch doesn't support this well | ### Recommendation for alknet-firewall **Use optional extras with lazy imports.** This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options: 1. **`pip install alknet-firewall`** — minimal install, downloads model at first run, requires torch to already be present 2. **`pip install "alknet-firewall[torch]"`** — installs torch as a dependency In your code, use lazy imports with a clear error message: ```python def _require_torch(): try: import torch return torch except ImportError: raise ImportError( "PyTorch is required for alknet-firewall inference. " "Install it with: pip install 'alknet-firewall[torch]' " "or pip install torch --index-url https://download.pytorch.org/whl/cpu" ) ``` --- ## 2. Model File Distribution ### Size Reality Check: SmolLM2-135M The SmolLM2-135M model consists of: - `model.safetensors` — ~269MB (model weights) - `config.json` — ~700 bytes - `tokenizer.json` — ~2-4MB - `tokenizer_config.json` — ~1KB - `generation_config.json` — ~200 bytes **Total: ~272MB+** This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX. ### Distribution Options | Approach | Feasibility | When to Use | |----------|-------------|-------------| | **Bundled in package_data** | ❌ Not feasible at 269MB | Only for files <10MB (configs, tokenizers) | | **Runtime download via huggingface_hub** | ✅ **Recommended** | Default approach for any model >10MB | | **Separate package for model artifacts** | ⚠️ Possible but awkward | When you need offline-first install | | **Custom download (S3, etc.)** | ⚠️ Works but reinvents the wheel | When HF Hub isn't available | ### Recommended Approach: Runtime Download via huggingface_hub This is exactly what `transformers` does. The pattern: ```python from huggingface_hub import hf_hub_download, snapshot_download # Download entire model (with caching) model_path = snapshot_download( repo_id="HuggingFaceTB/SmolLM2-135M", allow_patterns=["*.safetensors", "*.json", "tokenizer*"], # Users can set HF_HOME or HF_HUB_CACHE to control cache location ) # Or download individual files safetensors_path = hf_hub_download( repo_id="HuggingFaceTB/SmolLM2-135M", filename="model.safetensors", ) ``` ### Caching Strategy `huggingface_hub` handles caching automatically: - **Default cache location**: `~/.cache/huggingface/hub/` - **Configurable via**: `HF_HOME`, `HF_HUB_CACHE`, or `cache_dir` parameter - **Structure**: Content-addressed storage with symlinks (blobs + snapshots) - **Deduplication**: Same file across revisions → single blob on disk - **No re-downloads**: Cached files are checked before download - **Offline mode**: Set `HF_HUB_OFFLINE=1` to skip all network calls The cache structure: ``` ~/.cache/huggingface/hub/ ├── models--HuggingFaceTB--SmolLM2-135M/ │ ├── blobs/ # actual files, named by hash │ ├── refs/ # branch/tag → commit mappings │ └── snapshots/ # symlinks to blobs, one per revision ``` ### Pinning Model Versions To ensure reproducibility, pin the model revision: ```python # Pin to a specific commit hash for reproducibility MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3" model_path = snapshot_download( repo_id="HuggingFaceTB/SmolLM2-135M", revision=MODEL_REVISION, ) ``` Or pin to a tag if the model has version tags. ### Gated Model Authentication If your model requires authentication (accepting license terms on HF Hub): 1. User sets `HF_TOKEN` environment variable or logs in via `huggingface-cli login` 2. `hf_hub_download()` automatically picks up the token 3. Document this requirement clearly ```python # If the model is gated, this will fail without auth # with a clear error message from huggingface_hub model_path = snapshot_download( repo_id="YourOrg/YourGatedModel", token=True, # explicitly use stored token ) ``` SmolLM2-135M is **not gated** as of this writing, but your own fine-tuned version could be. --- ## 3. Inference-Only Considerations ### CPU-Only PyTorch **Yes, you can install torch without CUDA.** The official method: ```bash # CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA) pip install torch --index-url https://download.pytorch.org/whl/cpu ``` **Problem**: You can't express this in `pyproject.toml` extras. The CPU-only torch is served from a different index URL (`https://download.pytorch.org/whl/cpu`), not from PyPI. This means: 1. `pip install "alknet-firewall[torch]"` will install the default (CUDA) torch from PyPI — ~2GB 2. To get CPU-only torch, users must do a two-step install: ```bash pip install torch --index-url https://download.pytorch.org/whl/cpu pip install alknet-firewall ``` **Workaround**: Document both installation paths clearly: ```markdown ## Installation # With CUDA (default torch): pip install "alknet-firewall[torch]" # CPU-only (smaller, for inference without GPU): pip install torch --index-url https://download.pytorch.org/whl/cpu pip install alknet-firewall ``` ### torch.compile() for Faster Inference `torch.compile()` (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs: ```python model = AutoModelForSequenceClassification.from_pretrained(model_id) model = torch.compile(model) # JIT compile for faster inference ``` **Caveats**: - First run is slow (compilation overhead) - Best for repeated inference (the compiled model is cached) - CPU-only works but benefits are smaller than on GPU - Adds complexity; not worth it for a ~135M model unless latency is critical **Recommendation**: Make this optional. Don't `torch.compile()` by default — offer it as a performance tuning option. ### torch.export() / TorchDynamo `torch.export()` (PyTorch 2.1+) produces a portable model artifact: ```python exported_model = torch.export.export(model, (input_ids,)) ``` This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time. ### ONNX Runtime as an Alternative **This is the most compelling alternative to raw PyTorch for inference-only use cases.** HuggingFace's `optimum` library provides seamless ONNX Runtime integration: ```python # Instead of: from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(model_id) # Use: from optimum.onnxruntime import ORTModelForSequenceClassification model = ORTModelForSequenceClassification.from_pretrained(model_id) ``` **Benefits**: - `onnxruntime` package is ~30-50MB vs `torch` at ~200-2000MB+ - ONNX Runtime is optimized for inference (no autograd, no training overhead) - Often faster inference on CPU than PyTorch - Cross-platform (CPU, GPU, mobile, edge devices) **Drawbacks**: - Need to export model to ONNX format first (one-time step) - Not all model architectures support ONNX export equally - Quantization/int8 support varies by architecture - Adds `onnxruntime` + `optimum` as dependencies (still much smaller than torch) **Size comparison**: | Package | Install Size | |---------|-------------| | `torch` (CUDA) | ~2.5GB | | `torch` (CPU only) | ~200MB | | `onnxruntime` | ~30-50MB | | `onnxruntime-gpu` | ~500MB | **Recommendation**: Consider offering ONNX Runtime as an **alternative inference backend** via an extra: ```toml [project.optional-dependencies] torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"] onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"] ``` For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance. ### Using transformers Without Training Dependencies `transformers` is already split this way. The base `pip install transformers` does NOT include torch. You need `pip install "transformers[torch]"` to get torch support. Additional ways to keep transformers lean: - Don't install `accelerate` unless you need multi-GPU / device_map="auto" - Don't install training extras (`deepspeed`, `peft`, etc.) - For inference only, you don't need: `scipy`, `scikit-learn` (from transformers extras), `tensorboard`, etc. **What transformers needs for basic inference**: - `torch` (or `tensorflow`, or `flax`) - `safetensors` - `tokenizers` - `huggingface-hub` - `numpy` - `packaging` - `pyyaml` - `regex` - `tqdm` --- ## 4. sklearn + PyTorch Coexistence ### Compatibility: Generally Fine sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment. **Potential concerns**: 1. **numpy version**: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026: - torch>=2.4 requires `numpy>=1.17` (no upper bound in practice) - scikit-learn>=1.5 requires `numpy>=1.19.5` - These are compatible 2. **Dependency tree size**: Adding both adds ~500MB+ to install size, but there are no runtime conflicts. 3. **BLAS/LAPACK**: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected. 4. **Joblib vs torch parallelism**: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription: ```python import torch torch.set_num_threads(4) # limit torch threads import sklearn # joblib respects SKLEARN_MAX_THREADS or can be configured per-call ``` **Recommendation**: No special handling needed. Just include both as dependencies. Set `torch.set_num_threads()` if you notice CPU contention. --- ## 5. Package Size Optimization ### What to Make Required vs Optional For alknet-firewall, here's a practical breakdown: | Component | Required? | Rationale | |-----------|-----------|-----------| | `huggingface_hub` | ✅ Required | Model downloading, caching | | `safetensors` | ✅ Required | Loading model weights | | `tokenizers` | ✅ Required | Text preprocessing | | `numpy` | ✅ Required | Tensor operations, sklearn dependency | | `scikit-learn` | ✅ Required | SVD computations (core feature) | | `packaging` | ✅ Required | Version comparisons | | `filelock` | ✅ Required | File locking for cache | | `tqdm` | ✅ Required | Progress bars | | `pyyaml` | ✅ Required | Config parsing | | `torch` | ❌ Optional (extra) | Large; user may already have it | | `transformers` | ❌ Optional (extra) | Pulls many deps; only for model loading | | `onnxruntime` | ❌ Optional (extra) | Alternative inference backend | | `optimum` | ❌ Optional (extra) | ONNX Runtime integration | ### Practical pyproject.toml Structure ```toml [project] name = "alknet-firewall" requires-python = ">=3.10" dependencies = [ "huggingface-hub>=1.5.0,<2.0", "safetensors>=0.4.3", "tokenizers>=0.20", "numpy>=1.24", "scikit-learn>=1.3", "packaging>=20.0", "filelock>=3.10", "tqdm>=4.60", "pyyaml>=5.1", ] [project.optional-dependencies] # Full torch-based inference torch = [ "torch>=2.4", "transformers>=4.40", ] # ONNX Runtime inference (lighter) onnx = [ "onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40", ] # Development dev = [ "pytest>=7", "ruff>=0.9", "mypy", ] ``` ### Estimated Install Sizes | Install Command | Download Size | Disk Size | |----------------|---------------|-----------| | `pip install alknet-firewall` | ~30MB | ~100MB | | `pip install "alknet-firewall[torch]"` | ~2GB+ | ~5GB+ | | `pip install "alknet-firewall[onnx]"` | ~100MB | ~300MB | | + model download (first run) | ~269MB | ~269MB | --- ## 6. safetensors Format ### Why safetensors Over PyTorch Pickle | Property | `.safetensors` | `.pt` / `.bin` (pickle) | |----------|---------------|------------------------| | **Security** | ✅ No arbitrary code execution | ❌ Pickle can execute arbitrary code | | **Speed (CPU)** | ~76x faster than pickle | Baseline | | **Speed (GPU)** | ~2x faster than pickle | Baseline | | **Zero-copy** | ✅ Memory-mapped loading | ❌ Extra copies | | **Lazy loading** | ✅ Load only needed tensors | ❌ Must load entire file | | **Cross-framework** | ✅ pt, tf, jax, numpy, mlx | ❌ Framework-specific | | **File size limit** | ✅ No practical limit | ⚠️ Practical limits exist | | **Layout control** | ✅ Deterministic | ❌ Non-deterministic | ### Security Implications **Pickle-based `.pt` / `.bin` files are a known security risk.** Loading a `.pt` file with `torch.load()` executes arbitrary Python code embedded in the file. This is a supply chain attack vector. `safetensors` eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible. **For a security-focused product (firewall)**, this is critical. You should: 1. **Only load model weights from safetensors format** — never `.pt` or `.bin` 2. **Verify checksums** when downloading models (huggingface_hub does this automatically) 3. **Pin model revisions** to specific commit hashes ### Loading safetensors in Practice ```python # Method 1: via transformers (uses safetensors automatically) from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained( model_id, use_safetensors=True, # explicit, though default now ) # Method 2: direct loading (framework-agnostic) from safetensors import safe_open tensors = {} with safe_open("model.safetensors", framework="pt", device="cpu") as f: for key in f.keys(): tensors[key] = f.get_tensor(key) # Method 3: lazy loading (only some tensors) with safe_open("model.safetensors", framework="pt", device="cpu") as f: embedding = f.get_tensor("model.embed_tokens.weight") ``` **Recommendation**: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use `use_safetensors=True` explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent). --- ## 7. HuggingFace Integration ### How to Depend on huggingface_hub `huggingface_hub` is lightweight (~15MB installed) and well-maintained. It should be a **required dependency** for any package that downloads models from the Hub. ```toml dependencies = [ "huggingface-hub>=1.5.0,<2.0", ] ``` The version pin `>=1.5.0,<2.0` follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes. ### Key Features to Use 1. **`hf_hub_download()`** — Download a single file with caching 2. **`snapshot_download()`** — Download an entire repo with caching 3. **`try_to_load_from_cache()`** — Check if a file is already cached (no network call) 4. **Offline mode** — `HF_HUB_OFFLINE=1` or `local_files_only=True` 5. **Authentication** — Automatic via `HF_TOKEN` env var or `huggingface-cli login` 6. **Filtering** — `allow_patterns` / `ignore_patterns` to download only what's needed ### Download Pattern for alknet-firewall ```python import os from huggingface_hub import snapshot_download, try_to_load_from_cache # Configuration DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M" # or your fine-tuned version DEFAULT_MODEL_REVISION = "main" # or pin a specific commit hash def ensure_model_downloaded( model_id: str = DEFAULT_MODEL_ID, revision: str = DEFAULT_MODEL_REVISION, cache_dir: str | None = None, ) -> str: """Download model if not cached, return local path. Respects HF_HUB_OFFLINE for air-gapped environments. """ offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1" model_path = snapshot_download( repo_id=model_id, revision=revision, cache_dir=cache_dir, allow_patterns=[ "*.safetensors", "config.json", "tokenizer.json", "tokenizer_config.json", "generation_config.json", "special_tokens_map.json", ], local_files_only=offline, ) return model_path ``` ### Caching `huggingface_hub` caching is automatic and robust: - **Content-addressed**: Files are stored by SHA256 hash - **Symlink-based**: Multiple revisions share the same blob - **No redundant downloads**: Already-cached files are never re-downloaded - **Cache inspection**: `hf cache ls` CLI or `scan_cache_dir()` Python API - **Cache cleanup**: `hf cache prune` removes unreferenced revisions You don't need to implement your own caching layer. Just use `huggingface_hub` and let it handle everything. ### Authentication for Gated Models If your fine-tuned model is gated (requires license acceptance): ```python # User must: # 1. Accept the model license on huggingface.co # 2. Create an access token at huggingface.co/settings/tokens # 3. Set HF_TOKEN environment variable or run: huggingface-cli login # Your code just works — huggingface_hub reads the token automatically model_path = snapshot_download( repo_id="YourOrg/GatedModel", token=True, # explicitly use stored token ) ``` **Recommendation**: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly. ### Environment Variables Key environment variables your users might need: | Variable | Purpose | Default | |----------|---------|---------| | `HF_HOME` | Root cache directory | `~/.cache/huggingface` | | `HF_HUB_CACHE` | Specific cache directory for hub files | `$HF_HOME/hub` | | `HF_HUB_OFFLINE` | Skip all network calls | `0` | | `HF_TOKEN` | Authentication token | None | | `HF_HUB_DOWNLOAD_TIMEOUT` | Download timeout in seconds | `10` | | `TRANSFORMERS_CACHE` | Transformers-specific cache | Deprecated; use `HF_HUB_CACHE` | --- ## Summary of Recommendations ### Dependency Strategy ```toml [project] name = "alknet-firewall" requires-python = ">=3.10" dependencies = [ "huggingface-hub>=1.5.0,<2.0", "safetensors>=0.4.3", "tokenizers>=0.20", "numpy>=1.24", "scikit-learn>=1.3", "packaging>=20.0", "filelock>=3.10", "tqdm>=4.60", "pyyaml>=5.1", ] [project.optional-dependencies] torch = ["torch>=2.4", "transformers>=4.40"] onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"] cpu = ["torch>=2.4", "transformers>=4.40"] # same as torch; document CPU install separately dev = ["pytest>=7", "ruff>=0.9"] ``` ### Model Distribution - **Runtime download** via `huggingface_hub.snapshot_download()` - **Cache** in default HF cache (`~/.cache/huggingface/hub/`) - **Pin model revision** for reproducibility - **Filter downloads** with `allow_patterns` (skip `.bin`, `.msgpack`, etc.) - **Support offline mode** via `HF_HUB_OFFLINE` / `local_files_only=True` ### Inference Backend - **Primary**: PyTorch + transformers (via `[torch]` extra) - **Alternative**: ONNX Runtime (via `[onnx]` extra) — much smaller footprint - **CPU-only**: Document two-step install for CPU-only torch - **Don't torch.compile() by default** — make it opt-in ### Security - **Only load safetensors format** — never pickle-based `.pt`/`.bin` - **Verify model provenance** — pin to specific HF revisions - **Don't bundle model weights** — runtime download with checksums ### Installation Paths (for docs) ```bash # Full install (with CUDA torch) pip install "alknet-firewall[torch]" # CPU-only (smaller download) pip install torch --index-url https://download.pytorch.org/whl/cpu pip install alknet-firewall # ONNX Runtime (smallest footprint) pip install "alknet-firewall[onnx]" # Pre-download model for offline use alknet-firewall download # CLI command to pre-fetch model # Or set HF_HUB_OFFLINE=1 after first download ``` --- ## References - [HuggingFace Transformers setup.py](https://github.com/huggingface/transformers/blob/main/setup.py) — torch as optional extra pattern - [HuggingFace Safetensors pyproject.toml](https://github.com/huggingface/safetensors/blob/main/bindings/python/pyproject.toml) — chained extras pattern - [HuggingFace Hub setup.py](https://github.com/huggingface/huggingface_hub/blob/main/setup.py) — minimal core with extras - [HuggingFace Hub caching docs](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache) - [HuggingFace Hub download docs](https://huggingface.co/docs/huggingface_hub/en/guides/download) - [HuggingFace Safetensors docs](https://huggingface.co/docs/safetensors/index) - [Safetensors speed comparison](https://huggingface.co/docs/safetensors/en/speed) — 76x faster CPU load than pickle - [HuggingFace Optimum](https://github.com/huggingface/optimum) — ONNX Runtime integration - [HuggingFace Optimum ONNX quickstart](https://huggingface.co/docs/optimum-onnx/en/quickstart) - [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference engine - [PyTorch installation](https://pytorch.org/get-started/locally/) — CPU-only install via `--index-url` - [Transformers installation docs](https://huggingface.co/docs/transformers/installation) — CPU-only torch install pattern