Files
alknet-firewall/docs/research/python-ml-packaging.md
glm-5.1 cf464c2296 feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).
2026-06-13 05:17:40 +00:00

25 KiB

Research: Packaging Python Libraries with PyTorch Dependencies

Question

How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.


1. PyTorch as a Dependency

How Mature ML Packages Handle It

The three major HuggingFace packages each take a different approach:

transformers — Torch as Optional Extra

From setup.py (v5.x), transformers does NOT include torch in install_requires. Instead:

# Hard dependencies (install_requires)
install_requires = [
    "huggingface-hub>=1.5.0,<2.0",
    "numpy>=1.17",
    "packaging>=20.0",
    "pyyaml>=5.1",
    "regex>=2025.10.22",
    "tokenizers>=0.22.0,<=0.23.0",
    "safetensors>=0.4.3",
    "tqdm>=4.60",
    "typer",
]

# Torch is an OPTIONAL extra
extras["torch"] = deps_list("torch", "accelerate")

Users install with pip install "transformers[torch]". If you just pip install transformers without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.

Key insight: transformers is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses dummy_*.py modules that provide placeholder classes when a framework isn't installed, giving better error messages.

safetensors — Framework-Specific Optional Extras

From pyproject.toml:

[project.optional-dependencies]
numpy = ["numpy>=1.24.6"]
torch = ["safetensors[numpy]", "torch>=2.4"]
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
mlx = ["mlx>=0.0.9"]
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]

The base safetensors package (no extras) can load files and return raw tensor data (as numpy arrays via the numpy extra). Each framework extra adds the framework-specific save/load functions. The convert extra specifically chains to torch.

Key insight: Safetensors uses a chained extras pattern — torch depends on numpy, so safetensors[torch] pulls both. This is clean and explicit.

huggingface_hub — Minimal Core, Framework Extras

From setup.py:

install_requires = [
    "click>=8.4.0",
    "filelock>=3.10.0",
    "fsspec>=2023.5.0",
    "hf-xet>=1.5.1,<2.0.0",  # conditional on platform
    "httpx>=0.23.0, <1",
    "packaging>=20.9",
    "pyyaml>=5.1",
    "tqdm>=4.42.1",
    "typer>=0.20.0,<0.26.0",
    "typing-extensions>=4.1.0",
]

extras["torch"] = ["torch", "safetensors[torch]"]
extras["mcp"] = ["mcp>=1.8.0"]
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]

Key insight: huggingface_hub is deliberately minimal. Torch is only needed for certain features. The hf_xet dependency uses platform markers for conditional installation.

Options Summary

Approach Used By Pros Cons
Optional extra (package[torch]) transformers, safetensors, huggingface_hub Users control their torch version; avoids forcing 2GB+ install Must document clearly; code must handle missing torch gracefully
Required dependency Few mature packages Simpler code; guaranteed torch available Forces 2GB+ download; version conflicts with user's torch
Lazy imports + graceful error transformers (internal) Good UX when torch missing; no crashes on import More code complexity; can't type-check torch-dependent code
Platform-conditional huggingface_hub (hf_xet) Right dependency for right platform Complex setup.py; torch doesn't support this well

Recommendation for alknet-firewall

Use optional extras with lazy imports. This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:

  1. pip install alknet-firewall — minimal install, downloads model at first run, requires torch to already be present
  2. pip install "alknet-firewall[torch]" — installs torch as a dependency

In your code, use lazy imports with a clear error message:

def _require_torch():
    try:
        import torch
        return torch
    except ImportError:
        raise ImportError(
            "PyTorch is required for alknet-firewall inference. "
            "Install it with: pip install 'alknet-firewall[torch]' "
            "or pip install torch --index-url https://download.pytorch.org/whl/cpu"
        )

2. Model File Distribution

Size Reality Check: SmolLM2-135M

The SmolLM2-135M model consists of:

  • model.safetensors — ~269MB (model weights)
  • config.json — ~700 bytes
  • tokenizer.json — ~2-4MB
  • tokenizer_config.json — ~1KB
  • generation_config.json — ~200 bytes

Total: ~272MB+

This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.

Distribution Options

Approach Feasibility When to Use
Bundled in package_data Not feasible at 269MB Only for files <10MB (configs, tokenizers)
Runtime download via huggingface_hub Recommended Default approach for any model >10MB
Separate package for model artifacts ⚠️ Possible but awkward When you need offline-first install
Custom download (S3, etc.) ⚠️ Works but reinvents the wheel When HF Hub isn't available

This is exactly what transformers does. The pattern:

from huggingface_hub import hf_hub_download, snapshot_download

# Download entire model (with caching)
model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
    # Users can set HF_HOME or HF_HUB_CACHE to control cache location
)

# Or download individual files
safetensors_path = hf_hub_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    filename="model.safetensors",
)

Caching Strategy

huggingface_hub handles caching automatically:

  • Default cache location: ~/.cache/huggingface/hub/
  • Configurable via: HF_HOME, HF_HUB_CACHE, or cache_dir parameter
  • Structure: Content-addressed storage with symlinks (blobs + snapshots)
  • Deduplication: Same file across revisions → single blob on disk
  • No re-downloads: Cached files are checked before download
  • Offline mode: Set HF_HUB_OFFLINE=1 to skip all network calls

The cache structure:

~/.cache/huggingface/hub/
├── models--HuggingFaceTB--SmolLM2-135M/
│   ├── blobs/           # actual files, named by hash
│   ├── refs/            # branch/tag → commit mappings
│   └── snapshots/       # symlinks to blobs, one per revision

Pinning Model Versions

To ensure reproducibility, pin the model revision:

# Pin to a specific commit hash for reproducibility
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"

model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    revision=MODEL_REVISION,
)

Or pin to a tag if the model has version tags.

Gated Model Authentication

If your model requires authentication (accepting license terms on HF Hub):

  1. User sets HF_TOKEN environment variable or logs in via huggingface-cli login
  2. hf_hub_download() automatically picks up the token
  3. Document this requirement clearly
# If the model is gated, this will fail without auth
# with a clear error message from huggingface_hub
model_path = snapshot_download(
    repo_id="YourOrg/YourGatedModel",
    token=True,  # explicitly use stored token
)

SmolLM2-135M is not gated as of this writing, but your own fine-tuned version could be.


3. Inference-Only Considerations

CPU-Only PyTorch

Yes, you can install torch without CUDA. The official method:

# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cpu

Problem: You can't express this in pyproject.toml extras. The CPU-only torch is served from a different index URL (https://download.pytorch.org/whl/cpu), not from PyPI. This means:

  1. pip install "alknet-firewall[torch]" will install the default (CUDA) torch from PyPI — ~2GB
  2. To get CPU-only torch, users must do a two-step install:
    pip install torch --index-url https://download.pytorch.org/whl/cpu
    pip install alknet-firewall
    

Workaround: Document both installation paths clearly:

## Installation

# With CUDA (default torch):
pip install "alknet-firewall[torch]"

# CPU-only (smaller, for inference without GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

torch.compile() for Faster Inference

torch.compile() (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:

model = AutoModelForSequenceClassification.from_pretrained(model_id)
model = torch.compile(model)  # JIT compile for faster inference

Caveats:

  • First run is slow (compilation overhead)
  • Best for repeated inference (the compiled model is cached)
  • CPU-only works but benefits are smaller than on GPU
  • Adds complexity; not worth it for a ~135M model unless latency is critical

Recommendation: Make this optional. Don't torch.compile() by default — offer it as a performance tuning option.

torch.export() / TorchDynamo

torch.export() (PyTorch 2.1+) produces a portable model artifact:

exported_model = torch.export.export(model, (input_ids,))

This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.

ONNX Runtime as an Alternative

This is the most compelling alternative to raw PyTorch for inference-only use cases.

HuggingFace's optimum library provides seamless ONNX Runtime integration:

# Instead of:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Use:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(model_id)

Benefits:

  • onnxruntime package is ~30-50MB vs torch at ~200-2000MB+
  • ONNX Runtime is optimized for inference (no autograd, no training overhead)
  • Often faster inference on CPU than PyTorch
  • Cross-platform (CPU, GPU, mobile, edge devices)

Drawbacks:

  • Need to export model to ONNX format first (one-time step)
  • Not all model architectures support ONNX export equally
  • Quantization/int8 support varies by architecture
  • Adds onnxruntime + optimum as dependencies (still much smaller than torch)

Size comparison:

Package Install Size
torch (CUDA) ~2.5GB
torch (CPU only) ~200MB
onnxruntime ~30-50MB
onnxruntime-gpu ~500MB

Recommendation: Consider offering ONNX Runtime as an alternative inference backend via an extra:

[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]

For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.

Using transformers Without Training Dependencies

transformers is already split this way. The base pip install transformers does NOT include torch. You need pip install "transformers[torch]" to get torch support.

Additional ways to keep transformers lean:

  • Don't install accelerate unless you need multi-GPU / device_map="auto"
  • Don't install training extras (deepspeed, peft, etc.)
  • For inference only, you don't need: scipy, scikit-learn (from transformers extras), tensorboard, etc.

What transformers needs for basic inference:

  • torch (or tensorflow, or flax)
  • safetensors
  • tokenizers
  • huggingface-hub
  • numpy
  • packaging
  • pyyaml
  • regex
  • tqdm

4. sklearn + PyTorch Coexistence

Compatibility: Generally Fine

sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.

Potential concerns:

  1. numpy version: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:

    • torch>=2.4 requires numpy>=1.17 (no upper bound in practice)
    • scikit-learn>=1.5 requires numpy>=1.19.5
    • These are compatible
  2. Dependency tree size: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.

  3. BLAS/LAPACK: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.

  4. Joblib vs torch parallelism: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:

    import torch
    torch.set_num_threads(4)  # limit torch threads
    
    import sklearn
    # joblib respects SKLEARN_MAX_THREADS or can be configured per-call
    

Recommendation: No special handling needed. Just include both as dependencies. Set torch.set_num_threads() if you notice CPU contention.


5. Package Size Optimization

What to Make Required vs Optional

For alknet-firewall, here's a practical breakdown:

Component Required? Rationale
huggingface_hub Required Model downloading, caching
safetensors Required Loading model weights
tokenizers Required Text preprocessing
numpy Required Tensor operations, sklearn dependency
scikit-learn Required SVD computations (core feature)
packaging Required Version comparisons
filelock Required File locking for cache
tqdm Required Progress bars
pyyaml Required Config parsing
torch Optional (extra) Large; user may already have it
transformers Optional (extra) Pulls many deps; only for model loading
onnxruntime Optional (extra) Alternative inference backend
optimum Optional (extra) ONNX Runtime integration

Practical pyproject.toml Structure

[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
# Full torch-based inference
torch = [
    "torch>=2.4",
    "transformers>=4.40",
]
# ONNX Runtime inference (lighter)
onnx = [
    "onnxruntime>=1.17",
    "optimum[onnxruntime]",
    "transformers>=4.40",
]
# Development
dev = [
    "pytest>=7",
    "ruff>=0.9",
    "mypy",
]

Estimated Install Sizes

Install Command Download Size Disk Size
pip install alknet-firewall ~30MB ~100MB
pip install "alknet-firewall[torch]" ~2GB+ ~5GB+
pip install "alknet-firewall[onnx]" ~100MB ~300MB
+ model download (first run) ~269MB ~269MB

6. safetensors Format

Why safetensors Over PyTorch Pickle

Property .safetensors .pt / .bin (pickle)
Security No arbitrary code execution Pickle can execute arbitrary code
Speed (CPU) ~76x faster than pickle Baseline
Speed (GPU) ~2x faster than pickle Baseline
Zero-copy Memory-mapped loading Extra copies
Lazy loading Load only needed tensors Must load entire file
Cross-framework pt, tf, jax, numpy, mlx Framework-specific
File size limit No practical limit ⚠️ Practical limits exist
Layout control Deterministic Non-deterministic

Security Implications

Pickle-based .pt / .bin files are a known security risk. Loading a .pt file with torch.load() executes arbitrary Python code embedded in the file. This is a supply chain attack vector.

safetensors eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.

For a security-focused product (firewall), this is critical. You should:

  1. Only load model weights from safetensors format — never .pt or .bin
  2. Verify checksums when downloading models (huggingface_hub does this automatically)
  3. Pin model revisions to specific commit hashes

Loading safetensors in Practice

# Method 1: via transformers (uses safetensors automatically)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    use_safetensors=True,  # explicit, though default now
)

# Method 2: direct loading (framework-agnostic)
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    for key in f.keys():
        tensors[key] = f.get_tensor(key)

# Method 3: lazy loading (only some tensors)
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    embedding = f.get_tensor("model.embed_tokens.weight")

Recommendation: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use use_safetensors=True explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).


7. HuggingFace Integration

How to Depend on huggingface_hub

huggingface_hub is lightweight (~15MB installed) and well-maintained. It should be a required dependency for any package that downloads models from the Hub.

dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
]

The version pin >=1.5.0,<2.0 follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.

Key Features to Use

  1. hf_hub_download() — Download a single file with caching
  2. snapshot_download() — Download an entire repo with caching
  3. try_to_load_from_cache() — Check if a file is already cached (no network call)
  4. Offline modeHF_HUB_OFFLINE=1 or local_files_only=True
  5. Authentication — Automatic via HF_TOKEN env var or huggingface-cli login
  6. Filteringallow_patterns / ignore_patterns to download only what's needed

Download Pattern for alknet-firewall

import os
from huggingface_hub import snapshot_download, try_to_load_from_cache

# Configuration
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M"  # or your fine-tuned version
DEFAULT_MODEL_REVISION = "main"  # or pin a specific commit hash

def ensure_model_downloaded(
    model_id: str = DEFAULT_MODEL_ID,
    revision: str = DEFAULT_MODEL_REVISION,
    cache_dir: str | None = None,
) -> str:
    """Download model if not cached, return local path.
    
    Respects HF_HUB_OFFLINE for air-gapped environments.
    """
    offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
    
    model_path = snapshot_download(
        repo_id=model_id,
        revision=revision,
        cache_dir=cache_dir,
        allow_patterns=[
            "*.safetensors",
            "config.json",
            "tokenizer.json",
            "tokenizer_config.json",
            "generation_config.json",
            "special_tokens_map.json",
        ],
        local_files_only=offline,
    )
    return model_path

Caching

huggingface_hub caching is automatic and robust:

  • Content-addressed: Files are stored by SHA256 hash
  • Symlink-based: Multiple revisions share the same blob
  • No redundant downloads: Already-cached files are never re-downloaded
  • Cache inspection: hf cache ls CLI or scan_cache_dir() Python API
  • Cache cleanup: hf cache prune removes unreferenced revisions

You don't need to implement your own caching layer. Just use huggingface_hub and let it handle everything.

Authentication for Gated Models

If your fine-tuned model is gated (requires license acceptance):

# User must:
# 1. Accept the model license on huggingface.co
# 2. Create an access token at huggingface.co/settings/tokens
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login

# Your code just works — huggingface_hub reads the token automatically
model_path = snapshot_download(
    repo_id="YourOrg/GatedModel",
    token=True,  # explicitly use stored token
)

Recommendation: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.

Environment Variables

Key environment variables your users might need:

Variable Purpose Default
HF_HOME Root cache directory ~/.cache/huggingface
HF_HUB_CACHE Specific cache directory for hub files $HF_HOME/hub
HF_HUB_OFFLINE Skip all network calls 0
HF_TOKEN Authentication token None
HF_HUB_DOWNLOAD_TIMEOUT Download timeout in seconds 10
TRANSFORMERS_CACHE Transformers-specific cache Deprecated; use HF_HUB_CACHE

Summary of Recommendations

Dependency Strategy

[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
cpu = ["torch>=2.4", "transformers>=4.40"]  # same as torch; document CPU install separately
dev = ["pytest>=7", "ruff>=0.9"]

Model Distribution

  • Runtime download via huggingface_hub.snapshot_download()
  • Cache in default HF cache (~/.cache/huggingface/hub/)
  • Pin model revision for reproducibility
  • Filter downloads with allow_patterns (skip .bin, .msgpack, etc.)
  • Support offline mode via HF_HUB_OFFLINE / local_files_only=True

Inference Backend

  • Primary: PyTorch + transformers (via [torch] extra)
  • Alternative: ONNX Runtime (via [onnx] extra) — much smaller footprint
  • CPU-only: Document two-step install for CPU-only torch
  • Don't torch.compile() by default — make it opt-in

Security

  • Only load safetensors format — never pickle-based .pt/.bin
  • Verify model provenance — pin to specific HF revisions
  • Don't bundle model weights — runtime download with checksums

Installation Paths (for docs)

# Full install (with CUDA torch)
pip install "alknet-firewall[torch]"

# CPU-only (smaller download)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

# ONNX Runtime (smallest footprint)
pip install "alknet-firewall[onnx]"

# Pre-download model for offline use
alknet-firewall download  # CLI command to pre-fetch model
# Or set HF_HUB_OFFLINE=1 after first download

References