Files

glm-5.1 cf464c2296 feat: initial architecture specification and research

Phase 0→1 setup for alknet-firewall — a behavioral signal detection
library that screens untrusted LLM inputs using small model activations.

Architecture docs (5 specs, 10 ADRs, 7 open questions):
- overview: vision, scope, dependencies, package structure
- firewall: core API, alarm protocol, score composition, error handling
- codebook: SVD basis, spline distributions, calibration, tensor format
- model: activation extraction, model-agnostic interface, lazy loading
- configuration: thresholds, model selection, detection tuning

Research reports:
- modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI
- python-ml-packaging: optional PyTorch, HF Hub download, safetensors
- llm-input-safety-landscape: threat taxonomy, defenses, academic evidence

Agent role adaptations for Python project (replaced Rust conventions).

2026-06-13 05:17:40 +00:00

25 KiB

Raw Permalink Blame History

Research: Packaging Python Libraries with PyTorch Dependencies

Question

How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.

1. PyTorch as a Dependency

How Mature ML Packages Handle It

The three major HuggingFace packages each take a different approach:

`transformers` — Torch as Optional Extra

From setup.py (v5.x), transformers does NOT include torch in install_requires. Instead:

# Hard dependencies (install_requires)
install_requires = [
    "huggingface-hub>=1.5.0,<2.0",
    "numpy>=1.17",
    "packaging>=20.0",
    "pyyaml>=5.1",
    "regex>=2025.10.22",
    "tokenizers>=0.22.0,<=0.23.0",
    "safetensors>=0.4.3",
    "tqdm>=4.60",
    "typer",
]

# Torch is an OPTIONAL extra
extras["torch"] = deps_list("torch", "accelerate")

Users install with pip install "transformers[torch]". If you just pip install transformers without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.

Key insight: transformers is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses dummy_*.py modules that provide placeholder classes when a framework isn't installed, giving better error messages.

`safetensors` — Framework-Specific Optional Extras

From pyproject.toml:

[project.optional-dependencies]
numpy = ["numpy>=1.24.6"]
torch = ["safetensors[numpy]", "torch>=2.4"]
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
mlx = ["mlx>=0.0.9"]
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]

The base safetensors package (no extras) can load files and return raw tensor data (as numpy arrays via the numpy extra). Each framework extra adds the framework-specific save/load functions. The convert extra specifically chains to torch.

Key insight: Safetensors uses a chained extras pattern — torch depends on numpy, so safetensors[torch] pulls both. This is clean and explicit.

`huggingface_hub` — Minimal Core, Framework Extras

From setup.py:

install_requires = [
    "click>=8.4.0",
    "filelock>=3.10.0",
    "fsspec>=2023.5.0",
    "hf-xet>=1.5.1,<2.0.0",  # conditional on platform
    "httpx>=0.23.0, <1",
    "packaging>=20.9",
    "pyyaml>=5.1",
    "tqdm>=4.42.1",
    "typer>=0.20.0,<0.26.0",
    "typing-extensions>=4.1.0",
]

extras["torch"] = ["torch", "safetensors[torch]"]
extras["mcp"] = ["mcp>=1.8.0"]
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]

Key insight: huggingface_hub is deliberately minimal. Torch is only needed for certain features. The hf_xet dependency uses platform markers for conditional installation.

Options Summary

Approach	Used By	Pros	Cons
Optional extra (`package[torch]`)	transformers, safetensors, huggingface_hub	Users control their torch version; avoids forcing 2GB+ install	Must document clearly; code must handle missing torch gracefully
Required dependency	Few mature packages	Simpler code; guaranteed torch available	Forces 2GB+ download; version conflicts with user's torch
Lazy imports + graceful error	transformers (internal)	Good UX when torch missing; no crashes on import	More code complexity; can't type-check torch-dependent code
Platform-conditional	huggingface_hub (hf_xet)	Right dependency for right platform	Complex setup.py; torch doesn't support this well

Recommendation for alknet-firewall

Use optional extras with lazy imports. This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:

pip install alknet-firewall — minimal install, downloads model at first run, requires torch to already be present
pip install "alknet-firewall[torch]" — installs torch as a dependency

In your code, use lazy imports with a clear error message:

def _require_torch():
    try:
        import torch
        return torch
    except ImportError:
        raise ImportError(
            "PyTorch is required for alknet-firewall inference. "
            "Install it with: pip install 'alknet-firewall[torch]' "
            "or pip install torch --index-url https://download.pytorch.org/whl/cpu"
        )

2. Model File Distribution

Size Reality Check: SmolLM2-135M

The SmolLM2-135M model consists of:

model.safetensors — ~269MB (model weights)
config.json — ~700 bytes
tokenizer.json — ~2-4MB
tokenizer_config.json — ~1KB
generation_config.json — ~200 bytes

Total: ~272MB+

This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.

Distribution Options

Approach	Feasibility	When to Use
Bundled in package_data	❌ Not feasible at 269MB	Only for files <10MB (configs, tokenizers)
Runtime download via huggingface_hub	✅ Recommended	Default approach for any model >10MB
Separate package for model artifacts	⚠️ Possible but awkward	When you need offline-first install
Custom download (S3, etc.)	⚠️ Works but reinvents the wheel	When HF Hub isn't available

Recommended Approach: Runtime Download via huggingface_hub

This is exactly what transformers does. The pattern:

from huggingface_hub import hf_hub_download, snapshot_download

# Download entire model (with caching)
model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
    # Users can set HF_HOME or HF_HUB_CACHE to control cache location
)

# Or download individual files
safetensors_path = hf_hub_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    filename="model.safetensors",
)

Caching Strategy

huggingface_hub handles caching automatically:

Default cache location: ~/.cache/huggingface/hub/
Configurable via: HF_HOME, HF_HUB_CACHE, or cache_dir parameter
Structure: Content-addressed storage with symlinks (blobs + snapshots)
Deduplication: Same file across revisions → single blob on disk
No re-downloads: Cached files are checked before download
Offline mode: Set HF_HUB_OFFLINE=1 to skip all network calls

The cache structure:

~/.cache/huggingface/hub/
├── models--HuggingFaceTB--SmolLM2-135M/
│   ├── blobs/           # actual files, named by hash
│   ├── refs/            # branch/tag → commit mappings
│   └── snapshots/       # symlinks to blobs, one per revision

Pinning Model Versions

To ensure reproducibility, pin the model revision:

# Pin to a specific commit hash for reproducibility
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"

model_path = snapshot_download(
    repo_id="HuggingFaceTB/SmolLM2-135M",
    revision=MODEL_REVISION,
)

Or pin to a tag if the model has version tags.

Gated Model Authentication

If your model requires authentication (accepting license terms on HF Hub):

User sets HF_TOKEN environment variable or logs in via huggingface-cli login
hf_hub_download() automatically picks up the token
Document this requirement clearly

# If the model is gated, this will fail without auth
# with a clear error message from huggingface_hub
model_path = snapshot_download(
    repo_id="YourOrg/YourGatedModel",
    token=True,  # explicitly use stored token
)

SmolLM2-135M is not gated as of this writing, but your own fine-tuned version could be.

3. Inference-Only Considerations

CPU-Only PyTorch

Yes, you can install torch without CUDA. The official method:

# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
pip install torch --index-url https://download.pytorch.org/whl/cpu

Problem: You can't express this in pyproject.toml extras. The CPU-only torch is served from a different index URL (https://download.pytorch.org/whl/cpu), not from PyPI. This means:

pip install "alknet-firewall[torch]" will install the default (CUDA) torch from PyPI — ~2GB

To get CPU-only torch, users must do a two-step install:

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

Workaround: Document both installation paths clearly:

## Installation

# With CUDA (default torch):
pip install "alknet-firewall[torch]"

# CPU-only (smaller, for inference without GPU):
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

torch.compile() for Faster Inference

torch.compile() (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:

model = AutoModelForSequenceClassification.from_pretrained(model_id)
model = torch.compile(model)  # JIT compile for faster inference

Caveats:

First run is slow (compilation overhead)
Best for repeated inference (the compiled model is cached)
CPU-only works but benefits are smaller than on GPU
Adds complexity; not worth it for a ~135M model unless latency is critical

Recommendation: Make this optional. Don't torch.compile() by default — offer it as a performance tuning option.

torch.export() / TorchDynamo

torch.export() (PyTorch 2.1+) produces a portable model artifact:

exported_model = torch.export.export(model, (input_ids,))

This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.

ONNX Runtime as an Alternative

This is the most compelling alternative to raw PyTorch for inference-only use cases.

HuggingFace's optimum library provides seamless ONNX Runtime integration:

# Instead of:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Use:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(model_id)

Benefits:

onnxruntime package is ~30-50MB vs torch at ~200-2000MB+
ONNX Runtime is optimized for inference (no autograd, no training overhead)
Often faster inference on CPU than PyTorch
Cross-platform (CPU, GPU, mobile, edge devices)

Drawbacks:

Need to export model to ONNX format first (one-time step)
Not all model architectures support ONNX export equally
Quantization/int8 support varies by architecture
Adds onnxruntime + optimum as dependencies (still much smaller than torch)

Size comparison:

Package	Install Size
`torch` (CUDA)	~2.5GB
`torch` (CPU only)	~200MB
`onnxruntime`	~30-50MB
`onnxruntime-gpu`	~500MB

Recommendation: Consider offering ONNX Runtime as an alternative inference backend via an extra:

[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]

For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.

Using transformers Without Training Dependencies

transformers is already split this way. The base pip install transformers does NOT include torch. You need pip install "transformers[torch]" to get torch support.

Additional ways to keep transformers lean:

Don't install accelerate unless you need multi-GPU / device_map="auto"
Don't install training extras (deepspeed, peft, etc.)
For inference only, you don't need: scipy, scikit-learn (from transformers extras), tensorboard, etc.

What transformers needs for basic inference:

torch (or tensorflow, or flax)
safetensors
tokenizers
huggingface-hub
numpy
packaging
pyyaml
regex
tqdm

4. sklearn + PyTorch Coexistence

Compatibility: Generally Fine

sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.

Potential concerns:

numpy version: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:
- torch>=2.4 requires numpy>=1.17 (no upper bound in practice)
- scikit-learn>=1.5 requires numpy>=1.19.5
- These are compatible
Dependency tree size: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.
BLAS/LAPACK: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.
Joblib vs torch parallelism: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:
```
import torch
torch.set_num_threads(4)  # limit torch threads

import sklearn
# joblib respects SKLEARN_MAX_THREADS or can be configured per-call
```

Recommendation: No special handling needed. Just include both as dependencies. Set torch.set_num_threads() if you notice CPU contention.

5. Package Size Optimization

What to Make Required vs Optional

For alknet-firewall, here's a practical breakdown:

Component	Required?	Rationale
`huggingface_hub`	✅ Required	Model downloading, caching
`safetensors`	✅ Required	Loading model weights
`tokenizers`	✅ Required	Text preprocessing
`numpy`	✅ Required	Tensor operations, sklearn dependency
`scikit-learn`	✅ Required	SVD computations (core feature)
`packaging`	✅ Required	Version comparisons
`filelock`	✅ Required	File locking for cache
`tqdm`	✅ Required	Progress bars
`pyyaml`	✅ Required	Config parsing
`torch`	❌ Optional (extra)	Large; user may already have it
`transformers`	❌ Optional (extra)	Pulls many deps; only for model loading
`onnxruntime`	❌ Optional (extra)	Alternative inference backend
`optimum`	❌ Optional (extra)	ONNX Runtime integration

Practical pyproject.toml Structure

[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
# Full torch-based inference
torch = [
    "torch>=2.4",
    "transformers>=4.40",
]
# ONNX Runtime inference (lighter)
onnx = [
    "onnxruntime>=1.17",
    "optimum[onnxruntime]",
    "transformers>=4.40",
]
# Development
dev = [
    "pytest>=7",
    "ruff>=0.9",
    "mypy",
]

Estimated Install Sizes

Install Command	Download Size	Disk Size
`pip install alknet-firewall`	~30MB	~100MB
`pip install "alknet-firewall[torch]"`	~2GB+	~5GB+
`pip install "alknet-firewall[onnx]"`	~100MB	~300MB
+ model download (first run)	~269MB	~269MB

6. safetensors Format

Why safetensors Over PyTorch Pickle

Property	`.safetensors`	`.pt` / `.bin` (pickle)
Security	✅ No arbitrary code execution	❌ Pickle can execute arbitrary code
Speed (CPU)	~76x faster than pickle	Baseline
Speed (GPU)	~2x faster than pickle	Baseline
Zero-copy	✅ Memory-mapped loading	❌ Extra copies
Lazy loading	✅ Load only needed tensors	❌ Must load entire file
Cross-framework	✅ pt, tf, jax, numpy, mlx	❌ Framework-specific
File size limit	✅ No practical limit	⚠️ Practical limits exist
Layout control	✅ Deterministic	❌ Non-deterministic

Security Implications

Pickle-based .pt / .bin files are a known security risk. Loading a .pt file with torch.load() executes arbitrary Python code embedded in the file. This is a supply chain attack vector.

safetensors eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.

For a security-focused product (firewall), this is critical. You should:

Only load model weights from safetensors format — never .pt or .bin
Verify checksums when downloading models (huggingface_hub does this automatically)
Pin model revisions to specific commit hashes

Loading safetensors in Practice

# Method 1: via transformers (uses safetensors automatically)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    use_safetensors=True,  # explicit, though default now
)

# Method 2: direct loading (framework-agnostic)
from safetensors import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    for key in f.keys():
        tensors[key] = f.get_tensor(key)

# Method 3: lazy loading (only some tensors)
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
    embedding = f.get_tensor("model.embed_tokens.weight")

Recommendation: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use use_safetensors=True explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).

7. HuggingFace Integration

How to Depend on huggingface_hub

huggingface_hub is lightweight (~15MB installed) and well-maintained. It should be a required dependency for any package that downloads models from the Hub.

dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
]

The version pin >=1.5.0,<2.0 follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.

Key Features to Use

hf_hub_download() — Download a single file with caching
snapshot_download() — Download an entire repo with caching
try_to_load_from_cache() — Check if a file is already cached (no network call)
Offline mode — HF_HUB_OFFLINE=1 or local_files_only=True
Authentication — Automatic via HF_TOKEN env var or huggingface-cli login
Filtering — allow_patterns / ignore_patterns to download only what's needed

Download Pattern for alknet-firewall

import os
from huggingface_hub import snapshot_download, try_to_load_from_cache

# Configuration
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M"  # or your fine-tuned version
DEFAULT_MODEL_REVISION = "main"  # or pin a specific commit hash

def ensure_model_downloaded(
    model_id: str = DEFAULT_MODEL_ID,
    revision: str = DEFAULT_MODEL_REVISION,
    cache_dir: str | None = None,
) -> str:
    """Download model if not cached, return local path.
    
    Respects HF_HUB_OFFLINE for air-gapped environments.
    """
    offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
    
    model_path = snapshot_download(
        repo_id=model_id,
        revision=revision,
        cache_dir=cache_dir,
        allow_patterns=[
            "*.safetensors",
            "config.json",
            "tokenizer.json",
            "tokenizer_config.json",
            "generation_config.json",
            "special_tokens_map.json",
        ],
        local_files_only=offline,
    )
    return model_path

Caching

huggingface_hub caching is automatic and robust:

Content-addressed: Files are stored by SHA256 hash
Symlink-based: Multiple revisions share the same blob
No redundant downloads: Already-cached files are never re-downloaded
Cache inspection: hf cache ls CLI or scan_cache_dir() Python API
Cache cleanup: hf cache prune removes unreferenced revisions

You don't need to implement your own caching layer. Just use huggingface_hub and let it handle everything.

Authentication for Gated Models

If your fine-tuned model is gated (requires license acceptance):

# User must:
# 1. Accept the model license on huggingface.co
# 2. Create an access token at huggingface.co/settings/tokens
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login

# Your code just works — huggingface_hub reads the token automatically
model_path = snapshot_download(
    repo_id="YourOrg/GatedModel",
    token=True,  # explicitly use stored token
)

Recommendation: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.

Environment Variables

Key environment variables your users might need:

Variable	Purpose	Default
`HF_HOME`	Root cache directory	`~/.cache/huggingface`
`HF_HUB_CACHE`	Specific cache directory for hub files	`$HF_HOME/hub`
`HF_HUB_OFFLINE`	Skip all network calls	`0`
`HF_TOKEN`	Authentication token	None
`HF_HUB_DOWNLOAD_TIMEOUT`	Download timeout in seconds	`10`
`TRANSFORMERS_CACHE`	Transformers-specific cache	Deprecated; use `HF_HUB_CACHE`

Summary of Recommendations

Dependency Strategy

[project]
name = "alknet-firewall"
requires-python = ">=3.10"
dependencies = [
    "huggingface-hub>=1.5.0,<2.0",
    "safetensors>=0.4.3",
    "tokenizers>=0.20",
    "numpy>=1.24",
    "scikit-learn>=1.3",
    "packaging>=20.0",
    "filelock>=3.10",
    "tqdm>=4.60",
    "pyyaml>=5.1",
]

[project.optional-dependencies]
torch = ["torch>=2.4", "transformers>=4.40"]
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
cpu = ["torch>=2.4", "transformers>=4.40"]  # same as torch; document CPU install separately
dev = ["pytest>=7", "ruff>=0.9"]

Model Distribution

Runtime download via huggingface_hub.snapshot_download()
Cache in default HF cache (~/.cache/huggingface/hub/)
Pin model revision for reproducibility
Filter downloads with allow_patterns (skip .bin, .msgpack, etc.)
Support offline mode via HF_HUB_OFFLINE / local_files_only=True

Inference Backend

Primary: PyTorch + transformers (via [torch] extra)
Alternative: ONNX Runtime (via [onnx] extra) — much smaller footprint
CPU-only: Document two-step install for CPU-only torch
Don't torch.compile() by default — make it opt-in

Security

Only load safetensors format — never pickle-based .pt/.bin
Verify model provenance — pin to specific HF revisions
Don't bundle model weights — runtime download with checksums

Installation Paths (for docs)

# Full install (with CUDA torch)
pip install "alknet-firewall[torch]"

# CPU-only (smaller download)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install alknet-firewall

# ONNX Runtime (smallest footprint)
pip install "alknet-firewall[onnx]"

# Pre-download model for offline use
alknet-firewall download  # CLI command to pre-fetch model
# Or set HF_HUB_OFFLINE=1 after first download

References

HuggingFace Transformers setup.py — torch as optional extra pattern
HuggingFace Safetensors pyproject.toml — chained extras pattern
HuggingFace Hub setup.py — minimal core with extras
HuggingFace Hub caching docs
HuggingFace Hub download docs
HuggingFace Safetensors docs
Safetensors speed comparison — 76x faster CPU load than pickle
HuggingFace Optimum — ONNX Runtime integration
HuggingFace Optimum ONNX quickstart
ONNX Runtime — cross-platform inference engine
PyTorch installation — CPU-only install via --index-url
Transformers installation docs — CPU-only torch install pattern

25 KiB Raw Permalink Blame History