Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
689 lines
25 KiB
Markdown
689 lines
25 KiB
Markdown
# Research: Packaging Python Libraries with PyTorch Dependencies
|
|
|
|
## Question
|
|
|
|
How to package and distribute a Python library (alknet-firewall) that depends on PyTorch/transformers for inference of a ~125M parameter model (SmolLM2-135M), sklearn for SVD computations, and safetensors for model weight loading — while keeping the package lean, pip-installable, and reliable.
|
|
|
|
---
|
|
|
|
## 1. PyTorch as a Dependency
|
|
|
|
### How Mature ML Packages Handle It
|
|
|
|
The three major HuggingFace packages each take a different approach:
|
|
|
|
#### `transformers` — Torch as Optional Extra
|
|
|
|
From `setup.py` (v5.x), `transformers` does **NOT** include `torch` in `install_requires`. Instead:
|
|
|
|
```python
|
|
# Hard dependencies (install_requires)
|
|
install_requires = [
|
|
"huggingface-hub>=1.5.0,<2.0",
|
|
"numpy>=1.17",
|
|
"packaging>=20.0",
|
|
"pyyaml>=5.1",
|
|
"regex>=2025.10.22",
|
|
"tokenizers>=0.22.0,<=0.23.0",
|
|
"safetensors>=0.4.3",
|
|
"tqdm>=4.60",
|
|
"typer",
|
|
]
|
|
|
|
# Torch is an OPTIONAL extra
|
|
extras["torch"] = deps_list("torch", "accelerate")
|
|
```
|
|
|
|
Users install with `pip install "transformers[torch]"`. If you just `pip install transformers` without the extra, you get the library but it will fail at runtime if you try to use torch-dependent code.
|
|
|
|
**Key insight**: `transformers` is designed as a multi-framework library (torch/tf/jax), so making torch optional is a necessity, not just a convenience. It also uses `dummy_*.py` modules that provide placeholder classes when a framework isn't installed, giving better error messages.
|
|
|
|
#### `safetensors` — Framework-Specific Optional Extras
|
|
|
|
From `pyproject.toml`:
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
numpy = ["numpy>=1.24.6"]
|
|
torch = ["safetensors[numpy]", "torch>=2.4"]
|
|
tensorflow = ["safetensors[numpy]", "tensorflow>=2.11.0"]
|
|
jax = ["safetensors[numpy]", "flax>=0.6.3", "jax>=0.3.25", "jaxlib>=0.3.25"]
|
|
mlx = ["mlx>=0.0.9"]
|
|
paddlepaddle = ["safetensors[numpy]", "paddlepaddle>=2.4.1"]
|
|
convert = ["safetensors[torch]", "huggingface_hub>=1.4"]
|
|
```
|
|
|
|
The base `safetensors` package (no extras) can load files and return raw tensor data (as numpy arrays via the `numpy` extra). Each framework extra adds the framework-specific save/load functions. The `convert` extra specifically chains to `torch`.
|
|
|
|
**Key insight**: Safetensors uses a **chained extras** pattern — `torch` depends on `numpy`, so `safetensors[torch]` pulls both. This is clean and explicit.
|
|
|
|
#### `huggingface_hub` — Minimal Core, Framework Extras
|
|
|
|
From `setup.py`:
|
|
|
|
```python
|
|
install_requires = [
|
|
"click>=8.4.0",
|
|
"filelock>=3.10.0",
|
|
"fsspec>=2023.5.0",
|
|
"hf-xet>=1.5.1,<2.0.0", # conditional on platform
|
|
"httpx>=0.23.0, <1",
|
|
"packaging>=20.9",
|
|
"pyyaml>=5.1",
|
|
"tqdm>=4.42.1",
|
|
"typer>=0.20.0,<0.26.0",
|
|
"typing-extensions>=4.1.0",
|
|
]
|
|
|
|
extras["torch"] = ["torch", "safetensors[torch]"]
|
|
extras["mcp"] = ["mcp>=1.8.0"]
|
|
extras["oauth"] = ["authlib>=1.3.2", "fastapi", ...]
|
|
```
|
|
|
|
**Key insight**: `huggingface_hub` is deliberately minimal. Torch is only needed for certain features. The `hf_xet` dependency uses platform markers for conditional installation.
|
|
|
|
### Options Summary
|
|
|
|
| Approach | Used By | Pros | Cons |
|
|
|----------|---------|------|------|
|
|
| **Optional extra** (`package[torch]`) | transformers, safetensors, huggingface_hub | Users control their torch version; avoids forcing 2GB+ install | Must document clearly; code must handle missing torch gracefully |
|
|
| **Required dependency** | Few mature packages | Simpler code; guaranteed torch available | Forces 2GB+ download; version conflicts with user's torch |
|
|
| **Lazy imports + graceful error** | transformers (internal) | Good UX when torch missing; no crashes on import | More code complexity; can't type-check torch-dependent code |
|
|
| **Platform-conditional** | huggingface_hub (hf_xet) | Right dependency for right platform | Complex setup.py; torch doesn't support this well |
|
|
|
|
### Recommendation for alknet-firewall
|
|
|
|
**Use optional extras with lazy imports.** This is the dominant pattern in the HuggingFace ecosystem. Since this project specifically needs torch for inference (it's the core function), you have two sub-options:
|
|
|
|
1. **`pip install alknet-firewall`** — minimal install, downloads model at first run, requires torch to already be present
|
|
2. **`pip install "alknet-firewall[torch]"`** — installs torch as a dependency
|
|
|
|
In your code, use lazy imports with a clear error message:
|
|
|
|
```python
|
|
def _require_torch():
|
|
try:
|
|
import torch
|
|
return torch
|
|
except ImportError:
|
|
raise ImportError(
|
|
"PyTorch is required for alknet-firewall inference. "
|
|
"Install it with: pip install 'alknet-firewall[torch]' "
|
|
"or pip install torch --index-url https://download.pytorch.org/whl/cpu"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Model File Distribution
|
|
|
|
### Size Reality Check: SmolLM2-135M
|
|
|
|
The SmolLM2-135M model consists of:
|
|
- `model.safetensors` — ~269MB (model weights)
|
|
- `config.json` — ~700 bytes
|
|
- `tokenizer.json` — ~2-4MB
|
|
- `tokenizer_config.json` — ~1KB
|
|
- `generation_config.json` — ~200 bytes
|
|
|
|
**Total: ~272MB+**
|
|
|
|
This is far too large to bundle in a Python package. PyPI has a 60MB file size limit per upload (and 1GB total project size limit). Even if it were allowed, a 272MB wheel download is terrible UX.
|
|
|
|
### Distribution Options
|
|
|
|
| Approach | Feasibility | When to Use |
|
|
|----------|-------------|-------------|
|
|
| **Bundled in package_data** | ❌ Not feasible at 269MB | Only for files <10MB (configs, tokenizers) |
|
|
| **Runtime download via huggingface_hub** | ✅ **Recommended** | Default approach for any model >10MB |
|
|
| **Separate package for model artifacts** | ⚠️ Possible but awkward | When you need offline-first install |
|
|
| **Custom download (S3, etc.)** | ⚠️ Works but reinvents the wheel | When HF Hub isn't available |
|
|
|
|
### Recommended Approach: Runtime Download via huggingface_hub
|
|
|
|
This is exactly what `transformers` does. The pattern:
|
|
|
|
```python
|
|
from huggingface_hub import hf_hub_download, snapshot_download
|
|
|
|
# Download entire model (with caching)
|
|
model_path = snapshot_download(
|
|
repo_id="HuggingFaceTB/SmolLM2-135M",
|
|
allow_patterns=["*.safetensors", "*.json", "tokenizer*"],
|
|
# Users can set HF_HOME or HF_HUB_CACHE to control cache location
|
|
)
|
|
|
|
# Or download individual files
|
|
safetensors_path = hf_hub_download(
|
|
repo_id="HuggingFaceTB/SmolLM2-135M",
|
|
filename="model.safetensors",
|
|
)
|
|
```
|
|
|
|
### Caching Strategy
|
|
|
|
`huggingface_hub` handles caching automatically:
|
|
|
|
- **Default cache location**: `~/.cache/huggingface/hub/`
|
|
- **Configurable via**: `HF_HOME`, `HF_HUB_CACHE`, or `cache_dir` parameter
|
|
- **Structure**: Content-addressed storage with symlinks (blobs + snapshots)
|
|
- **Deduplication**: Same file across revisions → single blob on disk
|
|
- **No re-downloads**: Cached files are checked before download
|
|
- **Offline mode**: Set `HF_HUB_OFFLINE=1` to skip all network calls
|
|
|
|
The cache structure:
|
|
```
|
|
~/.cache/huggingface/hub/
|
|
├── models--HuggingFaceTB--SmolLM2-135M/
|
|
│ ├── blobs/ # actual files, named by hash
|
|
│ ├── refs/ # branch/tag → commit mappings
|
|
│ └── snapshots/ # symlinks to blobs, one per revision
|
|
```
|
|
|
|
### Pinning Model Versions
|
|
|
|
To ensure reproducibility, pin the model revision:
|
|
|
|
```python
|
|
# Pin to a specific commit hash for reproducibility
|
|
MODEL_REVISION = "4e047e16e1e8f8a0b3b3c3a3e3d3f3a3b3c3d3e3"
|
|
|
|
model_path = snapshot_download(
|
|
repo_id="HuggingFaceTB/SmolLM2-135M",
|
|
revision=MODEL_REVISION,
|
|
)
|
|
```
|
|
|
|
Or pin to a tag if the model has version tags.
|
|
|
|
### Gated Model Authentication
|
|
|
|
If your model requires authentication (accepting license terms on HF Hub):
|
|
|
|
1. User sets `HF_TOKEN` environment variable or logs in via `huggingface-cli login`
|
|
2. `hf_hub_download()` automatically picks up the token
|
|
3. Document this requirement clearly
|
|
|
|
```python
|
|
# If the model is gated, this will fail without auth
|
|
# with a clear error message from huggingface_hub
|
|
model_path = snapshot_download(
|
|
repo_id="YourOrg/YourGatedModel",
|
|
token=True, # explicitly use stored token
|
|
)
|
|
```
|
|
|
|
SmolLM2-135M is **not gated** as of this writing, but your own fine-tuned version could be.
|
|
|
|
---
|
|
|
|
## 3. Inference-Only Considerations
|
|
|
|
### CPU-Only PyTorch
|
|
|
|
**Yes, you can install torch without CUDA.** The official method:
|
|
|
|
```bash
|
|
# CPU-only torch (much smaller: ~200MB vs ~2GB+ for CUDA)
|
|
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
```
|
|
|
|
**Problem**: You can't express this in `pyproject.toml` extras. The CPU-only torch is served from a different index URL (`https://download.pytorch.org/whl/cpu`), not from PyPI. This means:
|
|
|
|
1. `pip install "alknet-firewall[torch]"` will install the default (CUDA) torch from PyPI — ~2GB
|
|
2. To get CPU-only torch, users must do a two-step install:
|
|
```bash
|
|
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
pip install alknet-firewall
|
|
```
|
|
|
|
**Workaround**: Document both installation paths clearly:
|
|
|
|
```markdown
|
|
## Installation
|
|
|
|
# With CUDA (default torch):
|
|
pip install "alknet-firewall[torch]"
|
|
|
|
# CPU-only (smaller, for inference without GPU):
|
|
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
pip install alknet-firewall
|
|
```
|
|
|
|
### torch.compile() for Faster Inference
|
|
|
|
`torch.compile()` (PyTorch 2.0+) can speed up inference significantly by JIT-compiling model graphs:
|
|
|
|
```python
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id)
|
|
model = torch.compile(model) # JIT compile for faster inference
|
|
```
|
|
|
|
**Caveats**:
|
|
- First run is slow (compilation overhead)
|
|
- Best for repeated inference (the compiled model is cached)
|
|
- CPU-only works but benefits are smaller than on GPU
|
|
- Adds complexity; not worth it for a ~135M model unless latency is critical
|
|
|
|
**Recommendation**: Make this optional. Don't `torch.compile()` by default — offer it as a performance tuning option.
|
|
|
|
### torch.export() / TorchDynamo
|
|
|
|
`torch.export()` (PyTorch 2.1+) produces a portable model artifact:
|
|
|
|
```python
|
|
exported_model = torch.export.export(model, (input_ids,))
|
|
```
|
|
|
|
This is still evolving and primarily targets server deployment. Not practical for a pip-installable library at this time.
|
|
|
|
### ONNX Runtime as an Alternative
|
|
|
|
**This is the most compelling alternative to raw PyTorch for inference-only use cases.**
|
|
|
|
HuggingFace's `optimum` library provides seamless ONNX Runtime integration:
|
|
|
|
```python
|
|
# Instead of:
|
|
from transformers import AutoModelForSequenceClassification
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id)
|
|
|
|
# Use:
|
|
from optimum.onnxruntime import ORTModelForSequenceClassification
|
|
model = ORTModelForSequenceClassification.from_pretrained(model_id)
|
|
```
|
|
|
|
**Benefits**:
|
|
- `onnxruntime` package is ~30-50MB vs `torch` at ~200-2000MB+
|
|
- ONNX Runtime is optimized for inference (no autograd, no training overhead)
|
|
- Often faster inference on CPU than PyTorch
|
|
- Cross-platform (CPU, GPU, mobile, edge devices)
|
|
|
|
**Drawbacks**:
|
|
- Need to export model to ONNX format first (one-time step)
|
|
- Not all model architectures support ONNX export equally
|
|
- Quantization/int8 support varies by architecture
|
|
- Adds `onnxruntime` + `optimum` as dependencies (still much smaller than torch)
|
|
|
|
**Size comparison**:
|
|
|
|
| Package | Install Size |
|
|
|---------|-------------|
|
|
| `torch` (CUDA) | ~2.5GB |
|
|
| `torch` (CPU only) | ~200MB |
|
|
| `onnxruntime` | ~30-50MB |
|
|
| `onnxruntime-gpu` | ~500MB |
|
|
|
|
**Recommendation**: Consider offering ONNX Runtime as an **alternative inference backend** via an extra:
|
|
|
|
```toml
|
|
[project.optional-dependencies]
|
|
torch = ["torch>=2.4", "transformers>=4.40", "accelerate>=1.0"]
|
|
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]"]
|
|
```
|
|
|
|
For a ~135M parameter model, ONNX Runtime on CPU should provide excellent performance.
|
|
|
|
### Using transformers Without Training Dependencies
|
|
|
|
`transformers` is already split this way. The base `pip install transformers` does NOT include torch. You need `pip install "transformers[torch]"` to get torch support.
|
|
|
|
Additional ways to keep transformers lean:
|
|
- Don't install `accelerate` unless you need multi-GPU / device_map="auto"
|
|
- Don't install training extras (`deepspeed`, `peft`, etc.)
|
|
- For inference only, you don't need: `scipy`, `scikit-learn` (from transformers extras), `tensorboard`, etc.
|
|
|
|
**What transformers needs for basic inference**:
|
|
- `torch` (or `tensorflow`, or `flax`)
|
|
- `safetensors`
|
|
- `tokenizers`
|
|
- `huggingface-hub`
|
|
- `numpy`
|
|
- `packaging`
|
|
- `pyyaml`
|
|
- `regex`
|
|
- `tqdm`
|
|
|
|
---
|
|
|
|
## 4. sklearn + PyTorch Coexistence
|
|
|
|
### Compatibility: Generally Fine
|
|
|
|
sklearn (scikit-learn) and PyTorch are independent packages with no direct dependency on each other. They coexist without issues in the same environment.
|
|
|
|
**Potential concerns**:
|
|
|
|
1. **numpy version**: Both sklearn and torch depend on numpy. torch historically pinned numpy tightly, but recent versions (2.4+) are more flexible. As of 2025-2026:
|
|
- torch>=2.4 requires `numpy>=1.17` (no upper bound in practice)
|
|
- scikit-learn>=1.5 requires `numpy>=1.19.5`
|
|
- These are compatible
|
|
|
|
2. **Dependency tree size**: Adding both adds ~500MB+ to install size, but there are no runtime conflicts.
|
|
|
|
3. **BLAS/LAPACK**: Both use optimized linear algebra. If using MKL-backed numpy, both benefit. No conflicts expected.
|
|
|
|
4. **Joblib vs torch parallelism**: sklearn uses joblib for parallelism; torch uses its own threading. If running sklearn SVD and torch inference in the same process, consider setting thread counts to avoid oversubscription:
|
|
```python
|
|
import torch
|
|
torch.set_num_threads(4) # limit torch threads
|
|
|
|
import sklearn
|
|
# joblib respects SKLEARN_MAX_THREADS or can be configured per-call
|
|
```
|
|
|
|
**Recommendation**: No special handling needed. Just include both as dependencies. Set `torch.set_num_threads()` if you notice CPU contention.
|
|
|
|
---
|
|
|
|
## 5. Package Size Optimization
|
|
|
|
### What to Make Required vs Optional
|
|
|
|
For alknet-firewall, here's a practical breakdown:
|
|
|
|
| Component | Required? | Rationale |
|
|
|-----------|-----------|-----------|
|
|
| `huggingface_hub` | ✅ Required | Model downloading, caching |
|
|
| `safetensors` | ✅ Required | Loading model weights |
|
|
| `tokenizers` | ✅ Required | Text preprocessing |
|
|
| `numpy` | ✅ Required | Tensor operations, sklearn dependency |
|
|
| `scikit-learn` | ✅ Required | SVD computations (core feature) |
|
|
| `packaging` | ✅ Required | Version comparisons |
|
|
| `filelock` | ✅ Required | File locking for cache |
|
|
| `tqdm` | ✅ Required | Progress bars |
|
|
| `pyyaml` | ✅ Required | Config parsing |
|
|
| `torch` | ❌ Optional (extra) | Large; user may already have it |
|
|
| `transformers` | ❌ Optional (extra) | Pulls many deps; only for model loading |
|
|
| `onnxruntime` | ❌ Optional (extra) | Alternative inference backend |
|
|
| `optimum` | ❌ Optional (extra) | ONNX Runtime integration |
|
|
|
|
### Practical pyproject.toml Structure
|
|
|
|
```toml
|
|
[project]
|
|
name = "alknet-firewall"
|
|
requires-python = ">=3.10"
|
|
dependencies = [
|
|
"huggingface-hub>=1.5.0,<2.0",
|
|
"safetensors>=0.4.3",
|
|
"tokenizers>=0.20",
|
|
"numpy>=1.24",
|
|
"scikit-learn>=1.3",
|
|
"packaging>=20.0",
|
|
"filelock>=3.10",
|
|
"tqdm>=4.60",
|
|
"pyyaml>=5.1",
|
|
]
|
|
|
|
[project.optional-dependencies]
|
|
# Full torch-based inference
|
|
torch = [
|
|
"torch>=2.4",
|
|
"transformers>=4.40",
|
|
]
|
|
# ONNX Runtime inference (lighter)
|
|
onnx = [
|
|
"onnxruntime>=1.17",
|
|
"optimum[onnxruntime]",
|
|
"transformers>=4.40",
|
|
]
|
|
# Development
|
|
dev = [
|
|
"pytest>=7",
|
|
"ruff>=0.9",
|
|
"mypy",
|
|
]
|
|
```
|
|
|
|
### Estimated Install Sizes
|
|
|
|
| Install Command | Download Size | Disk Size |
|
|
|----------------|---------------|-----------|
|
|
| `pip install alknet-firewall` | ~30MB | ~100MB |
|
|
| `pip install "alknet-firewall[torch]"` | ~2GB+ | ~5GB+ |
|
|
| `pip install "alknet-firewall[onnx]"` | ~100MB | ~300MB |
|
|
| + model download (first run) | ~269MB | ~269MB |
|
|
|
|
---
|
|
|
|
## 6. safetensors Format
|
|
|
|
### Why safetensors Over PyTorch Pickle
|
|
|
|
| Property | `.safetensors` | `.pt` / `.bin` (pickle) |
|
|
|----------|---------------|------------------------|
|
|
| **Security** | ✅ No arbitrary code execution | ❌ Pickle can execute arbitrary code |
|
|
| **Speed (CPU)** | ~76x faster than pickle | Baseline |
|
|
| **Speed (GPU)** | ~2x faster than pickle | Baseline |
|
|
| **Zero-copy** | ✅ Memory-mapped loading | ❌ Extra copies |
|
|
| **Lazy loading** | ✅ Load only needed tensors | ❌ Must load entire file |
|
|
| **Cross-framework** | ✅ pt, tf, jax, numpy, mlx | ❌ Framework-specific |
|
|
| **File size limit** | ✅ No practical limit | ⚠️ Practical limits exist |
|
|
| **Layout control** | ✅ Deterministic | ❌ Non-deterministic |
|
|
|
|
### Security Implications
|
|
|
|
**Pickle-based `.pt` / `.bin` files are a known security risk.** Loading a `.pt` file with `torch.load()` executes arbitrary Python code embedded in the file. This is a supply chain attack vector.
|
|
|
|
`safetensors` eliminates this entirely — the format is a simple binary layout with a JSON header describing tensor metadata. No code execution is possible.
|
|
|
|
**For a security-focused product (firewall)**, this is critical. You should:
|
|
1. **Only load model weights from safetensors format** — never `.pt` or `.bin`
|
|
2. **Verify checksums** when downloading models (huggingface_hub does this automatically)
|
|
3. **Pin model revisions** to specific commit hashes
|
|
|
|
### Loading safetensors in Practice
|
|
|
|
```python
|
|
# Method 1: via transformers (uses safetensors automatically)
|
|
from transformers import AutoModelForSequenceClassification
|
|
model = AutoModelForSequenceClassification.from_pretrained(
|
|
model_id,
|
|
use_safetensors=True, # explicit, though default now
|
|
)
|
|
|
|
# Method 2: direct loading (framework-agnostic)
|
|
from safetensors import safe_open
|
|
tensors = {}
|
|
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
|
|
for key in f.keys():
|
|
tensors[key] = f.get_tensor(key)
|
|
|
|
# Method 3: lazy loading (only some tensors)
|
|
with safe_open("model.safetensors", framework="pt", device="cpu") as f:
|
|
embedding = f.get_tensor("model.embed_tokens.weight")
|
|
```
|
|
|
|
**Recommendation**: Use Method 1 (via transformers) as the primary path. It handles all the complexity of model architecture, config parsing, and weight loading. Use `use_safetensors=True` explicitly for safety documentation purposes (it's the default in modern transformers, but being explicit shows intent).
|
|
|
|
---
|
|
|
|
## 7. HuggingFace Integration
|
|
|
|
### How to Depend on huggingface_hub
|
|
|
|
`huggingface_hub` is lightweight (~15MB installed) and well-maintained. It should be a **required dependency** for any package that downloads models from the Hub.
|
|
|
|
```toml
|
|
dependencies = [
|
|
"huggingface-hub>=1.5.0,<2.0",
|
|
]
|
|
```
|
|
|
|
The version pin `>=1.5.0,<2.0` follows HuggingFace's own convention (transformers uses the same pin). Major version 2.x may have breaking changes.
|
|
|
|
### Key Features to Use
|
|
|
|
1. **`hf_hub_download()`** — Download a single file with caching
|
|
2. **`snapshot_download()`** — Download an entire repo with caching
|
|
3. **`try_to_load_from_cache()`** — Check if a file is already cached (no network call)
|
|
4. **Offline mode** — `HF_HUB_OFFLINE=1` or `local_files_only=True`
|
|
5. **Authentication** — Automatic via `HF_TOKEN` env var or `huggingface-cli login`
|
|
6. **Filtering** — `allow_patterns` / `ignore_patterns` to download only what's needed
|
|
|
|
### Download Pattern for alknet-firewall
|
|
|
|
```python
|
|
import os
|
|
from huggingface_hub import snapshot_download, try_to_load_from_cache
|
|
|
|
# Configuration
|
|
DEFAULT_MODEL_ID = "HuggingFaceTB/SmolLM2-135M" # or your fine-tuned version
|
|
DEFAULT_MODEL_REVISION = "main" # or pin a specific commit hash
|
|
|
|
def ensure_model_downloaded(
|
|
model_id: str = DEFAULT_MODEL_ID,
|
|
revision: str = DEFAULT_MODEL_REVISION,
|
|
cache_dir: str | None = None,
|
|
) -> str:
|
|
"""Download model if not cached, return local path.
|
|
|
|
Respects HF_HUB_OFFLINE for air-gapped environments.
|
|
"""
|
|
offline = os.environ.get("HF_HUB_OFFLINE", "0") == "1"
|
|
|
|
model_path = snapshot_download(
|
|
repo_id=model_id,
|
|
revision=revision,
|
|
cache_dir=cache_dir,
|
|
allow_patterns=[
|
|
"*.safetensors",
|
|
"config.json",
|
|
"tokenizer.json",
|
|
"tokenizer_config.json",
|
|
"generation_config.json",
|
|
"special_tokens_map.json",
|
|
],
|
|
local_files_only=offline,
|
|
)
|
|
return model_path
|
|
```
|
|
|
|
### Caching
|
|
|
|
`huggingface_hub` caching is automatic and robust:
|
|
- **Content-addressed**: Files are stored by SHA256 hash
|
|
- **Symlink-based**: Multiple revisions share the same blob
|
|
- **No redundant downloads**: Already-cached files are never re-downloaded
|
|
- **Cache inspection**: `hf cache ls` CLI or `scan_cache_dir()` Python API
|
|
- **Cache cleanup**: `hf cache prune` removes unreferenced revisions
|
|
|
|
You don't need to implement your own caching layer. Just use `huggingface_hub` and let it handle everything.
|
|
|
|
### Authentication for Gated Models
|
|
|
|
If your fine-tuned model is gated (requires license acceptance):
|
|
|
|
```python
|
|
# User must:
|
|
# 1. Accept the model license on huggingface.co
|
|
# 2. Create an access token at huggingface.co/settings/tokens
|
|
# 3. Set HF_TOKEN environment variable or run: huggingface-cli login
|
|
|
|
# Your code just works — huggingface_hub reads the token automatically
|
|
model_path = snapshot_download(
|
|
repo_id="YourOrg/GatedModel",
|
|
token=True, # explicitly use stored token
|
|
)
|
|
```
|
|
|
|
**Recommendation**: Keep the public SmolLM2-135M model ungated for the base use case. If you fine-tune and need access control, document the authentication steps clearly.
|
|
|
|
### Environment Variables
|
|
|
|
Key environment variables your users might need:
|
|
|
|
| Variable | Purpose | Default |
|
|
|----------|---------|---------|
|
|
| `HF_HOME` | Root cache directory | `~/.cache/huggingface` |
|
|
| `HF_HUB_CACHE` | Specific cache directory for hub files | `$HF_HOME/hub` |
|
|
| `HF_HUB_OFFLINE` | Skip all network calls | `0` |
|
|
| `HF_TOKEN` | Authentication token | None |
|
|
| `HF_HUB_DOWNLOAD_TIMEOUT` | Download timeout in seconds | `10` |
|
|
| `TRANSFORMERS_CACHE` | Transformers-specific cache | Deprecated; use `HF_HUB_CACHE` |
|
|
|
|
---
|
|
|
|
## Summary of Recommendations
|
|
|
|
### Dependency Strategy
|
|
|
|
```toml
|
|
[project]
|
|
name = "alknet-firewall"
|
|
requires-python = ">=3.10"
|
|
dependencies = [
|
|
"huggingface-hub>=1.5.0,<2.0",
|
|
"safetensors>=0.4.3",
|
|
"tokenizers>=0.20",
|
|
"numpy>=1.24",
|
|
"scikit-learn>=1.3",
|
|
"packaging>=20.0",
|
|
"filelock>=3.10",
|
|
"tqdm>=4.60",
|
|
"pyyaml>=5.1",
|
|
]
|
|
|
|
[project.optional-dependencies]
|
|
torch = ["torch>=2.4", "transformers>=4.40"]
|
|
onnx = ["onnxruntime>=1.17", "optimum[onnxruntime]", "transformers>=4.40"]
|
|
cpu = ["torch>=2.4", "transformers>=4.40"] # same as torch; document CPU install separately
|
|
dev = ["pytest>=7", "ruff>=0.9"]
|
|
```
|
|
|
|
### Model Distribution
|
|
|
|
- **Runtime download** via `huggingface_hub.snapshot_download()`
|
|
- **Cache** in default HF cache (`~/.cache/huggingface/hub/`)
|
|
- **Pin model revision** for reproducibility
|
|
- **Filter downloads** with `allow_patterns` (skip `.bin`, `.msgpack`, etc.)
|
|
- **Support offline mode** via `HF_HUB_OFFLINE` / `local_files_only=True`
|
|
|
|
### Inference Backend
|
|
|
|
- **Primary**: PyTorch + transformers (via `[torch]` extra)
|
|
- **Alternative**: ONNX Runtime (via `[onnx]` extra) — much smaller footprint
|
|
- **CPU-only**: Document two-step install for CPU-only torch
|
|
- **Don't torch.compile() by default** — make it opt-in
|
|
|
|
### Security
|
|
|
|
- **Only load safetensors format** — never pickle-based `.pt`/`.bin`
|
|
- **Verify model provenance** — pin to specific HF revisions
|
|
- **Don't bundle model weights** — runtime download with checksums
|
|
|
|
### Installation Paths (for docs)
|
|
|
|
```bash
|
|
# Full install (with CUDA torch)
|
|
pip install "alknet-firewall[torch]"
|
|
|
|
# CPU-only (smaller download)
|
|
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
|
pip install alknet-firewall
|
|
|
|
# ONNX Runtime (smallest footprint)
|
|
pip install "alknet-firewall[onnx]"
|
|
|
|
# Pre-download model for offline use
|
|
alknet-firewall download # CLI command to pre-fetch model
|
|
# Or set HF_HUB_OFFLINE=1 after first download
|
|
```
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [HuggingFace Transformers setup.py](https://github.com/huggingface/transformers/blob/main/setup.py) — torch as optional extra pattern
|
|
- [HuggingFace Safetensors pyproject.toml](https://github.com/huggingface/safetensors/blob/main/bindings/python/pyproject.toml) — chained extras pattern
|
|
- [HuggingFace Hub setup.py](https://github.com/huggingface/huggingface_hub/blob/main/setup.py) — minimal core with extras
|
|
- [HuggingFace Hub caching docs](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache)
|
|
- [HuggingFace Hub download docs](https://huggingface.co/docs/huggingface_hub/en/guides/download)
|
|
- [HuggingFace Safetensors docs](https://huggingface.co/docs/safetensors/index)
|
|
- [Safetensors speed comparison](https://huggingface.co/docs/safetensors/en/speed) — 76x faster CPU load than pickle
|
|
- [HuggingFace Optimum](https://github.com/huggingface/optimum) — ONNX Runtime integration
|
|
- [HuggingFace Optimum ONNX quickstart](https://huggingface.co/docs/optimum-onnx/en/quickstart)
|
|
- [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference engine
|
|
- [PyTorch installation](https://pytorch.org/get-started/locally/) — CPU-only install via `--index-url`
|
|
- [Transformers installation docs](https://huggingface.co/docs/transformers/installation) — CPU-only torch install pattern |