--- status: draft last_updated: 2026-06-13 --- # Model The model component manages detector model loading, inference, and activation extraction. It is the interface between the firewall and the language model that provides behavioral signals. ## What It Is The model component loads a small language model (default: SmolLM2-135M), runs inference on untrusted inputs, and extracts hidden state activations at configured layers. It is model-agnostic — any transformer model with accessible hidden states can serve as a detector. ## Why It Exists The firewall needs model activations (hidden states) to detect behavioral patterns. This component encapsulates the complexity of model loading, inference, and activation extraction behind a clean interface that the codebook and firewall can consume without knowing model-specific details. The model-agnostic design (ADR-003) means the firewall is not tied to a specific detector model. Switching from SmolLM2-135M to another ~100M model requires recomputing the SVD basis and rebuilding the codebook, but no changes to the firewall logic. ## Key Concepts ### Activation Extraction The core operation: running the model on an input and capturing hidden state representations at specific layers. **Phase 1 (last-token extraction)**: ```python outputs = model(input_ids, output_hidden_states=True) activations = { layer_idx: outputs.hidden_states[layer_idx][:, -1, :] for layer_idx in configured_layers } # Shape: (hidden_dim,) per layer — single vector ``` **Phase 2 (per-token extraction)**: Extract hidden states at every token position to enable token-level smoothing and per-position classification (see codebook.md: Token-Level Smoothing). ```python outputs = model(input_ids, output_hidden_states=True) activations = { layer_idx: outputs.hidden_states[layer_idx][0, :, :] for layer_idx in configured_layers } # Shape: (seq_len, hidden_dim) per layer — sequence of vectors ``` The training pipeline uses per-token extraction (z-coordinates at every position are collected for population statistics). Phase 1 simplifies to last-token only for lower latency and simpler implementation. The codebook's classifiers are trained on per-token data from all positions, so they remain valid for both extraction modes. Key decisions: - **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their signals are highly correlated with the selected layers. - **Which token**: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard choice for autoregressive (LLaMA-family) models and sufficient for Phase 1. Per-token extraction enables the full detection pipeline in Phase 2. - **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim` (768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)` in Phase 2. ### Model-Agnostic Interface The model component exposes a generic interface that works with any transformer model: ```python class DetectorModel(Protocol): model_id: str hidden_dim: int n_layers: int def load(self, device: str = "cpu") -> None: ... def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ... ``` The `infer` method returns hidden states at key layers, abstracting away whether the backend is PyTorch or a future alternative inference engine. ### Lazy Loading The model is loaded on first use or explicit preload — not at import time. This keeps the library import fast (~milliseconds) even when torch is installed. ```python firewall = Firewall() # Does NOT load model yet firewall.preload() # Explicit: download + load model alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded ``` ### Offline Support The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags. In air-gapped environments, models must be pre-downloaded. The library provides a CLI command for this: ```bash python -m alknet_firewall download ``` ## Interfaces ### Public API ```python class DetectorModel(Protocol): model_id: str hidden_dim: int n_layers: int def load(self, device: str = "cpu") -> None: ... def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ... class HFDetectorModel: """Default implementation using HuggingFace transformers.""" DEFAULT_REVISION: ClassVar[str] = "" # Specific SmolLM2-135M commit def __init__( self, model_id: str = "HuggingFaceTB/SmolLM2-135M", revision: str = DEFAULT_REVISION, device: str = "cpu", cache_dir: str | None = None, ): ... def load(self, device: str | None = None) -> None: ... def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ... def is_loaded(self) -> bool: ... @property def extraction_layers(self) -> list[int]: ... ``` ### Constraints 1. **safetensors-only** — Model weights are loaded exclusively from safetensors format. Pickle-based `.pt`/`.bin` files are never loaded (ADR-005). This is a security requirement for a security product. 2. **Model pinning** — Model revision must be pinned for reproducibility. Default revision is a specific commit hash, not `"main"`. 3. **CPU-first** — Default device is CPU. GPU inference is supported but not required. The <10ms latency target is achievable on CPU with a 125M model. 4. **No training** — The detector model is inference-only. No gradients are computed. No model weights are modified at runtime. ## Design Decisions | ADR | Decision | Summary | |-----|----------|---------| | [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable | | [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats | | [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports | | [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled | | [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: - **OQ-01**: ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ (resolved — removed from scope; burn/cublas is a better future path)