Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
5.5 KiB
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
Model
The model component manages detector model loading, inference, and activation extraction. It is the interface between the firewall and the language model that provides behavioral signals.
What It Is
The model component loads a small language model (default: SmolLM2-135M), runs inference on untrusted inputs, and extracts hidden state activations at configured layers. It is model-agnostic — any transformer model with accessible hidden states can serve as a detector.
Why It Exists
The firewall needs model activations (hidden states) to detect behavioral patterns. This component encapsulates the complexity of model loading, inference, and activation extraction behind a clean interface that the codebook and firewall can consume without knowing model-specific details.
The model-agnostic design (ADR-003) means the firewall is not tied to a specific detector model. Switching from SmolLM2-135M to another ~100M model requires recomputing the SVD basis and rebuilding the codebook, but no changes to the firewall logic.
Key Concepts
Activation Extraction
The core operation: running the model on an input and capturing hidden state representations at specific layers.
# Conceptual
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in configured_layers
}
Key decisions:
- Which layers: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their signals are highly correlated with the selected layers.
- Which token: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard choice for autoregressive (LLaMA-family) models.
- Shape: Per layer, the activation is a 1D vector of size
hidden_dim(768 for SmolLM2-135M).
Model-Agnostic Interface
The model component exposes a generic interface that works with any transformer model:
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
The infer method returns hidden states at key layers, abstracting away
whether the backend is PyTorch or a future alternative inference engine.
Lazy Loading
The model is loaded on first use or explicit preload — not at import time. This keeps the library import fast (~milliseconds) even when torch is installed.
firewall = Firewall() # Does NOT load model yet
firewall.preload() # Explicit: download + load model
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
Offline Support
The model component respects HF_HUB_OFFLINE and local_files_only flags.
In air-gapped environments, models must be pre-downloaded. The library
provides a CLI command for this:
python -m alknet_firewall download
Interfaces
Public API
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
class HFDetectorModel:
"""Default implementation using HuggingFace transformers."""
DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>" # Specific SmolLM2-135M commit
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
revision: str = DEFAULT_REVISION,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def load(self, device: str | None = None) -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
def is_loaded(self) -> bool: ...
@property
def extraction_layers(self) -> list[int]: ...
Constraints
- safetensors-only — Model weights are loaded exclusively from
safetensors format. Pickle-based
.pt/.binfiles are never loaded (ADR-005). This is a security requirement for a security product. - Model pinning — Model revision must be pinned for reproducibility.
Default revision is a specific commit hash, not
"main". - CPU-first — Default device is CPU. GPU inference is supported but not required. The <10ms latency target is achievable on CPU with a 125M model.
- No training — The detector model is inference-only. No gradients are computed. No model weights are modified at runtime.
Design Decisions
| ADR | Decision | Summary |
|---|---|---|
| 003 | Small model detector | ~125M params, <10ms, CPU-deployable |
| 005 | Safetensors-only | Security product must use secure formats |
| 006 | Optional PyTorch | Large dependency via extras, lazy imports |
| 007 | Runtime download | HF Hub caching, 269MB can't be bundled |
| 009 | Last-token extraction | Standard for autoregressive models |
Open Questions
Open questions are tracked in open-questions.md. Key questions affecting this document:
- OQ-01:
Should ONNX Runtime be a supported inference backend in Phase 1?(resolved — removed from scope; burn/cublas is a better future path)