The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
6.5 KiB
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
Model
The model component manages detector model loading, inference, and activation extraction. It is the interface between the firewall and the language model that provides behavioral signals.
What It Is
The model component loads a small language model (default: SmolLM2-135M), runs inference on untrusted inputs, and extracts hidden state activations at configured layers. It is model-agnostic — any transformer model with accessible hidden states can serve as a detector.
Why It Exists
The firewall needs model activations (hidden states) to detect behavioral patterns. This component encapsulates the complexity of model loading, inference, and activation extraction behind a clean interface that the codebook and firewall can consume without knowing model-specific details.
The model-agnostic design (ADR-003) means the firewall is not tied to a specific detector model. Switching from SmolLM2-135M to another ~100M model requires recomputing the SVD basis and rebuilding the codebook, but no changes to the firewall logic.
Key Concepts
Activation Extraction
The core operation: running the model on an input and capturing hidden state representations at specific layers.
Phase 1 (last-token extraction):
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
for layer_idx in configured_layers
}
# Shape: (hidden_dim,) per layer — single vector
Phase 2 (per-token extraction): Extract hidden states at every token position to enable token-level smoothing and per-position classification (see codebook.md: Token-Level Smoothing).
outputs = model(input_ids, output_hidden_states=True)
activations = {
layer_idx: outputs.hidden_states[layer_idx][0, :, :]
for layer_idx in configured_layers
}
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
The training pipeline uses per-token extraction (z-coordinates at every position are collected for population statistics). Phase 1 simplifies to last-token only for lower latency and simpler implementation. The codebook's classifiers are trained on per-token data from all positions, so they remain valid for both extraction modes.
Key decisions:
- Which layers: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their signals are highly correlated with the selected layers.
- Which token: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard choice for autoregressive (LLaMA-family) models and sufficient for Phase 1. Per-token extraction enables the full detection pipeline in Phase 2.
- Shape: Per layer, the activation is a 1D vector of size
hidden_dim(768 for SmolLM2-135M) in Phase 1, or a 2D array(seq_len, hidden_dim)in Phase 2.
Model-Agnostic Interface
The model component exposes a generic interface that works with any transformer model:
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
The infer method returns hidden states at key layers, abstracting away
whether the backend is PyTorch or a future alternative inference engine.
Lazy Loading
The model is loaded on first use or explicit preload — not at import time. This keeps the library import fast (~milliseconds) even when torch is installed.
firewall = Firewall() # Does NOT load model yet
firewall.preload() # Explicit: download + load model
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
Offline Support
The model component respects HF_HUB_OFFLINE and local_files_only flags.
In air-gapped environments, models must be pre-downloaded. The library
provides a CLI command for this:
python -m alknet_firewall download
Interfaces
Public API
class DetectorModel(Protocol):
model_id: str
hidden_dim: int
n_layers: int
def load(self, device: str = "cpu") -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
class HFDetectorModel:
"""Default implementation using HuggingFace transformers."""
DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>" # Specific SmolLM2-135M commit
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
revision: str = DEFAULT_REVISION,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def load(self, device: str | None = None) -> None: ...
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
def is_loaded(self) -> bool: ...
@property
def extraction_layers(self) -> list[int]: ...
Constraints
- safetensors-only — Model weights are loaded exclusively from
safetensors format. Pickle-based
.pt/.binfiles are never loaded (ADR-005). This is a security requirement for a security product. - Model pinning — Model revision must be pinned for reproducibility.
Default revision is a specific commit hash, not
"main". - CPU-first — Default device is CPU. GPU inference is supported but not required. The <10ms latency target is achievable on CPU with a 125M model.
- No training — The detector model is inference-only. No gradients are computed. No model weights are modified at runtime.
Design Decisions
| ADR | Decision | Summary |
|---|---|---|
| 003 | Small model detector | ~125M params, <10ms, CPU-deployable |
| 005 | Safetensors-only | Security product must use secure formats |
| 006 | Optional PyTorch | Large dependency via extras, lazy imports |
| 007 | Runtime download | HF Hub caching, 269MB can't be bundled |
| 009 | Last-token extraction | Standard for autoregressive models |
Open Questions
Open questions are tracked in open-questions.md. Key questions affecting this document:
- OQ-01:
Should ONNX Runtime be a supported inference backend in Phase 1?(resolved — removed from scope; burn/cublas is a better future path)