Files
glm-5.1 45a0e0798c docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
2026-06-13 08:17:09 +00:00

6.5 KiB

status, last_updated
status last_updated
draft 2026-06-13

Model

The model component manages detector model loading, inference, and activation extraction. It is the interface between the firewall and the language model that provides behavioral signals.

What It Is

The model component loads a small language model (default: SmolLM2-135M), runs inference on untrusted inputs, and extracts hidden state activations at configured layers. It is model-agnostic — any transformer model with accessible hidden states can serve as a detector.

Why It Exists

The firewall needs model activations (hidden states) to detect behavioral patterns. This component encapsulates the complexity of model loading, inference, and activation extraction behind a clean interface that the codebook and firewall can consume without knowing model-specific details.

The model-agnostic design (ADR-003) means the firewall is not tied to a specific detector model. Switching from SmolLM2-135M to another ~100M model requires recomputing the SVD basis and rebuilding the codebook, but no changes to the firewall logic.

Key Concepts

Activation Extraction

The core operation: running the model on an input and capturing hidden state representations at specific layers.

Phase 1 (last-token extraction):

outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in configured_layers
}
# Shape: (hidden_dim,) per layer — single vector

Phase 2 (per-token extraction): Extract hidden states at every token position to enable token-level smoothing and per-position classification (see codebook.md: Token-Level Smoothing).

outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][0, :, :]
    for layer_idx in configured_layers
}
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors

The training pipeline uses per-token extraction (z-coordinates at every position are collected for population statistics). Phase 1 simplifies to last-token only for lower latency and simpler implementation. The codebook's classifiers are trained on per-token data from all positions, so they remain valid for both extraction modes.

Key decisions:

  • Which layers: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their signals are highly correlated with the selected layers.
  • Which token: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard choice for autoregressive (LLaMA-family) models and sufficient for Phase 1. Per-token extraction enables the full detection pipeline in Phase 2.
  • Shape: Per layer, the activation is a 1D vector of size hidden_dim (768 for SmolLM2-135M) in Phase 1, or a 2D array (seq_len, hidden_dim) in Phase 2.

Model-Agnostic Interface

The model component exposes a generic interface that works with any transformer model:

class DetectorModel(Protocol):
    model_id: str
    hidden_dim: int
    n_layers: int

    def load(self, device: str = "cpu") -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...

The infer method returns hidden states at key layers, abstracting away whether the backend is PyTorch or a future alternative inference engine.

Lazy Loading

The model is loaded on first use or explicit preload — not at import time. This keeps the library import fast (~milliseconds) even when torch is installed.

firewall = Firewall()      # Does NOT load model yet
firewall.preload()         # Explicit: download + load model
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded

Offline Support

The model component respects HF_HUB_OFFLINE and local_files_only flags. In air-gapped environments, models must be pre-downloaded. The library provides a CLI command for this:

python -m alknet_firewall download

Interfaces

Public API

class DetectorModel(Protocol):
    model_id: str
    hidden_dim: int
    n_layers: int

    def load(self, device: str = "cpu") -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...

class HFDetectorModel:
    """Default implementation using HuggingFace transformers."""

    DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>"  # Specific SmolLM2-135M commit

    def __init__(
        self,
        model_id: str = "HuggingFaceTB/SmolLM2-135M",
        revision: str = DEFAULT_REVISION,
        device: str = "cpu",
        cache_dir: str | None = None,
    ): ...

    def load(self, device: str | None = None) -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
    def is_loaded(self) -> bool: ...

    @property
    def extraction_layers(self) -> list[int]: ...

Constraints

  1. safetensors-only — Model weights are loaded exclusively from safetensors format. Pickle-based .pt/.bin files are never loaded (ADR-005). This is a security requirement for a security product.
  2. Model pinning — Model revision must be pinned for reproducibility. Default revision is a specific commit hash, not "main".
  3. CPU-first — Default device is CPU. GPU inference is supported but not required. The <10ms latency target is achievable on CPU with a 125M model.
  4. No training — The detector model is inference-only. No gradients are computed. No model weights are modified at runtime.

Design Decisions

ADR Decision Summary
003 Small model detector ~125M params, <10ms, CPU-deployable
005 Safetensors-only Security product must use secure formats
006 Optional PyTorch Large dependency via extras, lazy imports
007 Runtime download HF Hub caching, 269MB can't be bundled
009 Last-token extraction Standard for autoregressive models

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

  • OQ-01: Should ONNX Runtime be a supported inference backend in Phase 1? (resolved — removed from scope; burn/cublas is a better future path)