Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

5.5 KiB

Raw Blame History

status, last_updated

status	last_updated
draft	2026-06-13

Model

The model component manages detector model loading, inference, and activation extraction. It is the interface between the firewall and the language model that provides behavioral signals.

What It Is

The model component loads a small language model (default: SmolLM2-135M), runs inference on untrusted inputs, and extracts hidden state activations at configured layers. It is model-agnostic — any transformer model with accessible hidden states can serve as a detector.

Why It Exists

The firewall needs model activations (hidden states) to detect behavioral patterns. This component encapsulates the complexity of model loading, inference, and activation extraction behind a clean interface that the codebook and firewall can consume without knowing model-specific details.

The model-agnostic design (ADR-003) means the firewall is not tied to a specific detector model. Switching from SmolLM2-135M to another ~100M model requires recomputing the SVD basis and rebuilding the codebook, but no changes to the firewall logic.

Key Concepts

Activation Extraction

The core operation: running the model on an input and capturing hidden state representations at specific layers.

# Conceptual
outputs = model(input_ids, output_hidden_states=True)
activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in configured_layers
}

Key decisions:

Which layers: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model). Early layers (1, 2) capture safety signals per EMNLP 2024 findings. Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their signals are highly correlated with the selected layers.
Which token: The last token's hidden state carries the model's "conclusion" about the full input sequence (ADR-009). This is the standard choice for autoregressive (LLaMA-family) models.
Shape: Per layer, the activation is a 1D vector of size hidden_dim (768 for SmolLM2-135M).

Model-Agnostic Interface

The model component exposes a generic interface that works with any transformer model:

class DetectorModel(Protocol):
    model_id: str
    hidden_dim: int
    n_layers: int

    def load(self, device: str = "cpu") -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...

The infer method returns hidden states at key layers, abstracting away whether the backend is PyTorch or a future alternative inference engine.

Lazy Loading

The model is loaded on first use or explicit preload — not at import time. This keeps the library import fast (~milliseconds) even when torch is installed.

firewall = Firewall()      # Does NOT load model yet
firewall.preload()         # Explicit: download + load model
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded

Offline Support

The model component respects HF_HUB_OFFLINE and local_files_only flags. In air-gapped environments, models must be pre-downloaded. The library provides a CLI command for this:

python -m alknet_firewall download

Interfaces

Public API

class DetectorModel(Protocol):
    model_id: str
    hidden_dim: int
    n_layers: int

    def load(self, device: str = "cpu") -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...

class HFDetectorModel:
    """Default implementation using HuggingFace transformers."""

    DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>"  # Specific SmolLM2-135M commit

    def __init__(
        self,
        model_id: str = "HuggingFaceTB/SmolLM2-135M",
        revision: str = DEFAULT_REVISION,
        device: str = "cpu",
        cache_dir: str | None = None,
    ): ...

    def load(self, device: str | None = None) -> None: ...
    def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
    def is_loaded(self) -> bool: ...

    @property
    def extraction_layers(self) -> list[int]: ...

Constraints

safetensors-only — Model weights are loaded exclusively from safetensors format. Pickle-based .pt/.bin files are never loaded (ADR-005). This is a security requirement for a security product.
Model pinning — Model revision must be pinned for reproducibility. Default revision is a specific commit hash, not "main".
CPU-first — Default device is CPU. GPU inference is supported but not required. The <10ms latency target is achievable on CPU with a 125M model.
No training — The detector model is inference-only. No gradients are computed. No model weights are modified at runtime.

Design Decisions

ADR	Decision	Summary
003	Small model detector	~125M params, <10ms, CPU-deployable
005	Safetensors-only	Security product must use secure formats
006	Optional PyTorch	Large dependency via extras, lazy imports
007	Runtime download	HF Hub caching, 269MB can't be bundled
009	Last-token extraction	Standard for autoregressive models

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

OQ-01: ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ (resolved — removed from scope; burn/cublas is a better future path)

5.5 KiB Raw Blame History