The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
181 lines
6.5 KiB
Markdown
181 lines
6.5 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-06-13
|
|
---
|
|
|
|
# Model
|
|
|
|
The model component manages detector model loading, inference, and activation
|
|
extraction. It is the interface between the firewall and the language model
|
|
that provides behavioral signals.
|
|
|
|
## What It Is
|
|
|
|
The model component loads a small language model (default: SmolLM2-135M),
|
|
runs inference on untrusted inputs, and extracts hidden state activations at
|
|
configured layers. It is model-agnostic — any transformer model with
|
|
accessible hidden states can serve as a detector.
|
|
|
|
## Why It Exists
|
|
|
|
The firewall needs model activations (hidden states) to detect behavioral
|
|
patterns. This component encapsulates the complexity of model loading,
|
|
inference, and activation extraction behind a clean interface that the
|
|
codebook and firewall can consume without knowing model-specific details.
|
|
|
|
The model-agnostic design (ADR-003) means the firewall is not tied to a
|
|
specific detector model. Switching from SmolLM2-135M to another ~100M model
|
|
requires recomputing the SVD basis and rebuilding the codebook, but no
|
|
changes to the firewall logic.
|
|
|
|
## Key Concepts
|
|
|
|
### Activation Extraction
|
|
|
|
The core operation: running the model on an input and capturing hidden state
|
|
representations at specific layers.
|
|
|
|
**Phase 1 (last-token extraction)**:
|
|
```python
|
|
outputs = model(input_ids, output_hidden_states=True)
|
|
activations = {
|
|
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
|
|
for layer_idx in configured_layers
|
|
}
|
|
# Shape: (hidden_dim,) per layer — single vector
|
|
```
|
|
|
|
**Phase 2 (per-token extraction)**: Extract hidden states at every token
|
|
position to enable token-level smoothing and per-position classification
|
|
(see codebook.md: Token-Level Smoothing).
|
|
```python
|
|
outputs = model(input_ids, output_hidden_states=True)
|
|
activations = {
|
|
layer_idx: outputs.hidden_states[layer_idx][0, :, :]
|
|
for layer_idx in configured_layers
|
|
}
|
|
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
|
|
```
|
|
|
|
The training pipeline uses per-token extraction (z-coordinates at every
|
|
position are collected for population statistics). Phase 1 simplifies to
|
|
last-token only for lower latency and simpler implementation. The codebook's
|
|
classifiers are trained on per-token data from all positions, so they remain
|
|
valid for both extraction modes.
|
|
|
|
Key decisions:
|
|
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
|
|
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
|
|
Layer 4 provides mid-early context. Layer 8 provides mid-layer behavioral
|
|
patterns. Layers 3, 6, 7 are omitted to reduce dimensionality — their
|
|
signals are highly correlated with the selected layers.
|
|
- **Which token**: The last token's hidden state carries the model's
|
|
"conclusion" about the full input sequence (ADR-009). This is the standard
|
|
choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
|
|
Per-token extraction enables the full detection pipeline in Phase 2.
|
|
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
|
|
(768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
|
|
in Phase 2.
|
|
|
|
### Model-Agnostic Interface
|
|
|
|
The model component exposes a generic interface that works with any
|
|
transformer model:
|
|
|
|
```python
|
|
class DetectorModel(Protocol):
|
|
model_id: str
|
|
hidden_dim: int
|
|
n_layers: int
|
|
|
|
def load(self, device: str = "cpu") -> None: ...
|
|
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
|
```
|
|
|
|
The `infer` method returns hidden states at key layers, abstracting away
|
|
whether the backend is PyTorch or a future alternative inference engine.
|
|
|
|
### Lazy Loading
|
|
|
|
The model is loaded on first use or explicit preload — not at import time.
|
|
This keeps the library import fast (~milliseconds) even when torch is
|
|
installed.
|
|
|
|
```python
|
|
firewall = Firewall() # Does NOT load model yet
|
|
firewall.preload() # Explicit: download + load model
|
|
alarm = firewall.screen(x) # Implicit: loads model on first call if not loaded
|
|
```
|
|
|
|
### Offline Support
|
|
|
|
The model component respects `HF_HUB_OFFLINE` and `local_files_only` flags.
|
|
In air-gapped environments, models must be pre-downloaded. The library
|
|
provides a CLI command for this:
|
|
|
|
```bash
|
|
python -m alknet_firewall download
|
|
```
|
|
|
|
## Interfaces
|
|
|
|
### Public API
|
|
|
|
```python
|
|
class DetectorModel(Protocol):
|
|
model_id: str
|
|
hidden_dim: int
|
|
n_layers: int
|
|
|
|
def load(self, device: str = "cpu") -> None: ...
|
|
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
|
|
|
class HFDetectorModel:
|
|
"""Default implementation using HuggingFace transformers."""
|
|
|
|
DEFAULT_REVISION: ClassVar[str] = "<pinned-commit>" # Specific SmolLM2-135M commit
|
|
|
|
def __init__(
|
|
self,
|
|
model_id: str = "HuggingFaceTB/SmolLM2-135M",
|
|
revision: str = DEFAULT_REVISION,
|
|
device: str = "cpu",
|
|
cache_dir: str | None = None,
|
|
): ...
|
|
|
|
def load(self, device: str | None = None) -> None: ...
|
|
def infer(self, input_ids: list[int]) -> dict[int, np.ndarray]: ...
|
|
def is_loaded(self) -> bool: ...
|
|
|
|
@property
|
|
def extraction_layers(self) -> list[int]: ...
|
|
```
|
|
|
|
### Constraints
|
|
|
|
1. **safetensors-only** — Model weights are loaded exclusively from
|
|
safetensors format. Pickle-based `.pt`/`.bin` files are never loaded
|
|
(ADR-005). This is a security requirement for a security product.
|
|
2. **Model pinning** — Model revision must be pinned for reproducibility.
|
|
Default revision is a specific commit hash, not `"main"`.
|
|
3. **CPU-first** — Default device is CPU. GPU inference is supported but not
|
|
required. The <10ms latency target is achievable on CPU with a 125M model.
|
|
4. **No training** — The detector model is inference-only. No gradients are
|
|
computed. No model weights are modified at runtime.
|
|
|
|
## Design Decisions
|
|
|
|
| ADR | Decision | Summary |
|
|
|-----|----------|---------|
|
|
| [003](decisions/003-small-model-detector.md) | Small model detector | ~125M params, <10ms, CPU-deployable |
|
|
| [005](decisions/005-safetensors-only.md) | Safetensors-only | Security product must use secure formats |
|
|
| [006](decisions/006-optional-pytorch.md) | Optional PyTorch | Large dependency via extras, lazy imports |
|
|
| [007](decisions/007-runtime-model-download.md) | Runtime download | HF Hub caching, 269MB can't be bundled |
|
|
| [009](decisions/009-last-token-extraction.md) | Last-token extraction | Standard for autoregressive models |
|
|
|
|
## Open Questions
|
|
|
|
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
|
questions affecting this document:
|
|
|
|
- **OQ-01**: ~~Should ONNX Runtime be a supported inference backend in Phase 1?~~ (resolved — removed from scope; burn/cublas is a better future path) |