docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions
--- a/docs/architecture/codebook.md
+++ b/docs/architecture/codebook.md
@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
 distributions are calibrated for this scale, so threshold values in the
 codebook are specific to the z-coordinate range of the calibration data.
 **Training shape**: `(N, 3)` where N is the total number of token positions
 across all calibration prompts. Each token position produces its own
 z-coordinate, so the population data is a flattened collection of per-token
 z-vectors.
 **Inference shape**: `(seq_len, 3)` for a single input. Each token position
 in the input sequence produces a z-coordinate. The detection pipeline
 operates on this per-token sequence, optionally smoothing it before
 classification.
 ### SVD Basis
 Singular Value Decomposition of the activation space from a calibration dataset
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
 Inputs whose projections fall near or beyond the region boundary score
 increasingly high (SUSPICIOUS → DANGEROUS).
 ### Copula Decomposition
 Raw z-coordinates are not the detection feature space. The codebook
 decomposes z-coordinates through a copula transform that separates **scale**
 (how far from normal) from **position** (which behavioral direction):
 ```
 z → CDF → (x₀, x₁, x₂)     # Uniform marginals via CDF transform
  → S = x₀ + x₁ + x₂       # Scale: total CDF magnitude
  → x_norm = simplex(x)      # Normalize to probability simplex
  → (u, v) = barycentric(x_norm)  # Position: 2D simplex coordinates
 ```
 The three derived features `(S, u, v)` form the actual detection space:
 - **S (scale)**: How far the input's z-coordinates deviate from the
  population norm, aggregated across all three SVD dimensions. High S means
  the input is anomalous in *magnitude*.
 - **u, v (position)**: Where the input sits on the behavioral simplex —
  which *direction* the deviation points. Different behavioral patterns
  (refusal, instruction-following, self-reference) separate along different
  (u, v) axes.
 This decomposition is why the codebook can distinguish "this input activates
 the refusal direction" from "this input is just generally unusual" — the same
 S value with different (u, v) coordinates implies different behavioral
 patterns.
 The PoC's `decompose()` method implements this pipeline as a pure function.
 It is called both during codebook compilation (to compute direction
 profiles) and during inference (to transform new z-coordinates for
 classification).
 ### Direction Profiles and Contrast Pairs
 The codebook doesn't just detect "anomalous" — it detects specific behavioral
 **directions**. Each direction is defined by a contrast pair of conditions:
 | Contrast Pair | Condition A | Condition B | Behavioral Direction |
 |---------------|-------------|-------------|---------------------|
 | refusal | harmful | harmless | Refusal activation |
 | instruction_vs_data | instruction | data | Instruction-following |
 | tool_call | tool_call | natural_language | Tool call patterns |
 | self_vs_other | self_ref | other_ref | Self-reference |
 | semantic_violation | violated | expected | Semantic norm violation |
 | uncertainty | uncertain | confident | Uncertainty expression |
 | injection | injection | benign_instruction | Prompt injection |
 For each contrast pair, the codebook computes a **DirectionProfile** — the
 statistical baseline (means, pooled standard deviations, Cohen's d) of the
 (S, u, v) features for both conditions. This enables:
 1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
   features of condition A vs condition B. Produces P(active | features) —
   the probability that the input exhibits the "active" behavioral pattern.
 2. **Thresholds**: Midpoints between condition means for each feature, used
   for interpretable rule-based detection as a fallback.
 ### Token-Level Smoothing
 During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
 The detection pipeline optionally applies a rolling average (uniform kernel)
 to the decomposed (S, u, v) features before classification:
 - **window=1**: No smoothing. Each token position classified independently.
 - **window=8** (PoC default): Smooth features across 8 token positions.
  Reduces noise from individual token fluctuations while preserving
  sustained behavioral signals.
 Smoothing is an inference-time parameter — it does not affect codebook
 compilation or thresholds. The codebook is calibrated on per-token
 z-coordinates (all positions from the calibration data, flattened into
 `(N, 3)`), so the classifier weights are valid regardless of the smoothing
 window used at inference time.
 ### Spline Distributions
 Monotonic spline distributions model the probability density along each SVD
-dimension (ADR-010). They provide:
+dimension (ADR-010). They serve two roles in the codebook:
 1. **CDF transform**: The copula decomposition requires mapping z-coordinates
   to uniform marginals via the CDF. The spline CDF provides this transform.
 2. **Scale distribution**: A separate spline distribution models the
   sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
 They provide:
 - **Smooth scoring**: Continuous score rather than hard threshold
 - **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
  anomalous inputs
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
  adversarial training
 The spline distribution approach is adapted from the metaspline PoC
-(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
+(`spline.py` — `SplineDistribution` class, ~378 lines).
 **Formal definition**: The CDF along each dimension is modeled as a monotonic
-cubic spline with 10–20 knots. Knot positions are determined by quantiles of
+cubic spline with knots (typically 10–64, depending on calibration data
-the calibration data (ensuring density of knots where data is dense). Beyond
+size). Knot positions are determined by quantiles of the calibration data
-the extreme knots, the CDF decays exponentially at a rate fitted to the tail
+(ensuring density of knots where data is dense). Beyond the extreme knots,
-data. The scoring function maps a z-coordinate to a score in [0, 1] via the
+the CDF decays exponentially at a rate fitted to the tail data.
 CDF's complement: `score = 1 - cdf(z)`.
 **Canonical implementation**: The metaspline PoC files `spline.py`
 (`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
 and `space.py` (`unfold`/`fold`) are the reference implementation for the
 codebook compilation pipeline.
 ### Calibration Dataset
-The calibration dataset is the set of normal (non-adversarial) inputs used to
+The calibration dataset serves two purposes: establishing the population
-compute the SVD basis and fit behavioral region distributions. Requirements:
+distribution (normal behavioral baseline) and providing contrast pairs
 (labeled examples for each behavioral direction).
- **Composition**: Diverse normal inputs representative of the deployment
+**Population data**: Diverse normal inputs representative of the deployment
-  domain. No adversarial examples — the basis models *normal* behavior, and
+domain. No adversarial examples — the population models *normal* behavior,
-  anomalies are detected as deviations from it.
+and anomalies are detected as deviations from it. Each prompt is processed
 by the detector model, and z-coordinates are extracted at every token
 position. The flattened `(N, 3)` tensor of all positions forms the population.
 **Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
 instruction/data) that define each behavioral direction. Each condition
 produces a set of z-coordinates that, after copula decomposition, reveal
 where the conditions separate in (S, u, v) space.
 Requirements:
 - **Composition**: Population must cover the range of normal inputs the
  detector will see in production. Contrast pairs must be clearly distinct
  along their target behavioral direction.
 - **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
-  Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
+  Practical range: 1,000–10,000 prompts for population. Each contrast
-  have diminishing returns.
+  condition needs at least 50–200 prompts.
- **Diversity**: Must cover the range of normal inputs the detector will see
+- **Diversity**: A narrow population (e.g., only short English queries) will
-  in production. A narrow calibration dataset (e.g., only short English
+  produce high false positive rates on unusual but benign inputs.
-  queries) will produce high false positive rates on unusual but benign inputs.
+- **Model-specific**: Calibration data must be collected for each detector
 - **Model-specific**: A calibration dataset must be collected for each detector
  model by running that model on the inputs and extracting activations.
 The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
-automates calibration dataset processing.
+automates calibration dataset processing with `max_length=128` tokens per
 prompt.
 ### Codebook Compilation
 The codebook is compiled offline by a training pipeline that:
-1. Runs the detector model on a calibration dataset (diverse normal inputs)
+1. Runs the detector model on a calibration dataset (population + contrast
-2. Extracts hidden state activations at configured layers
+   pairs)
-3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
+2. Extracts hidden state activations at configured layers for every token
-   deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
+   position (not just last-token)
-   which uses randomized approximation and may not be deterministic)
+3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
-4. Fits spline distributions along each retained dimension
+   deterministic decomposition)
-5. Computes detection thresholds
+4. Projects activations onto the top-3 SVD components → z-coordinates
-6. Serializes the codebook to a portable format (safetensors + JSON config)
+5. Fits spline distributions on each SVD dimension and the sum S
 6. Applies copula decomposition to all z-coordinates → (S, u, v) features
 7. Computes direction profiles (means, pooled std, Cohen's d) for each
   contrast pair
 8. Trains logistic classifiers on (S, u, v) for each contrast pair
 9. Computes detection thresholds (midpoints between condition means)
 10. Serializes the codebook to a portable format (safetensors + JSON config)
 This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
 package** as package data (under `src/alknet_firewall/data/codebook/`). This
@@ -224,8 +328,10 @@ The codebook is stored as:
 codebook/
 ├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
 ├── regions.safetensors    # Region boundary parameters
 ├── classifiers.safetensors # Logistic classifier weights per direction
 ├── splines.json           # Spline knot positions and coefficients
-└── config.json            # Metadata: model_id, revision, n_dims, thresholds
+├── profiles.json          # Direction profiles (means, stds, Cohen's d)
 └── config.json            # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
 ```
 All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
 | `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
 **classifiers.safetensors**:
 | Key | Shape | Dtype | Description |
 |-----|-------|-------|-------------|
 | `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
 | `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
 | `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
 | `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
 **splines.json**:
 | Field | Type | Description |
 |-------|------|-------------|
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
 | `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
 **profiles.json**:
 | Field | Type | Description |
 |-------|------|-------------|
 | `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
 | `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
 Each `DirectionProfile` entry contains:
 | Field | Type | Description |
 |-------|------|-------------|
 | `label` | `str` | Direction name (e.g., "refusal") |
 | `sum_mean_a` | `float` | Mean S for condition A |
 | `sum_mean_b` | `float` | Mean S for condition B |
 | `sum_std_pooled` | `float` | Pooled std of S |
 | `u_mean_a` | `float` | Mean u for condition A |
 | `u_mean_b` | `float` | Mean u for condition B |
 | `u_std_pooled` | `float` | Pooled std of u |
 | `v_mean_a` | `float` | Mean v for condition A |
 | `v_mean_b` | `float` | Mean v for condition B |
 | `v_std_pooled` | `float` | Pooled std of v |
 | `cohen_d_sum` | `float` | Effect size for S |
 | `cohen_d_u` | `float` | Effect size for u |
 | `cohen_d_v` | `float` | Effect size for v |
 ## Interfaces
 ### Internal API
@@ -262,18 +399,54 @@ class CodebookConfig:
    model_revision: str
    n_dimensions: int
    layers: list[int]
-    suspicious_threshold: float    # Serialized threshold values
+    suspicious_threshold: float
-    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)
+    dangerous_threshold: float
    contrast_pairs: list[tuple[str, str, str]]  # (cond_a, cond_b, label)
    smoothing_window: int = 8                   # Token-level smoothing (inference only)
 class Codebook:
    def __init__(self, path: Path): ...
    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
-        """Project raw activations onto SVD basis → z-coordinates."""
+        """Project raw activations onto SVD basis → z-coordinates.
        Returns: (seq_len, 3) z-coordinates.
        """
        ...
-    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
+    def decompose(self, z_coords: np.ndarray) -> dict:
-        """Score z-coordinates against behavioral regions."""
+        """Copula decomposition: z → CDF → (S, u, v).
        Args:
            z_coords: (seq_len, 3) or (N, 3) z-coordinates
        Returns:
            dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
        """
        ...
    def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
        """Classify decomposed features using logistic classifiers.
        Args:
            features: Output of decompose(), with (seq_len,) arrays
            window: Smoothing window size (1 = no smoothing)
        Returns:
            dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
        """
        ...
    def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
               min_positions: int = 3, window: int = 8) -> DetectionResult:
        """Full detection pipeline: project → decompose → smooth → classify → flag.
        Args:
            z_coords: (seq_len, 3) z-coordinates for a single input
            threshold_prob: P(active) threshold for flagging a direction
            min_positions: Minimum token positions above threshold to flag
            window: Smoothing window for token-level features
        """
        ...
    @classmethod
--- a/docs/architecture/decisions/009-last-token-extraction.md
+++ b/docs/architecture/decisions/009-last-token-extraction.md
@@ -30,8 +30,14 @@ input.
 ## Decision
-Extract the last token's hidden state at each configured layer. This is
+Extract the last token's hidden state at each configured layer as the Phase 1
-standard for LLaMA-family models and provides full-sequence context.
+default. This is standard for LLaMA-family models and provides full-sequence
 context.
 Phase 2 extends this to per-token extraction (hidden states at every position)
 to enable token-level smoothing and per-position behavioral classification.
 The training pipeline already uses per-token extraction for calibration data
 collection.
 ## Consequences
@@ -40,6 +46,7 @@ standard for LLaMA-family models and provides full-sequence context.
 - Full sequence context via causal attention
 - Single vector per layer — simple to project and score
 - No padding sensitivity (unlike mean pooling with attention masks)
 - Phase 1 simplification: reduces implementation complexity and latency
 **Negative**:
 - Position-dependent — the last token's representation is influenced by its
@@ -48,6 +55,8 @@ standard for LLaMA-family models and provides full-sequence context.
  activation patterns
 - May miss patterns in long inputs where the adversarial payload is in the
  middle rather than the end
 - Phase 1 only: misses token-level behavioral signals that require per-token
  extraction (addressed in Phase 2)
 ## References
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
   "Please summarize this document: [hidden injection payload]"
 2. Tokenize
-   tokenizer.encode(input) → input_ids
+   tokenizer.encode(input) → input_ids  (shape: seq_len)
 3. Detector Model Inference
-   model(input_ids) → hidden_states at key layers
+   model(input_ids, output_hidden_states=True) → hidden_states at key layers
 4. Activation Extraction
-   Extract hidden states from configured layers (early + mid)
+   Extract last-token hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors
 5. SVD Projection
   Project activations onto precomputed SVD basis
-   z_coords = svd_basis @ activation_vector
+   z_coords = V^T @ (activation - mean)  → (seq_len, 3) z-coordinates
-6. Codebook Comparison
+6. Copula Decomposition
-   For each SVD dimension:
+   Transform z-coordinates through CDF → simplex → barycentric:
-     - Compute distance from normal behavioral region
+   z → (x₀, x₁, x₂) via CDF  →  S = x₀+x₁+x₂ (scale)
-     - Apply spline scoring (monotonic distribution)
+                               →  (u, v) via barycentric (position on simplex)
     - Aggregate multi-dimensional signals
-7. Alarm Generation
+7. Token-Level Smoothing (optional)
-   Combine per-dimension signals → overall alarm
+   Apply rolling average to (S, u, v) features across token positions
-   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
+   window=8: smooths per-token signals, reduces noise from single-token spikes
-   Include per-dimension breakdown for interpretability
+
 8. Direction Classification
   For each behavioral direction (refusal, injection, etc.):
     logistic_classifier(S, u, v) → P(active | features) per token position
 9. Aggregation
   Per direction: mean P(active), max P(active), fraction above threshold
   Flag if any direction exceeds threshold for sufficient token positions
 10. Alarm Generation
    Combine per-direction signals → overall alarm
    AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
    Include per-direction breakdown for interpretability
 ```
 Note: Step 4 extracts only the last token in Phase 1. The full pipeline
 (Phase 2) extracts per-token activations, enabling the token-level smoothing
 and per-position classification in steps 7–9.
 ## Key Concepts
 ### Behavioral Alarm
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.
 ### Score Composition
-The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
+The overall `Alarm.score` (0.0–1.0) is computed from per-direction
-using a weighted maximum:
+classification results. For each behavioral direction, the logistic
 classifier produces P(active | features) for every token position. The
 alarm score aggregates these across directions:
 ```
-score = max(w_d * signal_d for d in dimensions)
+direction_score = max(P(active) across token positions)
 score = max(w_d * direction_score_d for d in directions)
 ```
-Where `w_d` are dimension weights (default: equal, configurable in
+Where `w_d` are direction weights (default: equal, configurable in
-`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
+`Thresholds.per_dimension`). Using `max` at both levels ensures that:
-single strongly anomalous dimension can trigger an alarm even if other
+- A single strongly anomalous direction can trigger an alarm even if other
-dimensions are normal. This is critical for catching attacks that exploit
+  directions are normal
-specific behavioral patterns (e.g., refusal-suppression) while leaving other
+- A sustained behavioral signal at any token position surfaces in the alarm
-dimensions unaffected.
+
 This is critical for catching attacks that exploit specific behavioral
 patterns (e.g., refusal-suppression) while leaving other directions
 unaffected.
 The `suspicious` and `dangerous` thresholds are applied to this composite
 score to determine `Alarm.level`.
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.
 | Level | Meaning | Action |
 |-------|---------|--------|
-| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
+| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
-| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
+| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
-| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
+| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
 ### Latency Budget
@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
 | Model inference (125M, CPU) | ~5ms |
 | Activation extraction | ~0.1ms |
 | SVD projection | ~0.1ms |
-| Codebook comparison | ~0.3ms |
+| Copula decomposition | ~0.05ms |
 | Token-level smoothing | ~0.05ms |
 | Direction classification | ~0.1ms |
 | **Total** | **~6ms** |
 ## Interfaces
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):
@dataclass
 class DimensionSignal:
-    dimension: int
+    direction: str              # Behavioral direction name (e.g., "refusal", "injection")
-    deviation: float
+    score: float                # P(active) for this direction
-    score: float
+    max_score: float            # Max P(active) across token positions
    mean_score: float           # Mean P(active) across token positions
    n_positions_above: int      # Token positions above threshold
    direction_label: str | None
@dataclass
--- a/docs/architecture/model.md
+++ b/docs/architecture/model.md
@@ -35,15 +35,34 @@ changes to the firewall logic.
 The core operation: running the model on an input and capturing hidden state
 representations at specific layers.
 **Phase 1 (last-token extraction)**:
 ```python
 # Conceptual
 outputs = model(input_ids, output_hidden_states=True)
 activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in configured_layers
 }
 # Shape: (hidden_dim,) per layer — single vector
 ```
 **Phase 2 (per-token extraction)**: Extract hidden states at every token
 position to enable token-level smoothing and per-position classification
 (see codebook.md: Token-Level Smoothing).
 ```python
 outputs = model(input_ids, output_hidden_states=True)
 activations = {
    layer_idx: outputs.hidden_states[layer_idx][0, :, :]
    for layer_idx in configured_layers
 }
 # Shape: (seq_len, hidden_dim) per layer — sequence of vectors
 ```
 The training pipeline uses per-token extraction (z-coordinates at every
 position are collected for population statistics). Phase 1 simplifies to
 last-token only for lower latency and simpler implementation. The codebook's
 classifiers are trained on per-token data from all positions, so they remain
 valid for both extraction modes.
 Key decisions:
 - **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
  Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
@@ -52,9 +71,11 @@ Key decisions:
  signals are highly correlated with the selected layers.
 - **Which token**: The last token's hidden state carries the model's
  "conclusion" about the full input sequence (ADR-009). This is the standard
-  choice for autoregressive (LLaMA-family) models.
+  choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
  Per-token extraction enables the full detection pipeline in Phase 2.
 - **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
-  (768 for SmolLM2-135M).
+  (768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
  in Phase 2.
 ### Model-Agnostic Interface