docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions
--- a/docs/architecture/codebook.md
+++ b/docs/architecture/codebook.md
@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
 distributions are calibrated for this scale, so threshold values in the
 codebook are specific to the z-coordinate range of the calibration data.

+**Training shape**: `(N, 3)` where N is the total number of token positions
+across all calibration prompts. Each token position produces its own
+z-coordinate, so the population data is a flattened collection of per-token
+z-vectors.
+
+**Inference shape**: `(seq_len, 3)` for a single input. Each token position
+in the input sequence produces a z-coordinate. The detection pipeline
+operates on this per-token sequence, optionally smoothing it before
+classification.
+
 ### SVD Basis

 Singular Value Decomposition of the activation space from a calibration dataset
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
 Inputs whose projections fall near or beyond the region boundary score
 increasingly high (SUSPICIOUS → DANGEROUS).

+### Copula Decomposition
+
+Raw z-coordinates are not the detection feature space. The codebook
+decomposes z-coordinates through a copula transform that separates **scale**
+(how far from normal) from **position** (which behavioral direction):
+
+```
+z → CDF → (x₀, x₁, x₂)     # Uniform marginals via CDF transform
+  → S = x₀ + x₁ + x₂       # Scale: total CDF magnitude
+  → x_norm = simplex(x)      # Normalize to probability simplex
+  → (u, v) = barycentric(x_norm)  # Position: 2D simplex coordinates
+```
+
+The three derived features `(S, u, v)` form the actual detection space:
+
+- **S (scale)**: How far the input's z-coordinates deviate from the
+  population norm, aggregated across all three SVD dimensions. High S means
+  the input is anomalous in *magnitude*.
+- **u, v (position)**: Where the input sits on the behavioral simplex —
+  which *direction* the deviation points. Different behavioral patterns
+  (refusal, instruction-following, self-reference) separate along different
+  (u, v) axes.
+
+This decomposition is why the codebook can distinguish "this input activates
+the refusal direction" from "this input is just generally unusual" — the same
+S value with different (u, v) coordinates implies different behavioral
+patterns.
+
+The PoC's `decompose()` method implements this pipeline as a pure function.
+It is called both during codebook compilation (to compute direction
+profiles) and during inference (to transform new z-coordinates for
+classification).
+
+### Direction Profiles and Contrast Pairs
+
+The codebook doesn't just detect "anomalous" — it detects specific behavioral
+**directions**. Each direction is defined by a contrast pair of conditions:
+
+| Contrast Pair | Condition A | Condition B | Behavioral Direction |
+|---------------|-------------|-------------|---------------------|
+| refusal | harmful | harmless | Refusal activation |
+| instruction_vs_data | instruction | data | Instruction-following |
+| tool_call | tool_call | natural_language | Tool call patterns |
+| self_vs_other | self_ref | other_ref | Self-reference |
+| semantic_violation | violated | expected | Semantic norm violation |
+| uncertainty | uncertain | confident | Uncertainty expression |
+| injection | injection | benign_instruction | Prompt injection |
+
+For each contrast pair, the codebook computes a **DirectionProfile** — the
+statistical baseline (means, pooled standard deviations, Cohen's d) of the
+(S, u, v) features for both conditions. This enables:
+
+1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
+   features of condition A vs condition B. Produces P(active | features) —
+   the probability that the input exhibits the "active" behavioral pattern.
+2. **Thresholds**: Midpoints between condition means for each feature, used
+   for interpretable rule-based detection as a fallback.
+
+### Token-Level Smoothing
+
+During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
+The detection pipeline optionally applies a rolling average (uniform kernel)
+to the decomposed (S, u, v) features before classification:
+
+- **window=1**: No smoothing. Each token position classified independently.
+- **window=8** (PoC default): Smooth features across 8 token positions.
+  Reduces noise from individual token fluctuations while preserving
+  sustained behavioral signals.
+
+Smoothing is an inference-time parameter — it does not affect codebook
+compilation or thresholds. The codebook is calibrated on per-token
+z-coordinates (all positions from the calibration data, flattened into
+`(N, 3)`), so the classifier weights are valid regardless of the smoothing
+window used at inference time.
+
 ### Spline Distributions

 Monotonic spline distributions model the probability density along each SVD
-dimension (ADR-010). They provide:
+dimension (ADR-010). They serve two roles in the codebook:

+1. **CDF transform**: The copula decomposition requires mapping z-coordinates
+   to uniform marginals via the CDF. The spline CDF provides this transform.
+2. **Scale distribution**: A separate spline distribution models the
+   sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
+
+They provide:
 - **Smooth scoring**: Continuous score rather than hard threshold
 - **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
  anomalous inputs
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
  adversarial training

 The spline distribution approach is adapted from the metaspline PoC
-(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
+(`spline.py` — `SplineDistribution` class, ~378 lines).

 **Formal definition**: The CDF along each dimension is modeled as a monotonic
-cubic spline with 10–20 knots. Knot positions are determined by quantiles of
-the calibration data (ensuring density of knots where data is dense). Beyond
-the extreme knots, the CDF decays exponentially at a rate fitted to the tail
-data. The scoring function maps a z-coordinate to a score in [0, 1] via the
-CDF's complement: `score = 1 - cdf(z)`.
-
-**Canonical implementation**: The metaspline PoC files `spline.py`
-(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
-and `space.py` (`unfold`/`fold`) are the reference implementation for the
-codebook compilation pipeline.
+cubic spline with knots (typically 10–64, depending on calibration data
+size). Knot positions are determined by quantiles of the calibration data
+(ensuring density of knots where data is dense). Beyond the extreme knots,
+the CDF decays exponentially at a rate fitted to the tail data.

 ### Calibration Dataset

-The calibration dataset is the set of normal (non-adversarial) inputs used to
-compute the SVD basis and fit behavioral region distributions. Requirements:
+The calibration dataset serves two purposes: establishing the population
+distribution (normal behavioral baseline) and providing contrast pairs
+(labeled examples for each behavioral direction).

- **Composition**: Diverse normal inputs representative of the deployment
-  domain. No adversarial examples — the basis models *normal* behavior, and
-  anomalies are detected as deviations from it.
+**Population data**: Diverse normal inputs representative of the deployment
+domain. No adversarial examples — the population models *normal* behavior,
+and anomalies are detected as deviations from it. Each prompt is processed
+by the detector model, and z-coordinates are extracted at every token
+position. The flattened `(N, 3)` tensor of all positions forms the population.
+
+**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
+instruction/data) that define each behavioral direction. Each condition
+produces a set of z-coordinates that, after copula decomposition, reveal
+where the conditions separate in (S, u, v) space.
+
+Requirements:
+- **Composition**: Population must cover the range of normal inputs the
+  detector will see in production. Contrast pairs must be clearly distinct
+  along their target behavioral direction.
 - **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
-  Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
-  have diminishing returns.
- **Diversity**: Must cover the range of normal inputs the detector will see
-  in production. A narrow calibration dataset (e.g., only short English
-  queries) will produce high false positive rates on unusual but benign inputs.
- **Model-specific**: A calibration dataset must be collected for each detector
+  Practical range: 1,000–10,000 prompts for population. Each contrast
+  condition needs at least 50–200 prompts.
+- **Diversity**: A narrow population (e.g., only short English queries) will
+  produce high false positive rates on unusual but benign inputs.
+- **Model-specific**: Calibration data must be collected for each detector
  model by running that model on the inputs and extracting activations.

 The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
-automates calibration dataset processing.
+automates calibration dataset processing with `max_length=128` tokens per
+prompt.

 ### Codebook Compilation

 The codebook is compiled offline by a training pipeline that:

-1. Runs the detector model on a calibration dataset (diverse normal inputs)
-2. Extracts hidden state activations at configured layers
-3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
-   deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
-   which uses randomized approximation and may not be deterministic)
-4. Fits spline distributions along each retained dimension
-5. Computes detection thresholds
-6. Serializes the codebook to a portable format (safetensors + JSON config)
+1. Runs the detector model on a calibration dataset (population + contrast
+   pairs)
+2. Extracts hidden state activations at configured layers for every token
+   position (not just last-token)
+3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
+   deterministic decomposition)
+4. Projects activations onto the top-3 SVD components → z-coordinates
+5. Fits spline distributions on each SVD dimension and the sum S
+6. Applies copula decomposition to all z-coordinates → (S, u, v) features
+7. Computes direction profiles (means, pooled std, Cohen's d) for each
+   contrast pair
+8. Trains logistic classifiers on (S, u, v) for each contrast pair
+9. Computes detection thresholds (midpoints between condition means)
+10. Serializes the codebook to a portable format (safetensors + JSON config)

 This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
 package** as package data (under `src/alknet_firewall/data/codebook/`). This
@@ -224,8 +328,10 @@ The codebook is stored as:
 codebook/
 ├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
 ├── regions.safetensors    # Region boundary parameters
+├── classifiers.safetensors # Logistic classifier weights per direction
 ├── splines.json           # Spline knot positions and coefficients
-└── config.json            # Metadata: model_id, revision, n_dims, thresholds
+├── profiles.json          # Direction profiles (means, stds, Cohen's d)
+└── config.json            # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
 ```

 All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
 | `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |

+**classifiers.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
+| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
+| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
+| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
+
 **splines.json**:
 | Field | Type | Description |
 |-------|------|-------------|
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
 | `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |

+**profiles.json**:
+| Field | Type | Description |
+|-------|------|-------------|
+| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
+| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
+
+Each `DirectionProfile` entry contains:
+| Field | Type | Description |
+|-------|------|-------------|
+| `label` | `str` | Direction name (e.g., "refusal") |
+| `sum_mean_a` | `float` | Mean S for condition A |
+| `sum_mean_b` | `float` | Mean S for condition B |
+| `sum_std_pooled` | `float` | Pooled std of S |
+| `u_mean_a` | `float` | Mean u for condition A |
+| `u_mean_b` | `float` | Mean u for condition B |
+| `u_std_pooled` | `float` | Pooled std of u |
+| `v_mean_a` | `float` | Mean v for condition A |
+| `v_mean_b` | `float` | Mean v for condition B |
+| `v_std_pooled` | `float` | Pooled std of v |
+| `cohen_d_sum` | `float` | Effect size for S |
+| `cohen_d_u` | `float` | Effect size for u |
+| `cohen_d_v` | `float` | Effect size for v |
+
 ## Interfaces

 ### Internal API
@@ -262,18 +399,54 @@ class CodebookConfig:
    model_revision: str
    n_dimensions: int
    layers: list[int]
-    suspicious_threshold: float    # Serialized threshold values
-    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)
+    suspicious_threshold: float
+    dangerous_threshold: float
+    contrast_pairs: list[tuple[str, str, str]]  # (cond_a, cond_b, label)
+    smoothing_window: int = 8                   # Token-level smoothing (inference only)

 class Codebook:
    def __init__(self, path: Path): ...

    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
-        """Project raw activations onto SVD basis → z-coordinates."""
+        """Project raw activations onto SVD basis → z-coordinates.
+        
+        Returns: (seq_len, 3) z-coordinates.
+        """
        ...

-    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
-        """Score z-coordinates against behavioral regions."""
+    def decompose(self, z_coords: np.ndarray) -> dict:
+        """Copula decomposition: z → CDF → (S, u, v).
+        
+        Args:
+            z_coords: (seq_len, 3) or (N, 3) z-coordinates
+        
+        Returns:
+            dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
+        """
+        ...
+
+    def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
+        """Classify decomposed features using logistic classifiers.
+        
+        Args:
+            features: Output of decompose(), with (seq_len,) arrays
+            window: Smoothing window size (1 = no smoothing)
+        
+        Returns:
+            dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
+        """
+        ...
+
+    def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
+               min_positions: int = 3, window: int = 8) -> DetectionResult:
+        """Full detection pipeline: project → decompose → smooth → classify → flag.
+        
+        Args:
+            z_coords: (seq_len, 3) z-coordinates for a single input
+            threshold_prob: P(active) threshold for flagging a direction
+            min_positions: Minimum token positions above threshold to flag
+            window: Smoothing window for token-level features
+        """
        ...

    @classmethod
--- a/docs/architecture/decisions/009-last-token-extraction.md
+++ b/docs/architecture/decisions/009-last-token-extraction.md
@@ -30,8 +30,14 @@ input.

 ## Decision

-Extract the last token's hidden state at each configured layer. This is
-standard for LLaMA-family models and provides full-sequence context.
+Extract the last token's hidden state at each configured layer as the Phase 1
+default. This is standard for LLaMA-family models and provides full-sequence
+context.
+
+Phase 2 extends this to per-token extraction (hidden states at every position)
+to enable token-level smoothing and per-position behavioral classification.
+The training pipeline already uses per-token extraction for calibration data
+collection.

 ## Consequences

@@ -40,6 +46,7 @@ standard for LLaMA-family models and provides full-sequence context.
 - Full sequence context via causal attention
 - Single vector per layer — simple to project and score
 - No padding sensitivity (unlike mean pooling with attention masks)
+- Phase 1 simplification: reduces implementation complexity and latency

 **Negative**:
 - Position-dependent — the last token's representation is influenced by its
@@ -48,6 +55,8 @@ standard for LLaMA-family models and provides full-sequence context.
  activation patterns
 - May miss patterns in long inputs where the adversarial payload is in the
  middle rather than the end
+- Phase 1 only: misses token-level behavioral signals that require per-token
+  extraction (addressed in Phase 2)

 ## References

--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
   "Please summarize this document: [hidden injection payload]"

 2. Tokenize
-   tokenizer.encode(input) → input_ids
+   tokenizer.encode(input) → input_ids  (shape: seq_len)

 3. Detector Model Inference
-   model(input_ids) → hidden_states at key layers
+   model(input_ids, output_hidden_states=True) → hidden_states at key layers

 4. Activation Extraction
-   Extract hidden states from configured layers (early + mid)
+   Extract last-token hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors

 5. SVD Projection
   Project activations onto precomputed SVD basis
-   z_coords = svd_basis @ activation_vector
+   z_coords = V^T @ (activation - mean)  → (seq_len, 3) z-coordinates

-6. Codebook Comparison
-   For each SVD dimension:
-     - Compute distance from normal behavioral region
-     - Apply spline scoring (monotonic distribution)
-     - Aggregate multi-dimensional signals
+6. Copula Decomposition
+   Transform z-coordinates through CDF → simplex → barycentric:
+   z → (x₀, x₁, x₂) via CDF  →  S = x₀+x₁+x₂ (scale)
+                               →  (u, v) via barycentric (position on simplex)

-7. Alarm Generation
-   Combine per-dimension signals → overall alarm
-   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
-   Include per-dimension breakdown for interpretability
+7. Token-Level Smoothing (optional)
+   Apply rolling average to (S, u, v) features across token positions
+   window=8: smooths per-token signals, reduces noise from single-token spikes
+
+8. Direction Classification
+   For each behavioral direction (refusal, injection, etc.):
+     logistic_classifier(S, u, v) → P(active | features) per token position
+
+9. Aggregation
+   Per direction: mean P(active), max P(active), fraction above threshold
+   Flag if any direction exceeds threshold for sufficient token positions
+
+10. Alarm Generation
+    Combine per-direction signals → overall alarm
+    AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
+    Include per-direction breakdown for interpretability
 ```

+Note: Step 4 extracts only the last token in Phase 1. The full pipeline
+(Phase 2) extracts per-token activations, enabling the token-level smoothing
+and per-position classification in steps 7–9.
+
 ## Key Concepts

 ### Behavioral Alarm
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.

 ### Score Composition

-The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
-using a weighted maximum:
+The overall `Alarm.score` (0.0–1.0) is computed from per-direction
+classification results. For each behavioral direction, the logistic
+classifier produces P(active | features) for every token position. The
+alarm score aggregates these across directions:

 ```
-score = max(w_d * signal_d for d in dimensions)
+direction_score = max(P(active) across token positions)
+score = max(w_d * direction_score_d for d in directions)
 ```

-Where `w_d` are dimension weights (default: equal, configurable in
-`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
-single strongly anomalous dimension can trigger an alarm even if other
-dimensions are normal. This is critical for catching attacks that exploit
-specific behavioral patterns (e.g., refusal-suppression) while leaving other
-dimensions unaffected.
+Where `w_d` are direction weights (default: equal, configurable in
+`Thresholds.per_dimension`). Using `max` at both levels ensures that:
+- A single strongly anomalous direction can trigger an alarm even if other
+  directions are normal
+- A sustained behavioral signal at any token position surfaces in the alarm
+
+This is critical for catching attacks that exploit specific behavioral
+patterns (e.g., refusal-suppression) while leaving other directions
+unaffected.

 The `suspicious` and `dangerous` thresholds are applied to this composite
 score to determine `Alarm.level`.
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.

 | Level | Meaning | Action |
 |-------|---------|--------|
-| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
-| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
-| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
+| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
+| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
+| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |

 ### Latency Budget

@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
 | Model inference (125M, CPU) | ~5ms |
 | Activation extraction | ~0.1ms |
 | SVD projection | ~0.1ms |
-| Codebook comparison | ~0.3ms |
+| Copula decomposition | ~0.05ms |
+| Token-level smoothing | ~0.05ms |
+| Direction classification | ~0.1ms |
 | **Total** | **~6ms** |

 ## Interfaces
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):

@dataclass
 class DimensionSignal:
-    dimension: int
-    deviation: float
-    score: float
+    direction: str              # Behavioral direction name (e.g., "refusal", "injection")
+    score: float                # P(active) for this direction
+    max_score: float            # Max P(active) across token positions
+    mean_score: float           # Mean P(active) across token positions
+    n_positions_above: int      # Token positions above threshold
    direction_label: str | None

@dataclass
--- a/docs/architecture/model.md
+++ b/docs/architecture/model.md
@@ -35,15 +35,34 @@ changes to the firewall logic.
 The core operation: running the model on an input and capturing hidden state
 representations at specific layers.

+**Phase 1 (last-token extraction)**:
 ```python
-# Conceptual
 outputs = model(input_ids, output_hidden_states=True)
 activations = {
    layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
    for layer_idx in configured_layers
 }
+# Shape: (hidden_dim,) per layer — single vector
 ```

+**Phase 2 (per-token extraction)**: Extract hidden states at every token
+position to enable token-level smoothing and per-position classification
+(see codebook.md: Token-Level Smoothing).
+```python
+outputs = model(input_ids, output_hidden_states=True)
+activations = {
+    layer_idx: outputs.hidden_states[layer_idx][0, :, :]
+    for layer_idx in configured_layers
+}
+# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
+```
+
+The training pipeline uses per-token extraction (z-coordinates at every
+position are collected for population statistics). Phase 1 simplifies to
+last-token only for lower latency and simpler implementation. The codebook's
+classifiers are trained on per-token data from all positions, so they remain
+valid for both extraction modes.
+
 Key decisions:
 - **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
  Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
@@ -52,9 +71,11 @@ Key decisions:
  signals are highly correlated with the selected layers.
 - **Which token**: The last token's hidden state carries the model's
  "conclusion" about the full input sequence (ADR-009). This is the standard
-  choice for autoregressive (LLaMA-family) models.
+  choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
+  Per-token extraction enables the full detection pipeline in Phase 2.
 - **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
-  (768 for SmolLM2-135M).
+  (768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
+  in Phase 2.

 ### Model-Agnostic Interface