docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions
--- a/docs/architecture/codebook.md
+++ b/docs/architecture/codebook.md
@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
 distributions are calibrated for this scale, so threshold values in the
 codebook are specific to the z-coordinate range of the calibration data.

+**Training shape**: `(N, 3)` where N is the total number of token positions
+across all calibration prompts. Each token position produces its own
+z-coordinate, so the population data is a flattened collection of per-token
+z-vectors.
+
+**Inference shape**: `(seq_len, 3)` for a single input. Each token position
+in the input sequence produces a z-coordinate. The detection pipeline
+operates on this per-token sequence, optionally smoothing it before
+classification.
+
 ### SVD Basis

 Singular Value Decomposition of the activation space from a calibration dataset
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
 Inputs whose projections fall near or beyond the region boundary score
 increasingly high (SUSPICIOUS → DANGEROUS).

+### Copula Decomposition
+
+Raw z-coordinates are not the detection feature space. The codebook
+decomposes z-coordinates through a copula transform that separates **scale**
+(how far from normal) from **position** (which behavioral direction):
+
+```
+z → CDF → (x₀, x₁, x₂)     # Uniform marginals via CDF transform
+  → S = x₀ + x₁ + x₂       # Scale: total CDF magnitude
+  → x_norm = simplex(x)      # Normalize to probability simplex
+  → (u, v) = barycentric(x_norm)  # Position: 2D simplex coordinates
+```
+
+The three derived features `(S, u, v)` form the actual detection space:
+
+- **S (scale)**: How far the input's z-coordinates deviate from the
+  population norm, aggregated across all three SVD dimensions. High S means
+  the input is anomalous in *magnitude*.
+- **u, v (position)**: Where the input sits on the behavioral simplex —
+  which *direction* the deviation points. Different behavioral patterns
+  (refusal, instruction-following, self-reference) separate along different
+  (u, v) axes.
+
+This decomposition is why the codebook can distinguish "this input activates
+the refusal direction" from "this input is just generally unusual" — the same
+S value with different (u, v) coordinates implies different behavioral
+patterns.
+
+The PoC's `decompose()` method implements this pipeline as a pure function.
+It is called both during codebook compilation (to compute direction
+profiles) and during inference (to transform new z-coordinates for
+classification).
+
+### Direction Profiles and Contrast Pairs
+
+The codebook doesn't just detect "anomalous" — it detects specific behavioral
+**directions**. Each direction is defined by a contrast pair of conditions:
+
+| Contrast Pair | Condition A | Condition B | Behavioral Direction |
+|---------------|-------------|-------------|---------------------|
+| refusal | harmful | harmless | Refusal activation |
+| instruction_vs_data | instruction | data | Instruction-following |
+| tool_call | tool_call | natural_language | Tool call patterns |
+| self_vs_other | self_ref | other_ref | Self-reference |
+| semantic_violation | violated | expected | Semantic norm violation |
+| uncertainty | uncertain | confident | Uncertainty expression |
+| injection | injection | benign_instruction | Prompt injection |
+
+For each contrast pair, the codebook computes a **DirectionProfile** — the
+statistical baseline (means, pooled standard deviations, Cohen's d) of the
+(S, u, v) features for both conditions. This enables:
+
+1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
+   features of condition A vs condition B. Produces P(active | features) —
+   the probability that the input exhibits the "active" behavioral pattern.
+2. **Thresholds**: Midpoints between condition means for each feature, used
+   for interpretable rule-based detection as a fallback.
+
+### Token-Level Smoothing
+
+During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
+The detection pipeline optionally applies a rolling average (uniform kernel)
+to the decomposed (S, u, v) features before classification:
+
+- **window=1**: No smoothing. Each token position classified independently.
+- **window=8** (PoC default): Smooth features across 8 token positions.
+  Reduces noise from individual token fluctuations while preserving
+  sustained behavioral signals.
+
+Smoothing is an inference-time parameter — it does not affect codebook
+compilation or thresholds. The codebook is calibrated on per-token
+z-coordinates (all positions from the calibration data, flattened into
+`(N, 3)`), so the classifier weights are valid regardless of the smoothing
+window used at inference time.
+
 ### Spline Distributions

 Monotonic spline distributions model the probability density along each SVD
-dimension (ADR-010). They provide:
+dimension (ADR-010). They serve two roles in the codebook:

+1. **CDF transform**: The copula decomposition requires mapping z-coordinates
+   to uniform marginals via the CDF. The spline CDF provides this transform.
+2. **Scale distribution**: A separate spline distribution models the
+   sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
+
+They provide:
 - **Smooth scoring**: Continuous score rather than hard threshold
 - **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
  anomalous inputs
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
  adversarial training

 The spline distribution approach is adapted from the metaspline PoC
-(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
+(`spline.py` — `SplineDistribution` class, ~378 lines).

 **Formal definition**: The CDF along each dimension is modeled as a monotonic
-cubic spline with 10–20 knots. Knot positions are determined by quantiles of
-the calibration data (ensuring density of knots where data is dense). Beyond
-the extreme knots, the CDF decays exponentially at a rate fitted to the tail
-data. The scoring function maps a z-coordinate to a score in [0, 1] via the
-CDF's complement: `score = 1 - cdf(z)`.
-
-**Canonical implementation**: The metaspline PoC files `spline.py`
-(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
-and `space.py` (`unfold`/`fold`) are the reference implementation for the
-codebook compilation pipeline.
+cubic spline with knots (typically 10–64, depending on calibration data
+size). Knot positions are determined by quantiles of the calibration data
+(ensuring density of knots where data is dense). Beyond the extreme knots,
+the CDF decays exponentially at a rate fitted to the tail data.

 ### Calibration Dataset

-The calibration dataset is the set of normal (non-adversarial) inputs used to
-compute the SVD basis and fit behavioral region distributions. Requirements:
+The calibration dataset serves two purposes: establishing the population
+distribution (normal behavioral baseline) and providing contrast pairs
+(labeled examples for each behavioral direction).

- **Composition**: Diverse normal inputs representative of the deployment
-  domain. No adversarial examples — the basis models *normal* behavior, and
-  anomalies are detected as deviations from it.
+**Population data**: Diverse normal inputs representative of the deployment
+domain. No adversarial examples — the population models *normal* behavior,
+and anomalies are detected as deviations from it. Each prompt is processed
+by the detector model, and z-coordinates are extracted at every token
+position. The flattened `(N, 3)` tensor of all positions forms the population.
+
+**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
+instruction/data) that define each behavioral direction. Each condition
+produces a set of z-coordinates that, after copula decomposition, reveal
+where the conditions separate in (S, u, v) space.
+
+Requirements:
+- **Composition**: Population must cover the range of normal inputs the
+  detector will see in production. Contrast pairs must be clearly distinct
+  along their target behavioral direction.
 - **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
-  Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
-  have diminishing returns.
- **Diversity**: Must cover the range of normal inputs the detector will see
-  in production. A narrow calibration dataset (e.g., only short English
-  queries) will produce high false positive rates on unusual but benign inputs.
- **Model-specific**: A calibration dataset must be collected for each detector
+  Practical range: 1,000–10,000 prompts for population. Each contrast
+  condition needs at least 50–200 prompts.
+- **Diversity**: A narrow population (e.g., only short English queries) will
+  produce high false positive rates on unusual but benign inputs.
+- **Model-specific**: Calibration data must be collected for each detector
  model by running that model on the inputs and extracting activations.

 The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
-automates calibration dataset processing.
+automates calibration dataset processing with `max_length=128` tokens per
+prompt.

 ### Codebook Compilation

 The codebook is compiled offline by a training pipeline that:

-1. Runs the detector model on a calibration dataset (diverse normal inputs)
-2. Extracts hidden state activations at configured layers
-3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
-   deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
-   which uses randomized approximation and may not be deterministic)
-4. Fits spline distributions along each retained dimension
-5. Computes detection thresholds
-6. Serializes the codebook to a portable format (safetensors + JSON config)
+1. Runs the detector model on a calibration dataset (population + contrast
+   pairs)
+2. Extracts hidden state activations at configured layers for every token
+   position (not just last-token)
+3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
+   deterministic decomposition)
+4. Projects activations onto the top-3 SVD components → z-coordinates
+5. Fits spline distributions on each SVD dimension and the sum S
+6. Applies copula decomposition to all z-coordinates → (S, u, v) features
+7. Computes direction profiles (means, pooled std, Cohen's d) for each
+   contrast pair
+8. Trains logistic classifiers on (S, u, v) for each contrast pair
+9. Computes detection thresholds (midpoints between condition means)
+10. Serializes the codebook to a portable format (safetensors + JSON config)

 This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
 package** as package data (under `src/alknet_firewall/data/codebook/`). This
@@ -224,8 +328,10 @@ The codebook is stored as:
 codebook/
 ├── basis.safetensors      # SVD basis vectors (n_layers × n_dims × hidden_dim)
 ├── regions.safetensors    # Region boundary parameters
+├── classifiers.safetensors # Logistic classifier weights per direction
 ├── splines.json           # Spline knot positions and coefficients
-└── config.json            # Metadata: model_id, revision, n_dims, thresholds
+├── profiles.json          # Direction profiles (means, stds, Cohen's d)
+└── config.json            # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
 ```

 All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
 | `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |

+**classifiers.safetensors**:
+| Key | Shape | Dtype | Description |
+|-----|-------|-------|-------------|
+| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
+| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
+| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
+| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
+
 **splines.json**:
 | Field | Type | Description |
 |-------|------|-------------|
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
 | `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
 | `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |

+**profiles.json**:
+| Field | Type | Description |
+|-------|------|-------------|
+| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
+| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
+
+Each `DirectionProfile` entry contains:
+| Field | Type | Description |
+|-------|------|-------------|
+| `label` | `str` | Direction name (e.g., "refusal") |
+| `sum_mean_a` | `float` | Mean S for condition A |
+| `sum_mean_b` | `float` | Mean S for condition B |
+| `sum_std_pooled` | `float` | Pooled std of S |
+| `u_mean_a` | `float` | Mean u for condition A |
+| `u_mean_b` | `float` | Mean u for condition B |
+| `u_std_pooled` | `float` | Pooled std of u |
+| `v_mean_a` | `float` | Mean v for condition A |
+| `v_mean_b` | `float` | Mean v for condition B |
+| `v_std_pooled` | `float` | Pooled std of v |
+| `cohen_d_sum` | `float` | Effect size for S |
+| `cohen_d_u` | `float` | Effect size for u |
+| `cohen_d_v` | `float` | Effect size for v |
+
 ## Interfaces

 ### Internal API
@@ -262,18 +399,54 @@ class CodebookConfig:
    model_revision: str
    n_dimensions: int
    layers: list[int]
-    suspicious_threshold: float    # Serialized threshold values
-    dangerous_threshold: float     # (mapped to Thresholds dataclass at runtime)
+    suspicious_threshold: float
+    dangerous_threshold: float
+    contrast_pairs: list[tuple[str, str, str]]  # (cond_a, cond_b, label)
+    smoothing_window: int = 8                   # Token-level smoothing (inference only)

 class Codebook:
    def __init__(self, path: Path): ...

    def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
-        """Project raw activations onto SVD basis → z-coordinates."""
+        """Project raw activations onto SVD basis → z-coordinates.
+        
+        Returns: (seq_len, 3) z-coordinates.
+        """
        ...

-    def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
-        """Score z-coordinates against behavioral regions."""
+    def decompose(self, z_coords: np.ndarray) -> dict:
+        """Copula decomposition: z → CDF → (S, u, v).
+        
+        Args:
+            z_coords: (seq_len, 3) or (N, 3) z-coordinates
+        
+        Returns:
+            dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
+        """
+        ...
+
+    def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
+        """Classify decomposed features using logistic classifiers.
+        
+        Args:
+            features: Output of decompose(), with (seq_len,) arrays
+            window: Smoothing window size (1 = no smoothing)
+        
+        Returns:
+            dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
+        """
+        ...
+
+    def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
+               min_positions: int = 3, window: int = 8) -> DetectionResult:
+        """Full detection pipeline: project → decompose → smooth → classify → flag.
+        
+        Args:
+            z_coords: (seq_len, 3) z-coordinates for a single input
+            threshold_prob: P(active) threshold for flagging a direction
+            min_positions: Minimum token positions above threshold to flag
+            window: Smoothing window for token-level features
+        """
        ...

    @classmethod