docs: add copula decomposition pipeline, clarify detection data flow

The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
2026-06-13 08:17:09 +00:00
parent 7d8a39a88a
commit 45a0e0798c
4 changed files with 300 additions and 72 deletions
--- a/docs/architecture/firewall.md
+++ b/docs/architecture/firewall.md
@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
   "Please summarize this document: [hidden injection payload]"

 2. Tokenize
-   tokenizer.encode(input) → input_ids
+   tokenizer.encode(input) → input_ids  (shape: seq_len)

 3. Detector Model Inference
-   model(input_ids) → hidden_states at key layers
+   model(input_ids, output_hidden_states=True) → hidden_states at key layers

 4. Activation Extraction
-   Extract hidden states from configured layers (early + mid)
+   Extract last-token hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors

 5. SVD Projection
   Project activations onto precomputed SVD basis
-   z_coords = svd_basis @ activation_vector
+   z_coords = V^T @ (activation - mean)  → (seq_len, 3) z-coordinates

-6. Codebook Comparison
-   For each SVD dimension:
-     - Compute distance from normal behavioral region
-     - Apply spline scoring (monotonic distribution)
-     - Aggregate multi-dimensional signals
+6. Copula Decomposition
+   Transform z-coordinates through CDF → simplex → barycentric:
+   z → (x₀, x₁, x₂) via CDF  →  S = x₀+x₁+x₂ (scale)
+                               →  (u, v) via barycentric (position on simplex)

-7. Alarm Generation
-   Combine per-dimension signals → overall alarm
-   AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
-   Include per-dimension breakdown for interpretability
+7. Token-Level Smoothing (optional)
+   Apply rolling average to (S, u, v) features across token positions
+   window=8: smooths per-token signals, reduces noise from single-token spikes
+
+8. Direction Classification
+   For each behavioral direction (refusal, injection, etc.):
+     logistic_classifier(S, u, v) → P(active | features) per token position
+
+9. Aggregation
+   Per direction: mean P(active), max P(active), fraction above threshold
+   Flag if any direction exceeds threshold for sufficient token positions
+
+10. Alarm Generation
+    Combine per-direction signals → overall alarm
+    AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
+    Include per-direction breakdown for interpretability
 ```

+Note: Step 4 extracts only the last token in Phase 1. The full pipeline
+(Phase 2) extracts per-token activations, enabling the token-level smoothing
+and per-position classification in steps 7–9.
+
 ## Key Concepts

 ### Behavioral Alarm
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.

 ### Score Composition

-The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
-using a weighted maximum:
+The overall `Alarm.score` (0.0–1.0) is computed from per-direction
+classification results. For each behavioral direction, the logistic
+classifier produces P(active | features) for every token position. The
+alarm score aggregates these across directions:

 ```
-score = max(w_d * signal_d for d in dimensions)
+direction_score = max(P(active) across token positions)
+score = max(w_d * direction_score_d for d in directions)
 ```

-Where `w_d` are dimension weights (default: equal, configurable in
-`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
-single strongly anomalous dimension can trigger an alarm even if other
-dimensions are normal. This is critical for catching attacks that exploit
-specific behavioral patterns (e.g., refusal-suppression) while leaving other
-dimensions unaffected.
+Where `w_d` are direction weights (default: equal, configurable in
+`Thresholds.per_dimension`). Using `max` at both levels ensures that:
+- A single strongly anomalous direction can trigger an alarm even if other
+  directions are normal
+- A sustained behavioral signal at any token position surfaces in the alarm
+
+This is critical for catching attacks that exploit specific behavioral
+patterns (e.g., refusal-suppression) while leaving other directions
+unaffected.

 The `suspicious` and `dangerous` thresholds are applied to this composite
 score to determine `Alarm.level`.
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.

 | Level | Meaning | Action |
 |-------|---------|--------|
-| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
-| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
-| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
+| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
+| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
+| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |

 ### Latency Budget

@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
 | Model inference (125M, CPU) | ~5ms |
 | Activation extraction | ~0.1ms |
 | SVD projection | ~0.1ms |
-| Codebook comparison | ~0.3ms |
+| Copula decomposition | ~0.05ms |
+| Token-level smoothing | ~0.05ms |
+| Direction classification | ~0.1ms |
 | **Total** | **~6ms** |

 ## Interfaces
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):

@dataclass
 class DimensionSignal:
-    dimension: int
-    deviation: float
-    score: float
+    direction: str              # Behavioral direction name (e.g., "refusal", "injection")
+    score: float                # P(active) for this direction
+    max_score: float            # Max P(active) across token positions
+    mean_score: float           # Mean P(active) across token positions
+    n_positions_above: int      # Token positions above threshold
    direction_label: str | None

@dataclass