docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
This commit is contained in:
@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
|
||||
"Please summarize this document: [hidden injection payload]"
|
||||
|
||||
2. Tokenize
|
||||
tokenizer.encode(input) → input_ids
|
||||
tokenizer.encode(input) → input_ids (shape: seq_len)
|
||||
|
||||
3. Detector Model Inference
|
||||
model(input_ids) → hidden_states at key layers
|
||||
model(input_ids, output_hidden_states=True) → hidden_states at key layers
|
||||
|
||||
4. Activation Extraction
|
||||
Extract hidden states from configured layers (early + mid)
|
||||
Extract last-token hidden states from configured layers (early + mid)
|
||||
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
||||
|
||||
5. SVD Projection
|
||||
Project activations onto precomputed SVD basis
|
||||
z_coords = svd_basis @ activation_vector
|
||||
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
|
||||
|
||||
6. Codebook Comparison
|
||||
For each SVD dimension:
|
||||
- Compute distance from normal behavioral region
|
||||
- Apply spline scoring (monotonic distribution)
|
||||
- Aggregate multi-dimensional signals
|
||||
6. Copula Decomposition
|
||||
Transform z-coordinates through CDF → simplex → barycentric:
|
||||
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
|
||||
→ (u, v) via barycentric (position on simplex)
|
||||
|
||||
7. Alarm Generation
|
||||
Combine per-dimension signals → overall alarm
|
||||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||||
Include per-dimension breakdown for interpretability
|
||||
7. Token-Level Smoothing (optional)
|
||||
Apply rolling average to (S, u, v) features across token positions
|
||||
window=8: smooths per-token signals, reduces noise from single-token spikes
|
||||
|
||||
8. Direction Classification
|
||||
For each behavioral direction (refusal, injection, etc.):
|
||||
logistic_classifier(S, u, v) → P(active | features) per token position
|
||||
|
||||
9. Aggregation
|
||||
Per direction: mean P(active), max P(active), fraction above threshold
|
||||
Flag if any direction exceeds threshold for sufficient token positions
|
||||
|
||||
10. Alarm Generation
|
||||
Combine per-direction signals → overall alarm
|
||||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||||
Include per-direction breakdown for interpretability
|
||||
```
|
||||
|
||||
Note: Step 4 extracts only the last token in Phase 1. The full pipeline
|
||||
(Phase 2) extracts per-token activations, enabling the token-level smoothing
|
||||
and per-position classification in steps 7–9.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Behavioral Alarm
|
||||
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.
|
||||
|
||||
### Score Composition
|
||||
|
||||
The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
|
||||
using a weighted maximum:
|
||||
The overall `Alarm.score` (0.0–1.0) is computed from per-direction
|
||||
classification results. For each behavioral direction, the logistic
|
||||
classifier produces P(active | features) for every token position. The
|
||||
alarm score aggregates these across directions:
|
||||
|
||||
```
|
||||
score = max(w_d * signal_d for d in dimensions)
|
||||
direction_score = max(P(active) across token positions)
|
||||
score = max(w_d * direction_score_d for d in directions)
|
||||
```
|
||||
|
||||
Where `w_d` are dimension weights (default: equal, configurable in
|
||||
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
|
||||
single strongly anomalous dimension can trigger an alarm even if other
|
||||
dimensions are normal. This is critical for catching attacks that exploit
|
||||
specific behavioral patterns (e.g., refusal-suppression) while leaving other
|
||||
dimensions unaffected.
|
||||
Where `w_d` are direction weights (default: equal, configurable in
|
||||
`Thresholds.per_dimension`). Using `max` at both levels ensures that:
|
||||
- A single strongly anomalous direction can trigger an alarm even if other
|
||||
directions are normal
|
||||
- A sustained behavioral signal at any token position surfaces in the alarm
|
||||
|
||||
This is critical for catching attacks that exploit specific behavioral
|
||||
patterns (e.g., refusal-suppression) while leaving other directions
|
||||
unaffected.
|
||||
|
||||
The `suspicious` and `dangerous` thresholds are applied to this composite
|
||||
score to determine `Alarm.level`.
|
||||
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.
|
||||
|
||||
| Level | Meaning | Action |
|
||||
|-------|---------|--------|
|
||||
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
|
||||
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
|
||||
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
|
||||
| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
|
||||
| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
|
||||
| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
|
||||
|
||||
### Latency Budget
|
||||
|
||||
@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
|
||||
| Model inference (125M, CPU) | ~5ms |
|
||||
| Activation extraction | ~0.1ms |
|
||||
| SVD projection | ~0.1ms |
|
||||
| Codebook comparison | ~0.3ms |
|
||||
| Copula decomposition | ~0.05ms |
|
||||
| Token-level smoothing | ~0.05ms |
|
||||
| Direction classification | ~0.1ms |
|
||||
| **Total** | **~6ms** |
|
||||
|
||||
## Interfaces
|
||||
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):
|
||||
|
||||
@dataclass
|
||||
class DimensionSignal:
|
||||
dimension: int
|
||||
deviation: float
|
||||
score: float
|
||||
direction: str # Behavioral direction name (e.g., "refusal", "injection")
|
||||
score: float # P(active) for this direction
|
||||
max_score: float # Max P(active) across token positions
|
||||
mean_score: float # Mean P(active) across token positions
|
||||
n_positions_above: int # Token positions above threshold
|
||||
direction_label: str | None
|
||||
|
||||
@dataclass
|
||||
|
||||
Reference in New Issue
Block a user