docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
This commit is contained in:
@@ -55,6 +55,16 @@ z-coordinates are raw (unnormalized) projections. The codebook's spline
|
|||||||
distributions are calibrated for this scale, so threshold values in the
|
distributions are calibrated for this scale, so threshold values in the
|
||||||
codebook are specific to the z-coordinate range of the calibration data.
|
codebook are specific to the z-coordinate range of the calibration data.
|
||||||
|
|
||||||
|
**Training shape**: `(N, 3)` where N is the total number of token positions
|
||||||
|
across all calibration prompts. Each token position produces its own
|
||||||
|
z-coordinate, so the population data is a flattened collection of per-token
|
||||||
|
z-vectors.
|
||||||
|
|
||||||
|
**Inference shape**: `(seq_len, 3)` for a single input. Each token position
|
||||||
|
in the input sequence produces a z-coordinate. The detection pipeline
|
||||||
|
operates on this per-token sequence, optionally smoothing it before
|
||||||
|
classification.
|
||||||
|
|
||||||
### SVD Basis
|
### SVD Basis
|
||||||
|
|
||||||
Singular Value Decomposition of the activation space from a calibration dataset
|
Singular Value Decomposition of the activation space from a calibration dataset
|
||||||
@@ -83,11 +93,92 @@ Inputs whose projections fall within the normal region score low (CLEAR).
|
|||||||
Inputs whose projections fall near or beyond the region boundary score
|
Inputs whose projections fall near or beyond the region boundary score
|
||||||
increasingly high (SUSPICIOUS → DANGEROUS).
|
increasingly high (SUSPICIOUS → DANGEROUS).
|
||||||
|
|
||||||
|
### Copula Decomposition
|
||||||
|
|
||||||
|
Raw z-coordinates are not the detection feature space. The codebook
|
||||||
|
decomposes z-coordinates through a copula transform that separates **scale**
|
||||||
|
(how far from normal) from **position** (which behavioral direction):
|
||||||
|
|
||||||
|
```
|
||||||
|
z → CDF → (x₀, x₁, x₂) # Uniform marginals via CDF transform
|
||||||
|
→ S = x₀ + x₁ + x₂ # Scale: total CDF magnitude
|
||||||
|
→ x_norm = simplex(x) # Normalize to probability simplex
|
||||||
|
→ (u, v) = barycentric(x_norm) # Position: 2D simplex coordinates
|
||||||
|
```
|
||||||
|
|
||||||
|
The three derived features `(S, u, v)` form the actual detection space:
|
||||||
|
|
||||||
|
- **S (scale)**: How far the input's z-coordinates deviate from the
|
||||||
|
population norm, aggregated across all three SVD dimensions. High S means
|
||||||
|
the input is anomalous in *magnitude*.
|
||||||
|
- **u, v (position)**: Where the input sits on the behavioral simplex —
|
||||||
|
which *direction* the deviation points. Different behavioral patterns
|
||||||
|
(refusal, instruction-following, self-reference) separate along different
|
||||||
|
(u, v) axes.
|
||||||
|
|
||||||
|
This decomposition is why the codebook can distinguish "this input activates
|
||||||
|
the refusal direction" from "this input is just generally unusual" — the same
|
||||||
|
S value with different (u, v) coordinates implies different behavioral
|
||||||
|
patterns.
|
||||||
|
|
||||||
|
The PoC's `decompose()` method implements this pipeline as a pure function.
|
||||||
|
It is called both during codebook compilation (to compute direction
|
||||||
|
profiles) and during inference (to transform new z-coordinates for
|
||||||
|
classification).
|
||||||
|
|
||||||
|
### Direction Profiles and Contrast Pairs
|
||||||
|
|
||||||
|
The codebook doesn't just detect "anomalous" — it detects specific behavioral
|
||||||
|
**directions**. Each direction is defined by a contrast pair of conditions:
|
||||||
|
|
||||||
|
| Contrast Pair | Condition A | Condition B | Behavioral Direction |
|
||||||
|
|---------------|-------------|-------------|---------------------|
|
||||||
|
| refusal | harmful | harmless | Refusal activation |
|
||||||
|
| instruction_vs_data | instruction | data | Instruction-following |
|
||||||
|
| tool_call | tool_call | natural_language | Tool call patterns |
|
||||||
|
| self_vs_other | self_ref | other_ref | Self-reference |
|
||||||
|
| semantic_violation | violated | expected | Semantic norm violation |
|
||||||
|
| uncertainty | uncertain | confident | Uncertainty expression |
|
||||||
|
| injection | injection | benign_instruction | Prompt injection |
|
||||||
|
|
||||||
|
For each contrast pair, the codebook computes a **DirectionProfile** — the
|
||||||
|
statistical baseline (means, pooled standard deviations, Cohen's d) of the
|
||||||
|
(S, u, v) features for both conditions. This enables:
|
||||||
|
|
||||||
|
1. **DirectionClassifier**: A logistic regression trained on the (S, u, v)
|
||||||
|
features of condition A vs condition B. Produces P(active | features) —
|
||||||
|
the probability that the input exhibits the "active" behavioral pattern.
|
||||||
|
2. **Thresholds**: Midpoints between condition means for each feature, used
|
||||||
|
for interpretable rule-based detection as a fallback.
|
||||||
|
|
||||||
|
### Token-Level Smoothing
|
||||||
|
|
||||||
|
During inference, the z-coordinates form a sequence of shape `(seq_len, 3)`.
|
||||||
|
The detection pipeline optionally applies a rolling average (uniform kernel)
|
||||||
|
to the decomposed (S, u, v) features before classification:
|
||||||
|
|
||||||
|
- **window=1**: No smoothing. Each token position classified independently.
|
||||||
|
- **window=8** (PoC default): Smooth features across 8 token positions.
|
||||||
|
Reduces noise from individual token fluctuations while preserving
|
||||||
|
sustained behavioral signals.
|
||||||
|
|
||||||
|
Smoothing is an inference-time parameter — it does not affect codebook
|
||||||
|
compilation or thresholds. The codebook is calibrated on per-token
|
||||||
|
z-coordinates (all positions from the calibration data, flattened into
|
||||||
|
`(N, 3)`), so the classifier weights are valid regardless of the smoothing
|
||||||
|
window used at inference time.
|
||||||
|
|
||||||
### Spline Distributions
|
### Spline Distributions
|
||||||
|
|
||||||
Monotonic spline distributions model the probability density along each SVD
|
Monotonic spline distributions model the probability density along each SVD
|
||||||
dimension (ADR-010). They provide:
|
dimension (ADR-010). They serve two roles in the codebook:
|
||||||
|
|
||||||
|
1. **CDF transform**: The copula decomposition requires mapping z-coordinates
|
||||||
|
to uniform marginals via the CDF. The spline CDF provides this transform.
|
||||||
|
2. **Scale distribution**: A separate spline distribution models the
|
||||||
|
sum S = x₀ + x₁ + x₂, providing the CDF transform for the scale feature.
|
||||||
|
|
||||||
|
They provide:
|
||||||
- **Smooth scoring**: Continuous score rather than hard threshold
|
- **Smooth scoring**: Continuous score rather than hard threshold
|
||||||
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
- **Tail sensitivity**: Exponential tail behavior captures rare-but-critical
|
||||||
anomalous inputs
|
anomalous inputs
|
||||||
@@ -97,52 +188,65 @@ dimension (ADR-010). They provide:
|
|||||||
adversarial training
|
adversarial training
|
||||||
|
|
||||||
The spline distribution approach is adapted from the metaspline PoC
|
The spline distribution approach is adapted from the metaspline PoC
|
||||||
(`spline.py`, `transform.py`, `space.py` — ~280 lines total).
|
(`spline.py` — `SplineDistribution` class, ~378 lines).
|
||||||
|
|
||||||
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
**Formal definition**: The CDF along each dimension is modeled as a monotonic
|
||||||
cubic spline with 10–20 knots. Knot positions are determined by quantiles of
|
cubic spline with knots (typically 10–64, depending on calibration data
|
||||||
the calibration data (ensuring density of knots where data is dense). Beyond
|
size). Knot positions are determined by quantiles of the calibration data
|
||||||
the extreme knots, the CDF decays exponentially at a rate fitted to the tail
|
(ensuring density of knots where data is dense). Beyond the extreme knots,
|
||||||
data. The scoring function maps a z-coordinate to a score in [0, 1] via the
|
the CDF decays exponentially at a rate fitted to the tail data.
|
||||||
CDF's complement: `score = 1 - cdf(z)`.
|
|
||||||
|
|
||||||
**Canonical implementation**: The metaspline PoC files `spline.py`
|
|
||||||
(`SplineDistribution` class), `transform.py` (`dcs_norm`, simplex transforms),
|
|
||||||
and `space.py` (`unfold`/`fold`) are the reference implementation for the
|
|
||||||
codebook compilation pipeline.
|
|
||||||
|
|
||||||
### Calibration Dataset
|
### Calibration Dataset
|
||||||
|
|
||||||
The calibration dataset is the set of normal (non-adversarial) inputs used to
|
The calibration dataset serves two purposes: establishing the population
|
||||||
compute the SVD basis and fit behavioral region distributions. Requirements:
|
distribution (normal behavioral baseline) and providing contrast pairs
|
||||||
|
(labeled examples for each behavioral direction).
|
||||||
|
|
||||||
- **Composition**: Diverse normal inputs representative of the deployment
|
**Population data**: Diverse normal inputs representative of the deployment
|
||||||
domain. No adversarial examples — the basis models *normal* behavior, and
|
domain. No adversarial examples — the population models *normal* behavior,
|
||||||
anomalies are detected as deviations from it.
|
and anomalies are detected as deviations from it. Each prompt is processed
|
||||||
|
by the detector model, and z-coordinates are extracted at every token
|
||||||
|
position. The flattened `(N, 3)` tensor of all positions forms the population.
|
||||||
|
|
||||||
|
**Contrast data**: Labeled pairs of conditions (e.g., harmful/harmless,
|
||||||
|
instruction/data) that define each behavioral direction. Each condition
|
||||||
|
produces a set of z-coordinates that, after copula decomposition, reveal
|
||||||
|
where the conditions separate in (S, u, v) space.
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
- **Composition**: Population must cover the range of normal inputs the
|
||||||
|
detector will see in production. Contrast pairs must be clearly distinct
|
||||||
|
along their target behavioral direction.
|
||||||
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
- **Size**: At minimum, enough inputs to produce a stable SVD decomposition.
|
||||||
Practical range: 1,000–10,000 inputs. More inputs stabilize the basis but
|
Practical range: 1,000–10,000 prompts for population. Each contrast
|
||||||
have diminishing returns.
|
condition needs at least 50–200 prompts.
|
||||||
- **Diversity**: Must cover the range of normal inputs the detector will see
|
- **Diversity**: A narrow population (e.g., only short English queries) will
|
||||||
in production. A narrow calibration dataset (e.g., only short English
|
produce high false positive rates on unusual but benign inputs.
|
||||||
queries) will produce high false positive rates on unusual but benign inputs.
|
- **Model-specific**: Calibration data must be collected for each detector
|
||||||
- **Model-specific**: A calibration dataset must be collected for each detector
|
|
||||||
model by running that model on the inputs and extracting activations.
|
model by running that model on the inputs and extracting activations.
|
||||||
|
|
||||||
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
The codebook compilation pipeline (`run_manifold_projection.py` in the PoC)
|
||||||
automates calibration dataset processing.
|
automates calibration dataset processing with `max_length=128` tokens per
|
||||||
|
prompt.
|
||||||
|
|
||||||
### Codebook Compilation
|
### Codebook Compilation
|
||||||
|
|
||||||
The codebook is compiled offline by a training pipeline that:
|
The codebook is compiled offline by a training pipeline that:
|
||||||
|
|
||||||
1. Runs the detector model on a calibration dataset (diverse normal inputs)
|
1. Runs the detector model on a calibration dataset (population + contrast
|
||||||
2. Extracts hidden state activations at configured layers
|
pairs)
|
||||||
3. Computes SVD on the activation matrix (`scipy.linalg.svd` for exact,
|
2. Extracts hidden state activations at configured layers for every token
|
||||||
deterministic decomposition; not `sklearn.decomposition.TruncatedSVD`
|
position (not just last-token)
|
||||||
which uses randomized approximation and may not be deterministic)
|
3. Computes SVD on the perturbation vectors (`torch.linalg.svd` for exact,
|
||||||
4. Fits spline distributions along each retained dimension
|
deterministic decomposition)
|
||||||
5. Computes detection thresholds
|
4. Projects activations onto the top-3 SVD components → z-coordinates
|
||||||
6. Serializes the codebook to a portable format (safetensors + JSON config)
|
5. Fits spline distributions on each SVD dimension and the sum S
|
||||||
|
6. Applies copula decomposition to all z-coordinates → (S, u, v) features
|
||||||
|
7. Computes direction profiles (means, pooled std, Cohen's d) for each
|
||||||
|
contrast pair
|
||||||
|
8. Trains logistic classifiers on (S, u, v) for each contrast pair
|
||||||
|
9. Computes detection thresholds (midpoints between condition means)
|
||||||
|
10. Serializes the codebook to a portable format (safetensors + JSON config)
|
||||||
|
|
||||||
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
This pipeline is Phase 2. In Phase 1, the codebook is **bundled with the
|
||||||
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
package** as package data (under `src/alknet_firewall/data/codebook/`). This
|
||||||
@@ -224,8 +328,10 @@ The codebook is stored as:
|
|||||||
codebook/
|
codebook/
|
||||||
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
├── basis.safetensors # SVD basis vectors (n_layers × n_dims × hidden_dim)
|
||||||
├── regions.safetensors # Region boundary parameters
|
├── regions.safetensors # Region boundary parameters
|
||||||
|
├── classifiers.safetensors # Logistic classifier weights per direction
|
||||||
├── splines.json # Spline knot positions and coefficients
|
├── splines.json # Spline knot positions and coefficients
|
||||||
└── config.json # Metadata: model_id, revision, n_dims, thresholds
|
├── profiles.json # Direction profiles (means, stds, Cohen's d)
|
||||||
|
└── config.json # Metadata: model_id, revision, n_dims, thresholds, contrast_pairs
|
||||||
```
|
```
|
||||||
|
|
||||||
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
||||||
@@ -244,6 +350,14 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
|||||||
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
| `centroids` | `(n_layers, n_dims)` | float32 | Mean projection per dimension |
|
||||||
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
| `scale` | `(n_layers, n_dims)` | float32 | Standard deviation per dimension |
|
||||||
|
|
||||||
|
**classifiers.safetensors**:
|
||||||
|
| Key | Shape | Dtype | Description |
|
||||||
|
|-----|-------|-------|-------------|
|
||||||
|
| `weights_sum` | `(n_directions,)` | float32 | Logistic classifier weight for S feature |
|
||||||
|
| `weights_u` | `(n_directions,)` | float32 | Logistic classifier weight for u feature |
|
||||||
|
| `weights_v` | `(n_directions,)` | float32 | Logistic classifier weight for v feature |
|
||||||
|
| `intercepts` | `(n_directions,)` | float32 | Logistic classifier intercepts |
|
||||||
|
|
||||||
**splines.json**:
|
**splines.json**:
|
||||||
| Field | Type | Description |
|
| Field | Type | Description |
|
||||||
|-------|------|-------------|
|
|-------|------|-------------|
|
||||||
@@ -251,6 +365,29 @@ All tensor data uses safetensors format (ADR-005). Configuration uses JSON.
|
|||||||
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
| `coefficients` | `list[list[float]]` | Spline coefficients per dimension |
|
||||||
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
| `tail_decay` | `list[float]` | Exponential tail decay rate per dimension |
|
||||||
|
|
||||||
|
**profiles.json**:
|
||||||
|
| Field | Type | Description |
|
||||||
|
|-------|------|-------------|
|
||||||
|
| `directions` | `list[DirectionProfile]` | Per-direction statistical profiles |
|
||||||
|
| `contrast_pairs` | `list[[str, str, str]]` | (cond_a, cond_b, label) tuples |
|
||||||
|
|
||||||
|
Each `DirectionProfile` entry contains:
|
||||||
|
| Field | Type | Description |
|
||||||
|
|-------|------|-------------|
|
||||||
|
| `label` | `str` | Direction name (e.g., "refusal") |
|
||||||
|
| `sum_mean_a` | `float` | Mean S for condition A |
|
||||||
|
| `sum_mean_b` | `float` | Mean S for condition B |
|
||||||
|
| `sum_std_pooled` | `float` | Pooled std of S |
|
||||||
|
| `u_mean_a` | `float` | Mean u for condition A |
|
||||||
|
| `u_mean_b` | `float` | Mean u for condition B |
|
||||||
|
| `u_std_pooled` | `float` | Pooled std of u |
|
||||||
|
| `v_mean_a` | `float` | Mean v for condition A |
|
||||||
|
| `v_mean_b` | `float` | Mean v for condition B |
|
||||||
|
| `v_std_pooled` | `float` | Pooled std of v |
|
||||||
|
| `cohen_d_sum` | `float` | Effect size for S |
|
||||||
|
| `cohen_d_u` | `float` | Effect size for u |
|
||||||
|
| `cohen_d_v` | `float` | Effect size for v |
|
||||||
|
|
||||||
## Interfaces
|
## Interfaces
|
||||||
|
|
||||||
### Internal API
|
### Internal API
|
||||||
@@ -262,18 +399,54 @@ class CodebookConfig:
|
|||||||
model_revision: str
|
model_revision: str
|
||||||
n_dimensions: int
|
n_dimensions: int
|
||||||
layers: list[int]
|
layers: list[int]
|
||||||
suspicious_threshold: float # Serialized threshold values
|
suspicious_threshold: float
|
||||||
dangerous_threshold: float # (mapped to Thresholds dataclass at runtime)
|
dangerous_threshold: float
|
||||||
|
contrast_pairs: list[tuple[str, str, str]] # (cond_a, cond_b, label)
|
||||||
|
smoothing_window: int = 8 # Token-level smoothing (inference only)
|
||||||
|
|
||||||
class Codebook:
|
class Codebook:
|
||||||
def __init__(self, path: Path): ...
|
def __init__(self, path: Path): ...
|
||||||
|
|
||||||
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
def project(self, activations: dict[int, np.ndarray]) -> np.ndarray:
|
||||||
"""Project raw activations onto SVD basis → z-coordinates."""
|
"""Project raw activations onto SVD basis → z-coordinates.
|
||||||
|
|
||||||
|
Returns: (seq_len, 3) z-coordinates.
|
||||||
|
"""
|
||||||
...
|
...
|
||||||
|
|
||||||
def score(self, z_coords: np.ndarray) -> list[DimensionSignal]:
|
def decompose(self, z_coords: np.ndarray) -> dict:
|
||||||
"""Score z-coordinates against behavioral regions."""
|
"""Copula decomposition: z → CDF → (S, u, v).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
z_coords: (seq_len, 3) or (N, 3) z-coordinates
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict with keys 'u_sum' (CDF of S), 'u' (barycentric u), 'v' (barycentric v)
|
||||||
|
"""
|
||||||
|
...
|
||||||
|
|
||||||
|
def classify(self, features: dict, window: int = 8) -> dict[str, dict]:
|
||||||
|
"""Classify decomposed features using logistic classifiers.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
features: Output of decompose(), with (seq_len,) arrays
|
||||||
|
window: Smoothing window size (1 = no smoothing)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict mapping direction name to {'prob', 'mean_prob', 'max_prob'}
|
||||||
|
"""
|
||||||
|
...
|
||||||
|
|
||||||
|
def detect(self, z_coords: np.ndarray, threshold_prob: float = 0.7,
|
||||||
|
min_positions: int = 3, window: int = 8) -> DetectionResult:
|
||||||
|
"""Full detection pipeline: project → decompose → smooth → classify → flag.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
z_coords: (seq_len, 3) z-coordinates for a single input
|
||||||
|
threshold_prob: P(active) threshold for flagging a direction
|
||||||
|
min_positions: Minimum token positions above threshold to flag
|
||||||
|
window: Smoothing window for token-level features
|
||||||
|
"""
|
||||||
...
|
...
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
|
|||||||
@@ -30,8 +30,14 @@ input.
|
|||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
Extract the last token's hidden state at each configured layer. This is
|
Extract the last token's hidden state at each configured layer as the Phase 1
|
||||||
standard for LLaMA-family models and provides full-sequence context.
|
default. This is standard for LLaMA-family models and provides full-sequence
|
||||||
|
context.
|
||||||
|
|
||||||
|
Phase 2 extends this to per-token extraction (hidden states at every position)
|
||||||
|
to enable token-level smoothing and per-position behavioral classification.
|
||||||
|
The training pipeline already uses per-token extraction for calibration data
|
||||||
|
collection.
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
@@ -40,6 +46,7 @@ standard for LLaMA-family models and provides full-sequence context.
|
|||||||
- Full sequence context via causal attention
|
- Full sequence context via causal attention
|
||||||
- Single vector per layer — simple to project and score
|
- Single vector per layer — simple to project and score
|
||||||
- No padding sensitivity (unlike mean pooling with attention masks)
|
- No padding sensitivity (unlike mean pooling with attention masks)
|
||||||
|
- Phase 1 simplification: reduces implementation complexity and latency
|
||||||
|
|
||||||
**Negative**:
|
**Negative**:
|
||||||
- Position-dependent — the last token's representation is influenced by its
|
- Position-dependent — the last token's representation is influenced by its
|
||||||
@@ -48,6 +55,8 @@ standard for LLaMA-family models and provides full-sequence context.
|
|||||||
activation patterns
|
activation patterns
|
||||||
- May miss patterns in long inputs where the adversarial payload is in the
|
- May miss patterns in long inputs where the adversarial payload is in the
|
||||||
middle rather than the end
|
middle rather than the end
|
||||||
|
- Phase 1 only: misses token-level behavioral signals that require per-token
|
||||||
|
extraction (addressed in Phase 2)
|
||||||
|
|
||||||
## References
|
## References
|
||||||
|
|
||||||
|
|||||||
@@ -30,31 +30,46 @@ activation patterns regardless of their surface form (ADR-002).
|
|||||||
"Please summarize this document: [hidden injection payload]"
|
"Please summarize this document: [hidden injection payload]"
|
||||||
|
|
||||||
2. Tokenize
|
2. Tokenize
|
||||||
tokenizer.encode(input) → input_ids
|
tokenizer.encode(input) → input_ids (shape: seq_len)
|
||||||
|
|
||||||
3. Detector Model Inference
|
3. Detector Model Inference
|
||||||
model(input_ids) → hidden_states at key layers
|
model(input_ids, output_hidden_states=True) → hidden_states at key layers
|
||||||
|
|
||||||
4. Activation Extraction
|
4. Activation Extraction
|
||||||
Extract hidden states from configured layers (early + mid)
|
Extract last-token hidden states from configured layers (early + mid)
|
||||||
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
||||||
|
|
||||||
5. SVD Projection
|
5. SVD Projection
|
||||||
Project activations onto precomputed SVD basis
|
Project activations onto precomputed SVD basis
|
||||||
z_coords = svd_basis @ activation_vector
|
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
|
||||||
|
|
||||||
6. Codebook Comparison
|
6. Copula Decomposition
|
||||||
For each SVD dimension:
|
Transform z-coordinates through CDF → simplex → barycentric:
|
||||||
- Compute distance from normal behavioral region
|
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
|
||||||
- Apply spline scoring (monotonic distribution)
|
→ (u, v) via barycentric (position on simplex)
|
||||||
- Aggregate multi-dimensional signals
|
|
||||||
|
|
||||||
7. Alarm Generation
|
7. Token-Level Smoothing (optional)
|
||||||
Combine per-dimension signals → overall alarm
|
Apply rolling average to (S, u, v) features across token positions
|
||||||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
window=8: smooths per-token signals, reduces noise from single-token spikes
|
||||||
Include per-dimension breakdown for interpretability
|
|
||||||
|
8. Direction Classification
|
||||||
|
For each behavioral direction (refusal, injection, etc.):
|
||||||
|
logistic_classifier(S, u, v) → P(active | features) per token position
|
||||||
|
|
||||||
|
9. Aggregation
|
||||||
|
Per direction: mean P(active), max P(active), fraction above threshold
|
||||||
|
Flag if any direction exceeds threshold for sufficient token positions
|
||||||
|
|
||||||
|
10. Alarm Generation
|
||||||
|
Combine per-direction signals → overall alarm
|
||||||
|
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||||||
|
Include per-direction breakdown for interpretability
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Note: Step 4 extracts only the last token in Phase 1. The full pipeline
|
||||||
|
(Phase 2) extracts per-token activations, enabling the token-level smoothing
|
||||||
|
and per-position classification in steps 7–9.
|
||||||
|
|
||||||
## Key Concepts
|
## Key Concepts
|
||||||
|
|
||||||
### Behavioral Alarm
|
### Behavioral Alarm
|
||||||
@@ -73,19 +88,25 @@ dimensions is more suspicious than one that shifts only one dimension.
|
|||||||
|
|
||||||
### Score Composition
|
### Score Composition
|
||||||
|
|
||||||
The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
|
The overall `Alarm.score` (0.0–1.0) is computed from per-direction
|
||||||
using a weighted maximum:
|
classification results. For each behavioral direction, the logistic
|
||||||
|
classifier produces P(active | features) for every token position. The
|
||||||
|
alarm score aggregates these across directions:
|
||||||
|
|
||||||
```
|
```
|
||||||
score = max(w_d * signal_d for d in dimensions)
|
direction_score = max(P(active) across token positions)
|
||||||
|
score = max(w_d * direction_score_d for d in directions)
|
||||||
```
|
```
|
||||||
|
|
||||||
Where `w_d` are dimension weights (default: equal, configurable in
|
Where `w_d` are direction weights (default: equal, configurable in
|
||||||
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
|
`Thresholds.per_dimension`). Using `max` at both levels ensures that:
|
||||||
single strongly anomalous dimension can trigger an alarm even if other
|
- A single strongly anomalous direction can trigger an alarm even if other
|
||||||
dimensions are normal. This is critical for catching attacks that exploit
|
directions are normal
|
||||||
specific behavioral patterns (e.g., refusal-suppression) while leaving other
|
- A sustained behavioral signal at any token position surfaces in the alarm
|
||||||
dimensions unaffected.
|
|
||||||
|
This is critical for catching attacks that exploit specific behavioral
|
||||||
|
patterns (e.g., refusal-suppression) while leaving other directions
|
||||||
|
unaffected.
|
||||||
|
|
||||||
The `suspicious` and `dangerous` thresholds are applied to this composite
|
The `suspicious` and `dangerous` thresholds are applied to this composite
|
||||||
score to determine `Alarm.level`.
|
score to determine `Alarm.level`.
|
||||||
@@ -94,9 +115,9 @@ score to determine `Alarm.level`.
|
|||||||
|
|
||||||
| Level | Meaning | Action |
|
| Level | Meaning | Action |
|
||||||
|-------|---------|--------|
|
|-------|---------|--------|
|
||||||
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
|
| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
|
||||||
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
|
| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
|
||||||
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
|
| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
|
||||||
|
|
||||||
### Latency Budget
|
### Latency Budget
|
||||||
|
|
||||||
@@ -109,7 +130,9 @@ The firewall must complete screening in <10ms on commodity hardware
|
|||||||
| Model inference (125M, CPU) | ~5ms |
|
| Model inference (125M, CPU) | ~5ms |
|
||||||
| Activation extraction | ~0.1ms |
|
| Activation extraction | ~0.1ms |
|
||||||
| SVD projection | ~0.1ms |
|
| SVD projection | ~0.1ms |
|
||||||
| Codebook comparison | ~0.3ms |
|
| Copula decomposition | ~0.05ms |
|
||||||
|
| Token-level smoothing | ~0.05ms |
|
||||||
|
| Direction classification | ~0.1ms |
|
||||||
| **Total** | **~6ms** |
|
| **Total** | **~6ms** |
|
||||||
|
|
||||||
## Interfaces
|
## Interfaces
|
||||||
@@ -124,9 +147,11 @@ class AlarmLevel(Enum):
|
|||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class DimensionSignal:
|
class DimensionSignal:
|
||||||
dimension: int
|
direction: str # Behavioral direction name (e.g., "refusal", "injection")
|
||||||
deviation: float
|
score: float # P(active) for this direction
|
||||||
score: float
|
max_score: float # Max P(active) across token positions
|
||||||
|
mean_score: float # Mean P(active) across token positions
|
||||||
|
n_positions_above: int # Token positions above threshold
|
||||||
direction_label: str | None
|
direction_label: str | None
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
|
|||||||
@@ -35,15 +35,34 @@ changes to the firewall logic.
|
|||||||
The core operation: running the model on an input and capturing hidden state
|
The core operation: running the model on an input and capturing hidden state
|
||||||
representations at specific layers.
|
representations at specific layers.
|
||||||
|
|
||||||
|
**Phase 1 (last-token extraction)**:
|
||||||
```python
|
```python
|
||||||
# Conceptual
|
|
||||||
outputs = model(input_ids, output_hidden_states=True)
|
outputs = model(input_ids, output_hidden_states=True)
|
||||||
activations = {
|
activations = {
|
||||||
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
|
layer_idx: outputs.hidden_states[layer_idx][:, -1, :]
|
||||||
for layer_idx in configured_layers
|
for layer_idx in configured_layers
|
||||||
}
|
}
|
||||||
|
# Shape: (hidden_dim,) per layer — single vector
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Phase 2 (per-token extraction)**: Extract hidden states at every token
|
||||||
|
position to enable token-level smoothing and per-position classification
|
||||||
|
(see codebook.md: Token-Level Smoothing).
|
||||||
|
```python
|
||||||
|
outputs = model(input_ids, output_hidden_states=True)
|
||||||
|
activations = {
|
||||||
|
layer_idx: outputs.hidden_states[layer_idx][0, :, :]
|
||||||
|
for layer_idx in configured_layers
|
||||||
|
}
|
||||||
|
# Shape: (seq_len, hidden_dim) per layer — sequence of vectors
|
||||||
|
```
|
||||||
|
|
||||||
|
The training pipeline uses per-token extraction (z-coordinates at every
|
||||||
|
position are collected for population statistics). Phase 1 simplifies to
|
||||||
|
last-token only for lower latency and simpler implementation. The codebook's
|
||||||
|
classifiers are trained on per-token data from all positions, so they remain
|
||||||
|
valid for both extraction modes.
|
||||||
|
|
||||||
Key decisions:
|
Key decisions:
|
||||||
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
|
- **Which layers**: Layers 1, 2, 4, 8 of SmolLM2-135M (12-layer model).
|
||||||
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
|
Early layers (1, 2) capture safety signals per EMNLP 2024 findings.
|
||||||
@@ -52,9 +71,11 @@ Key decisions:
|
|||||||
signals are highly correlated with the selected layers.
|
signals are highly correlated with the selected layers.
|
||||||
- **Which token**: The last token's hidden state carries the model's
|
- **Which token**: The last token's hidden state carries the model's
|
||||||
"conclusion" about the full input sequence (ADR-009). This is the standard
|
"conclusion" about the full input sequence (ADR-009). This is the standard
|
||||||
choice for autoregressive (LLaMA-family) models.
|
choice for autoregressive (LLaMA-family) models and sufficient for Phase 1.
|
||||||
|
Per-token extraction enables the full detection pipeline in Phase 2.
|
||||||
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
|
- **Shape**: Per layer, the activation is a 1D vector of size `hidden_dim`
|
||||||
(768 for SmolLM2-135M).
|
(768 for SmolLM2-135M) in Phase 1, or a 2D array `(seq_len, hidden_dim)`
|
||||||
|
in Phase 2.
|
||||||
|
|
||||||
### Model-Agnostic Interface
|
### Model-Agnostic Interface
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user