Files
alknet-firewall/docs/architecture/firewall.md
glm-5.1 45a0e0798c docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
2026-06-13 08:17:09 +00:00

225 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
status: draft
last_updated: 2026-06-13
---
# Firewall
The core firewall component: the public API for screening untrusted inputs and
producing behavioral alarms.
## What It Is
The Firewall is the primary entry point for alknet-firewall. It receives
untrusted text input, runs it through the detector model, extracts behavioral
signals from hidden state activations, and produces a structured alarm
indicating whether the input exhibits adversarial behavioral patterns.
## Why It Exists
LLM-based systems need a fast, pre-inference screening mechanism that catches
adversarial inputs *before* they reach the target model. Text-surface
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
detection catches what text hides — adversarial inputs produce anomalous
activation patterns regardless of their surface form (ADR-002).
## Data Flow
```
1. Input Arrives
"Please summarize this document: [hidden injection payload]"
2. Tokenize
tokenizer.encode(input) → input_ids (shape: seq_len)
3. Detector Model Inference
model(input_ids, output_hidden_states=True) → hidden_states at key layers
4. Activation Extraction
Extract last-token hidden states from configured layers (early + mid)
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
5. SVD Projection
Project activations onto precomputed SVD basis
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
6. Copula Decomposition
Transform z-coordinates through CDF → simplex → barycentric:
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
→ (u, v) via barycentric (position on simplex)
7. Token-Level Smoothing (optional)
Apply rolling average to (S, u, v) features across token positions
window=8: smooths per-token signals, reduces noise from single-token spikes
8. Direction Classification
For each behavioral direction (refusal, injection, etc.):
logistic_classifier(S, u, v) → P(active | features) per token position
9. Aggregation
Per direction: mean P(active), max P(active), fraction above threshold
Flag if any direction exceeds threshold for sufficient token positions
10. Alarm Generation
Combine per-direction signals → overall alarm
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
Include per-direction breakdown for interpretability
```
Note: Step 4 extracts only the last token in Phase 1. The full pipeline
(Phase 2) extracts per-token activations, enabling the token-level smoothing
and per-position classification in steps 79.
## Key Concepts
### Behavioral Alarm
Not a simple safe/unsafe binary. A behavioral alarm contains:
- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
- **Score**: Continuous 0.01.0 composite score
- **Signals**: Per-dimension behavioral signal strengths
- **Dimensions**: Which SVD directions are anomalous and by how much
This multi-signal approach reflects that safety is multi-dimensional in
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
that simultaneously shifts the refusal direction while activating role-playing
dimensions is more suspicious than one that shifts only one dimension.
### Score Composition
The overall `Alarm.score` (0.01.0) is computed from per-direction
classification results. For each behavioral direction, the logistic
classifier produces P(active | features) for every token position. The
alarm score aggregates these across directions:
```
direction_score = max(P(active) across token positions)
score = max(w_d * direction_score_d for d in directions)
```
Where `w_d` are direction weights (default: equal, configurable in
`Thresholds.per_dimension`). Using `max` at both levels ensures that:
- A single strongly anomalous direction can trigger an alarm even if other
directions are normal
- A sustained behavioral signal at any token position surfaces in the alarm
This is critical for catching attacks that exploit specific behavioral
patterns (e.g., refusal-suppression) while leaving other directions
unaffected.
The `suspicious` and `dangerous` thresholds are applied to this composite
score to determine `Alarm.level`.
### Alarm Levels
| Level | Meaning | Action |
|-------|---------|--------|
| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
### Latency Budget
The firewall must complete screening in <10ms on commodity hardware
(ADR-003). This budget breaks down approximately:
| Step | Target Latency |
|------|----------------|
| Tokenization | ~0.5ms |
| Model inference (125M, CPU) | ~5ms |
| Activation extraction | ~0.1ms |
| SVD projection | ~0.1ms |
| Copula decomposition | ~0.05ms |
| Token-level smoothing | ~0.05ms |
| Direction classification | ~0.1ms |
| **Total** | **~6ms** |
## Interfaces
### Public API
```python
class AlarmLevel(Enum):
CLEAR = "clear"
SUSPICIOUS = "suspicious"
DANGEROUS = "dangerous"
@dataclass
class DimensionSignal:
direction: str # Behavioral direction name (e.g., "refusal", "injection")
score: float # P(active) for this direction
max_score: float # Max P(active) across token positions
mean_score: float # Mean P(active) across token positions
n_positions_above: int # Token positions above threshold
direction_label: str | None
@dataclass
class Alarm:
level: AlarmLevel
score: float
signals: list[DimensionSignal]
input_hash: str # SHA-256 of raw input string (for logging/dedup)
model_id: str
timestamp: float
class Firewall:
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
model_revision: str = DEFAULT_MODEL_REVISION,
codebook_path: Path | None = None,
thresholds: Thresholds | None = None,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def preload(self) -> None: ...
def screen(self, input: str) -> Alarm: ...
```
> `screen_batch` is Phase 2 (see overview.md scope).
### Constraints
1. **No network calls during screening** — the model is lazily loaded on
first `screen()` call or via explicit `preload()`. Download never happens at
import time. Once loaded, screening is entirely local.
2. **Synchronous API**`screen()` is a blocking call. Async is Phase 2.
3. **No target model dependency** — the firewall has no access to the target
LLM's internals. It runs its own detector model.
4. **Reproducible** — Same input + same model + same codebook = same alarm.
Pin model revision and codebook version.
## Error Handling
| Failure Mode | Exception Type | Behavior |
|-------------|---------------|----------|
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
| Empty input | `ValueError` | Raised if input is empty string. |
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
All exception types subclass `AlknetFirewallError` (base library exception).
## Design Decisions
| ADR | Decision | Summary |
|-----|----------|---------|
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; [research complete](../research/streaming-screening-patterns/rolling-window-analysis.md))
- ~~**OQ-05**~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2)