Research confirmed rolling token windows as the right approach for long document screening. ADR-012 formalizes the decision: Phase 2 implements screen_document() with 25% overlap (512 tokens for SmolLM2-135M), max pooling aggregation, and character offset tracking. Short inputs fall through to screen() unchanged. This resolves the last open question. All 6 original OQs are now resolved: - OQ-01: ONNX removed (burn/cublas better future path) - OQ-02: 65% codebook compression achievable - OQ-03: Rolling token windows for Phase 2 (ADR-012) - OQ-04: Both model-specific defaults + user-overridable - OQ-05: Standalone API + thin adapters (ADR-011) - OQ-06: TOML for file-based config
225 lines
8.7 KiB
Markdown
225 lines
8.7 KiB
Markdown
---
|
||
status: draft
|
||
last_updated: 2026-06-13
|
||
---
|
||
|
||
# Firewall
|
||
|
||
The core firewall component: the public API for screening untrusted inputs and
|
||
producing behavioral alarms.
|
||
|
||
## What It Is
|
||
|
||
The Firewall is the primary entry point for alknet-firewall. It receives
|
||
untrusted text input, runs it through the detector model, extracts behavioral
|
||
signals from hidden state activations, and produces a structured alarm
|
||
indicating whether the input exhibits adversarial behavioral patterns.
|
||
|
||
## Why It Exists
|
||
|
||
LLM-based systems need a fast, pre-inference screening mechanism that catches
|
||
adversarial inputs *before* they reach the target model. Text-surface
|
||
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
|
||
detection catches what text hides — adversarial inputs produce anomalous
|
||
activation patterns regardless of their surface form (ADR-002).
|
||
|
||
## Data Flow
|
||
|
||
```
|
||
1. Input Arrives
|
||
"Please summarize this document: [hidden injection payload]"
|
||
|
||
2. Tokenize
|
||
tokenizer.encode(input) → input_ids (shape: seq_len)
|
||
|
||
3. Detector Model Inference
|
||
model(input_ids, output_hidden_states=True) → hidden_states at key layers
|
||
|
||
4. Activation Extraction
|
||
Extract last-token hidden states from configured layers (early + mid)
|
||
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
||
|
||
5. SVD Projection
|
||
Project activations onto precomputed SVD basis
|
||
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
|
||
|
||
6. Copula Decomposition
|
||
Transform z-coordinates through CDF → simplex → barycentric:
|
||
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
|
||
→ (u, v) via barycentric (position on simplex)
|
||
|
||
7. Token-Level Smoothing (optional)
|
||
Apply rolling average to (S, u, v) features across token positions
|
||
window=8: smooths per-token signals, reduces noise from single-token spikes
|
||
|
||
8. Direction Classification
|
||
For each behavioral direction (refusal, injection, etc.):
|
||
logistic_classifier(S, u, v) → P(active | features) per token position
|
||
|
||
9. Aggregation
|
||
Per direction: mean P(active), max P(active), fraction above threshold
|
||
Flag if any direction exceeds threshold for sufficient token positions
|
||
|
||
10. Alarm Generation
|
||
Combine per-direction signals → overall alarm
|
||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||
Include per-direction breakdown for interpretability
|
||
```
|
||
|
||
Note: Step 4 extracts only the last token in Phase 1. The full pipeline
|
||
(Phase 2) extracts per-token activations, enabling the token-level smoothing
|
||
and per-position classification in steps 7–9.
|
||
|
||
## Key Concepts
|
||
|
||
### Behavioral Alarm
|
||
|
||
Not a simple safe/unsafe binary. A behavioral alarm contains:
|
||
|
||
- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
|
||
- **Score**: Continuous 0.0–1.0 composite score
|
||
- **Signals**: Per-dimension behavioral signal strengths
|
||
- **Dimensions**: Which SVD directions are anomalous and by how much
|
||
|
||
This multi-signal approach reflects that safety is multi-dimensional in
|
||
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
|
||
that simultaneously shifts the refusal direction while activating role-playing
|
||
dimensions is more suspicious than one that shifts only one dimension.
|
||
|
||
### Score Composition
|
||
|
||
The overall `Alarm.score` (0.0–1.0) is computed from per-direction
|
||
classification results. For each behavioral direction, the logistic
|
||
classifier produces P(active | features) for every token position. The
|
||
alarm score aggregates these across directions:
|
||
|
||
```
|
||
direction_score = max(P(active) across token positions)
|
||
score = max(w_d * direction_score_d for d in directions)
|
||
```
|
||
|
||
Where `w_d` are direction weights (default: equal, configurable in
|
||
`Thresholds.per_dimension`). Using `max` at both levels ensures that:
|
||
- A single strongly anomalous direction can trigger an alarm even if other
|
||
directions are normal
|
||
- A sustained behavioral signal at any token position surfaces in the alarm
|
||
|
||
This is critical for catching attacks that exploit specific behavioral
|
||
patterns (e.g., refusal-suppression) while leaving other directions
|
||
unaffected.
|
||
|
||
The `suspicious` and `dangerous` thresholds are applied to this composite
|
||
score to determine `Alarm.level`.
|
||
|
||
### Alarm Levels
|
||
|
||
| Level | Meaning | Action |
|
||
|-------|---------|--------|
|
||
| `CLEAR` | Input exhibits normal behavioral patterns across all directions | Pass to target model |
|
||
| `SUSPICIOUS` | Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
|
||
| `DANGEROUS` | Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
|
||
|
||
### Latency Budget
|
||
|
||
The firewall must complete screening in <10ms on commodity hardware
|
||
(ADR-003). This budget breaks down approximately:
|
||
|
||
| Step | Target Latency |
|
||
|------|----------------|
|
||
| Tokenization | ~0.5ms |
|
||
| Model inference (125M, CPU) | ~5ms |
|
||
| Activation extraction | ~0.1ms |
|
||
| SVD projection | ~0.1ms |
|
||
| Copula decomposition | ~0.05ms |
|
||
| Token-level smoothing | ~0.05ms |
|
||
| Direction classification | ~0.1ms |
|
||
| **Total** | **~6ms** |
|
||
|
||
## Interfaces
|
||
|
||
### Public API
|
||
|
||
```python
|
||
class AlarmLevel(Enum):
|
||
CLEAR = "clear"
|
||
SUSPICIOUS = "suspicious"
|
||
DANGEROUS = "dangerous"
|
||
|
||
@dataclass
|
||
class DimensionSignal:
|
||
direction: str # Behavioral direction name (e.g., "refusal", "injection")
|
||
score: float # P(active) for this direction
|
||
max_score: float # Max P(active) across token positions
|
||
mean_score: float # Mean P(active) across token positions
|
||
n_positions_above: int # Token positions above threshold
|
||
direction_label: str | None
|
||
|
||
@dataclass
|
||
class Alarm:
|
||
level: AlarmLevel
|
||
score: float
|
||
signals: list[DimensionSignal]
|
||
input_hash: str # SHA-256 of raw input string (for logging/dedup)
|
||
model_id: str
|
||
timestamp: float
|
||
|
||
class Firewall:
|
||
def __init__(
|
||
self,
|
||
model_id: str = "HuggingFaceTB/SmolLM2-135M",
|
||
model_revision: str = DEFAULT_MODEL_REVISION,
|
||
codebook_path: Path | None = None,
|
||
thresholds: Thresholds | None = None,
|
||
device: str = "cpu",
|
||
cache_dir: str | None = None,
|
||
): ...
|
||
|
||
def preload(self) -> None: ...
|
||
|
||
def screen(self, input: str) -> Alarm: ...
|
||
```
|
||
|
||
> `screen_batch` is Phase 2 (see overview.md scope).
|
||
|
||
### Constraints
|
||
|
||
1. **No network calls during screening** — the model is lazily loaded on
|
||
first `screen()` call or via explicit `preload()`. Download never happens at
|
||
import time. Once loaded, screening is entirely local.
|
||
2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
|
||
3. **No target model dependency** — the firewall has no access to the target
|
||
LLM's internals. It runs its own detector model.
|
||
4. **Reproducible** — Same input + same model + same codebook = same alarm.
|
||
Pin model revision and codebook version.
|
||
|
||
## Error Handling
|
||
|
||
| Failure Mode | Exception Type | Behavior |
|
||
|-------------|---------------|----------|
|
||
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
|
||
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
|
||
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
|
||
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
|
||
| Empty input | `ValueError` | Raised if input is empty string. |
|
||
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
|
||
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
|
||
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
|
||
|
||
All exception types subclass `AlknetFirewallError` (base library exception).
|
||
|
||
## Design Decisions
|
||
|
||
| ADR | Decision | Summary |
|
||
|-----|----------|---------|
|
||
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
|
||
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
|
||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
|
||
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
|
||
|
||
## Open Questions
|
||
|
||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||
questions affecting this document:
|
||
|
||
- ~~**OQ-03**~~: ~~Should the firewall support streaming/chunked input screening?~~ (resolved — ADR-012: rolling token windows with `screen_document()` in Phase 2)
|
||
- ~~**OQ-05**~~: ~~How should the firewall integrate with existing guardrail systems?~~ (resolved — ADR-011: standalone API + thin adapters Phase 2) |