The architecture specs previously described detection as a single-vector path (one activation → one z-coordinate → one alarm), but the PoC operates on per-token z-coordinate sequences with a two-stage copula decomposition. Key updates: - codebook.md: Add Copula Decomposition section (z → CDF → simplex → barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section, Token-Level Smoothing section, classifier weights and direction profiles to data format, updated Internal API with decompose/classify/detect methods - codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened per-token positions, inference is (seq_len, 3) per-token sequence - firewall.md: Update data flow to 10-step pipeline including copula decomposition, smoothing, and direction classification; update score composition to use direction-level P(active); update DimensionSignal dataclass; update latency budget with copula/smoothing/classification steps - model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes - ADR-009: Note last-token is Phase 1 simplification, per-token is full pipeline
8.8 KiB
status, last_updated
| status | last_updated |
|---|---|
| draft | 2026-06-13 |
Firewall
The core firewall component: the public API for screening untrusted inputs and producing behavioral alarms.
What It Is
The Firewall is the primary entry point for alknet-firewall. It receives untrusted text input, runs it through the detector model, extracts behavioral signals from hidden state activations, and produces a structured alarm indicating whether the input exhibits adversarial behavioral patterns.
Why It Exists
LLM-based systems need a fast, pre-inference screening mechanism that catches adversarial inputs before they reach the target model. Text-surface defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal detection catches what text hides — adversarial inputs produce anomalous activation patterns regardless of their surface form (ADR-002).
Data Flow
1. Input Arrives
"Please summarize this document: [hidden injection payload]"
2. Tokenize
tokenizer.encode(input) → input_ids (shape: seq_len)
3. Detector Model Inference
model(input_ids, output_hidden_states=True) → hidden_states at key layers
4. Activation Extraction
Extract last-token hidden states from configured layers (early + mid)
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
5. SVD Projection
Project activations onto precomputed SVD basis
z_coords = V^T @ (activation - mean) → (seq_len, 3) z-coordinates
6. Copula Decomposition
Transform z-coordinates through CDF → simplex → barycentric:
z → (x₀, x₁, x₂) via CDF → S = x₀+x₁+x₂ (scale)
→ (u, v) via barycentric (position on simplex)
7. Token-Level Smoothing (optional)
Apply rolling average to (S, u, v) features across token positions
window=8: smooths per-token signals, reduces noise from single-token spikes
8. Direction Classification
For each behavioral direction (refusal, injection, etc.):
logistic_classifier(S, u, v) → P(active | features) per token position
9. Aggregation
Per direction: mean P(active), max P(active), fraction above threshold
Flag if any direction exceeds threshold for sufficient token positions
10. Alarm Generation
Combine per-direction signals → overall alarm
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
Include per-direction breakdown for interpretability
Note: Step 4 extracts only the last token in Phase 1. The full pipeline (Phase 2) extracts per-token activations, enabling the token-level smoothing and per-position classification in steps 7–9.
Key Concepts
Behavioral Alarm
Not a simple safe/unsafe binary. A behavioral alarm contains:
- Level:
CLEAR,SUSPICIOUS, orDANGEROUS - Score: Continuous 0.0–1.0 composite score
- Signals: Per-dimension behavioral signal strengths
- Dimensions: Which SVD directions are anomalous and by how much
This multi-signal approach reflects that safety is multi-dimensional in activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input that simultaneously shifts the refusal direction while activating role-playing dimensions is more suspicious than one that shifts only one dimension.
Score Composition
The overall Alarm.score (0.0–1.0) is computed from per-direction
classification results. For each behavioral direction, the logistic
classifier produces P(active | features) for every token position. The
alarm score aggregates these across directions:
direction_score = max(P(active) across token positions)
score = max(w_d * direction_score_d for d in directions)
Where w_d are direction weights (default: equal, configurable in
Thresholds.per_dimension). Using max at both levels ensures that:
- A single strongly anomalous direction can trigger an alarm even if other directions are normal
- A sustained behavioral signal at any token position surfaces in the alarm
This is critical for catching attacks that exploit specific behavioral patterns (e.g., refusal-suppression) while leaving other directions unaffected.
The suspicious and dangerous thresholds are applied to this composite
score to determine Alarm.level.
Alarm Levels
| Level | Meaning | Action |
|---|---|---|
CLEAR |
Input exhibits normal behavioral patterns across all directions | Pass to target model |
SUSPICIOUS |
Some behavioral directions show elevated activation signals | Flag for review or apply additional checks |
DANGEROUS |
Strong behavioral anomaly in one or more directions, sustained across token positions | Block input or apply strong mitigations |
Latency Budget
The firewall must complete screening in <10ms on commodity hardware (ADR-003). This budget breaks down approximately:
| Step | Target Latency |
|---|---|
| Tokenization | ~0.5ms |
| Model inference (125M, CPU) | ~5ms |
| Activation extraction | ~0.1ms |
| SVD projection | ~0.1ms |
| Copula decomposition | ~0.05ms |
| Token-level smoothing | ~0.05ms |
| Direction classification | ~0.1ms |
| Total | ~6ms |
Interfaces
Public API
class AlarmLevel(Enum):
CLEAR = "clear"
SUSPICIOUS = "suspicious"
DANGEROUS = "dangerous"
@dataclass
class DimensionSignal:
direction: str # Behavioral direction name (e.g., "refusal", "injection")
score: float # P(active) for this direction
max_score: float # Max P(active) across token positions
mean_score: float # Mean P(active) across token positions
n_positions_above: int # Token positions above threshold
direction_label: str | None
@dataclass
class Alarm:
level: AlarmLevel
score: float
signals: list[DimensionSignal]
input_hash: str # SHA-256 of raw input string (for logging/dedup)
model_id: str
timestamp: float
class Firewall:
def __init__(
self,
model_id: str = "HuggingFaceTB/SmolLM2-135M",
model_revision: str = DEFAULT_MODEL_REVISION,
codebook_path: Path | None = None,
thresholds: Thresholds | None = None,
device: str = "cpu",
cache_dir: str | None = None,
): ...
def preload(self) -> None: ...
def screen(self, input: str) -> Alarm: ...
screen_batchis Phase 2 (see overview.md scope).
Constraints
- No network calls during screening — the model is lazily loaded on
first
screen()call or via explicitpreload(). Download never happens at import time. Once loaded, screening is entirely local. - Synchronous API —
screen()is a blocking call. Async is Phase 2. - No target model dependency — the firewall has no access to the target LLM's internals. It runs its own detector model.
- Reproducible — Same input + same model + same codebook = same alarm. Pin model revision and codebook version.
Error Handling
| Failure Mode | Exception Type | Behavior |
|---|---|---|
| Model download fails (network) | ModelDownloadError |
Raised from preload() or first screen(). User must retry. |
Model not loaded when screen() called |
ModelNotLoadedError |
Raised if model loading was previously attempted and failed. |
| Corrupted codebook | CodebookCorruptedError |
Raised at Firewall.__init__ if codebook fails validation. |
| Codebook-model mismatch | CodebookMismatchError |
Raised if codebook's model_id doesn't match loaded model. |
| Empty input | ValueError |
Raised if input is empty string. |
| Non-UTF8 input | ValueError |
Raised if input cannot be encoded to UTF-8. |
| Very long input | — | Truncated to model's max sequence length with a UserWarning. |
| Insufficient memory for model | MemoryError |
Propagated from PyTorch/torch. User must reduce model size or free memory. |
All exception types subclass AlknetFirewallError (base library exception).
Design Decisions
| ADR | Decision | Summary |
|---|---|---|
| 002 | Behavioral signals | Detect how models react, not what text says |
| 003 | Small model detector | <10ms latency, CPU-deployable |
| 004 | SVD-based detection | Multi-dimensional, interpretable, efficient |
| 008 | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
Open Questions
Open questions are tracked in open-questions.md. Key questions affecting this document:
- OQ-03: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; research complete)
OQ-05:How should the firewall integrate with existing guardrail systems?(resolved — ADR-011: standalone API + thin adapters Phase 2)