Files
alknet-firewall/docs/architecture/firewall.md
glm-5.1 45a0e0798c docs: add copula decomposition pipeline, clarify detection data flow
The architecture specs previously described detection as a single-vector
path (one activation → one z-coordinate → one alarm), but the PoC operates
on per-token z-coordinate sequences with a two-stage copula decomposition.

Key updates:
- codebook.md: Add Copula Decomposition section (z → CDF → simplex →
  barycentric → (S, u, v)), Direction Profiles and Contrast Pairs section,
  Token-Level Smoothing section, classifier weights and direction profiles
  to data format, updated Internal API with decompose/classify/detect methods
- codebook.md: Clarify z-coordinate shapes — training is (N, 3) flattened
  per-token positions, inference is (seq_len, 3) per-token sequence
- firewall.md: Update data flow to 10-step pipeline including copula
  decomposition, smoothing, and direction classification; update score
  composition to use direction-level P(active); update DimensionSignal
  dataclass; update latency budget with copula/smoothing/classification steps
- model.md: Add Phase 1 (last-token) vs Phase 2 (per-token) extraction modes
- ADR-009: Note last-token is Phase 1 simplification, per-token is full
  pipeline
2026-06-13 08:17:09 +00:00

8.8 KiB
Raw Blame History

status, last_updated
status last_updated
draft 2026-06-13

Firewall

The core firewall component: the public API for screening untrusted inputs and producing behavioral alarms.

What It Is

The Firewall is the primary entry point for alknet-firewall. It receives untrusted text input, runs it through the detector model, extracts behavioral signals from hidden state activations, and produces a structured alarm indicating whether the input exhibits adversarial behavioral patterns.

Why It Exists

LLM-based systems need a fast, pre-inference screening mechanism that catches adversarial inputs before they reach the target model. Text-surface defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal detection catches what text hides — adversarial inputs produce anomalous activation patterns regardless of their surface form (ADR-002).

Data Flow

1. Input Arrives
   "Please summarize this document: [hidden injection payload]"

2. Tokenize
   tokenizer.encode(input) → input_ids  (shape: seq_len)

3. Detector Model Inference
   model(input_ids, output_hidden_states=True) → hidden_states at key layers

4. Activation Extraction
   Extract last-token hidden states from configured layers (early + mid)
   hidden_states[layer_idx][:, -1, :]  → per-layer activation vectors

5. SVD Projection
   Project activations onto precomputed SVD basis
   z_coords = V^T @ (activation - mean)  → (seq_len, 3) z-coordinates

6. Copula Decomposition
   Transform z-coordinates through CDF → simplex → barycentric:
   z → (x₀, x₁, x₂) via CDF  →  S = x₀+x₁+x₂ (scale)
                               →  (u, v) via barycentric (position on simplex)

7. Token-Level Smoothing (optional)
   Apply rolling average to (S, u, v) features across token positions
   window=8: smooths per-token signals, reduces noise from single-token spikes

8. Direction Classification
   For each behavioral direction (refusal, injection, etc.):
     logistic_classifier(S, u, v) → P(active | features) per token position

9. Aggregation
   Per direction: mean P(active), max P(active), fraction above threshold
   Flag if any direction exceeds threshold for sufficient token positions

10. Alarm Generation
    Combine per-direction signals → overall alarm
    AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
    Include per-direction breakdown for interpretability

Note: Step 4 extracts only the last token in Phase 1. The full pipeline (Phase 2) extracts per-token activations, enabling the token-level smoothing and per-position classification in steps 79.

Key Concepts

Behavioral Alarm

Not a simple safe/unsafe binary. A behavioral alarm contains:

  • Level: CLEAR, SUSPICIOUS, or DANGEROUS
  • Score: Continuous 0.01.0 composite score
  • Signals: Per-dimension behavioral signal strengths
  • Dimensions: Which SVD directions are anomalous and by how much

This multi-signal approach reflects that safety is multi-dimensional in activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input that simultaneously shifts the refusal direction while activating role-playing dimensions is more suspicious than one that shifts only one dimension.

Score Composition

The overall Alarm.score (0.01.0) is computed from per-direction classification results. For each behavioral direction, the logistic classifier produces P(active | features) for every token position. The alarm score aggregates these across directions:

direction_score = max(P(active) across token positions)
score = max(w_d * direction_score_d for d in directions)

Where w_d are direction weights (default: equal, configurable in Thresholds.per_dimension). Using max at both levels ensures that:

  • A single strongly anomalous direction can trigger an alarm even if other directions are normal
  • A sustained behavioral signal at any token position surfaces in the alarm

This is critical for catching attacks that exploit specific behavioral patterns (e.g., refusal-suppression) while leaving other directions unaffected.

The suspicious and dangerous thresholds are applied to this composite score to determine Alarm.level.

Alarm Levels

Level Meaning Action
CLEAR Input exhibits normal behavioral patterns across all directions Pass to target model
SUSPICIOUS Some behavioral directions show elevated activation signals Flag for review or apply additional checks
DANGEROUS Strong behavioral anomaly in one or more directions, sustained across token positions Block input or apply strong mitigations

Latency Budget

The firewall must complete screening in <10ms on commodity hardware (ADR-003). This budget breaks down approximately:

Step Target Latency
Tokenization ~0.5ms
Model inference (125M, CPU) ~5ms
Activation extraction ~0.1ms
SVD projection ~0.1ms
Copula decomposition ~0.05ms
Token-level smoothing ~0.05ms
Direction classification ~0.1ms
Total ~6ms

Interfaces

Public API

class AlarmLevel(Enum):
    CLEAR = "clear"
    SUSPICIOUS = "suspicious"
    DANGEROUS = "dangerous"

@dataclass
class DimensionSignal:
    direction: str              # Behavioral direction name (e.g., "refusal", "injection")
    score: float                # P(active) for this direction
    max_score: float            # Max P(active) across token positions
    mean_score: float           # Mean P(active) across token positions
    n_positions_above: int      # Token positions above threshold
    direction_label: str | None

@dataclass
class Alarm:
    level: AlarmLevel
    score: float
    signals: list[DimensionSignal]
    input_hash: str          # SHA-256 of raw input string (for logging/dedup)
    model_id: str
    timestamp: float

class Firewall:
    def __init__(
        self,
        model_id: str = "HuggingFaceTB/SmolLM2-135M",
        model_revision: str = DEFAULT_MODEL_REVISION,
        codebook_path: Path | None = None,
        thresholds: Thresholds | None = None,
        device: str = "cpu",
        cache_dir: str | None = None,
    ): ...

    def preload(self) -> None: ...

    def screen(self, input: str) -> Alarm: ...

screen_batch is Phase 2 (see overview.md scope).

Constraints

  1. No network calls during screening — the model is lazily loaded on first screen() call or via explicit preload(). Download never happens at import time. Once loaded, screening is entirely local.
  2. Synchronous APIscreen() is a blocking call. Async is Phase 2.
  3. No target model dependency — the firewall has no access to the target LLM's internals. It runs its own detector model.
  4. Reproducible — Same input + same model + same codebook = same alarm. Pin model revision and codebook version.

Error Handling

Failure Mode Exception Type Behavior
Model download fails (network) ModelDownloadError Raised from preload() or first screen(). User must retry.
Model not loaded when screen() called ModelNotLoadedError Raised if model loading was previously attempted and failed.
Corrupted codebook CodebookCorruptedError Raised at Firewall.__init__ if codebook fails validation.
Codebook-model mismatch CodebookMismatchError Raised if codebook's model_id doesn't match loaded model.
Empty input ValueError Raised if input is empty string.
Non-UTF8 input ValueError Raised if input cannot be encoded to UTF-8.
Very long input Truncated to model's max sequence length with a UserWarning.
Insufficient memory for model MemoryError Propagated from PyTorch/torch. User must reduce model size or free memory.

All exception types subclass AlknetFirewallError (base library exception).

Design Decisions

ADR Decision Summary
002 Behavioral signals Detect how models react, not what text says
003 Small model detector <10ms latency, CPU-deployable
004 SVD-based detection Multi-dimensional, interpretable, efficient
008 Three-level alarm CLEAR/SUSPICIOUS/DANGEROUS with continuous score

Open Questions

Open questions are tracked in open-questions.md. Key questions affecting this document:

  • OQ-03: Should the firewall support streaming/chunked input screening? (open — rolling window approach is promising; research complete)
  • OQ-05: How should the firewall integrate with existing guardrail systems? (resolved — ADR-011: standalone API + thin adapters Phase 2)