feat: initial architecture specification and research
Phase 0→1 setup for alknet-firewall — a behavioral signal detection library that screens untrusted LLM inputs using small model activations. Architecture docs (5 specs, 10 ADRs, 7 open questions): - overview: vision, scope, dependencies, package structure - firewall: core API, alarm protocol, score composition, error handling - codebook: SVD basis, spline distributions, calibration, tensor format - model: activation extraction, model-agnostic interface, lazy loading - configuration: thresholds, model selection, detection tuning Research reports: - modern-python-project-setup: uv, pyproject.toml, src layout, ruff, CI - python-ml-packaging: optional PyTorch, HF Hub download, safetensors - llm-input-safety-landscape: threat taxonomy, defenses, academic evidence Agent role adaptations for Python project (replaced Rust conventions).
This commit is contained in:
200
docs/architecture/firewall.md
Normal file
200
docs/architecture/firewall.md
Normal file
@@ -0,0 +1,200 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-13
|
||||
---
|
||||
|
||||
# Firewall
|
||||
|
||||
The core firewall component: the public API for screening untrusted inputs and
|
||||
producing behavioral alarms.
|
||||
|
||||
## What It Is
|
||||
|
||||
The Firewall is the primary entry point for alknet-firewall. It receives
|
||||
untrusted text input, runs it through the detector model, extracts behavioral
|
||||
signals from hidden state activations, and produces a structured alarm
|
||||
indicating whether the input exhibits adversarial behavioral patterns.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
LLM-based systems need a fast, pre-inference screening mechanism that catches
|
||||
adversarial inputs *before* they reach the target model. Text-surface
|
||||
defenses miss obfuscated, multilingual, and novel attacks. Behavioral signal
|
||||
detection catches what text hides — adversarial inputs produce anomalous
|
||||
activation patterns regardless of their surface form (ADR-002).
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
1. Input Arrives
|
||||
"Please summarize this document: [hidden injection payload]"
|
||||
|
||||
2. Tokenize
|
||||
tokenizer.encode(input) → input_ids
|
||||
|
||||
3. Detector Model Inference
|
||||
model(input_ids) → hidden_states at key layers
|
||||
|
||||
4. Activation Extraction
|
||||
Extract hidden states from configured layers (early + mid)
|
||||
hidden_states[layer_idx][:, -1, :] → per-layer activation vectors
|
||||
|
||||
5. SVD Projection
|
||||
Project activations onto precomputed SVD basis
|
||||
z_coords = svd_basis @ activation_vector
|
||||
|
||||
6. Codebook Comparison
|
||||
For each SVD dimension:
|
||||
- Compute distance from normal behavioral region
|
||||
- Apply spline scoring (monotonic distribution)
|
||||
- Aggregate multi-dimensional signals
|
||||
|
||||
7. Alarm Generation
|
||||
Combine per-dimension signals → overall alarm
|
||||
AlarmLevel: CLEAR | SUSPICIOUS | DANGEROUS
|
||||
Include per-dimension breakdown for interpretability
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Behavioral Alarm
|
||||
|
||||
Not a simple safe/unsafe binary. A behavioral alarm contains:
|
||||
|
||||
- **Level**: `CLEAR`, `SUSPICIOUS`, or `DANGEROUS`
|
||||
- **Score**: Continuous 0.0–1.0 composite score
|
||||
- **Signals**: Per-dimension behavioral signal strengths
|
||||
- **Dimensions**: Which SVD directions are anomalous and by how much
|
||||
|
||||
This multi-signal approach reflects that safety is multi-dimensional in
|
||||
activation space (ICML 2025, Hidden Dimensions of LLM Alignment). An input
|
||||
that simultaneously shifts the refusal direction while activating role-playing
|
||||
dimensions is more suspicious than one that shifts only one dimension.
|
||||
|
||||
### Score Composition
|
||||
|
||||
The overall `Alarm.score` (0.0–1.0) is computed from per-dimension signals
|
||||
using a weighted maximum:
|
||||
|
||||
```
|
||||
score = max(w_d * signal_d for d in dimensions)
|
||||
```
|
||||
|
||||
Where `w_d` are dimension weights (default: equal, configurable in
|
||||
`Thresholds.per_dimension`). Using `max` rather than `mean` ensures that a
|
||||
single strongly anomalous dimension can trigger an alarm even if other
|
||||
dimensions are normal. This is critical for catching attacks that exploit
|
||||
specific behavioral patterns (e.g., refusal-suppression) while leaving other
|
||||
dimensions unaffected.
|
||||
|
||||
The `suspicious` and `dangerous` thresholds are applied to this composite
|
||||
score to determine `Alarm.level`.
|
||||
|
||||
### Alarm Levels
|
||||
|
||||
| Level | Meaning | Action |
|
||||
|-------|---------|--------|
|
||||
| `CLEAR` | Input exhibits normal behavioral patterns | Pass to target model |
|
||||
| `SUSPICIOUS` | Some anomalous signals detected | Flag for review or apply additional checks |
|
||||
| `DANGEROUS` | Strong behavioral anomaly across multiple dimensions | Block input or apply strong mitigations |
|
||||
|
||||
### Latency Budget
|
||||
|
||||
The firewall must complete screening in <10ms on commodity hardware
|
||||
(ADR-003). This budget breaks down approximately:
|
||||
|
||||
| Step | Target Latency |
|
||||
|------|----------------|
|
||||
| Tokenization | ~0.5ms |
|
||||
| Model inference (125M, CPU) | ~5ms |
|
||||
| Activation extraction | ~0.1ms |
|
||||
| SVD projection | ~0.1ms |
|
||||
| Codebook comparison | ~0.3ms |
|
||||
| **Total** | **~6ms** |
|
||||
|
||||
## Interfaces
|
||||
|
||||
### Public API
|
||||
|
||||
```python
|
||||
class AlarmLevel(Enum):
|
||||
CLEAR = "clear"
|
||||
SUSPICIOUS = "suspicious"
|
||||
DANGEROUS = "dangerous"
|
||||
|
||||
@dataclass
|
||||
class DimensionSignal:
|
||||
dimension: int
|
||||
deviation: float
|
||||
score: float
|
||||
direction_label: str | None
|
||||
|
||||
@dataclass
|
||||
class Alarm:
|
||||
level: AlarmLevel
|
||||
score: float
|
||||
signals: list[DimensionSignal]
|
||||
input_hash: str # SHA-256 of raw input string (for logging/dedup)
|
||||
model_id: str
|
||||
timestamp: float
|
||||
|
||||
class Firewall:
|
||||
def __init__(
|
||||
self,
|
||||
model_id: str = "HuggingFaceTB/SmolLM2-135M",
|
||||
model_revision: str = DEFAULT_MODEL_REVISION,
|
||||
codebook_path: Path | None = None,
|
||||
thresholds: Thresholds | None = None,
|
||||
device: str = "cpu",
|
||||
cache_dir: str | None = None,
|
||||
): ...
|
||||
|
||||
def preload(self) -> None: ...
|
||||
|
||||
def screen(self, input: str) -> Alarm: ...
|
||||
```
|
||||
|
||||
> `screen_batch` is Phase 2 (see overview.md scope).
|
||||
|
||||
### Constraints
|
||||
|
||||
1. **No network calls during screening** — the model is lazily loaded on
|
||||
first `screen()` call or via explicit `preload()`. Download never happens at
|
||||
import time. Once loaded, screening is entirely local.
|
||||
2. **Synchronous API** — `screen()` is a blocking call. Async is Phase 2.
|
||||
3. **No target model dependency** — the firewall has no access to the target
|
||||
LLM's internals. It runs its own detector model.
|
||||
4. **Reproducible** — Same input + same model + same codebook = same alarm.
|
||||
Pin model revision and codebook version.
|
||||
|
||||
## Error Handling
|
||||
|
||||
| Failure Mode | Exception Type | Behavior |
|
||||
|-------------|---------------|----------|
|
||||
| Model download fails (network) | `ModelDownloadError` | Raised from `preload()` or first `screen()`. User must retry. |
|
||||
| Model not loaded when `screen()` called | `ModelNotLoadedError` | Raised if model loading was previously attempted and failed. |
|
||||
| Corrupted codebook | `CodebookCorruptedError` | Raised at `Firewall.__init__` if codebook fails validation. |
|
||||
| Codebook-model mismatch | `CodebookMismatchError` | Raised if codebook's `model_id` doesn't match loaded model. |
|
||||
| Empty input | `ValueError` | Raised if input is empty string. |
|
||||
| Non-UTF8 input | `ValueError` | Raised if input cannot be encoded to UTF-8. |
|
||||
| Very long input | — | Truncated to model's max sequence length with a `UserWarning`. |
|
||||
| Insufficient memory for model | `MemoryError` | Propagated from PyTorch/torch. User must reduce model size or free memory. |
|
||||
|
||||
All exception types subclass `AlknetFirewallError` (base library exception).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [002](decisions/002-behavioral-signals.md) | Behavioral signals | Detect how models react, not what text says |
|
||||
| [003](decisions/003-small-model-detector.md) | Small model detector | <10ms latency, CPU-deployable |
|
||||
| [004](decisions/004-svd-based-detection.md) | SVD-based detection | Multi-dimensional, interpretable, efficient |
|
||||
| [008](decisions/008-three-level-alarm.md) | Three-level alarm | CLEAR/SUSPICIOUS/DANGEROUS with continuous score |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-03**: Should the firewall support streaming/chunked input screening? (open)
|
||||
- **OQ-05**: How should the firewall integrate with existing guardrail systems? (open)
|
||||
Reference in New Issue
Block a user