alknet-firewall/docs/research/codebook-analysis/poc-architecture.md

# Research: PoC Codebook Architecture Analysis (OQ-02)

**Date**: 2026-06-13
**Status**: Complete
**Question**: What is the minimum viable codebook? Can the 1,245-line PoC codebook be compressed, and what is essential vs. exploratory/dead code?

---

## 1. PoC Architecture Overview

### 1.1 File Structure & Role

The PoC codebook lives in `firewall_codebook.py` (1,245 lines) and depends on three metaspline core modules:

```
firewall_codebook.py (1,245 lines)
├── Imports from metaspline core:
│   ├── metaspline.spline.SplineDistribution    (spline.py, 378 lines)
│   ├── metaspline.spline.ensure_strictly_increasing (spline.py)
│   ├── metaspline.space.unfold / fold           (space.py, 46 lines)
│   └── metaspline.transform.simplex             (transform.py, 78 lines)
├── External imports:
│   ├── sklearn.linear_model.LogisticRegression
│   └── sklearn.mixture.GaussianMixture (imported but unused)
└── Internal definitions (see §1.2)
```

### 1.2 Major Sections of `firewall_codebook.py`

| Lines | Component | Description |
|-------|-----------|-------------|
| 1–50 | Module docstring + imports | Theory overview, imports |
| 53–75 | `reverse_bary3d()` | Simplex → barycentric (u,v) transform |
| 69–74 | `bary_to_simplex()` | Inverse: barycentric → simplex |
| 77–112 | `DirectionProfile` dataclass | Per-contrast statistical profile |
| 114–127 | `DirectionClassifier` dataclass | Per-contrast logistic regression weights |
| 129–146 | `HistogramClassifier` dataclass | 2×2×2 codebook-state histogram classifier |
| 148–165 | `DetectionResult` dataclass | Output of `detect()` |
| 167–596 | `FirewallCodebook.__init__` + `build()` | Codebook construction (429 lines!) |
| 598–629 | `FirewallCodebook.decompose()` | z → (sum, u, v) copula transform |
| 631–669 | `FirewallCodebook.classify()` | Per-contrast logistic classification |
| 671–729 | `FirewallCodebook.classify_histogram()` | 8-state histogram classification |
| 731–860 | `FirewallCodebook.detect()` | Main detection entry point |
| 862–884 | `FirewallCodebook.detect_from_perturbations()` | Convenience: P → z → detect |
| 886–945 | `FirewallCodebook.summary()` | Human-readable summary |
| 947–1041 | `FirewallCodebook.evaluate_auc()` | AUC evaluation on held-out data |
| 1044–1118 | `build_codebook_from_precomputed()` | Load from saved .pt files |
| 1121–1245 | `__main__` block | Script-mode evaluation + duplicated data loading |

### 1.3 Dependency Map

```
                     ┌──────────────────┐
                     │ FirewallCodebook  │
                     │  (main class)     │
                     └────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
    ┌─────────▼──┐   ┌──────▼──────┐   ┌─────▼─────┐
    │ SplineDist  │   │ simplex()   │   │ bary3d()  │
    │ (CDF/ICDF) │   │ (transform) │   │ (local)   │
    └─────────┬──┘   └─────────────┘   └───────────┘
              │
    ┌─────────▼──────────────┐
    │ MonotonicCubicSpline   │
    │ (pchip interpolation)   │
    └────────────────────────┘
```

The `FirewallCodebook` has these hard dependencies at runtime:
1. **SplineDistribution** — CDF/ICDF transforms (population fitting + inference)
2. **simplex()** — normalize to simplex (x/sum(x))
3. **reverse_bary3d()** — project simplex to 2D barycentric coordinates
4. **torch** — tensor operations
5. **numpy** — sklearn bridge for training only

Training-time dependencies (not needed at inference):
- **sklearn.linear_model.LogisticRegression** — classifier training
- **sklearn.metrics.silhouette_score** — profile quality metric
- **sklearn.metrics.roc_auc_score** — evaluation metric

---

## 2. Essential vs. Exploratory vs. Dead Code Classification

### 2.1 Essential (Required for Production Codebook)

These are the core components that must be extracted into the production package:

| Component | Lines | Role | Production Mapping |
|-----------|-------|------|-------------------|
| `reverse_bary3d()` | 53–66 | z → (u,v) barycentric projection | `codebook/transforms.py` |
| `bary_to_simplex()` | 69–74 | Inverse barycentric (needed for reconstruction) | `codebook/transforms.py` |
| `SplineDistribution` | spline.py:200–261 | CDF/ICDF for copula transform | `codebook/splines.py` (adapted) |
| `MonotonicCubicSpline` | spline.py:80–197 | PCHIP interpolation engine | `codebook/splines.py` (adapted) |
| `ensure_strictly_increasing` | spline.py:43–73 | Knot sanitization | `codebook/splines.py` |
| `simplex()` | transform.py:34–36 | Normalize to unit simplex | `codebook/transforms.py` |
| `FirewallCodebook.__init__` | 182–203 | State initialization | `codebook/codebook.py` |
| `FirewallCodebook.decompose()` | 598–629 | z → (sum, u, v) copula space | `codebook/projection.py` |
| `FirewallCodebook.detect()` | 731–860 | Main detection logic | `codebook/detection.py` |
| `DetectionResult` | 148–165 | Output dataclass | `codebook/results.py` |
| `FirewallCodebook.build()` (core logic only) | 204–396 | SVD, spline fitting, profile computation | `training/compiler.py` |
| `DirectionProfile` | 77–112 | Per-direction statistical profile | `codebook/profiles.py` |
| `DirectionClassifier` | 114–127 | Per-direction linear classifier | `codebook/classifiers.py` |
| `FirewallCodebook.detect_from_perturbations()` | 862–884 | P → z convenience wrapper | `codebook/projection.py` |

**Total essential lines**: ~480 lines (including metaspline core)

### 2.2 Exploratory / Research Code

These were useful for research but are **not needed** in production:

| Component | Lines | Purpose | Disposition |
|-----------|-------|---------|-------------|
| `HistogramClassifier` dataclass | 129–146 | Alternative 2×2×2 discretized classifier | Keep as optional, not MVP |
| `classify_histogram()` | 671–729 | Histogram-based classification variant | Research variant, not MVP |
| `build()` histogram classifier section | 481–596 | Training histogram classifiers | Research variant |
| `evaluate_auc()` | 947–1041 | Offline AUC evaluation | Testing/benchmarking only |
| `summary()` | 886–945 | Human-readable codebook summary | Debugging/diagnostic tool |
| `classify()` | 631–669 | Per-position probability output | Subsumed by `detect()` |
| `build_codebook_from_precomputed()` | 1044–1118 | Load from .pt files | Training pipeline I/O |
| `build()` contrast_pairs default | 268–276 | Hardcoded 7-pair contrast list | Config, not code |
| `pooled_std()` inner function | 327–331 | Statistical utility | Extract to `training/stats.py` |
| `cohen_d()` inner function | 337–340 | Effect size utility | Extract to `training/stats.py` |
| `compute_silhouette()` inner function | 365–370 | Quality metric | Training diagnostic |

### 2.3 Dead Code

| Component | Lines | Issue |
|-----------|-------|-------|
| `sklearn.mixture.GaussianMixture` import | 44 | Imported but never used |
| `unfold()` / `fold()` from `space.py` | space.py:4–45 | Imported but never called in codebook |
| `dcs_norm()` from `transform.py` | transform.py:20–23 | Imported but never used |
| `__main__` block duplicated data loading | 1121–1245 | Lines 1203–1245 repeat 1126–1182 verbatim with different formatting — copy-paste artifact |
| `bary_to_simplex()` | 69–74 | Defined but never called in codebook |
| `DensitySpline` class | spline.py:315–378 | Legacy alternative, not used by codebook |
| `empirical_cdf()` / `empirical_density()` / `log_bins()` / `generate_asymmetric_knots()` | spline.py:268–313 | Utility functions not used by codebook |

### 2.4 Infrastructure (Training Pipeline, Not Runtime)

| Component | Lines | Purpose |
|-----------|-------|---------|
| `run_manifold_projection.py` (entire) | 823 | Model loading, data collection, SVD computation, saving artifacts |
| `analyzer.py` (entire) | 560 | Multi-layer direction analysis, residual extraction |
| `discover_directions.py` (entire) | 401 | Post-hoc direction discovery from trajectory data |
| `build()` SVD computation section | 229–233 | Population SVD → V3 basis |

---

## 3. Training Pipeline Analysis

### 3.1 `run_manifold_projection.py` — Step by Step

The training pipeline performs these operations:

1. **Model Loading** (L79–103): Load HuggingFace model + tokenizer. Configure for GPU/CPU.

2. **Condition Catalog Construction** (L106–153): Build contrastive prompt sets for 8 behavioral conditions:
   - self_ref / other_ref
   - violated / expected (semantic)
   - code_violated / code_expected
   - instruction / data
   - tool_call / natural_language
   - uncertain / confident
   - harmful / harmless
   - injection / benign_instruction

3. **Feature Extraction** (L156–213): For each condition, extract:
   - Hidden states across all layers → `residuals` (n_prompts, n_layers+1, hidden_dim)
   - ICDF perturbation vectors → `perturbations` (n_prompts, 64)
   - Last-layer hidden states → `hidden_last` (n_prompts, hidden_dim)

4. **SVD Computation** (L216–263):
   - Activation SVD: `H_all` (N, 2048) → principal components in hidden state space
   - Perturbation SVD: `P_all` (N, 64) → the **3D perturbation manifold** (this is the basis V3)

5. **Direction Vector Computation** (L393–434): Per-contrast mean-difference direction vectors at best layers.

6. **Projection Analysis** (L436–668): Extensive analysis of direction projections onto activation/perturbation subspaces. **This is research output, not needed for codebook compilation.**

7. **Save Results** (L670–755):
   - `.json`: Scalar metrics, SVD variance, separation stats
   - `.pt`: Tensors — **this is the key artifact**:
     - `perturbation_svd_Vh` → top-k right-singular vectors (the SVD basis)
     - `perturbation_mean` → population mean for centering
     - `condition_perturbations` → per-condition perturbation vectors
     - `condition_hidden_last` → last-layer hidden states per condition

### 3.2 Codebook Artifact Production

The `.pt` file from `run_manifold_projection.py` feeds directly into `build_codebook_from_precomputed()`, which:

1. Loads `.pt` file → extracts `perturbation_svd_Vh[:3]` (V3 basis) and `perturbation_mean` (P_mean)
2. Reconstructs z-coords: `z = (P - P_mean) @ V3.T`
3. Calls `FirewallCodebook.build()` which:
   - Fits SplineDistribution on each z dimension (population)
   - Fits SplineDistribution on sums (population)
   - Decomposes each condition via CDF → (sum, u, v)
   - Computes DirectionProfiles (pooled stats, Cohen's d, thresholds)
   - Trains DirectionClassifiers (logistic regression per contrast)
   - Trains HistogramClassifiers (8-state discrete classifiers)

**The produced codebook artifacts map to the production spec as:**

| PoC Artifact | Production Format | Notes |
|---|---|---|
| `FirewallCodebook.z_splines` (3× SplineDistribution) | `splines.json` (knot positions + coefficients) | Spline knots serialized as JSON arrays |
| `FirewallCodebook.svd_V3` (3×64 tensor) | `basis.safetensors` → `basis_vectors` | Reshaped for multi-layer format |
| `FirewallCodebook.population_mean_P` (64 tensor) | `basis.safetensors` → `mean` | Centering vector |
| `FirewallCodebook.direction_profiles` (dict) | `regions.safetensors` → centroids, scale | Per-direction statistical profiles |
| `FirewallCodebook.classifiers` (dict) | Part of `config.json` or `regions.safetensors` | Logistic weights (3 floats + intercept per direction) |
| `FirewallCodebook.sum_spline` (SplineDistribution) | `splines.json` | Sum distribution spline |
| `FirewallCodebook.population_stats` (dict) | `regions.safetensors` → centroids, scale | Population baselines |

---

## 4. Core Library Assessment

### 4.1 Metaspline Core Usage

The metaspline core (`spline.py` 378 lines, `transform.py` 78 lines, `space.py` 46 lines — 502 lines total) provides:

| Module | Lines | Used by Codebook | Lines Actually Used |
|--------|-------|-------------------|---------------------|
| `spline.py` | 378 | `SplineDistribution`, `ensure_strictly_increasing` | ~175 lines (SplineDistribution + MonotonicCubicSpline + ensure_strictly_increasing) |
| `transform.py` | 78 | `simplex()` only | 3 lines |
| `space.py` | 46 | None (imported but unused) | 0 lines |

**Actual dependency: ~178 lines out of 502.** The codebook uses only `SplineDistribution` (CDF/ICDF), `MonotonicCubicSpline` (its backbone), `ensure_strictly_increasing`, and `simplex()`. The following are unused:

- `DensitySpline` class (spline.py, 60 lines) — legacy CDF-based distribution, not used
- `empirical_cdf()`, `empirical_density()`, `log_bins()`, `generate_asymmetric_knots()` (spline.py, ~45 lines) — utility functions, unused
- `unfold()` / `fold()` (space.py, 46 lines) — digit expansion/contraction, unused
- `double_cumsum()`, `double_diff()`, `dcs_norm()`, `normalize_01()`, `column_cdf_normalize()`, `toBase()`, `numSymbols()`, `ndVec()` (transform.py, ~75 lines) — unused

### 4.2 How Much Is Inline vs. Library?

The `FirewallCodebook.build()` method has **significant inline reimplementation** of statistical operations that could be cleaner:

- **Lines 229–233**: SVD computation is inline (should use the pipeline's `compute_perturbation_svd()`)
- **Lines 236–246**: Spline fitting is inline but delegates to `SplineDistribution`
- **Lines 313–324**: CDF → decompose → barycentric is duplicated 3× (in `build()`, `classify()`, `classify_histogram()`)
- **Lines 327–340**: `pooled_std()` and `cohen_d()` are inner functions, not module-level
- **Lines 365–370**: `compute_silhouette()` is an inner function with sklearn import

The core decomposition pipeline (z → CDF → simplex → barycentric → (sum, u, v)) appears **verbatim** in:
1. `build()` lines 242–250 (population)
2. `build()` lines 313–324 (per-condition, profile computation)
3. `build()` lines 445–456 (per-condition, classifier computation)
4. `build()` lines 521–532 (per-condition, histogram computation)
5. `decompose()` lines 610–628 (runtime inference)

This is the **single most compressible pattern** — a 10-line decomposition sequence repeated 5 times.

---

## 5. Minimum Viable Codebook

### 5.1 Required Functions for Production

Based on the production spec (`codebook.md`), the minimum viable codebook needs:

1. **`project(activations) → z_coords`**: SVD projection (matrix multiply + centering)
2. **`decompose(z_coords) → (sum, u, v)`**: CDF → simplex → barycentric
3. **`score(z_coords) → list[DimensionSignal]`**: Per-direction scoring against profiles
4. **`detect(z_coords, threshold) → DetectionResult`**: Threshold comparison + flagging
5. **`load(path) → Codebook`**: Deserialize from safetensors + JSON
6. **SplineDistribution**: CDF evaluation for decompose

And for the **training pipeline** (not runtime):
7. **`build(population_data, direction_data) → Codebook`**: SVD, spline fitting, classifier training

### 5.2 Compression Estimate

| Source | Lines | Classification | Production Lines |
|--------|-------|----------------|------------------|
| `firewall_codebook.py` | 1,245 | Core + research + dead | ~350 |
| `spline.py` (used parts) | ~178 | Core library | ~180 |
| `transform.py` (used parts) | ~3 | Core library | ~5 |
| **Total PoC dependency** | **~426** | | **~535** |

**Target estimate: 400–500 lines for runtime codebook, 150–200 lines for training pipeline.**

Breakdown of production targets:

| Module | Target Lines | Contents |
|--------|-------------|----------|
| `codebook/transforms.py` | ~30 | `simplex()`, `reverse_bary3d()`, `bary_to_simplex()` |
| `codebook/splines.py` | ~180 | `MonotonicCubicSpline`, `SplineDistribution`, `ensure_strictly_increasing` |
| `codebook/profiles.py` | ~30 | `DirectionProfile` dataclass |
| `codebook/classifiers.py` | ~20 | `DirectionClassifier` dataclass |
| `codebook/results.py` | ~15 | `DetectionResult` dataclass |
| `codebook/projection.py` | ~30 | `project()` and `decompose()` |
| `codebook/detection.py` | ~50 | `detect()` with rolling window, threshold logic |
| `codebook/codebook.py` | ~40 | `Codebook` class (init, load, summary) |
| `training/compiler.py` | ~150 | `build()` — SVD, spline fitting, profile computation |
| `training/stats.py` | ~25 | `pooled_std()`, `cohen_d()`, silhouette |
| **Total** | **~570** | | |

This is **46% of the PoC's 1,245 lines**, or if including the used portion of metaspline core, **~35% of the total 1,745 lines** referenced in the overview.

### 5.3 What Gets Cut

| Lines Cut | Source | Reason |
|-----------|--------|--------|
| ~130 | `HistogramClassifier` + `classify_histogram()` + histogram training | Alternative approach, not MVP |
| ~95 | `evaluate_auc()` | Testing/benchmarking tool |
| ~60 | `summary()` | Debugging tool, not runtime |
| ~75 | `__main__` block (including duplicated code) | Script-mode evaluation |
| ~40 | `classify()` method | Subsumed by `detect()` |
| ~30 | `build_codebook_from_precomputed()` | Training I/O, not runtime |
| ~124 | Unused metaspline code (DensitySpline, unfold/fold, dcs_norm, etc.) | Dead code |
| ~50 | Repeated decomposition sequences | DRY refactoring |

---

## 6. Proposed Decomposition

Matching the production package structure from `codebook.md`:

```
src/alknet_firewall/
├── codebook/
│   ├── __init__.py            # Public exports
│   ├── codebook.py            # Codebook class (init, load, project, score)
│   ├── transforms.py          # simplex, reverse_bary3d, bary_to_simplex
│   ├── splines.py             # MonotonicCubicSpline, SplineDistribution
│   ├── profiles.py            # DirectionProfile, population stats
│   ├── classifiers.py          # DirectionClassifier (logistic weights)
│   ├── results.py             # DetectionResult, DimensionSignal, AlarmLevel
│   ├── projection.py          # project(), decompose()
│   └── detection.py           # detect(), threshold comparison, rolling window
├── training/
│   ├── __init__.py
│   ├── compiler.py            # build() — SVD, spline fitting, profile comp
│   ├── stats.py               # pooled_std, cohen_d, silhouette
│   └── data_loader.py         # Condition catalog, prompt sets, data loading
└── data/
    └── codebook/
        ├── basis.safetensors
        ├── regions.safetensors
        ├── splines.json
        └── config.json
```

### 6.1 Key Design Decisions for Extraction

1. **SplineDistribution stays in `codebook/splines.py`** — it's a general-purpose distribution class used at both training and inference time. No need for a separate package.

2. **`simplex()` moves to `codebook/transforms.py`** — it's a single pure function (3 lines), no need for the `transform.py` dependency chain.

3. **`unfold`/`fold` from `space.py` are dropped** — never used by the codebook.

4. **`DirectionProfile` and `DirectionClassifier` become separate dataclass modules** — clean separation of data from logic.

5. **`build()` moves entirely to `training/compiler.py`** — runtime codebook is read-only. This is the biggest architectural change: the codebook class should not have a `build()` classmethod.

6. **Decompose becomes a pure function** — `decompose(z, splines)` is a pure mathematical transform with no state dependencies beyond the splines. Making it a standalone function enables testing.

7. **Detection is separate from the codebook class** — `detect(z, classifiers, profiles, threshold)` is a stateless function given the codebook data. This enables swapping detection strategies without touching the codebook.

---

## 7. Testing Data

### 7.1 Saved Artifacts Referenced in Code

The PoC references these saved data files:

| File | Path | Contents | Reusable for Testing |
|------|------|----------|---------------------|
| Population precomputed | `saved_data/precomputed_seed42_qwen3_0.6b.pt` | z_coords, P_mean, perturbation_svd_Vh | Yes — basis for integration tests |
| Population precomputed | `saved_data/precomputed_seed42_qwen3_1.7b.pt` | Same for 1.7B model | Yes — multi-model test |
| Population precomputed | `saved_data/precomputed_seed42_qwen3_4b.pt` | Same for 4B model | Yes — multi-model test |
| Direction geometry | `experiments/direction_geometry/results/Qwen_Qwen3-0.6B_manifold_projection.pt` | Full condition data + SVD | Yes — golden data for codebook compilation |
| Direction geometry | `experiments/direction_geometry/results/Qwen_Qwen3-1.7B_manifold_projection.pt` | Same for 1.7B | Yes |
| Contrast pairs | Hardcoded in `build()` L268–276 and `run_manifold_projection.py` L139–148 | 7 behavioral contrasts | Yes — test fixture definition |

### 7.2 Validation Results Referenced

The `__main__` block (L1121–1245) contains:
- AUC evaluation at window sizes [1, 4, 8, 16]
- Per-direction AUC scores for both continuous and histogram classifiers
- Per-token AUC evaluation

These results should be captured as **golden test fixtures** for the production codebook:
- Build a codebook from the 0.6B precomputed data
- Verify that AUC scores match expected ranges
- Verify that detection decisions match expected flags

### 7.3 Calibration Data for Testing

For unit/integration tests, we need:

1. **Synthetic z-coord population**: Small N=1000 tensor for spline fitting tests
2. **Known-contrast z-coords**: Small pairs (harmful/harmless) for direction profile tests
3. **Expected spline parameters**: Known knot positions/coefficients for regression tests
4. **Expected detection results**: For a given input, what does `detect()` return?

The PoC's `build_codebook_from_precomputed()` provides a ready-made path to generate these fixtures from the saved `.pt` files.

---

## Summary

### Key Findings

1. **The 1,245-line PoC contains ~480 lines of essential code**. Including the metaspline core dependency (~178 lines used), the total essential code is ~658 lines. With dead code and research artifacts removed, the production codebook should target **400–500 lines** for runtime + **150–200 lines** for training.

2. **The decomposition pipeline (z → CDF → simplex → bary → (sum,u,v)) is repeated 5 times** in the PoC. Extracting it into a single `decompose()` function saves ~50 lines and eliminates a bug surface.

3. **The metaspline core has ~65% unused code** when viewed from the codebook's perspective. Only `SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`, and `simplex()` are needed — the rest (DensitySpline, unfold/fold, dcs_norm, etc.) can be dropped entirely.

4. **The histogram classifier (2×2×2 discretized approach) is an exploratory alternative**, not the primary detection mechanism. The continuous logistic classifier is superior (higher AUC) and should be the MVP approach. The histogram classifier adds ~130 lines and can be deferred.

5. **The `build()` method is the largest single function (429 lines)** and mixes training with runtime state. It must be decomposed: training logic moves to `training/compiler.py`, runtime state becomes immutable serialized data.

6. **Saved `.pt` files from the PoC provide golden test data** — the manifold projection results for Qwen3-0.6B and 1.7B can be reused directly for integration tests.

### Recommendation

**Target: 500–600 lines total** for the production codebook (runtime + training), down from 1,245 lines in the PoC and 1,745 lines including metaspline core. This is a **~65% compression**.

The architecture should separate:
- **Runtime** (~400 lines): `Codebook`, transforms, splines, detection, results
- **Training** (~150 lines): compiler, stats, data loading
- **Data** (bundled): safetensors + JSON, no Python

### Next Steps

1. Create `src/alknet_firewall/codebook/` package structure
2. Extract `transforms.py` (simplex, barycentric) — trivial, ~30 lines
3. Port `splines.py` (MonotonicCubicSpline + SplineDistribution) — ~180 lines, mostly copy with cleanup
4. Implement `projection.py` (project, decompose) — thin wrappers, ~30 lines
5. Implement `detection.py` (detect with rolling window) — ~50 lines, port from PoC's detect()
6. Implement `codebook.py` (Codebook class with load) — ~40 lines
7. Extract `training/compiler.py` from `build()` — most complex extraction, ~150 lines
8. Create test fixtures from saved `.pt` data
9. Verify round-trip: build from .pt → serialize → load → detect matches PoC output