# Research: PoC Codebook Architecture Analysis (OQ-02) **Date**: 2026-06-13 **Status**: Complete **Question**: What is the minimum viable codebook? Can the 1,245-line PoC codebook be compressed, and what is essential vs. exploratory/dead code? --- ## 1. PoC Architecture Overview ### 1.1 File Structure & Role The PoC codebook lives in `firewall_codebook.py` (1,245 lines) and depends on three metaspline core modules: ``` firewall_codebook.py (1,245 lines) ├── Imports from metaspline core: │ ├── metaspline.spline.SplineDistribution (spline.py, 378 lines) │ ├── metaspline.spline.ensure_strictly_increasing (spline.py) │ ├── metaspline.space.unfold / fold (space.py, 46 lines) │ └── metaspline.transform.simplex (transform.py, 78 lines) ├── External imports: │ ├── sklearn.linear_model.LogisticRegression │ └── sklearn.mixture.GaussianMixture (imported but unused) └── Internal definitions (see §1.2) ``` ### 1.2 Major Sections of `firewall_codebook.py` | Lines | Component | Description | |-------|-----------|-------------| | 1–50 | Module docstring + imports | Theory overview, imports | | 53–75 | `reverse_bary3d()` | Simplex → barycentric (u,v) transform | | 69–74 | `bary_to_simplex()` | Inverse: barycentric → simplex | | 77–112 | `DirectionProfile` dataclass | Per-contrast statistical profile | | 114–127 | `DirectionClassifier` dataclass | Per-contrast logistic regression weights | | 129–146 | `HistogramClassifier` dataclass | 2×2×2 codebook-state histogram classifier | | 148–165 | `DetectionResult` dataclass | Output of `detect()` | | 167–596 | `FirewallCodebook.__init__` + `build()` | Codebook construction (429 lines!) | | 598–629 | `FirewallCodebook.decompose()` | z → (sum, u, v) copula transform | | 631–669 | `FirewallCodebook.classify()` | Per-contrast logistic classification | | 671–729 | `FirewallCodebook.classify_histogram()` | 8-state histogram classification | | 731–860 | `FirewallCodebook.detect()` | Main detection entry point | | 862–884 | `FirewallCodebook.detect_from_perturbations()` | Convenience: P → z → detect | | 886–945 | `FirewallCodebook.summary()` | Human-readable summary | | 947–1041 | `FirewallCodebook.evaluate_auc()` | AUC evaluation on held-out data | | 1044–1118 | `build_codebook_from_precomputed()` | Load from saved .pt files | | 1121–1245 | `__main__` block | Script-mode evaluation + duplicated data loading | ### 1.3 Dependency Map ``` ┌──────────────────┐ │ FirewallCodebook │ │ (main class) │ └────────┬─────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌─────────▼──┐ ┌──────▼──────┐ ┌─────▼─────┐ │ SplineDist │ │ simplex() │ │ bary3d() │ │ (CDF/ICDF) │ │ (transform) │ │ (local) │ └─────────┬──┘ └─────────────┘ └───────────┘ │ ┌─────────▼──────────────┐ │ MonotonicCubicSpline │ │ (pchip interpolation) │ └────────────────────────┘ ``` The `FirewallCodebook` has these hard dependencies at runtime: 1. **SplineDistribution** — CDF/ICDF transforms (population fitting + inference) 2. **simplex()** — normalize to simplex (x/sum(x)) 3. **reverse_bary3d()** — project simplex to 2D barycentric coordinates 4. **torch** — tensor operations 5. **numpy** — sklearn bridge for training only Training-time dependencies (not needed at inference): - **sklearn.linear_model.LogisticRegression** — classifier training - **sklearn.metrics.silhouette_score** — profile quality metric - **sklearn.metrics.roc_auc_score** — evaluation metric --- ## 2. Essential vs. Exploratory vs. Dead Code Classification ### 2.1 Essential (Required for Production Codebook) These are the core components that must be extracted into the production package: | Component | Lines | Role | Production Mapping | |-----------|-------|------|-------------------| | `reverse_bary3d()` | 53–66 | z → (u,v) barycentric projection | `codebook/transforms.py` | | `bary_to_simplex()` | 69–74 | Inverse barycentric (needed for reconstruction) | `codebook/transforms.py` | | `SplineDistribution` | spline.py:200–261 | CDF/ICDF for copula transform | `codebook/splines.py` (adapted) | | `MonotonicCubicSpline` | spline.py:80–197 | PCHIP interpolation engine | `codebook/splines.py` (adapted) | | `ensure_strictly_increasing` | spline.py:43–73 | Knot sanitization | `codebook/splines.py` | | `simplex()` | transform.py:34–36 | Normalize to unit simplex | `codebook/transforms.py` | | `FirewallCodebook.__init__` | 182–203 | State initialization | `codebook/codebook.py` | | `FirewallCodebook.decompose()` | 598–629 | z → (sum, u, v) copula space | `codebook/projection.py` | | `FirewallCodebook.detect()` | 731–860 | Main detection logic | `codebook/detection.py` | | `DetectionResult` | 148–165 | Output dataclass | `codebook/results.py` | | `FirewallCodebook.build()` (core logic only) | 204–396 | SVD, spline fitting, profile computation | `training/compiler.py` | | `DirectionProfile` | 77–112 | Per-direction statistical profile | `codebook/profiles.py` | | `DirectionClassifier` | 114–127 | Per-direction linear classifier | `codebook/classifiers.py` | | `FirewallCodebook.detect_from_perturbations()` | 862–884 | P → z convenience wrapper | `codebook/projection.py` | **Total essential lines**: ~480 lines (including metaspline core) ### 2.2 Exploratory / Research Code These were useful for research but are **not needed** in production: | Component | Lines | Purpose | Disposition | |-----------|-------|---------|-------------| | `HistogramClassifier` dataclass | 129–146 | Alternative 2×2×2 discretized classifier | Keep as optional, not MVP | | `classify_histogram()` | 671–729 | Histogram-based classification variant | Research variant, not MVP | | `build()` histogram classifier section | 481–596 | Training histogram classifiers | Research variant | | `evaluate_auc()` | 947–1041 | Offline AUC evaluation | Testing/benchmarking only | | `summary()` | 886–945 | Human-readable codebook summary | Debugging/diagnostic tool | | `classify()` | 631–669 | Per-position probability output | Subsumed by `detect()` | | `build_codebook_from_precomputed()` | 1044–1118 | Load from .pt files | Training pipeline I/O | | `build()` contrast_pairs default | 268–276 | Hardcoded 7-pair contrast list | Config, not code | | `pooled_std()` inner function | 327–331 | Statistical utility | Extract to `training/stats.py` | | `cohen_d()` inner function | 337–340 | Effect size utility | Extract to `training/stats.py` | | `compute_silhouette()` inner function | 365–370 | Quality metric | Training diagnostic | ### 2.3 Dead Code | Component | Lines | Issue | |-----------|-------|-------| | `sklearn.mixture.GaussianMixture` import | 44 | Imported but never used | | `unfold()` / `fold()` from `space.py` | space.py:4–45 | Imported but never called in codebook | | `dcs_norm()` from `transform.py` | transform.py:20–23 | Imported but never used | | `__main__` block duplicated data loading | 1121–1245 | Lines 1203–1245 repeat 1126–1182 verbatim with different formatting — copy-paste artifact | | `bary_to_simplex()` | 69–74 | Defined but never called in codebook | | `DensitySpline` class | spline.py:315–378 | Legacy alternative, not used by codebook | | `empirical_cdf()` / `empirical_density()` / `log_bins()` / `generate_asymmetric_knots()` | spline.py:268–313 | Utility functions not used by codebook | ### 2.4 Infrastructure (Training Pipeline, Not Runtime) | Component | Lines | Purpose | |-----------|-------|---------| | `run_manifold_projection.py` (entire) | 823 | Model loading, data collection, SVD computation, saving artifacts | | `analyzer.py` (entire) | 560 | Multi-layer direction analysis, residual extraction | | `discover_directions.py` (entire) | 401 | Post-hoc direction discovery from trajectory data | | `build()` SVD computation section | 229–233 | Population SVD → V3 basis | --- ## 3. Training Pipeline Analysis ### 3.1 `run_manifold_projection.py` — Step by Step The training pipeline performs these operations: 1. **Model Loading** (L79–103): Load HuggingFace model + tokenizer. Configure for GPU/CPU. 2. **Condition Catalog Construction** (L106–153): Build contrastive prompt sets for 8 behavioral conditions: - self_ref / other_ref - violated / expected (semantic) - code_violated / code_expected - instruction / data - tool_call / natural_language - uncertain / confident - harmful / harmless - injection / benign_instruction 3. **Feature Extraction** (L156–213): For each condition, extract: - Hidden states across all layers → `residuals` (n_prompts, n_layers+1, hidden_dim) - ICDF perturbation vectors → `perturbations` (n_prompts, 64) - Last-layer hidden states → `hidden_last` (n_prompts, hidden_dim) 4. **SVD Computation** (L216–263): - Activation SVD: `H_all` (N, 2048) → principal components in hidden state space - Perturbation SVD: `P_all` (N, 64) → the **3D perturbation manifold** (this is the basis V3) 5. **Direction Vector Computation** (L393–434): Per-contrast mean-difference direction vectors at best layers. 6. **Projection Analysis** (L436–668): Extensive analysis of direction projections onto activation/perturbation subspaces. **This is research output, not needed for codebook compilation.** 7. **Save Results** (L670–755): - `.json`: Scalar metrics, SVD variance, separation stats - `.pt`: Tensors — **this is the key artifact**: - `perturbation_svd_Vh` → top-k right-singular vectors (the SVD basis) - `perturbation_mean` → population mean for centering - `condition_perturbations` → per-condition perturbation vectors - `condition_hidden_last` → last-layer hidden states per condition ### 3.2 Codebook Artifact Production The `.pt` file from `run_manifold_projection.py` feeds directly into `build_codebook_from_precomputed()`, which: 1. Loads `.pt` file → extracts `perturbation_svd_Vh[:3]` (V3 basis) and `perturbation_mean` (P_mean) 2. Reconstructs z-coords: `z = (P - P_mean) @ V3.T` 3. Calls `FirewallCodebook.build()` which: - Fits SplineDistribution on each z dimension (population) - Fits SplineDistribution on sums (population) - Decomposes each condition via CDF → (sum, u, v) - Computes DirectionProfiles (pooled stats, Cohen's d, thresholds) - Trains DirectionClassifiers (logistic regression per contrast) - Trains HistogramClassifiers (8-state discrete classifiers) **The produced codebook artifacts map to the production spec as:** | PoC Artifact | Production Format | Notes | |---|---|---| | `FirewallCodebook.z_splines` (3× SplineDistribution) | `splines.json` (knot positions + coefficients) | Spline knots serialized as JSON arrays | | `FirewallCodebook.svd_V3` (3×64 tensor) | `basis.safetensors` → `basis_vectors` | Reshaped for multi-layer format | | `FirewallCodebook.population_mean_P` (64 tensor) | `basis.safetensors` → `mean` | Centering vector | | `FirewallCodebook.direction_profiles` (dict) | `regions.safetensors` → centroids, scale | Per-direction statistical profiles | | `FirewallCodebook.classifiers` (dict) | Part of `config.json` or `regions.safetensors` | Logistic weights (3 floats + intercept per direction) | | `FirewallCodebook.sum_spline` (SplineDistribution) | `splines.json` | Sum distribution spline | | `FirewallCodebook.population_stats` (dict) | `regions.safetensors` → centroids, scale | Population baselines | --- ## 4. Core Library Assessment ### 4.1 Metaspline Core Usage The metaspline core (`spline.py` 378 lines, `transform.py` 78 lines, `space.py` 46 lines — 502 lines total) provides: | Module | Lines | Used by Codebook | Lines Actually Used | |--------|-------|-------------------|---------------------| | `spline.py` | 378 | `SplineDistribution`, `ensure_strictly_increasing` | ~175 lines (SplineDistribution + MonotonicCubicSpline + ensure_strictly_increasing) | | `transform.py` | 78 | `simplex()` only | 3 lines | | `space.py` | 46 | None (imported but unused) | 0 lines | **Actual dependency: ~178 lines out of 502.** The codebook uses only `SplineDistribution` (CDF/ICDF), `MonotonicCubicSpline` (its backbone), `ensure_strictly_increasing`, and `simplex()`. The following are unused: - `DensitySpline` class (spline.py, 60 lines) — legacy CDF-based distribution, not used - `empirical_cdf()`, `empirical_density()`, `log_bins()`, `generate_asymmetric_knots()` (spline.py, ~45 lines) — utility functions, unused - `unfold()` / `fold()` (space.py, 46 lines) — digit expansion/contraction, unused - `double_cumsum()`, `double_diff()`, `dcs_norm()`, `normalize_01()`, `column_cdf_normalize()`, `toBase()`, `numSymbols()`, `ndVec()` (transform.py, ~75 lines) — unused ### 4.2 How Much Is Inline vs. Library? The `FirewallCodebook.build()` method has **significant inline reimplementation** of statistical operations that could be cleaner: - **Lines 229–233**: SVD computation is inline (should use the pipeline's `compute_perturbation_svd()`) - **Lines 236–246**: Spline fitting is inline but delegates to `SplineDistribution` - **Lines 313–324**: CDF → decompose → barycentric is duplicated 3× (in `build()`, `classify()`, `classify_histogram()`) - **Lines 327–340**: `pooled_std()` and `cohen_d()` are inner functions, not module-level - **Lines 365–370**: `compute_silhouette()` is an inner function with sklearn import The core decomposition pipeline (z → CDF → simplex → barycentric → (sum, u, v)) appears **verbatim** in: 1. `build()` lines 242–250 (population) 2. `build()` lines 313–324 (per-condition, profile computation) 3. `build()` lines 445–456 (per-condition, classifier computation) 4. `build()` lines 521–532 (per-condition, histogram computation) 5. `decompose()` lines 610–628 (runtime inference) This is the **single most compressible pattern** — a 10-line decomposition sequence repeated 5 times. --- ## 5. Minimum Viable Codebook ### 5.1 Required Functions for Production Based on the production spec (`codebook.md`), the minimum viable codebook needs: 1. **`project(activations) → z_coords`**: SVD projection (matrix multiply + centering) 2. **`decompose(z_coords) → (sum, u, v)`**: CDF → simplex → barycentric 3. **`score(z_coords) → list[DimensionSignal]`**: Per-direction scoring against profiles 4. **`detect(z_coords, threshold) → DetectionResult`**: Threshold comparison + flagging 5. **`load(path) → Codebook`**: Deserialize from safetensors + JSON 6. **SplineDistribution**: CDF evaluation for decompose And for the **training pipeline** (not runtime): 7. **`build(population_data, direction_data) → Codebook`**: SVD, spline fitting, classifier training ### 5.2 Compression Estimate | Source | Lines | Classification | Production Lines | |--------|-------|----------------|------------------| | `firewall_codebook.py` | 1,245 | Core + research + dead | ~350 | | `spline.py` (used parts) | ~178 | Core library | ~180 | | `transform.py` (used parts) | ~3 | Core library | ~5 | | **Total PoC dependency** | **~426** | | **~535** | **Target estimate: 400–500 lines for runtime codebook, 150–200 lines for training pipeline.** Breakdown of production targets: | Module | Target Lines | Contents | |--------|-------------|----------| | `codebook/transforms.py` | ~30 | `simplex()`, `reverse_bary3d()`, `bary_to_simplex()` | | `codebook/splines.py` | ~180 | `MonotonicCubicSpline`, `SplineDistribution`, `ensure_strictly_increasing` | | `codebook/profiles.py` | ~30 | `DirectionProfile` dataclass | | `codebook/classifiers.py` | ~20 | `DirectionClassifier` dataclass | | `codebook/results.py` | ~15 | `DetectionResult` dataclass | | `codebook/projection.py` | ~30 | `project()` and `decompose()` | | `codebook/detection.py` | ~50 | `detect()` with rolling window, threshold logic | | `codebook/codebook.py` | ~40 | `Codebook` class (init, load, summary) | | `training/compiler.py` | ~150 | `build()` — SVD, spline fitting, profile computation | | `training/stats.py` | ~25 | `pooled_std()`, `cohen_d()`, silhouette | | **Total** | **~570** | | | This is **46% of the PoC's 1,245 lines**, or if including the used portion of metaspline core, **~35% of the total 1,745 lines** referenced in the overview. ### 5.3 What Gets Cut | Lines Cut | Source | Reason | |-----------|--------|--------| | ~130 | `HistogramClassifier` + `classify_histogram()` + histogram training | Alternative approach, not MVP | | ~95 | `evaluate_auc()` | Testing/benchmarking tool | | ~60 | `summary()` | Debugging tool, not runtime | | ~75 | `__main__` block (including duplicated code) | Script-mode evaluation | | ~40 | `classify()` method | Subsumed by `detect()` | | ~30 | `build_codebook_from_precomputed()` | Training I/O, not runtime | | ~124 | Unused metaspline code (DensitySpline, unfold/fold, dcs_norm, etc.) | Dead code | | ~50 | Repeated decomposition sequences | DRY refactoring | --- ## 6. Proposed Decomposition Matching the production package structure from `codebook.md`: ``` src/alknet_firewall/ ├── codebook/ │ ├── __init__.py # Public exports │ ├── codebook.py # Codebook class (init, load, project, score) │ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex │ ├── splines.py # MonotonicCubicSpline, SplineDistribution │ ├── profiles.py # DirectionProfile, population stats │ ├── classifiers.py # DirectionClassifier (logistic weights) │ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel │ ├── projection.py # project(), decompose() │ └── detection.py # detect(), threshold comparison, rolling window ├── training/ │ ├── __init__.py │ ├── compiler.py # build() — SVD, spline fitting, profile comp │ ├── stats.py # pooled_std, cohen_d, silhouette │ └── data_loader.py # Condition catalog, prompt sets, data loading └── data/ └── codebook/ ├── basis.safetensors ├── regions.safetensors ├── splines.json └── config.json ``` ### 6.1 Key Design Decisions for Extraction 1. **SplineDistribution stays in `codebook/splines.py`** — it's a general-purpose distribution class used at both training and inference time. No need for a separate package. 2. **`simplex()` moves to `codebook/transforms.py`** — it's a single pure function (3 lines), no need for the `transform.py` dependency chain. 3. **`unfold`/`fold` from `space.py` are dropped** — never used by the codebook. 4. **`DirectionProfile` and `DirectionClassifier` become separate dataclass modules** — clean separation of data from logic. 5. **`build()` moves entirely to `training/compiler.py`** — runtime codebook is read-only. This is the biggest architectural change: the codebook class should not have a `build()` classmethod. 6. **Decompose becomes a pure function** — `decompose(z, splines)` is a pure mathematical transform with no state dependencies beyond the splines. Making it a standalone function enables testing. 7. **Detection is separate from the codebook class** — `detect(z, classifiers, profiles, threshold)` is a stateless function given the codebook data. This enables swapping detection strategies without touching the codebook. --- ## 7. Testing Data ### 7.1 Saved Artifacts Referenced in Code The PoC references these saved data files: | File | Path | Contents | Reusable for Testing | |------|------|----------|---------------------| | Population precomputed | `saved_data/precomputed_seed42_qwen3_0.6b.pt` | z_coords, P_mean, perturbation_svd_Vh | Yes — basis for integration tests | | Population precomputed | `saved_data/precomputed_seed42_qwen3_1.7b.pt` | Same for 1.7B model | Yes — multi-model test | | Population precomputed | `saved_data/precomputed_seed42_qwen3_4b.pt` | Same for 4B model | Yes — multi-model test | | Direction geometry | `experiments/direction_geometry/results/Qwen_Qwen3-0.6B_manifold_projection.pt` | Full condition data + SVD | Yes — golden data for codebook compilation | | Direction geometry | `experiments/direction_geometry/results/Qwen_Qwen3-1.7B_manifold_projection.pt` | Same for 1.7B | Yes | | Contrast pairs | Hardcoded in `build()` L268–276 and `run_manifold_projection.py` L139–148 | 7 behavioral contrasts | Yes — test fixture definition | ### 7.2 Validation Results Referenced The `__main__` block (L1121–1245) contains: - AUC evaluation at window sizes [1, 4, 8, 16] - Per-direction AUC scores for both continuous and histogram classifiers - Per-token AUC evaluation These results should be captured as **golden test fixtures** for the production codebook: - Build a codebook from the 0.6B precomputed data - Verify that AUC scores match expected ranges - Verify that detection decisions match expected flags ### 7.3 Calibration Data for Testing For unit/integration tests, we need: 1. **Synthetic z-coord population**: Small N=1000 tensor for spline fitting tests 2. **Known-contrast z-coords**: Small pairs (harmful/harmless) for direction profile tests 3. **Expected spline parameters**: Known knot positions/coefficients for regression tests 4. **Expected detection results**: For a given input, what does `detect()` return? The PoC's `build_codebook_from_precomputed()` provides a ready-made path to generate these fixtures from the saved `.pt` files. --- ## Summary ### Key Findings 1. **The 1,245-line PoC contains ~480 lines of essential code**. Including the metaspline core dependency (~178 lines used), the total essential code is ~658 lines. With dead code and research artifacts removed, the production codebook should target **400–500 lines** for runtime + **150–200 lines** for training. 2. **The decomposition pipeline (z → CDF → simplex → bary → (sum,u,v)) is repeated 5 times** in the PoC. Extracting it into a single `decompose()` function saves ~50 lines and eliminates a bug surface. 3. **The metaspline core has ~65% unused code** when viewed from the codebook's perspective. Only `SplineDistribution`, `MonotonicCubicSpline`, `ensure_strictly_increasing`, and `simplex()` are needed — the rest (DensitySpline, unfold/fold, dcs_norm, etc.) can be dropped entirely. 4. **The histogram classifier (2×2×2 discretized approach) is an exploratory alternative**, not the primary detection mechanism. The continuous logistic classifier is superior (higher AUC) and should be the MVP approach. The histogram classifier adds ~130 lines and can be deferred. 5. **The `build()` method is the largest single function (429 lines)** and mixes training with runtime state. It must be decomposed: training logic moves to `training/compiler.py`, runtime state becomes immutable serialized data. 6. **Saved `.pt` files from the PoC provide golden test data** — the manifold projection results for Qwen3-0.6B and 1.7B can be reused directly for integration tests. ### Recommendation **Target: 500–600 lines total** for the production codebook (runtime + training), down from 1,245 lines in the PoC and 1,745 lines including metaspline core. This is a **~65% compression**. The architecture should separate: - **Runtime** (~400 lines): `Codebook`, transforms, splines, detection, results - **Training** (~150 lines): compiler, stats, data loading - **Data** (bundled): safetensors + JSON, no Python ### Next Steps 1. Create `src/alknet_firewall/codebook/` package structure 2. Extract `transforms.py` (simplex, barycentric) — trivial, ~30 lines 3. Port `splines.py` (MonotonicCubicSpline + SplineDistribution) — ~180 lines, mostly copy with cleanup 4. Implement `projection.py` (project, decompose) — thin wrappers, ~30 lines 5. Implement `detection.py` (detect with rolling window) — ~50 lines, port from PoC's detect() 6. Implement `codebook.py` (Codebook class with load) — ~40 lines 7. Extract `training/compiler.py` from `build()` — most complex extraction, ~150 lines 8. Create test fixtures from saved `.pt` data 9. Verify round-trip: build from .pt → serialize → load → detect matches PoC output