Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06: - OQ-01: Remove ONNX Runtime from scope entirely — doesn't support activation extraction natively (optimum #972 closed as not planned), bloated model exports; burn/cublas via safetensors is a better future path - OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package Structure and Extraction from PoC sections to codebook.md based on PoC analysis of metaspline firewall_codebook.py - OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships Firewall.screen() only, Phase 2 adds <100-line adapter packages for LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails - OQ-06: TOML for file-based config — standard modern Python, two-way door Also: research OQ-03 rolling windows from taskgraph-semantic reference code, remove onnxruntime/optimum from dependencies, move streaming screening to Phase 2, add burn/cublas as Phase 3 alternative backend.
24 KiB
Research: PoC Codebook Architecture Analysis (OQ-02)
Date: 2026-06-13 Status: Complete Question: What is the minimum viable codebook? Can the 1,245-line PoC codebook be compressed, and what is essential vs. exploratory/dead code?
1. PoC Architecture Overview
1.1 File Structure & Role
The PoC codebook lives in firewall_codebook.py (1,245 lines) and depends on three metaspline core modules:
firewall_codebook.py (1,245 lines)
├── Imports from metaspline core:
│ ├── metaspline.spline.SplineDistribution (spline.py, 378 lines)
│ ├── metaspline.spline.ensure_strictly_increasing (spline.py)
│ ├── metaspline.space.unfold / fold (space.py, 46 lines)
│ └── metaspline.transform.simplex (transform.py, 78 lines)
├── External imports:
│ ├── sklearn.linear_model.LogisticRegression
│ └── sklearn.mixture.GaussianMixture (imported but unused)
└── Internal definitions (see §1.2)
1.2 Major Sections of firewall_codebook.py
| Lines | Component | Description |
|---|---|---|
| 1–50 | Module docstring + imports | Theory overview, imports |
| 53–75 | reverse_bary3d() |
Simplex → barycentric (u,v) transform |
| 69–74 | bary_to_simplex() |
Inverse: barycentric → simplex |
| 77–112 | DirectionProfile dataclass |
Per-contrast statistical profile |
| 114–127 | DirectionClassifier dataclass |
Per-contrast logistic regression weights |
| 129–146 | HistogramClassifier dataclass |
2×2×2 codebook-state histogram classifier |
| 148–165 | DetectionResult dataclass |
Output of detect() |
| 167–596 | FirewallCodebook.__init__ + build() |
Codebook construction (429 lines!) |
| 598–629 | FirewallCodebook.decompose() |
z → (sum, u, v) copula transform |
| 631–669 | FirewallCodebook.classify() |
Per-contrast logistic classification |
| 671–729 | FirewallCodebook.classify_histogram() |
8-state histogram classification |
| 731–860 | FirewallCodebook.detect() |
Main detection entry point |
| 862–884 | FirewallCodebook.detect_from_perturbations() |
Convenience: P → z → detect |
| 886–945 | FirewallCodebook.summary() |
Human-readable summary |
| 947–1041 | FirewallCodebook.evaluate_auc() |
AUC evaluation on held-out data |
| 1044–1118 | build_codebook_from_precomputed() |
Load from saved .pt files |
| 1121–1245 | __main__ block |
Script-mode evaluation + duplicated data loading |
1.3 Dependency Map
┌──────────────────┐
│ FirewallCodebook │
│ (main class) │
└────────┬─────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌─────────▼──┐ ┌──────▼──────┐ ┌─────▼─────┐
│ SplineDist │ │ simplex() │ │ bary3d() │
│ (CDF/ICDF) │ │ (transform) │ │ (local) │
└─────────┬──┘ └─────────────┘ └───────────┘
│
┌─────────▼──────────────┐
│ MonotonicCubicSpline │
│ (pchip interpolation) │
└────────────────────────┘
The FirewallCodebook has these hard dependencies at runtime:
- SplineDistribution — CDF/ICDF transforms (population fitting + inference)
- simplex() — normalize to simplex (x/sum(x))
- reverse_bary3d() — project simplex to 2D barycentric coordinates
- torch — tensor operations
- numpy — sklearn bridge for training only
Training-time dependencies (not needed at inference):
- sklearn.linear_model.LogisticRegression — classifier training
- sklearn.metrics.silhouette_score — profile quality metric
- sklearn.metrics.roc_auc_score — evaluation metric
2. Essential vs. Exploratory vs. Dead Code Classification
2.1 Essential (Required for Production Codebook)
These are the core components that must be extracted into the production package:
| Component | Lines | Role | Production Mapping |
|---|---|---|---|
reverse_bary3d() |
53–66 | z → (u,v) barycentric projection | codebook/transforms.py |
bary_to_simplex() |
69–74 | Inverse barycentric (needed for reconstruction) | codebook/transforms.py |
SplineDistribution |
spline.py:200–261 | CDF/ICDF for copula transform | codebook/splines.py (adapted) |
MonotonicCubicSpline |
spline.py:80–197 | PCHIP interpolation engine | codebook/splines.py (adapted) |
ensure_strictly_increasing |
spline.py:43–73 | Knot sanitization | codebook/splines.py |
simplex() |
transform.py:34–36 | Normalize to unit simplex | codebook/transforms.py |
FirewallCodebook.__init__ |
182–203 | State initialization | codebook/codebook.py |
FirewallCodebook.decompose() |
598–629 | z → (sum, u, v) copula space | codebook/projection.py |
FirewallCodebook.detect() |
731–860 | Main detection logic | codebook/detection.py |
DetectionResult |
148–165 | Output dataclass | codebook/results.py |
FirewallCodebook.build() (core logic only) |
204–396 | SVD, spline fitting, profile computation | training/compiler.py |
DirectionProfile |
77–112 | Per-direction statistical profile | codebook/profiles.py |
DirectionClassifier |
114–127 | Per-direction linear classifier | codebook/classifiers.py |
FirewallCodebook.detect_from_perturbations() |
862–884 | P → z convenience wrapper | codebook/projection.py |
Total essential lines: ~480 lines (including metaspline core)
2.2 Exploratory / Research Code
These were useful for research but are not needed in production:
| Component | Lines | Purpose | Disposition |
|---|---|---|---|
HistogramClassifier dataclass |
129–146 | Alternative 2×2×2 discretized classifier | Keep as optional, not MVP |
classify_histogram() |
671–729 | Histogram-based classification variant | Research variant, not MVP |
build() histogram classifier section |
481–596 | Training histogram classifiers | Research variant |
evaluate_auc() |
947–1041 | Offline AUC evaluation | Testing/benchmarking only |
summary() |
886–945 | Human-readable codebook summary | Debugging/diagnostic tool |
classify() |
631–669 | Per-position probability output | Subsumed by detect() |
build_codebook_from_precomputed() |
1044–1118 | Load from .pt files | Training pipeline I/O |
build() contrast_pairs default |
268–276 | Hardcoded 7-pair contrast list | Config, not code |
pooled_std() inner function |
327–331 | Statistical utility | Extract to training/stats.py |
cohen_d() inner function |
337–340 | Effect size utility | Extract to training/stats.py |
compute_silhouette() inner function |
365–370 | Quality metric | Training diagnostic |
2.3 Dead Code
| Component | Lines | Issue |
|---|---|---|
sklearn.mixture.GaussianMixture import |
44 | Imported but never used |
unfold() / fold() from space.py |
space.py:4–45 | Imported but never called in codebook |
dcs_norm() from transform.py |
transform.py:20–23 | Imported but never used |
__main__ block duplicated data loading |
1121–1245 | Lines 1203–1245 repeat 1126–1182 verbatim with different formatting — copy-paste artifact |
bary_to_simplex() |
69–74 | Defined but never called in codebook |
DensitySpline class |
spline.py:315–378 | Legacy alternative, not used by codebook |
empirical_cdf() / empirical_density() / log_bins() / generate_asymmetric_knots() |
spline.py:268–313 | Utility functions not used by codebook |
2.4 Infrastructure (Training Pipeline, Not Runtime)
| Component | Lines | Purpose |
|---|---|---|
run_manifold_projection.py (entire) |
823 | Model loading, data collection, SVD computation, saving artifacts |
analyzer.py (entire) |
560 | Multi-layer direction analysis, residual extraction |
discover_directions.py (entire) |
401 | Post-hoc direction discovery from trajectory data |
build() SVD computation section |
229–233 | Population SVD → V3 basis |
3. Training Pipeline Analysis
3.1 run_manifold_projection.py — Step by Step
The training pipeline performs these operations:
-
Model Loading (L79–103): Load HuggingFace model + tokenizer. Configure for GPU/CPU.
-
Condition Catalog Construction (L106–153): Build contrastive prompt sets for 8 behavioral conditions:
- self_ref / other_ref
- violated / expected (semantic)
- code_violated / code_expected
- instruction / data
- tool_call / natural_language
- uncertain / confident
- harmful / harmless
- injection / benign_instruction
-
Feature Extraction (L156–213): For each condition, extract:
- Hidden states across all layers →
residuals(n_prompts, n_layers+1, hidden_dim) - ICDF perturbation vectors →
perturbations(n_prompts, 64) - Last-layer hidden states →
hidden_last(n_prompts, hidden_dim)
- Hidden states across all layers →
-
SVD Computation (L216–263):
- Activation SVD:
H_all(N, 2048) → principal components in hidden state space - Perturbation SVD:
P_all(N, 64) → the 3D perturbation manifold (this is the basis V3)
- Activation SVD:
-
Direction Vector Computation (L393–434): Per-contrast mean-difference direction vectors at best layers.
-
Projection Analysis (L436–668): Extensive analysis of direction projections onto activation/perturbation subspaces. This is research output, not needed for codebook compilation.
-
Save Results (L670–755):
.json: Scalar metrics, SVD variance, separation stats.pt: Tensors — this is the key artifact:perturbation_svd_Vh→ top-k right-singular vectors (the SVD basis)perturbation_mean→ population mean for centeringcondition_perturbations→ per-condition perturbation vectorscondition_hidden_last→ last-layer hidden states per condition
3.2 Codebook Artifact Production
The .pt file from run_manifold_projection.py feeds directly into build_codebook_from_precomputed(), which:
- Loads
.ptfile → extractsperturbation_svd_Vh[:3](V3 basis) andperturbation_mean(P_mean) - Reconstructs z-coords:
z = (P - P_mean) @ V3.T - Calls
FirewallCodebook.build()which:- Fits SplineDistribution on each z dimension (population)
- Fits SplineDistribution on sums (population)
- Decomposes each condition via CDF → (sum, u, v)
- Computes DirectionProfiles (pooled stats, Cohen's d, thresholds)
- Trains DirectionClassifiers (logistic regression per contrast)
- Trains HistogramClassifiers (8-state discrete classifiers)
The produced codebook artifacts map to the production spec as:
| PoC Artifact | Production Format | Notes |
|---|---|---|
FirewallCodebook.z_splines (3× SplineDistribution) |
splines.json (knot positions + coefficients) |
Spline knots serialized as JSON arrays |
FirewallCodebook.svd_V3 (3×64 tensor) |
basis.safetensors → basis_vectors |
Reshaped for multi-layer format |
FirewallCodebook.population_mean_P (64 tensor) |
basis.safetensors → mean |
Centering vector |
FirewallCodebook.direction_profiles (dict) |
regions.safetensors → centroids, scale |
Per-direction statistical profiles |
FirewallCodebook.classifiers (dict) |
Part of config.json or regions.safetensors |
Logistic weights (3 floats + intercept per direction) |
FirewallCodebook.sum_spline (SplineDistribution) |
splines.json |
Sum distribution spline |
FirewallCodebook.population_stats (dict) |
regions.safetensors → centroids, scale |
Population baselines |
4. Core Library Assessment
4.1 Metaspline Core Usage
The metaspline core (spline.py 378 lines, transform.py 78 lines, space.py 46 lines — 502 lines total) provides:
| Module | Lines | Used by Codebook | Lines Actually Used |
|---|---|---|---|
spline.py |
378 | SplineDistribution, ensure_strictly_increasing |
~175 lines (SplineDistribution + MonotonicCubicSpline + ensure_strictly_increasing) |
transform.py |
78 | simplex() only |
3 lines |
space.py |
46 | None (imported but unused) | 0 lines |
Actual dependency: ~178 lines out of 502. The codebook uses only SplineDistribution (CDF/ICDF), MonotonicCubicSpline (its backbone), ensure_strictly_increasing, and simplex(). The following are unused:
DensitySplineclass (spline.py, 60 lines) — legacy CDF-based distribution, not usedempirical_cdf(),empirical_density(),log_bins(),generate_asymmetric_knots()(spline.py, ~45 lines) — utility functions, unusedunfold()/fold()(space.py, 46 lines) — digit expansion/contraction, unuseddouble_cumsum(),double_diff(),dcs_norm(),normalize_01(),column_cdf_normalize(),toBase(),numSymbols(),ndVec()(transform.py, ~75 lines) — unused
4.2 How Much Is Inline vs. Library?
The FirewallCodebook.build() method has significant inline reimplementation of statistical operations that could be cleaner:
- Lines 229–233: SVD computation is inline (should use the pipeline's
compute_perturbation_svd()) - Lines 236–246: Spline fitting is inline but delegates to
SplineDistribution - Lines 313–324: CDF → decompose → barycentric is duplicated 3× (in
build(),classify(),classify_histogram()) - Lines 327–340:
pooled_std()andcohen_d()are inner functions, not module-level - Lines 365–370:
compute_silhouette()is an inner function with sklearn import
The core decomposition pipeline (z → CDF → simplex → barycentric → (sum, u, v)) appears verbatim in:
build()lines 242–250 (population)build()lines 313–324 (per-condition, profile computation)build()lines 445–456 (per-condition, classifier computation)build()lines 521–532 (per-condition, histogram computation)decompose()lines 610–628 (runtime inference)
This is the single most compressible pattern — a 10-line decomposition sequence repeated 5 times.
5. Minimum Viable Codebook
5.1 Required Functions for Production
Based on the production spec (codebook.md), the minimum viable codebook needs:
project(activations) → z_coords: SVD projection (matrix multiply + centering)decompose(z_coords) → (sum, u, v): CDF → simplex → barycentricscore(z_coords) → list[DimensionSignal]: Per-direction scoring against profilesdetect(z_coords, threshold) → DetectionResult: Threshold comparison + flaggingload(path) → Codebook: Deserialize from safetensors + JSON- SplineDistribution: CDF evaluation for decompose
And for the training pipeline (not runtime):
7. build(population_data, direction_data) → Codebook: SVD, spline fitting, classifier training
5.2 Compression Estimate
| Source | Lines | Classification | Production Lines |
|---|---|---|---|
firewall_codebook.py |
1,245 | Core + research + dead | ~350 |
spline.py (used parts) |
~178 | Core library | ~180 |
transform.py (used parts) |
~3 | Core library | ~5 |
| Total PoC dependency | ~426 | ~535 |
Target estimate: 400–500 lines for runtime codebook, 150–200 lines for training pipeline.
Breakdown of production targets:
| Module | Target Lines | Contents |
|---|---|---|
codebook/transforms.py |
~30 | simplex(), reverse_bary3d(), bary_to_simplex() |
codebook/splines.py |
~180 | MonotonicCubicSpline, SplineDistribution, ensure_strictly_increasing |
codebook/profiles.py |
~30 | DirectionProfile dataclass |
codebook/classifiers.py |
~20 | DirectionClassifier dataclass |
codebook/results.py |
~15 | DetectionResult dataclass |
codebook/projection.py |
~30 | project() and decompose() |
codebook/detection.py |
~50 | detect() with rolling window, threshold logic |
codebook/codebook.py |
~40 | Codebook class (init, load, summary) |
training/compiler.py |
~150 | build() — SVD, spline fitting, profile computation |
training/stats.py |
~25 | pooled_std(), cohen_d(), silhouette |
| Total | ~570 |
This is 46% of the PoC's 1,245 lines, or if including the used portion of metaspline core, ~35% of the total 1,745 lines referenced in the overview.
5.3 What Gets Cut
| Lines Cut | Source | Reason |
|---|---|---|
| ~130 | HistogramClassifier + classify_histogram() + histogram training |
Alternative approach, not MVP |
| ~95 | evaluate_auc() |
Testing/benchmarking tool |
| ~60 | summary() |
Debugging tool, not runtime |
| ~75 | __main__ block (including duplicated code) |
Script-mode evaluation |
| ~40 | classify() method |
Subsumed by detect() |
| ~30 | build_codebook_from_precomputed() |
Training I/O, not runtime |
| ~124 | Unused metaspline code (DensitySpline, unfold/fold, dcs_norm, etc.) | Dead code |
| ~50 | Repeated decomposition sequences | DRY refactoring |
6. Proposed Decomposition
Matching the production package structure from codebook.md:
src/alknet_firewall/
├── codebook/
│ ├── __init__.py # Public exports
│ ├── codebook.py # Codebook class (init, load, project, score)
│ ├── transforms.py # simplex, reverse_bary3d, bary_to_simplex
│ ├── splines.py # MonotonicCubicSpline, SplineDistribution
│ ├── profiles.py # DirectionProfile, population stats
│ ├── classifiers.py # DirectionClassifier (logistic weights)
│ ├── results.py # DetectionResult, DimensionSignal, AlarmLevel
│ ├── projection.py # project(), decompose()
│ └── detection.py # detect(), threshold comparison, rolling window
├── training/
│ ├── __init__.py
│ ├── compiler.py # build() — SVD, spline fitting, profile comp
│ ├── stats.py # pooled_std, cohen_d, silhouette
│ └── data_loader.py # Condition catalog, prompt sets, data loading
└── data/
└── codebook/
├── basis.safetensors
├── regions.safetensors
├── splines.json
└── config.json
6.1 Key Design Decisions for Extraction
-
SplineDistribution stays in
codebook/splines.py— it's a general-purpose distribution class used at both training and inference time. No need for a separate package. -
simplex()moves tocodebook/transforms.py— it's a single pure function (3 lines), no need for thetransform.pydependency chain. -
unfold/foldfromspace.pyare dropped — never used by the codebook. -
DirectionProfileandDirectionClassifierbecome separate dataclass modules — clean separation of data from logic. -
build()moves entirely totraining/compiler.py— runtime codebook is read-only. This is the biggest architectural change: the codebook class should not have abuild()classmethod. -
Decompose becomes a pure function —
decompose(z, splines)is a pure mathematical transform with no state dependencies beyond the splines. Making it a standalone function enables testing. -
Detection is separate from the codebook class —
detect(z, classifiers, profiles, threshold)is a stateless function given the codebook data. This enables swapping detection strategies without touching the codebook.
7. Testing Data
7.1 Saved Artifacts Referenced in Code
The PoC references these saved data files:
| File | Path | Contents | Reusable for Testing |
|---|---|---|---|
| Population precomputed | saved_data/precomputed_seed42_qwen3_0.6b.pt |
z_coords, P_mean, perturbation_svd_Vh | Yes — basis for integration tests |
| Population precomputed | saved_data/precomputed_seed42_qwen3_1.7b.pt |
Same for 1.7B model | Yes — multi-model test |
| Population precomputed | saved_data/precomputed_seed42_qwen3_4b.pt |
Same for 4B model | Yes — multi-model test |
| Direction geometry | experiments/direction_geometry/results/Qwen_Qwen3-0.6B_manifold_projection.pt |
Full condition data + SVD | Yes — golden data for codebook compilation |
| Direction geometry | experiments/direction_geometry/results/Qwen_Qwen3-1.7B_manifold_projection.pt |
Same for 1.7B | Yes |
| Contrast pairs | Hardcoded in build() L268–276 and run_manifold_projection.py L139–148 |
7 behavioral contrasts | Yes — test fixture definition |
7.2 Validation Results Referenced
The __main__ block (L1121–1245) contains:
- AUC evaluation at window sizes [1, 4, 8, 16]
- Per-direction AUC scores for both continuous and histogram classifiers
- Per-token AUC evaluation
These results should be captured as golden test fixtures for the production codebook:
- Build a codebook from the 0.6B precomputed data
- Verify that AUC scores match expected ranges
- Verify that detection decisions match expected flags
7.3 Calibration Data for Testing
For unit/integration tests, we need:
- Synthetic z-coord population: Small N=1000 tensor for spline fitting tests
- Known-contrast z-coords: Small pairs (harmful/harmless) for direction profile tests
- Expected spline parameters: Known knot positions/coefficients for regression tests
- Expected detection results: For a given input, what does
detect()return?
The PoC's build_codebook_from_precomputed() provides a ready-made path to generate these fixtures from the saved .pt files.
Summary
Key Findings
-
The 1,245-line PoC contains ~480 lines of essential code. Including the metaspline core dependency (~178 lines used), the total essential code is ~658 lines. With dead code and research artifacts removed, the production codebook should target 400–500 lines for runtime + 150–200 lines for training.
-
The decomposition pipeline (z → CDF → simplex → bary → (sum,u,v)) is repeated 5 times in the PoC. Extracting it into a single
decompose()function saves ~50 lines and eliminates a bug surface. -
The metaspline core has ~65% unused code when viewed from the codebook's perspective. Only
SplineDistribution,MonotonicCubicSpline,ensure_strictly_increasing, andsimplex()are needed — the rest (DensitySpline, unfold/fold, dcs_norm, etc.) can be dropped entirely. -
The histogram classifier (2×2×2 discretized approach) is an exploratory alternative, not the primary detection mechanism. The continuous logistic classifier is superior (higher AUC) and should be the MVP approach. The histogram classifier adds ~130 lines and can be deferred.
-
The
build()method is the largest single function (429 lines) and mixes training with runtime state. It must be decomposed: training logic moves totraining/compiler.py, runtime state becomes immutable serialized data. -
Saved
.ptfiles from the PoC provide golden test data — the manifold projection results for Qwen3-0.6B and 1.7B can be reused directly for integration tests.
Recommendation
Target: 500–600 lines total for the production codebook (runtime + training), down from 1,245 lines in the PoC and 1,745 lines including metaspline core. This is a ~65% compression.
The architecture should separate:
- Runtime (~400 lines):
Codebook, transforms, splines, detection, results - Training (~150 lines): compiler, stats, data loading
- Data (bundled): safetensors + JSON, no Python
Next Steps
- Create
src/alknet_firewall/codebook/package structure - Extract
transforms.py(simplex, barycentric) — trivial, ~30 lines - Port
splines.py(MonotonicCubicSpline + SplineDistribution) — ~180 lines, mostly copy with cleanup - Implement
projection.py(project, decompose) — thin wrappers, ~30 lines - Implement
detection.py(detect with rolling window) — ~50 lines, port from PoC's detect() - Implement
codebook.py(Codebook class with load) — ~40 lines - Extract
training/compiler.pyfrombuild()— most complex extraction, ~150 lines - Create test fixtures from saved
.ptdata - Verify round-trip: build from .pt → serialize → load → detect matches PoC output