Files
glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure
Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.
2026-06-13 07:27:40 +00:00

24 KiB
Raw Permalink Blame History

Research: PoC Codebook Architecture Analysis (OQ-02)

Date: 2026-06-13 Status: Complete Question: What is the minimum viable codebook? Can the 1,245-line PoC codebook be compressed, and what is essential vs. exploratory/dead code?


1. PoC Architecture Overview

1.1 File Structure & Role

The PoC codebook lives in firewall_codebook.py (1,245 lines) and depends on three metaspline core modules:

firewall_codebook.py (1,245 lines)
├── Imports from metaspline core:
│   ├── metaspline.spline.SplineDistribution    (spline.py, 378 lines)
│   ├── metaspline.spline.ensure_strictly_increasing (spline.py)
│   ├── metaspline.space.unfold / fold           (space.py, 46 lines)
│   └── metaspline.transform.simplex             (transform.py, 78 lines)
├── External imports:
│   ├── sklearn.linear_model.LogisticRegression
│   └── sklearn.mixture.GaussianMixture (imported but unused)
└── Internal definitions (see §1.2)

1.2 Major Sections of firewall_codebook.py

Lines Component Description
150 Module docstring + imports Theory overview, imports
5375 reverse_bary3d() Simplex → barycentric (u,v) transform
6974 bary_to_simplex() Inverse: barycentric → simplex
77112 DirectionProfile dataclass Per-contrast statistical profile
114127 DirectionClassifier dataclass Per-contrast logistic regression weights
129146 HistogramClassifier dataclass 2×2×2 codebook-state histogram classifier
148165 DetectionResult dataclass Output of detect()
167596 FirewallCodebook.__init__ + build() Codebook construction (429 lines!)
598629 FirewallCodebook.decompose() z → (sum, u, v) copula transform
631669 FirewallCodebook.classify() Per-contrast logistic classification
671729 FirewallCodebook.classify_histogram() 8-state histogram classification
731860 FirewallCodebook.detect() Main detection entry point
862884 FirewallCodebook.detect_from_perturbations() Convenience: P → z → detect
886945 FirewallCodebook.summary() Human-readable summary
9471041 FirewallCodebook.evaluate_auc() AUC evaluation on held-out data
10441118 build_codebook_from_precomputed() Load from saved .pt files
11211245 __main__ block Script-mode evaluation + duplicated data loading

1.3 Dependency Map

                     ┌──────────────────┐
                     │ FirewallCodebook  │
                     │  (main class)     │
                     └────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
    ┌─────────▼──┐   ┌──────▼──────┐   ┌─────▼─────┐
    │ SplineDist  │   │ simplex()   │   │ bary3d()  │
    │ (CDF/ICDF) │   │ (transform) │   │ (local)   │
    └─────────┬──┘   └─────────────┘   └───────────┘
              │
    ┌─────────▼──────────────┐
    │ MonotonicCubicSpline   │
    │ (pchip interpolation)   │
    └────────────────────────┘

The FirewallCodebook has these hard dependencies at runtime:

  1. SplineDistribution — CDF/ICDF transforms (population fitting + inference)
  2. simplex() — normalize to simplex (x/sum(x))
  3. reverse_bary3d() — project simplex to 2D barycentric coordinates
  4. torch — tensor operations
  5. numpy — sklearn bridge for training only

Training-time dependencies (not needed at inference):

  • sklearn.linear_model.LogisticRegression — classifier training
  • sklearn.metrics.silhouette_score — profile quality metric
  • sklearn.metrics.roc_auc_score — evaluation metric

2. Essential vs. Exploratory vs. Dead Code Classification

2.1 Essential (Required for Production Codebook)

These are the core components that must be extracted into the production package:

Component Lines Role Production Mapping
reverse_bary3d() 5366 z → (u,v) barycentric projection codebook/transforms.py
bary_to_simplex() 6974 Inverse barycentric (needed for reconstruction) codebook/transforms.py
SplineDistribution spline.py:200261 CDF/ICDF for copula transform codebook/splines.py (adapted)
MonotonicCubicSpline spline.py:80197 PCHIP interpolation engine codebook/splines.py (adapted)
ensure_strictly_increasing spline.py:4373 Knot sanitization codebook/splines.py
simplex() transform.py:3436 Normalize to unit simplex codebook/transforms.py
FirewallCodebook.__init__ 182203 State initialization codebook/codebook.py
FirewallCodebook.decompose() 598629 z → (sum, u, v) copula space codebook/projection.py
FirewallCodebook.detect() 731860 Main detection logic codebook/detection.py
DetectionResult 148165 Output dataclass codebook/results.py
FirewallCodebook.build() (core logic only) 204396 SVD, spline fitting, profile computation training/compiler.py
DirectionProfile 77112 Per-direction statistical profile codebook/profiles.py
DirectionClassifier 114127 Per-direction linear classifier codebook/classifiers.py
FirewallCodebook.detect_from_perturbations() 862884 P → z convenience wrapper codebook/projection.py

Total essential lines: ~480 lines (including metaspline core)

2.2 Exploratory / Research Code

These were useful for research but are not needed in production:

Component Lines Purpose Disposition
HistogramClassifier dataclass 129146 Alternative 2×2×2 discretized classifier Keep as optional, not MVP
classify_histogram() 671729 Histogram-based classification variant Research variant, not MVP
build() histogram classifier section 481596 Training histogram classifiers Research variant
evaluate_auc() 9471041 Offline AUC evaluation Testing/benchmarking only
summary() 886945 Human-readable codebook summary Debugging/diagnostic tool
classify() 631669 Per-position probability output Subsumed by detect()
build_codebook_from_precomputed() 10441118 Load from .pt files Training pipeline I/O
build() contrast_pairs default 268276 Hardcoded 7-pair contrast list Config, not code
pooled_std() inner function 327331 Statistical utility Extract to training/stats.py
cohen_d() inner function 337340 Effect size utility Extract to training/stats.py
compute_silhouette() inner function 365370 Quality metric Training diagnostic

2.3 Dead Code

Component Lines Issue
sklearn.mixture.GaussianMixture import 44 Imported but never used
unfold() / fold() from space.py space.py:445 Imported but never called in codebook
dcs_norm() from transform.py transform.py:2023 Imported but never used
__main__ block duplicated data loading 11211245 Lines 12031245 repeat 11261182 verbatim with different formatting — copy-paste artifact
bary_to_simplex() 6974 Defined but never called in codebook
DensitySpline class spline.py:315378 Legacy alternative, not used by codebook
empirical_cdf() / empirical_density() / log_bins() / generate_asymmetric_knots() spline.py:268313 Utility functions not used by codebook

2.4 Infrastructure (Training Pipeline, Not Runtime)

Component Lines Purpose
run_manifold_projection.py (entire) 823 Model loading, data collection, SVD computation, saving artifacts
analyzer.py (entire) 560 Multi-layer direction analysis, residual extraction
discover_directions.py (entire) 401 Post-hoc direction discovery from trajectory data
build() SVD computation section 229233 Population SVD → V3 basis

3. Training Pipeline Analysis

3.1 run_manifold_projection.py — Step by Step

The training pipeline performs these operations:

  1. Model Loading (L79103): Load HuggingFace model + tokenizer. Configure for GPU/CPU.

  2. Condition Catalog Construction (L106153): Build contrastive prompt sets for 8 behavioral conditions:

    • self_ref / other_ref
    • violated / expected (semantic)
    • code_violated / code_expected
    • instruction / data
    • tool_call / natural_language
    • uncertain / confident
    • harmful / harmless
    • injection / benign_instruction
  3. Feature Extraction (L156213): For each condition, extract:

    • Hidden states across all layers → residuals (n_prompts, n_layers+1, hidden_dim)
    • ICDF perturbation vectors → perturbations (n_prompts, 64)
    • Last-layer hidden states → hidden_last (n_prompts, hidden_dim)
  4. SVD Computation (L216263):

    • Activation SVD: H_all (N, 2048) → principal components in hidden state space
    • Perturbation SVD: P_all (N, 64) → the 3D perturbation manifold (this is the basis V3)
  5. Direction Vector Computation (L393434): Per-contrast mean-difference direction vectors at best layers.

  6. Projection Analysis (L436668): Extensive analysis of direction projections onto activation/perturbation subspaces. This is research output, not needed for codebook compilation.

  7. Save Results (L670755):

    • .json: Scalar metrics, SVD variance, separation stats
    • .pt: Tensors — this is the key artifact:
      • perturbation_svd_Vh → top-k right-singular vectors (the SVD basis)
      • perturbation_mean → population mean for centering
      • condition_perturbations → per-condition perturbation vectors
      • condition_hidden_last → last-layer hidden states per condition

3.2 Codebook Artifact Production

The .pt file from run_manifold_projection.py feeds directly into build_codebook_from_precomputed(), which:

  1. Loads .pt file → extracts perturbation_svd_Vh[:3] (V3 basis) and perturbation_mean (P_mean)
  2. Reconstructs z-coords: z = (P - P_mean) @ V3.T
  3. Calls FirewallCodebook.build() which:
    • Fits SplineDistribution on each z dimension (population)
    • Fits SplineDistribution on sums (population)
    • Decomposes each condition via CDF → (sum, u, v)
    • Computes DirectionProfiles (pooled stats, Cohen's d, thresholds)
    • Trains DirectionClassifiers (logistic regression per contrast)
    • Trains HistogramClassifiers (8-state discrete classifiers)

The produced codebook artifacts map to the production spec as:

PoC Artifact Production Format Notes
FirewallCodebook.z_splines (3× SplineDistribution) splines.json (knot positions + coefficients) Spline knots serialized as JSON arrays
FirewallCodebook.svd_V3 (3×64 tensor) basis.safetensorsbasis_vectors Reshaped for multi-layer format
FirewallCodebook.population_mean_P (64 tensor) basis.safetensorsmean Centering vector
FirewallCodebook.direction_profiles (dict) regions.safetensors → centroids, scale Per-direction statistical profiles
FirewallCodebook.classifiers (dict) Part of config.json or regions.safetensors Logistic weights (3 floats + intercept per direction)
FirewallCodebook.sum_spline (SplineDistribution) splines.json Sum distribution spline
FirewallCodebook.population_stats (dict) regions.safetensors → centroids, scale Population baselines

4. Core Library Assessment

4.1 Metaspline Core Usage

The metaspline core (spline.py 378 lines, transform.py 78 lines, space.py 46 lines — 502 lines total) provides:

Module Lines Used by Codebook Lines Actually Used
spline.py 378 SplineDistribution, ensure_strictly_increasing ~175 lines (SplineDistribution + MonotonicCubicSpline + ensure_strictly_increasing)
transform.py 78 simplex() only 3 lines
space.py 46 None (imported but unused) 0 lines

Actual dependency: ~178 lines out of 502. The codebook uses only SplineDistribution (CDF/ICDF), MonotonicCubicSpline (its backbone), ensure_strictly_increasing, and simplex(). The following are unused:

  • DensitySpline class (spline.py, 60 lines) — legacy CDF-based distribution, not used
  • empirical_cdf(), empirical_density(), log_bins(), generate_asymmetric_knots() (spline.py, ~45 lines) — utility functions, unused
  • unfold() / fold() (space.py, 46 lines) — digit expansion/contraction, unused
  • double_cumsum(), double_diff(), dcs_norm(), normalize_01(), column_cdf_normalize(), toBase(), numSymbols(), ndVec() (transform.py, ~75 lines) — unused

4.2 How Much Is Inline vs. Library?

The FirewallCodebook.build() method has significant inline reimplementation of statistical operations that could be cleaner:

  • Lines 229233: SVD computation is inline (should use the pipeline's compute_perturbation_svd())
  • Lines 236246: Spline fitting is inline but delegates to SplineDistribution
  • Lines 313324: CDF → decompose → barycentric is duplicated 3× (in build(), classify(), classify_histogram())
  • Lines 327340: pooled_std() and cohen_d() are inner functions, not module-level
  • Lines 365370: compute_silhouette() is an inner function with sklearn import

The core decomposition pipeline (z → CDF → simplex → barycentric → (sum, u, v)) appears verbatim in:

  1. build() lines 242250 (population)
  2. build() lines 313324 (per-condition, profile computation)
  3. build() lines 445456 (per-condition, classifier computation)
  4. build() lines 521532 (per-condition, histogram computation)
  5. decompose() lines 610628 (runtime inference)

This is the single most compressible pattern — a 10-line decomposition sequence repeated 5 times.


5. Minimum Viable Codebook

5.1 Required Functions for Production

Based on the production spec (codebook.md), the minimum viable codebook needs:

  1. project(activations) → z_coords: SVD projection (matrix multiply + centering)
  2. decompose(z_coords) → (sum, u, v): CDF → simplex → barycentric
  3. score(z_coords) → list[DimensionSignal]: Per-direction scoring against profiles
  4. detect(z_coords, threshold) → DetectionResult: Threshold comparison + flagging
  5. load(path) → Codebook: Deserialize from safetensors + JSON
  6. SplineDistribution: CDF evaluation for decompose

And for the training pipeline (not runtime): 7. build(population_data, direction_data) → Codebook: SVD, spline fitting, classifier training

5.2 Compression Estimate

Source Lines Classification Production Lines
firewall_codebook.py 1,245 Core + research + dead ~350
spline.py (used parts) ~178 Core library ~180
transform.py (used parts) ~3 Core library ~5
Total PoC dependency ~426 ~535

Target estimate: 400500 lines for runtime codebook, 150200 lines for training pipeline.

Breakdown of production targets:

Module Target Lines Contents
codebook/transforms.py ~30 simplex(), reverse_bary3d(), bary_to_simplex()
codebook/splines.py ~180 MonotonicCubicSpline, SplineDistribution, ensure_strictly_increasing
codebook/profiles.py ~30 DirectionProfile dataclass
codebook/classifiers.py ~20 DirectionClassifier dataclass
codebook/results.py ~15 DetectionResult dataclass
codebook/projection.py ~30 project() and decompose()
codebook/detection.py ~50 detect() with rolling window, threshold logic
codebook/codebook.py ~40 Codebook class (init, load, summary)
training/compiler.py ~150 build() — SVD, spline fitting, profile computation
training/stats.py ~25 pooled_std(), cohen_d(), silhouette
Total ~570

This is 46% of the PoC's 1,245 lines, or if including the used portion of metaspline core, ~35% of the total 1,745 lines referenced in the overview.

5.3 What Gets Cut

Lines Cut Source Reason
~130 HistogramClassifier + classify_histogram() + histogram training Alternative approach, not MVP
~95 evaluate_auc() Testing/benchmarking tool
~60 summary() Debugging tool, not runtime
~75 __main__ block (including duplicated code) Script-mode evaluation
~40 classify() method Subsumed by detect()
~30 build_codebook_from_precomputed() Training I/O, not runtime
~124 Unused metaspline code (DensitySpline, unfold/fold, dcs_norm, etc.) Dead code
~50 Repeated decomposition sequences DRY refactoring

6. Proposed Decomposition

Matching the production package structure from codebook.md:

src/alknet_firewall/
├── codebook/
│   ├── __init__.py            # Public exports
│   ├── codebook.py            # Codebook class (init, load, project, score)
│   ├── transforms.py          # simplex, reverse_bary3d, bary_to_simplex
│   ├── splines.py             # MonotonicCubicSpline, SplineDistribution
│   ├── profiles.py            # DirectionProfile, population stats
│   ├── classifiers.py          # DirectionClassifier (logistic weights)
│   ├── results.py             # DetectionResult, DimensionSignal, AlarmLevel
│   ├── projection.py          # project(), decompose()
│   └── detection.py           # detect(), threshold comparison, rolling window
├── training/
│   ├── __init__.py
│   ├── compiler.py            # build() — SVD, spline fitting, profile comp
│   ├── stats.py               # pooled_std, cohen_d, silhouette
│   └── data_loader.py         # Condition catalog, prompt sets, data loading
└── data/
    └── codebook/
        ├── basis.safetensors
        ├── regions.safetensors
        ├── splines.json
        └── config.json

6.1 Key Design Decisions for Extraction

  1. SplineDistribution stays in codebook/splines.py — it's a general-purpose distribution class used at both training and inference time. No need for a separate package.

  2. simplex() moves to codebook/transforms.py — it's a single pure function (3 lines), no need for the transform.py dependency chain.

  3. unfold/fold from space.py are dropped — never used by the codebook.

  4. DirectionProfile and DirectionClassifier become separate dataclass modules — clean separation of data from logic.

  5. build() moves entirely to training/compiler.py — runtime codebook is read-only. This is the biggest architectural change: the codebook class should not have a build() classmethod.

  6. Decompose becomes a pure functiondecompose(z, splines) is a pure mathematical transform with no state dependencies beyond the splines. Making it a standalone function enables testing.

  7. Detection is separate from the codebook classdetect(z, classifiers, profiles, threshold) is a stateless function given the codebook data. This enables swapping detection strategies without touching the codebook.


7. Testing Data

7.1 Saved Artifacts Referenced in Code

The PoC references these saved data files:

File Path Contents Reusable for Testing
Population precomputed saved_data/precomputed_seed42_qwen3_0.6b.pt z_coords, P_mean, perturbation_svd_Vh Yes — basis for integration tests
Population precomputed saved_data/precomputed_seed42_qwen3_1.7b.pt Same for 1.7B model Yes — multi-model test
Population precomputed saved_data/precomputed_seed42_qwen3_4b.pt Same for 4B model Yes — multi-model test
Direction geometry experiments/direction_geometry/results/Qwen_Qwen3-0.6B_manifold_projection.pt Full condition data + SVD Yes — golden data for codebook compilation
Direction geometry experiments/direction_geometry/results/Qwen_Qwen3-1.7B_manifold_projection.pt Same for 1.7B Yes
Contrast pairs Hardcoded in build() L268276 and run_manifold_projection.py L139148 7 behavioral contrasts Yes — test fixture definition

7.2 Validation Results Referenced

The __main__ block (L11211245) contains:

  • AUC evaluation at window sizes [1, 4, 8, 16]
  • Per-direction AUC scores for both continuous and histogram classifiers
  • Per-token AUC evaluation

These results should be captured as golden test fixtures for the production codebook:

  • Build a codebook from the 0.6B precomputed data
  • Verify that AUC scores match expected ranges
  • Verify that detection decisions match expected flags

7.3 Calibration Data for Testing

For unit/integration tests, we need:

  1. Synthetic z-coord population: Small N=1000 tensor for spline fitting tests
  2. Known-contrast z-coords: Small pairs (harmful/harmless) for direction profile tests
  3. Expected spline parameters: Known knot positions/coefficients for regression tests
  4. Expected detection results: For a given input, what does detect() return?

The PoC's build_codebook_from_precomputed() provides a ready-made path to generate these fixtures from the saved .pt files.


Summary

Key Findings

  1. The 1,245-line PoC contains ~480 lines of essential code. Including the metaspline core dependency (~178 lines used), the total essential code is ~658 lines. With dead code and research artifacts removed, the production codebook should target 400500 lines for runtime + 150200 lines for training.

  2. The decomposition pipeline (z → CDF → simplex → bary → (sum,u,v)) is repeated 5 times in the PoC. Extracting it into a single decompose() function saves ~50 lines and eliminates a bug surface.

  3. The metaspline core has ~65% unused code when viewed from the codebook's perspective. Only SplineDistribution, MonotonicCubicSpline, ensure_strictly_increasing, and simplex() are needed — the rest (DensitySpline, unfold/fold, dcs_norm, etc.) can be dropped entirely.

  4. The histogram classifier (2×2×2 discretized approach) is an exploratory alternative, not the primary detection mechanism. The continuous logistic classifier is superior (higher AUC) and should be the MVP approach. The histogram classifier adds ~130 lines and can be deferred.

  5. The build() method is the largest single function (429 lines) and mixes training with runtime state. It must be decomposed: training logic moves to training/compiler.py, runtime state becomes immutable serialized data.

  6. Saved .pt files from the PoC provide golden test data — the manifold projection results for Qwen3-0.6B and 1.7B can be reused directly for integration tests.

Recommendation

Target: 500600 lines total for the production codebook (runtime + training), down from 1,245 lines in the PoC and 1,745 lines including metaspline core. This is a ~65% compression.

The architecture should separate:

  • Runtime (~400 lines): Codebook, transforms, splines, detection, results
  • Training (~150 lines): compiler, stats, data loading
  • Data (bundled): safetensors + JSON, no Python

Next Steps

  1. Create src/alknet_firewall/codebook/ package structure
  2. Extract transforms.py (simplex, barycentric) — trivial, ~30 lines
  3. Port splines.py (MonotonicCubicSpline + SplineDistribution) — ~180 lines, mostly copy with cleanup
  4. Implement projection.py (project, decompose) — thin wrappers, ~30 lines
  5. Implement detection.py (detect with rolling window) — ~50 lines, port from PoC's detect()
  6. Implement codebook.py (Codebook class with load) — ~40 lines
  7. Extract training/compiler.py from build() — most complex extraction, ~150 lines
  8. Create test fixtures from saved .pt data
  9. Verify round-trip: build from .pt → serialize → load → detect matches PoC output