Files

glm-5.1 7d8a39a88a docs: resolve 4 open questions, add research, spec codebook package structure

Research-driven resolution of OQ-01, OQ-02, OQ-05, OQ-06:

- OQ-01: Remove ONNX Runtime from scope entirely — doesn't support
  activation extraction natively (optimum #972 closed as not planned),
  bloated model exports; burn/cublas via safetensors is a better future path

- OQ-02: Codebook compresses ~65% (1,245 → 500-600 lines); add Package
  Structure and Extraction from PoC sections to codebook.md based on PoC
  analysis of metaspline firewall_codebook.py

- OQ-05: Standalone API + thin adapter pattern (ADR-011); Phase 1 ships
  Firewall.screen() only, Phase 2 adds <100-line adapter packages for
  LlamaFirewall, OpenAI Agents SDK, NeMo Guardrails

- OQ-06: TOML for file-based config — standard modern Python, two-way door

Also: research OQ-03 rolling windows from taskgraph-semantic reference code,
remove onnxruntime/optimum from dependencies, move streaming screening to
Phase 2, add burn/cublas as Phase 3 alternative backend.

2026-06-13 07:27:40 +00:00

24 KiB

Raw Permalink Blame History

Research: PoC Codebook Architecture Analysis (OQ-02)

Date: 2026-06-13 Status: Complete Question: What is the minimum viable codebook? Can the 1,245-line PoC codebook be compressed, and what is essential vs. exploratory/dead code?

1. PoC Architecture Overview

1.1 File Structure & Role

The PoC codebook lives in firewall_codebook.py (1,245 lines) and depends on three metaspline core modules:

firewall_codebook.py (1,245 lines)
├── Imports from metaspline core:
│   ├── metaspline.spline.SplineDistribution    (spline.py, 378 lines)
│   ├── metaspline.spline.ensure_strictly_increasing (spline.py)
│   ├── metaspline.space.unfold / fold           (space.py, 46 lines)
│   └── metaspline.transform.simplex             (transform.py, 78 lines)
├── External imports:
│   ├── sklearn.linear_model.LogisticRegression
│   └── sklearn.mixture.GaussianMixture (imported but unused)
└── Internal definitions (see §1.2)

1.2 Major Sections of `firewall_codebook.py`

Lines	Component	Description
1–50	Module docstring + imports	Theory overview, imports
53–75	`reverse_bary3d()`	Simplex → barycentric (u,v) transform
69–74	`bary_to_simplex()`	Inverse: barycentric → simplex
77–112	`DirectionProfile` dataclass	Per-contrast statistical profile
114–127	`DirectionClassifier` dataclass	Per-contrast logistic regression weights
129–146	`HistogramClassifier` dataclass	2×2×2 codebook-state histogram classifier
148–165	`DetectionResult` dataclass	Output of `detect()`
167–596	`FirewallCodebook.__init__` + `build()`	Codebook construction (429 lines!)
598–629	`FirewallCodebook.decompose()`	z → (sum, u, v) copula transform
631–669	`FirewallCodebook.classify()`	Per-contrast logistic classification
671–729	`FirewallCodebook.classify_histogram()`	8-state histogram classification
731–860	`FirewallCodebook.detect()`	Main detection entry point
862–884	`FirewallCodebook.detect_from_perturbations()`	Convenience: P → z → detect
886–945	`FirewallCodebook.summary()`	Human-readable summary
947–1041	`FirewallCodebook.evaluate_auc()`	AUC evaluation on held-out data
1044–1118	`build_codebook_from_precomputed()`	Load from saved .pt files
1121–1245	`__main__` block	Script-mode evaluation + duplicated data loading

1.3 Dependency Map

                     ┌──────────────────┐
                     │ FirewallCodebook  │
                     │  (main class)     │
                     └────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
    ┌─────────▼──┐   ┌──────▼──────┐   ┌─────▼─────┐
    │ SplineDist  │   │ simplex()   │   │ bary3d()  │
    │ (CDF/ICDF) │   │ (transform) │   │ (local)   │
    └─────────┬──┘   └─────────────┘   └───────────┘
              │
    ┌─────────▼──────────────┐
    │ MonotonicCubicSpline   │
    │ (pchip interpolation)   │
    └────────────────────────┘

The FirewallCodebook has these hard dependencies at runtime:

SplineDistribution — CDF/ICDF transforms (population fitting + inference)
simplex() — normalize to simplex (x/sum(x))
reverse_bary3d() — project simplex to 2D barycentric coordinates
torch — tensor operations
numpy — sklearn bridge for training only

Training-time dependencies (not needed at inference):

sklearn.linear_model.LogisticRegression — classifier training
sklearn.metrics.silhouette_score — profile quality metric
sklearn.metrics.roc_auc_score — evaluation metric

2. Essential vs. Exploratory vs. Dead Code Classification

2.1 Essential (Required for Production Codebook)

These are the core components that must be extracted into the production package:

Component	Lines	Role	Production Mapping
`reverse_bary3d()`	53–66	z → (u,v) barycentric projection	`codebook/transforms.py`
`bary_to_simplex()`	69–74	Inverse barycentric (needed for reconstruction)	`codebook/transforms.py`
`SplineDistribution`	spline.py:200–261	CDF/ICDF for copula transform	`codebook/splines.py` (adapted)
`MonotonicCubicSpline`	spline.py:80–197	PCHIP interpolation engine	`codebook/splines.py` (adapted)
`ensure_strictly_increasing`	spline.py:43–73	Knot sanitization	`codebook/splines.py`
`simplex()`	transform.py:34–36	Normalize to unit simplex	`codebook/transforms.py`
`FirewallCodebook.__init__`	182–203	State initialization	`codebook/codebook.py`
`FirewallCodebook.decompose()`	598–629	z → (sum, u, v) copula space	`codebook/projection.py`
`FirewallCodebook.detect()`	731–860	Main detection logic	`codebook/detection.py`
`DetectionResult`	148–165	Output dataclass	`codebook/results.py`
`FirewallCodebook.build()` (core logic only)	204–396	SVD, spline fitting, profile computation	`training/compiler.py`
`DirectionProfile`	77–112	Per-direction statistical profile	`codebook/profiles.py`
`DirectionClassifier`	114–127	Per-direction linear classifier	`codebook/classifiers.py`
`FirewallCodebook.detect_from_perturbations()`	862–884	P → z convenience wrapper	`codebook/projection.py`

Total essential lines: ~480 lines (including metaspline core)

2.2 Exploratory / Research Code

These were useful for research but are not needed in production:

Component	Lines	Purpose	Disposition
`HistogramClassifier` dataclass	129–146	Alternative 2×2×2 discretized classifier	Keep as optional, not MVP
`classify_histogram()`	671–729	Histogram-based classification variant	Research variant, not MVP
`build()` histogram classifier section	481–596	Training histogram classifiers	Research variant
`evaluate_auc()`	947–1041	Offline AUC evaluation	Testing/benchmarking only
`summary()`	886–945	Human-readable codebook summary	Debugging/diagnostic tool
`classify()`	631–669	Per-position probability output	Subsumed by `detect()`
`build_codebook_from_precomputed()`	1044–1118	Load from .pt files	Training pipeline I/O
`build()` contrast_pairs default	268–276	Hardcoded 7-pair contrast list	Config, not code
`pooled_std()` inner function	327–331	Statistical utility	Extract to `training/stats.py`
`cohen_d()` inner function	337–340	Effect size utility	Extract to `training/stats.py`
`compute_silhouette()` inner function	365–370	Quality metric	Training diagnostic

2.3 Dead Code

Component	Lines	Issue
`sklearn.mixture.GaussianMixture` import	44	Imported but never used
`unfold()` / `fold()` from `space.py`	space.py:4–45	Imported but never called in codebook
`dcs_norm()` from `transform.py`	transform.py:20–23	Imported but never used
`__main__` block duplicated data loading	1121–1245	Lines 1203–1245 repeat 1126–1182 verbatim with different formatting — copy-paste artifact
`bary_to_simplex()`	69–74	Defined but never called in codebook
`DensitySpline` class	spline.py:315–378	Legacy alternative, not used by codebook
`empirical_cdf()` / `empirical_density()` / `log_bins()` / `generate_asymmetric_knots()`	spline.py:268–313	Utility functions not used by codebook

2.4 Infrastructure (Training Pipeline, Not Runtime)

Component	Lines	Purpose
`run_manifold_projection.py` (entire)	823	Model loading, data collection, SVD computation, saving artifacts
`analyzer.py` (entire)	560	Multi-layer direction analysis, residual extraction
`discover_directions.py` (entire)	401	Post-hoc direction discovery from trajectory data
`build()` SVD computation section	229–233	Population SVD → V3 basis

3. Training Pipeline Analysis

3.1 `run_manifold_projection.py` — Step by Step

The training pipeline performs these operations:

Model Loading (L79–103): Load HuggingFace model + tokenizer. Configure for GPU/CPU.
Condition Catalog Construction (L106–153): Build contrastive prompt sets for 8 behavioral conditions:
- self_ref / other_ref
- violated / expected (semantic)
- code_violated / code_expected
- instruction / data
- tool_call / natural_language
- uncertain / confident
- harmful / harmless
- injection / benign_instruction
Feature Extraction (L156–213): For each condition, extract:
- Hidden states across all layers → residuals (n_prompts, n_layers+1, hidden_dim)
- ICDF perturbation vectors → perturbations (n_prompts, 64)
- Last-layer hidden states → hidden_last (n_prompts, hidden_dim)
SVD Computation (L216–263):
- Activation SVD: H_all (N, 2048) → principal components in hidden state space
- Perturbation SVD: P_all (N, 64) → the 3D perturbation manifold (this is the basis V3)
Direction Vector Computation (L393–434): Per-contrast mean-difference direction vectors at best layers.
Projection Analysis (L436–668): Extensive analysis of direction projections onto activation/perturbation subspaces. This is research output, not needed for codebook compilation.
Save Results (L670–755):
- .json: Scalar metrics, SVD variance, separation stats
- .pt: Tensors — this is the key artifact:
  - perturbation_svd_Vh → top-k right-singular vectors (the SVD basis)
  - perturbation_mean → population mean for centering
  - condition_perturbations → per-condition perturbation vectors
  - condition_hidden_last → last-layer hidden states per condition

3.2 Codebook Artifact Production

The .pt file from run_manifold_projection.py feeds directly into build_codebook_from_precomputed(), which:

Loads .pt file → extracts perturbation_svd_Vh[:3] (V3 basis) and perturbation_mean (P_mean)
Reconstructs z-coords: z = (P - P_mean) @ V3.T
Calls FirewallCodebook.build() which:
- Fits SplineDistribution on each z dimension (population)
- Fits SplineDistribution on sums (population)
- Decomposes each condition via CDF → (sum, u, v)
- Computes DirectionProfiles (pooled stats, Cohen's d, thresholds)
- Trains DirectionClassifiers (logistic regression per contrast)
- Trains HistogramClassifiers (8-state discrete classifiers)

The produced codebook artifacts map to the production spec as:

PoC Artifact	Production Format	Notes
`FirewallCodebook.z_splines` (3× SplineDistribution)	`splines.json` (knot positions + coefficients)	Spline knots serialized as JSON arrays
`FirewallCodebook.svd_V3` (3×64 tensor)	`basis.safetensors` → `basis_vectors`	Reshaped for multi-layer format
`FirewallCodebook.population_mean_P` (64 tensor)	`basis.safetensors` → `mean`	Centering vector
`FirewallCodebook.direction_profiles` (dict)	`regions.safetensors` → centroids, scale	Per-direction statistical profiles
`FirewallCodebook.classifiers` (dict)	Part of `config.json` or `regions.safetensors`	Logistic weights (3 floats + intercept per direction)
`FirewallCodebook.sum_spline` (SplineDistribution)	`splines.json`	Sum distribution spline
`FirewallCodebook.population_stats` (dict)	`regions.safetensors` → centroids, scale	Population baselines

4. Core Library Assessment

4.1 Metaspline Core Usage

The metaspline core (spline.py 378 lines, transform.py 78 lines, space.py 46 lines — 502 lines total) provides:

Module	Lines	Used by Codebook	Lines Actually Used
`spline.py`	378	`SplineDistribution`, `ensure_strictly_increasing`	~175 lines (SplineDistribution + MonotonicCubicSpline + ensure_strictly_increasing)
`transform.py`	78	`simplex()` only	3 lines
`space.py`	46	None (imported but unused)	0 lines

Actual dependency: ~178 lines out of 502. The codebook uses only SplineDistribution (CDF/ICDF), MonotonicCubicSpline (its backbone), ensure_strictly_increasing, and simplex(). The following are unused:

DensitySpline class (spline.py, 60 lines) — legacy CDF-based distribution, not used
empirical_cdf(), empirical_density(), log_bins(), generate_asymmetric_knots() (spline.py, ~45 lines) — utility functions, unused
unfold() / fold() (space.py, 46 lines) — digit expansion/contraction, unused
double_cumsum(), double_diff(), dcs_norm(), normalize_01(), column_cdf_normalize(), toBase(), numSymbols(), ndVec() (transform.py, ~75 lines) — unused

4.2 How Much Is Inline vs. Library?

The FirewallCodebook.build() method has significant inline reimplementation of statistical operations that could be cleaner:

Lines 229–233: SVD computation is inline (should use the pipeline's compute_perturbation_svd())
Lines 236–246: Spline fitting is inline but delegates to SplineDistribution
Lines 313–324: CDF → decompose → barycentric is duplicated 3× (in build(), classify(), classify_histogram())
Lines 327–340: pooled_std() and cohen_d() are inner functions, not module-level
Lines 365–370: compute_silhouette() is an inner function with sklearn import

The core decomposition pipeline (z → CDF → simplex → barycentric → (sum, u, v)) appears verbatim in:

build() lines 242–250 (population)
build() lines 313–324 (per-condition, profile computation)
build() lines 445–456 (per-condition, classifier computation)
build() lines 521–532 (per-condition, histogram computation)
decompose() lines 610–628 (runtime inference)

This is the single most compressible pattern — a 10-line decomposition sequence repeated 5 times.

5. Minimum Viable Codebook

5.1 Required Functions for Production

Based on the production spec (codebook.md), the minimum viable codebook needs:

project(activations) → z_coords: SVD projection (matrix multiply + centering)
decompose(z_coords) → (sum, u, v): CDF → simplex → barycentric
score(z_coords) → list[DimensionSignal]: Per-direction scoring against profiles
detect(z_coords, threshold) → DetectionResult: Threshold comparison + flagging
load(path) → Codebook: Deserialize from safetensors + JSON
SplineDistribution: CDF evaluation for decompose

And for the training pipeline (not runtime): 7. build(population_data, direction_data) → Codebook: SVD, spline fitting, classifier training

5.2 Compression Estimate

Source	Lines	Classification	Production Lines
`firewall_codebook.py`	1,245	Core + research + dead	~350
`spline.py` (used parts)	~178	Core library	~180
`transform.py` (used parts)	~3	Core library	~5
Total PoC dependency	~426		~535

Target estimate: 400–500 lines for runtime codebook, 150–200 lines for training pipeline.

Breakdown of production targets:

Module	Target Lines	Contents
`codebook/transforms.py`	~30	`simplex()`, `reverse_bary3d()`, `bary_to_simplex()`
`codebook/splines.py`	~180	`MonotonicCubicSpline`, `SplineDistribution`, `ensure_strictly_increasing`
`codebook/profiles.py`	~30	`DirectionProfile` dataclass
`codebook/classifiers.py`	~20	`DirectionClassifier` dataclass
`codebook/results.py`	~15	`DetectionResult` dataclass
`codebook/projection.py`	~30	`project()` and `decompose()`
`codebook/detection.py`	~50	`detect()` with rolling window, threshold logic
`codebook/codebook.py`	~40	`Codebook` class (init, load, summary)
`training/compiler.py`	~150	`build()` — SVD, spline fitting, profile computation
`training/stats.py`	~25	`pooled_std()`, `cohen_d()`, silhouette
Total	~570

This is 46% of the PoC's 1,245 lines, or if including the used portion of metaspline core, ~35% of the total 1,745 lines referenced in the overview.

5.3 What Gets Cut

Lines Cut	Source	Reason
~130	`HistogramClassifier` + `classify_histogram()` + histogram training	Alternative approach, not MVP
~95	`evaluate_auc()`	Testing/benchmarking tool
~60	`summary()`	Debugging tool, not runtime
~75	`__main__` block (including duplicated code)	Script-mode evaluation
~40	`classify()` method	Subsumed by `detect()`
~30	`build_codebook_from_precomputed()`	Training I/O, not runtime
~124	Unused metaspline code (DensitySpline, unfold/fold, dcs_norm, etc.)	Dead code
~50	Repeated decomposition sequences	DRY refactoring

6. Proposed Decomposition

Matching the production package structure from codebook.md:

src/alknet_firewall/
├── codebook/
│   ├── __init__.py            # Public exports
│   ├── codebook.py            # Codebook class (init, load, project, score)
│   ├── transforms.py          # simplex, reverse_bary3d, bary_to_simplex
│   ├── splines.py             # MonotonicCubicSpline, SplineDistribution
│   ├── profiles.py            # DirectionProfile, population stats
│   ├── classifiers.py          # DirectionClassifier (logistic weights)
│   ├── results.py             # DetectionResult, DimensionSignal, AlarmLevel
│   ├── projection.py          # project(), decompose()
│   └── detection.py           # detect(), threshold comparison, rolling window
├── training/
│   ├── __init__.py
│   ├── compiler.py            # build() — SVD, spline fitting, profile comp
│   ├── stats.py               # pooled_std, cohen_d, silhouette
│   └── data_loader.py         # Condition catalog, prompt sets, data loading
└── data/
    └── codebook/
        ├── basis.safetensors
        ├── regions.safetensors
        ├── splines.json
        └── config.json

6.1 Key Design Decisions for Extraction

SplineDistribution stays in codebook/splines.py — it's a general-purpose distribution class used at both training and inference time. No need for a separate package.
simplex() moves to codebook/transforms.py — it's a single pure function (3 lines), no need for the transform.py dependency chain.
unfold/fold from space.py are dropped — never used by the codebook.
DirectionProfile and DirectionClassifier become separate dataclass modules — clean separation of data from logic.
build() moves entirely to training/compiler.py — runtime codebook is read-only. This is the biggest architectural change: the codebook class should not have a build() classmethod.
Decompose becomes a pure function — decompose(z, splines) is a pure mathematical transform with no state dependencies beyond the splines. Making it a standalone function enables testing.
Detection is separate from the codebook class — detect(z, classifiers, profiles, threshold) is a stateless function given the codebook data. This enables swapping detection strategies without touching the codebook.

7. Testing Data

7.1 Saved Artifacts Referenced in Code

The PoC references these saved data files:

File	Path	Contents	Reusable for Testing
Population precomputed	`saved_data/precomputed_seed42_qwen3_0.6b.pt`	z_coords, P_mean, perturbation_svd_Vh	Yes — basis for integration tests
Population precomputed	`saved_data/precomputed_seed42_qwen3_1.7b.pt`	Same for 1.7B model	Yes — multi-model test
Population precomputed	`saved_data/precomputed_seed42_qwen3_4b.pt`	Same for 4B model	Yes — multi-model test
Direction geometry	`experiments/direction_geometry/results/Qwen_Qwen3-0.6B_manifold_projection.pt`	Full condition data + SVD	Yes — golden data for codebook compilation
Direction geometry	`experiments/direction_geometry/results/Qwen_Qwen3-1.7B_manifold_projection.pt`	Same for 1.7B	Yes
Contrast pairs	Hardcoded in `build()` L268–276 and `run_manifold_projection.py` L139–148	7 behavioral contrasts	Yes — test fixture definition

7.2 Validation Results Referenced

The __main__ block (L1121–1245) contains:

AUC evaluation at window sizes [1, 4, 8, 16]
Per-direction AUC scores for both continuous and histogram classifiers
Per-token AUC evaluation

These results should be captured as golden test fixtures for the production codebook:

Build a codebook from the 0.6B precomputed data
Verify that AUC scores match expected ranges
Verify that detection decisions match expected flags

7.3 Calibration Data for Testing

For unit/integration tests, we need:

Synthetic z-coord population: Small N=1000 tensor for spline fitting tests
Known-contrast z-coords: Small pairs (harmful/harmless) for direction profile tests
Expected spline parameters: Known knot positions/coefficients for regression tests
Expected detection results: For a given input, what does detect() return?

The PoC's build_codebook_from_precomputed() provides a ready-made path to generate these fixtures from the saved .pt files.

Summary

Key Findings

The 1,245-line PoC contains ~480 lines of essential code. Including the metaspline core dependency (~178 lines used), the total essential code is ~658 lines. With dead code and research artifacts removed, the production codebook should target 400–500 lines for runtime + 150–200 lines for training.
The decomposition pipeline (z → CDF → simplex → bary → (sum,u,v)) is repeated 5 times in the PoC. Extracting it into a single decompose() function saves ~50 lines and eliminates a bug surface.
The metaspline core has ~65% unused code when viewed from the codebook's perspective. Only SplineDistribution, MonotonicCubicSpline, ensure_strictly_increasing, and simplex() are needed — the rest (DensitySpline, unfold/fold, dcs_norm, etc.) can be dropped entirely.
The histogram classifier (2×2×2 discretized approach) is an exploratory alternative, not the primary detection mechanism. The continuous logistic classifier is superior (higher AUC) and should be the MVP approach. The histogram classifier adds ~130 lines and can be deferred.
The build() method is the largest single function (429 lines) and mixes training with runtime state. It must be decomposed: training logic moves to training/compiler.py, runtime state becomes immutable serialized data.
Saved .pt files from the PoC provide golden test data — the manifold projection results for Qwen3-0.6B and 1.7B can be reused directly for integration tests.

Recommendation

Target: 500–600 lines total for the production codebook (runtime + training), down from 1,245 lines in the PoC and 1,745 lines including metaspline core. This is a ~65% compression.

The architecture should separate:

Runtime (~400 lines): Codebook, transforms, splines, detection, results
Training (~150 lines): compiler, stats, data loading
Data (bundled): safetensors + JSON, no Python

Next Steps

Create src/alknet_firewall/codebook/ package structure
Extract transforms.py (simplex, barycentric) — trivial, ~30 lines
Port splines.py (MonotonicCubicSpline + SplineDistribution) — ~180 lines, mostly copy with cleanup
Implement projection.py (project, decompose) — thin wrappers, ~30 lines
Implement detection.py (detect with rolling window) — ~50 lines, port from PoC's detect()
Implement codebook.py (Codebook class with load) — ~40 lines
Extract training/compiler.py from build() — most complex extraction, ~150 lines
Create test fixtures from saved .pt data
Verify round-trip: build from .pt → serialize → load → detect matches PoC output

24 KiB Raw Permalink Blame History Unescape Escape