docs(research): split alknet-tensor into alknet-runtime + alknet-compute + alknet-tensor

Extract the shared JS+wgpu substrate (verified by the alknet-desktop POCs) as alknet-runtime — the generalized QuickJS-NG + wgpu runtime that both alknet-desktop (render) and alknet-compute (tensor compute) build on. Key property driving the split: wgpu on llvmpipe is genuinely useful compute with no physical GPU (WGSL → optimized SIMD beats JS for non-trivial workloads), so wgpu is unconditional in the runtime rather than a feature flag. Reframes the original alknet-tensor architecture-summary as alknet-compute (builds on alknet-runtime + alknet-tensor) with ShaderGenerator as a trait (WGSL first impl, SPIR-V/GLSL/naga-IR later per wgpu multi-input-language support). alknet-tensor/metatensor-format.md is now clearly the pure binary format crate (no JS or wgpu dep), usable standalone by a pure-Rust model server. Layering: alknet-runtime depends on alknet-call (registry authority stays per ADR-013); alknet-compute and alknet-desktop depend on alknet-runtime; alknet-tensor is a pure-format sibling.
2026-06-30 12:44:39 +00:00
parent b71db99753
commit 303b9a58e2
4 changed files with 315 additions and 57 deletions
--- a/docs/research/alknet-compute/architecture-summary.md
+++ b/docs/research/alknet-compute/architecture-summary.md
@@ -1,20 +1,22 @@
-# alknet-tensor: Research Summary
+# alknet-compute: Tensor Compute Engine (Research Summary)
-**Status:** Early research — architecture direction established, no POCs yet. Derived from analyzing `webgpu-torch` as a reference design and the quickjs+wgpu verification from the alknet-desktop POCs.
+**Status:** Early research — architecture direction established, no POCs yet. Derived from analyzing `webgpu-torch` as a reference design. This doc was previously titled `alknet-tensor/architecture-summary.md`; the crate-decomposition session on 2026-06-30 split the original `alknet-tensor` concept into two crates: `alknet-tensor` (the pure-format metatensor binary layout, now at `docs/research/alknet-tensor/metatensor-format.md`) and `alknet-compute` (the wgpu compute engine — this doc). The compute engine builds on `alknet-runtime` (the JS+wgpu substrate, `docs/research/alknet-runtime/summary.md`) and `alknet-tensor` (the format).
-**Date:** 2026-06-20
+**Date:** 2026-06-20 (original), 2026-06-30 (reframed for crate split)
-**Scope:** Captures the architectural direction for a Rust+wgpu tensor library with autograd, using QuickJS as a thin API/composition layer and WGSL compute shaders for execution. Documents what `webgpu-torch` established as a reference, how the architecture differs from a straight port, and what unknowns remain. Separate from `alknet-desktop` but shares the same verified substrate (quickjs + wgpu + the operations protocol).
+**Scope:** Captures the architectural direction for the wgpu compute engine: buffer management, kernel codegen, autograd-via-flowgraph, distributed training over irpc. Uses `alknet-runtime` for the JS isolate, wgpu device, and ops bridge into alknet-call's registry; uses `alknet-tensor` for the binary model format. Documents what `webgpu-torch` established as a reference, how the architecture differs from a straight port, and what unknowns remain.
 ---
 ## Executive Summary
-`alknet-tensor` is a PyTorch-shaped tensor computation library built on Rust + wgpu, where the JS layer (QuickJS via rquickjs) is a thin API/composition surface and Rust owns memory, dispatch, and codegen. It is derived from the design of `webgpu-torch` (`/workspace/webgpu-torch`) — a pure-JS tensor + autograd library that runs entirely on the WebGPU compute pipeline — but is not a port of its code. webgpu-torch is the *reference design*; alknet-tensor is the *production architecture*.
+`alknet-compute` is a PyTorch-shaped tensor computation layer built on the `alknet-runtime` substrate (Rust + wgpu + QuickJS via rquickjs) and `alknet-tensor` (the binary format). It owns the tensor-shaped abstractions: `BufferId`-handle buffer manager, the `OpSpec`/`KernelSpec` op table, the `ShaderGenerator` codegen pipeline, the ~5 high-level Rust ops, autograd-via-flowgraph, and distributed training. It does not own the JS isolate, the wgpu device, or the operations-protocol bridge — those live in `alknet-runtime`. It does not own the binary format — that lives in `alknet-tensor`.
-The two completed alknet-desktop POCs (documented in `docs/research/alknet-desktop/poc-summary.md`) established the substrate this builds on:
+It is derived from the design of `webgpu-torch` (`/workspace/webgpu-torch`) — a pure-JS tensor + autograd library that runs entirely on the WebGPU compute pipeline — but is not a port of its code. webgpu-torch is the *reference design*; alknet-compute is the *production architecture*.
-1. **wgpu renders on llvmpipe (software Vulkan) with no physical GPU** — so tensor compute is testable on this OVH box right now, deployable to vast.ai GPU instances for production.
+The substrate this builds on is verified by the alknet-desktop POCs and captured in `docs/research/alknet-runtime/summary.md`:
-2. **QuickJS-NG runs the operations protocol (`@alkdev/operations` registry, call, envelopes, ACL, `buildCallHandler`)** — so every tensor op can be an `OperationSpec` on the registry, network-callable over irpc, same as any other operation.
+
-3. **`typebox-rs` already has the handlebars codegen pattern** (`/workspace/@alkimiadev/typebox-rs/src/codegen/`) — `RustGenerator` and `TypeScriptGenerator` render typed schemas to target languages; a `WgslGenerator` is the same shape, rendering `KernelSpec` → WGSL shader strings.
+1. **wgpu on llvmpipe (software Vulkan) is genuinely useful compute with no physical GPU** — WGSL compiles to optimized SIMD, beats JS for any non-trivial workload, and the same WGSL runs at full GPU speed when a GPU is present. Tensor compute is testable on this OVH box right now, deployable to vast.ai GPU instances for production. The runtime acquires the wgpu device; alknet-compute uses it.
 2. **QuickJS-NG runs the operations protocol (`@alkdev/operations` registry, call, envelopes, ACL, `buildCallHandler`)** — verified by POC-2. Every tensor op can be an `OperationSpec` on the registry, network-callable over irpc, same as any other operation. The runtime owns the ops bridge; alknet-compute registers its ops on the runtime's registry.
 3. **`typebox-rs` has the handlebars codegen pattern** (`/workspace/@alkimiadev/typebox-rs/src/codegen/`) — `RustGenerator` and `TypeScriptGenerator` render typed schemas to target languages; a `ShaderGenerator` trait with a `WgslGenerator` impl is the same shape, rendering `KernelSpec` → shader strings. The trait is parameterized by shading language (WGSL first, SPIR-V / GLSL / naga-IR later) per wgpu's multi-input-language support.
 This solves several downstream problems that weren't the original target (see §Downstream Problems Solved).
@@ -114,31 +116,43 @@ Small and entirely compute-oriented (no render passes, no swapchain, no textures
 ## The Architecture: JS as API, Rust as Execution
-The key architectural decision: **JS holds handles, Rust owns memory and dispatch.** This is the PyTorch model (Python holds handles, C++/CUDA owns memory) applied to QuickJS + wgpu.
+The key architectural decision: **JS holds handles, Rust owns memory and dispatch.** This is the PyTorch model (Python holds handles, C++/CUDA owns memory) applied to QuickJS + wgpu. Under the crate split, the JS isolate and wgpu device live in `alknet-runtime`; `alknet-compute` owns the tensor-shaped abstractions on top.
-### What lives in JS (QuickJS)
+### What lives in alknet-runtime (the substrate)
-The thin API/composition layer. No tensor data, no GPU calls.
+- **The JS isolate** (rquickjs + QuickJS-NG, the 271-module shared core bundle)
 - **The wgpu device** (acquired unconditionally; llvmpipe on CPU-only boxes, real GPU when present)
 - **The operations-protocol bridge** into alknet-call's `OperationRegistry` — tensor ops registered here become `OperationSpec`s, network-callable via `CallClient`/`from_call` (ADR-017)
 - **Primitive compute dispatch** — compile shader module, create buffer, dispatch compute pass, readback. `alknet-compute`'s high-level ops are built on these primitives.
 - **Sandbox / privilege model** — `allowFetch`/`allowFs`/`envProxy` gates
 ### What lives in alknet-compute (this crate)
 #### JS layer (thin API/composition, no tensor data, no GPU calls)
 - **Tensor** = `{id: BufferId, shape: number[], dtype: string, requiresGrad: boolean, grad: Tensor | null}` — metadata only, the data is a Rust-owned `wgpu::Buffer`
 - **Op table** — declarative `OpSpec` definitions (same schema as webgpu-torch's, possibly as TypeBox schemas for registry integration)
 - **Autograd graph** — `GradientContext`, `AutoFunction`, backward bookkeeping. Pure metadata wiring.
 - **nn module hierarchy** — `Module`, `Parameter`, `Sequential`, `Conv2d`, `Linear`, etc. Composition structure that builds the call graph.
 - **Optimizer** — param groups, state, the `step()` loop. Calls Rust-side ops.
- **Custom kernel registration** — user writes WGSL string, calls `register_kernel(name, wgsl, input_specs, output_specs)`. Rust compiles and caches.
+- **Custom kernel registration** — user writes a shader string, calls `register_kernel(name, shader, input_specs, output_specs)`. Rust compiles and caches.
 - **Operations registry integration** — each tensor op is an `OperationSpec` (verified on quickjs by POC 2). Built-in ops register at init; user ops register dynamically. All network-callable over irpc.
-### What lives in Rust
+#### Rust layer (memory, dispatch, codegen — the execution layer)
 Memory, dispatch, codegen. The execution layer.
 - **Buffer manager** — `HashMap<BufferId, wgpu::Buffer>` with manual lifetime management. Replaces webgpu-torch's `FinalizationRegistry`-driven JS buffer pool with Rust-native resource management. No GC interaction, no weak refs, deterministic destruction.
- **Kernel compiler** — `wgpu::ShaderModule` creation from WGSL strings. Built-in kernels compiled at startup (or build time via handlebars codegen); custom kernels compiled on `register_kernel` call. Pipeline cache by shader hash.
+- **Kernel compiler** — `wgpu::ShaderModule` creation from shader strings (WGSL by default; SPIR-V / GLSL / naga-IR via wgpu's input-language features). Built-in kernels compiled at startup (or build time via handlebars codegen); custom kernels compiled on `register_kernel` call. Pipeline cache by shader hash.
- **Dispatch** — bind groups, compute pass encoding, `dispatchWorkgroups`, command submission. One Rust op per dispatch shape.
+- **Dispatch** — bind groups, compute pass encoding, `dispatchWorkgroups`, command submission. One Rust op per dispatch shape. Built on `alknet-runtime`'s primitive compute dispatch.
- **WGSL codegen** — `WgslGenerator` (handlebars-rs) renders `KernelSpec` → WGSL string. Same pattern as `typebox-rs`'s `RustGenerator` / `TypeScriptGenerator`. Build-time codegen for built-in ops; runtime compilation for custom kernels.
+- **Shader codegen** — `ShaderGenerator` trait (handlebars-rs) renders `KernelSpec` → shader string. `WgslGenerator` is the first impl; `SpirvGenerator` / `GlslGenerator` / `NagaIrGenerator` are later backends per wgpu's multi-input-language support. Same pattern as `typebox-rs`'s `RustGenerator` / `TypeScriptGenerator`. Build-time codegen for built-in ops; runtime compilation for custom kernels.
 - **Readback** — `copyBufferToBuffer` to a mapped read buffer, return `ArrayBuffer` to JS. The only data-crossing op (explicit, like PyTorch's `.cpu()` / `.numpy()`).
-### The Rust op surface
+### What lives in alknet-tensor (the format crate, sibling not child)
 - **Binary layout** — schema-driven offsets, flat/struct/blob tensor kinds, mmap via `memmap2`, QUIC per-tensor stream mapping
 - **No JS or wgpu dependency** — a pure-Rust model server can use the format without `alknet-runtime`
 - **Bridge to compute** — `alknet-compute` registers the `load_model`/`stream_model` ops that read a metatensor file into wgpu buffers; the format crate itself doesn't know about wgpu
 ### The Rust op surface (alknet-compute's high-level ops, built on runtime primitives)
 Minimal — ~4-5 high-level ops, not ~20 low-level GPU API calls:
@@ -146,7 +160,7 @@ Minimal — ~4-5 high-level ops, not ~20 low-level GPU API calls:
 |----|-----------|---------|
 | `create_tensor` | `(data: ArrayBuffer, shape: number[], dtype: string) → BufferId` | Allocate a storage buffer, write initial data |
 | `dispatch_kernel` | `(name: string, inputs: BufferId[], params: object, workgroup_count: [u, v, w]) → BufferId[]` | Look up compiled kernel, bind inputs, dispatch compute pass, return output buffer IDs |
-| `register_kernel` | `(name: string, wgsl: string, input_specs: KernelInputSpec[], output_specs: KernelOutputSpec[]) → void` | Compile custom WGSL, cache by name |
+| `register_kernel` | `(name: string, shader: string, input_specs: KernelInputSpec[], output_specs: KernelOutputSpec[]) → void` | Compile custom shader (WGSL/SPIR-V/GLSL/naga-IR), cache by name |
 | `read_tensor` | `(buffer_id: BufferId) → ArrayBuffer` | Copy buffer to mapped read buffer, return data to JS |
 | `write_tensor` | `(buffer_id: BufferId, data: ArrayBuffer) → void` | Overwrite buffer contents from JS |
@@ -159,8 +173,8 @@ Build time:
  OpSpec[] (declarative, from op table)
    → opgen transform (opgen.ts logic, in Rust or JS)
    → KernelSpec[] (compute-pass descriptions)
-    → WgslGenerator (handlebars-rs) renders each KernelSpec → WGSL string
+    → ShaderGenerator::render(KernelSpec) → shader string (WGSL first)
-    → wgpu pre-compiles each WGSL → ShaderModule (cached by name)
+    → wgpu pre-compiles each shader → ShaderModule (cached by name)
 Runtime (built-in ops):
  JS calls dispatch_kernel("matmul", [a_id, b_id], params, count)
@@ -168,23 +182,23 @@ Runtime (built-in ops):
  → binds buffers, dispatches, returns output BufferId
 Runtime (custom kernels):
-  JS calls register_kernel("my_op", wgsl_string, inputs, outputs)
+  JS calls register_kernel("my_op", shader_string, inputs, outputs)
-  → Rust compiles WGSL via wgpu::ShaderModule
+  → Rust compiles shader via wgpu::ShaderModule (language per wgpu features)
  → caches pipeline by name
  → subsequent dispatch_kernel("my_op", ...) uses the cached pipeline
 ```
-The `WgslGenerator` is the natural third backend in `typebox-rs`'s codegen module:
+The `ShaderGenerator` trait (with `WgslGenerator` as the first impl) is the natural third backend in `typebox-rs`'s codegen module:
 ```
 typebox-rs/src/codegen/
-├── mod.rs          — pub use RustGenerator, TypeScriptGenerator, WgslGenerator
+├── mod.rs          — pub use RustGenerator, TypeScriptGenerator, ShaderGenerator
 ├── rust.rs         — Schema → Rust structs (existing)
 ├── typescript.rs   — Schema → TS interfaces (existing)
-└── wgsl.rs         — KernelSpec → WGSL shader (new)
+└── shader.rs       — KernelSpec → shader string (new; WgslGenerator + later backends)
 ```
-The WGSL template encodes the scaffolding from webgpu-torch's `getKernelShaderCode` (`kernel.ts:299-375`): struct declarations, `@group(0) @binding(N)` declarations, `@compute @workgroup_size` header, conditional `@builtin` inclusion. One handlebars template with `{{#each inputs}}`, `{{#each outputs}}`, `{{#if uses_global_id}}` blocks.
+The WGSL template encodes the scaffolding from webgpu-torch's `getKernelShaderCode` (`kernel.ts:299-375`): struct declarations, `@group(0) @binding(N)` declarations, `@compute @workgroup_size` header, conditional `@builtin` inclusion. One handlebars template with `{{#each inputs}}`, `{{#each outputs}}`, `{{#if uses_global_id}}` blocks. The trait abstraction means a SPIR-V or GLSL template can be added later without changing `KernelSpec` or the opgen transform — only the final render step is language-specific.
 ---
@@ -208,28 +222,28 @@ QuickJS-NG's 210 KiB footprint + wgpu's cross-platform backends (including llvmp
 ### 4. The compositing problem from alknet-desktop
-The alknet-desktop research doc flagged "compositing 3D + 2D onto one surface" as an open unknown. Tensor compute sidesteps it entirely — compute pipelines have no surface, no swapchain, no present loop. The compositing complexity is a *render* problem; tensor ops are pure compute. This makes alknet-tensor structurally simpler than alknet-desktop despite being a "heavier" workload.
+The alknet-desktop research doc flagged "compositing 3D + 2D onto one surface" as an open unknown. Tensor compute sidesteps it entirely — compute pipelines have no surface, no swapchain, no present loop. The compositing complexity is a *render* problem; tensor ops are pure compute. This makes alknet-compute structurally simpler than alknet-desktop despite being a "heavier" workload.
 ### 5. Cross-platform by construction, not configuration
-wgpu's "one API, many backends" design means the same WGSL shaders and the same dispatch code run on Vulkan (Linux), Metal (macOS), DX12 (Windows), and llvmpipe (anywhere). No `#ifdef CUDA`, no "Linux is second-class", no platform-specific build matrix. The op table is WGSL strings; the execution is wgpu; the platform is whatever wgpu supports. Currently: everything.
+wgpu's "one API, many backends" design means the same shaders and the same dispatch code run on Vulkan (Linux), Metal (macOS), DX12 (Windows), and llvmpipe (anywhere). No `#ifdef CUDA`, no "Linux is second-class", no platform-specific build matrix. The op table is shader strings (WGSL by default); the execution is wgpu; the platform is whatever wgpu supports. Currently: everything.
 ---
-## Relationship to alknet-desktop
+## Relationship to alknet-desktop (via alknet-runtime)
-alknet-tensor shares the verified substrate with alknet-desktop (quickjs + wgpu + the operations protocol) but is a separate concern:
+alknet-compute and alknet-desktop are sibling consumers of `alknet-runtime`. They don't depend on each other directly; both depend on the runtime for the JS isolate, wgpu device, and ops bridge. A desktop app that also does in-process ML depends on both (desktop → runtime, desktop → compute), sharing the one wgpu device the runtime acquires.
-| | alknet-desktop | alknet-tensor |
+| | alknet-runtime (substrate) | alknet-desktop (sibling consumer) | alknet-compute (this crate) |
-|---|---|---|
+|---|---|---|---|
-| **wgpu usage** | Render passes, surfaces, swapchain, compositing | Compute passes only — no surface, no swapchain |
+| **Owns** | JS isolate, wgpu device, ops bridge, shared JS core, sandbox, primitive compute dispatch | winit, surface/swapchain, three.js shims, Three/SDF HostConfigs, compositor, irpc-to-head | Buffer manager, op table, `ShaderGenerator`, tensor ops, autograd-via-flowgraph, `gradcheck`, distributed training |
-| **GPU op surface** | ~25-40 ops (browser globals for three.js + surface management) | ~4-5 ops (create/dispatch/register/read/write) |
+| **wgpu usage** | Device acquisition + primitive compute dispatch | Render passes, surfaces, swapchain, compositing | Compute passes only — no surface, no swapchain |
-| **JS layer** | ujsx reconciler + HostConfig (3D + 2D UI composition) | Op table + autograd graph + nn module hierarchy |
+| **GPU op surface** | Primitive: compile_shader/create_buffer/dispatch/readback | ~25-40 ops (browser globals for three.js + surface management) | ~4-5 ops (create/dispatch/register/read/write) layered on runtime primitives |
-| **Rust layer** | winit window + wgpu surface + three.js browser-env shims | wgpu buffer manager + kernel compiler + WGSL codegen |
+| **JS layer** | Shared core bundle (271 modules) | + three.js + Three/SDF HostConfigs | + flowgraph + reactive execution host + op table + autograd graph |
-| **Complexity driver** | The 3D+2D compositing and three.js shim surface | The autograd graph correctness and kernel codegen |
+| **Complexity driver** | The extraction boundary (what's truly shared) | 3D+2D compositing, three.js shim surface | Autograd graph correctness, kernel codegen, distributed training |
-| **Network model** | Desktop worker dials head, renders UI | Tensor ops callable locally or over irpc; distributed training is ops on the registry |
+| **Network model** | Ops bridge into alknet-call registry | Desktop worker dials head, renders UI (ADR-017) | Tensor ops on registry, distributed via `from_call` (ADR-017) |
-They could share a crate (same quickjs runtime, same wgpu instance — a desktop app that also does tensor compute) or be separate crates (a pure compute server with no window). The operations registry is the shared seam — both register ops on the same protocol.
+The operations registry (owned by alknet-call, bridged by alknet-runtime) is the shared seam — both consumers register their ops on the same registry, and both become network-callable via `CallClient`/`from_call`.
 ---
@@ -237,7 +251,7 @@ They could share a crate (same quickjs runtime, same wgpu instance — a desktop
 ### 1. Where does the op table live — Rust or JS?
-If built-in ops are Rust-side (specs compiled at build time via handlebars `WgslGenerator`, kernels pre-registered), JS just calls `matmul(a, b)` and Rust looks up the compiled kernel. Fast, simple, fixed op surface.
+If built-in ops are Rust-side (specs compiled at build time via handlebars `ShaderGenerator`/`WgslGenerator`, kernels pre-registered), JS just calls `matmul(a, b)` and Rust looks up the compiled kernel. Fast, simple, fixed op surface.
 If the op table stays JS-side (op specs as data in JS, sent to Rust at init to compile), it's more flexible — swap op implementations at runtime, let users inspect/override specs, let LLMs generate new ops. Adds a startup cost and more JS↔Rust traffic at init.
@@ -452,11 +466,11 @@ The `GraphologyHostConfig` becomes a Rust-backed host that builds a `petgraph::D
 4. **Custom graph validation** — flowgraph's `validateTemplate` already does cycle detection, type compatibility, topological ordering. This is graph validation that PyTorch and CUDA graphs don't have.
-### What flowgraph *doesn't* provide (stays in alknet-tensor)
+### What flowgraph *doesn't* provide (stays in alknet-compute)
 - **The tensor ops themselves** — `tensor.matmul`, `tensor.conv2d`, `tensor.relu` etc. are still Rust-side wgpu compute kernels, exposed as `OperationSpec`s on the registry. flowgraph orchestrates them; it doesn't implement them.
 - **Buffer management** — still Rust-owned `wgpu::Buffer` with `BufferId` handles in JS (the ~4-5 Rust ops from the architecture section above).
- **WGSL codegen** — still `WgslGenerator` (handlebars-rs) rendering `KernelSpec` → WGSL. flowgraph is orthogonal to kernel compilation.
+- **Shader codegen** — still `ShaderGenerator`/`WgslGenerator` (handlebars-rs) rendering `KernelSpec` → shader string. flowgraph is orthogonal to kernel compilation.
 - **`gradcheck`** — finite-difference gradient verification, still a test harness operation.
 ---
@@ -465,11 +479,11 @@ The `GraphologyHostConfig` becomes a Rust-backed host that builds a `petgraph::D
 In priority order:
-1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template against `KernelSpec`, render all ~100 ops from `op_table.ts`, diff output against `getKernelShaderCode`'s output. If they match, the Rust codegen path is proven. Half-day exercise.
+1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template (first `ShaderGenerator` impl) against `KernelSpec`, render all ~100 ops from `op_table.ts`, diff output against `getKernelShaderCode`'s output. If they match, the Rust codegen path is proven. Half-day exercise.
 2. **`ExprCode` parser assessment** — read `src/expr.ts`, determine if the parser ports to Rust cleanly. If yes, stage 2 moves to Rust entirely. If no, stage 2 stays in JS and sends `KernelSpec` to Rust at init.
-3. **End-to-end compute skeleton** — Rust crate that creates a wgpu device on llvmpipe, exposes `create_tensor` / `dispatch_kernel` / `read_tensor` to quickjs, registers `tensor.matmul` as an `OperationSpec` on the operations registry, and runs a matmul via a flowgraph `<Sequential>` template. Proves the full stack (wgpu + quickjs + operations + flowgraph + ujsx) integrates. One day.
+3. **End-to-end compute skeleton** — `alknet-compute` crate that depends on `alknet-runtime` (for the wgpu device + JS isolate + ops bridge) and `alknet-tensor` (for model loading), exposes `create_tensor` / `dispatch_kernel` / `read_tensor` to quickjs via the runtime's primitive compute dispatch, registers `tensor.matmul` as an `OperationSpec` on the runtime's registry, and runs a matmul via a flowgraph `<Sequential>` template. Proves the full stack (runtime + tensor + compute + flowgraph + ujsx) integrates. One day.
 4. **`gradcheck` test harness** — implement finite-difference gradient verification as an operation, run against a subset of the op table with a flowgraph `<Sequential>` forward template and reverse-order backward template. Proves the autograd-via-flowgraph design. Half-day.
@@ -479,15 +493,18 @@ In priority order:
 ## References
 - **alknet-runtime (substrate this builds on):** `docs/research/alknet-runtime/summary.md` — JS isolate, wgpu device, ops bridge, shared JS core, sandbox, primitive compute dispatch
 - **alknet-tensor (format sibling):** `docs/research/alknet-tensor/metatensor-format.md` — pure-format binary tensor layout; `alknet-compute` registers the `load_model`/`stream_model` ops that bridge the format to wgpu buffers
 - **Reference design (tensor):** `/workspace/webgpu-torch` — `src/op_spec.ts` (OpSpec schema), `src/op_table.ts` (452 lines, ~100 ops), `src/opgen.ts` (728 lines, op→kernel transform), `src/kernel.ts:299-375` (WGSL shader generation), `src/autograd.ts` (112 lines, gradient graph), `src/nn_module.ts` (467 lines, module hierarchy), `src/optim.ts` (204 lines, optimizers), `src/device_webgpu.ts` (GPU device + buffer pool with FinalizationRegistry)
 - **Compute graph layer:** `/workspace/@alkdev/flowgraph` — `src/component/` (`Operation`, `Sequential`, `Parallel`, `Conditional`, `Map` — ujsx components that build the workflow template), `src/host/graphology.ts` (`GraphologyHostConfig` — renders template to DAG, validates), `src/host/reactive.ts` (`ReactiveHostConfig` — renders template to reactive execution structure), `src/reactive/node-status.ts` (`computePreconditions`, `computeBlockedByFailure`, `registerStartEffect` — signal-driven DAG execution), `src/graph/` (construction, validation, queries — graphology API surface to port to petgraph), `src/analysis/` (type-compat, ordering, workflow — graph validation)
- **Codegen infrastructure:** `/workspace/@alkimiadev/typebox-rs/src/codegen/` — `mod.rs` (`RustGenerator`, `TypeScriptGenerator`), `rust.rs` (handlebars → Rust structs), `typescript.rs` (handlebars → TS interfaces). The `WgslGenerator` would be the third backend here.
+- **Codegen infrastructure:** `/workspace/@alkimiadev/typebox-rs/src/codegen/` — `mod.rs` (`RustGenerator`, `TypeScriptGenerator`), `rust.rs` (handlebars → Rust structs), `typescript.rs` (handlebars → TS interfaces). The `ShaderGenerator` trait (with `WgslGenerator` as first impl) would be the third backend here.
- **Verified substrate (from alknet-desktop POCs):** `/workspace/@alkdev/alknet/docs/research/alknet-desktop/poc-summary.md` — quickjs+wgpu+operations protocol all verified; llvmpipe software Vulkan confirmed as the headless backend
+- **wgpu shading-language support (multi-backend codegen):** https://docs.rs/wgpu/latest/wgpu/#shading-language-support — SPIR-V / GLSL / WGSL / naga-IR input languages; the `ShaderGenerator` trait is parameterized by these
 - **Verified substrate (from alknet-desktop POCs):** `docs/research/alknet-desktop/poc-summary.md` — quickjs+wgpu+operations protocol all verified; llvmpipe software Vulkan confirmed as the headless backend
 - **ujsx reconciler (verified on quickjs):** `/workspace/@alkdev/ujsx/src/host/{reconcile,config,fiber}.ts` — fiber-based reconciler with keyed child reconciliation, Value.Diff prop diffing, signal wiring
 - **ujsx jsx-runtime (TSX→h target):** `/workspace/@alkdev/ujsx/src/core/jsx-runtime.ts` — the runtime that a TSX transform would emit calls to
 - **typebox-rs (to be simplified with serde+jsonschema):** `/workspace/@alkimiadev/typebox-rs/` — `Cargo.toml` (handlebars v5, codegen feature), `src/schema.rs`, `src/builder.rs`
- **toolEnv (UDF sandbox precedent):** `/workspace/toolEnv/core/sandbox/` — `SandboxManager` with `allowFetch`/`allowFs` privilege flags, `@sebastianwessel/quickjs` WASM backend (alknet-tensor would use native rquickjs instead)
+- **toolEnv (UDF sandbox precedent):** `/workspace/toolEnv/core/sandbox/` — `SandboxManager` with `allowFetch`/`allowFs` privilege flags, `@sebastianwessel/quickjs` WASM backend (alknet-compute uses native rquickjs via alknet-runtime instead)
 - **Operations protocol (verified on quickjs):** `/workspace/@alkdev/operations/src/` — `registry.ts`, `call.ts`, `types.ts`, `validation.ts`, `response-envelope.ts`, `access.ts`
 - **graphology API surface (to port to petgraph):** `~15 methods` used across `flowgraph/src/host/graphology.ts`, `flowgraph/src/graph/{construction,validation,queries}.ts`, `flowgraph/src/analysis/{type-compat,ordering,workflow}.ts` — all map 1:1 to `petgraph::DiGraph` + `petgraph::algo`
- **alknet ADRs (shared with alknet-desktop):** `/workspace/@alkdev/alknet/docs/architecture/decisions/` — ADR-005 (irpc), ADR-012 (stream model), ADR-013 (Rust canonical), ADR-017 (call client contract)
+- **alknet ADRs (shared with alknet-desktop, via alknet-runtime):** `docs/architecture/decisions/` — ADR-005 (irpc), ADR-012 (stream model), ADR-013 (Rust canonical, alknet-call owns `OperationRegistry`), ADR-017 (call client + `from_call` adapter — the distributed-training mechanism)
- **wgpu clone (to be bumped to v29):** `/workspace/wgpu` (currently v24.0.5; compute API stable across versions, surface API changed around v25 but tensor compute doesn't use surfaces)
+- **wgpu clone (to be bumped to v29):** `/workspace/wgpu` (currently v24.0.5; compute API stable across versions, surface API changed around v25 but alknet-compute doesn't use surfaces — that's alknet-desktop's concern)
--- a/docs/research/alknet-desktop/poc-summary.md
+++ b/docs/research/alknet-desktop/poc-summary.md
@@ -1,9 +1,11 @@
 # alknet-desktop: POC Research Summary
 **Status:** Research complete on the three highest-leverage unknowns; further POCs planned before spec.
-**Date:** 2026-06-20
+**Date:** 2026-06-20 (original), 2026-06-30 (crate-decomposition note added)
 **Scope:** Captures what the two completed POCs proved, what unknowns they closed, what remains open, and the architectural direction they jointly establish. Source material for the eventual `alknet-desktop` crate spec.
 **Crate note (2026-06-30):** The substrate this POC verified (rquickjs isolate, wgpu device, operations-protocol bridge, shared JS core bundle, sandbox/privilege model) has been extracted as `alknet-runtime` (`docs/research/alknet-runtime/summary.md`). `alknet-desktop` is now a consumer of `alknet-runtime`, not a from-scratch implementation of the substrate. The POC findings below remain valid — they verified the substrate that the runtime now embodies. The "End-to-end skeleton" POC (§Open Unknowns #3) is now scoped as: `alknet-desktop` crate that depends on `alknet-runtime` for the JS isolate + wgpu device + ops bridge, adds the winit window + wgpu surface + three.js shims + HostConfigs on top, and registers its render ops on the runtime's registry. A sibling crate, `alknet-compute` (`docs/research/alknet-compute/architecture-summary.md`), does the same for tensor compute; both share the runtime.
 ---
 ## Executive Summary
@@ -277,7 +279,7 @@ The two POCs answered the three biggest known-unknowns (headless WebGPU renderin
 1. **three.js loader op enumeration** — run GLTFLoader in a quickjs isolate with instrumented globals; produce the concrete op list.
 2. **Compositing design probe** — render three.js to a texture + SDF layer to a texture + compositor pass onto a wgpu surface, end-to-end, on llvmpipe. Answers the compositing shape question and exercises the v29 surface-from-handle API at the same time.
-3. **End-to-end skeleton** — `alknet-desktop` crate skeleton: Cargo.toml + lib.rs that opens a winit window, creates a wgpu v29 surface, loads the ujsx reconciler + operations registry via rquickjs, renders a hardcoded `<div>` tree to the surface via a no-op HostConfig, and exposes one UDF ("render") callable from the Rust side. Proves the full stack integrates before any spec is written.
+3. **End-to-end skeleton** — `alknet-desktop` crate skeleton: depends on `alknet-runtime` for the rquickjs isolate + wgpu device + ops bridge + shared JS core bundle (the 271 modules POC 2 verified), adds winit window + wgpu v29 surface + three.js browser-env shims + a no-op HostConfig, renders a hardcoded `<div>` tree to the surface, and exposes one UDF ("render") callable from the Rust side via the runtime's ops bridge. Proves the full stack (runtime + desktop) integrates before any spec is written.
 ---
--- a/docs/research/alknet-runtime/summary.md
+++ b/docs/research/alknet-runtime/summary.md
@@ -0,0 +1,236 @@
 # alknet-runtime: JS + wgpu Substrate (Research Summary)
 **Status:** Concept design derived from the alknet-desktop and alknet-tensor POCs/research. No POCs yet for the extracted substrate itself; the substrate's components are individually verified by the two existing POCs.
 **Date:** 2026-06-30
 **Scope:** The generalized QuickJS-NG + wgpu runtime that serves as the shared substrate for `alknet-desktop` (render) and `alknet-compute` (tensor compute). Owns the JS isolate, the wgpu device, the operations-protocol bridge, and the shared JS core bundle. Consumers layer their HostConfigs and op surfaces on top. This doc captures the boundary, what moves out of the consumer crates into the runtime, the layering against alknet-call, and the open unknowns for the extracted substrate specifically.
 ---
 ## Executive Summary
 Two POCs (`alknet-desktop` and the `alknet-tensor` architecture) independently arrived at the same substrate: rquickjs wrapping QuickJS-NG, wgpu for device access, and the `@alkdev/operations` protocol as the JS↔Rust bridge. The desktop POC verified the reactive core + operations protocol on QuickJS-NG (271 modules load and link cleanly); the tensor architecture verified that wgpu compute on llvmpipe (software Vulkan, no physical GPU) is genuinely useful compute — WGSL compiled to optimized SIMD beats JS for any non-trivial workload, and the same WGSL runs at full GPU speed when a GPU is present.
 Rather than have both consumer crates re-implement rquickjs setup, the ops bridge, the shared JS core bundle, sandbox/privilege flags, and wgpu device acquisition, this substrate is extracted as `alknet-runtime`. Consumers keep what's genuinely theirs: the render surface (desktop), the buffer manager + kernel codegen + autograd (compute), the binary format (tensor). The runtime owns the *always-present* layer that every consumer needs.
 The key property that makes wgpu unconditional rather than a feature flag: **wgpu on llvmpipe is a better-than-JS compute layer on every box, with no GPU required.** A UDF host with no render needs still benefits from wgpu — compute-bound UDFs (string processing, hashing, image transforms, signal processing, ML ops) can register WGSL kernels and dispatch them on llvmpipe, getting SIMD-speed compute on a CPU-only box and full GPU speed when one is present. There is no deployment scenario where you'd want to *not* have wgpu available. The "UDF-only host" case isn't a no-wgpu case — it's a compute host that happens to not render.
 ---
 ## The Boundary
 ### What alknet-runtime owns
 - **rquickjs isolate lifecycle.** Creation, module resolver, `embed!` bytecode preload of the shared JS core bundle, microtask/job pumping (the POC-2 scheduling note — `queueMicrotask`-scheduled updates need explicit `ctx.run_jobs()` in one-shot probes but flush naturally in the per-frame render loop — is handled here once, not per-consumer).
 - **wgpu device + adapter acquisition.** Unconditional — every consumer gets a device. Adapter request is parameterized by intent (`RenderIntent::None` for compute-only, `RenderIntent::Surface` for desktop) so the runtime requests the right adapter features without consumers owning acquisition logic. llvmpipe presents as a normal Vulkan ICD; no adapter-detection branching.
 - **The operations-protocol bridge.** The Rust↔JS call surface: Rust ops exposed to JS as `envProxy`, JS UDFs registered and callable from Rust, `ResponseEnvelope` plumbing, ACL enforcement hook, the bidirectional `buildCallHandler` counterpart on the Rust side. This is the seam that alknet-call's `CallClient`/`OperationRegistry` speak to.
 - **The shared JS core bundle.** The exact 271-module set verified by POC-2: `@preact/signals-core`, `@alkdev/typebox`, `@alkdev/ujsx` reconciler, `@alkdev/operations`, `@alkdev/pubsub`, `@logtape/logtape`. Shipped as one embedded `embed!` bundle. Consumers add their own modules on top (flowgraph, three.js, custom UDF code).
 - **Sandbox / privilege model.** The `allowFetch` / `allowFs` / `envProxy` shape from `toolEnv`, generalized. A UDF host with `allowFetch: false` / `allowFs: false` and only registered operations exposed is a real sandbox — native rquickjs (OS-level isolation via process boundaries) instead of the WASM-QuickJS path `toolEnv` v1 used.
 - **Primitive wgpu compute dispatch.** Compile a shader module, create buffer, dispatch compute pass, readback. These primitives are general — any UDF can use them, not just ML. This is the "wgpu compute as a better-than-JS layer everywhere" capability, exposed at the runtime layer so every consumer gets it.
 - **The cold-start budget.** 271 modules is the baseline. The runtime owns the `embed!` bundle and the load-once-at-startup cost. Consumers adding modules (flowgraph, three.js) layer on top and own their own cold-start contributions.
 ### What moves out of the consumer crates into the runtime
 | Concern | Was in (consumer) | Now in (runtime) |
 |---|---|---|
 | rquickjs isolate setup + `embed!` | desktop POC-2, tensor architecture | runtime |
 | wgpu device acquisition | both | runtime |
 | Operations protocol Rust bridge | both | runtime |
 | Shared JS core bundle loading | both | runtime |
 | Sandbox / privilege flags | toolEnv (WASM path) | runtime (native path) |
 | Microtask/job pumping | desktop POC-2 noted it | runtime (one place) |
 | Primitive compute dispatch (shader/buffer/dispatch) | tensor architecture's ~5 ops | runtime exposes primitives; compute layers the tensor-shaped ops on top |
 ### What stays in the consumer crates
 - **`alknet-desktop`** — winit, `Surface`/swapchain (the wgpu v29 surface-API migration is entirely here, not in runtime), three.js browser-global shims (~25-40 ops), Three `HostConfig` + SDF `HostConfig`, the 3D+2D compositor, the irpc-to-head client (ADR-017 contract). Uses runtime's device; owns the render surface. Optionally depends on `alknet-compute` when an app does in-process ML on the same device.
 - **`alknet-compute`** — `BufferId`-handle buffer manager, the `OpSpec`/`KernelSpec` op table, `ShaderGenerator` (handlebars codegen, WGSL first / SPIR-V / GLSL / naga-IR later), the tensor-shaped high-level ops (`create_tensor`/`dispatch_kernel`/`register_kernel`/`read_tensor`/`write_tensor`), autograd-via-flowgraph, `gradcheck`, distributed training over irpc. Uses runtime's primitive compute dispatch; owns the tensor abstractions and codegen.
 - **`alknet-tensor`** (the metatensor format) — pure-format, no runtime dep. Schema-driven binary layout, `compute_offsets`, mmap via `memmap2`, QUIC/BiStream per-tensor mapping, ujsx `<Struct>/<Field>/<Tensor>` authoring → TypeBox schema + OffsetMap. Can be used by a pure-Rust model server with no JS runtime at all; runtime integration (a `load_model` op) is registered by `alknet-compute`, not by the format crate.
 ---
 ## Layering Against alknet-call
 Per ADR-013, alknet-call owns the canonical `OperationSpec`, `OperationRegistry`, `CallClient`, and the adapter contract. Per ADR-017, `CallClient` opens connections, shares the dispatch loop, and the connection is symmetric after establishment — both sides can call each other. The `from_call` adapter imports remote operations into a local registry.
 The layering:
 ```
 alknet-call          (canonical op types, CallClient, adapter contract — no JS, no wgpu)
   ▲
   │ depends on
   │
 alknet-runtime       (JS isolate + wgpu device + ops bridge into alknet-call registry)
   ▲
   │ depends on
   │
 alknet-compute       (tensor ops + codegen + autograd, registers on runtime's registry)
 alknet-desktop       (render surface + three.js shims, registers on runtime's registry)
 alknet-tensor        (pure format, no runtime dep — sibling, used by compute)
 ```
 - **alknet-runtime depends on alknet-call.** The runtime's ops bridge is the Rust-side counterpart to `@alkdev/operations`'s `buildCallHandler`: it produces `HandlerRegistration` bundles that register into an `OperationRegistry` (the same registry type alknet-call owns). UDFs authored in JS and registered via `envProxy` become `OperationSpec`s on the registry, network-callable via `CallClient`. This is the "QuickJS UDF host convergence" the desktop POC flagged — made concrete by depending on alknet-call, not reimplementing the protocol.
 - **alknet-compute and alknet-desktop depend on alknet-runtime.** They get the JS isolate, the wgpu device, and the ops bridge for free. They register their op surfaces (tensor ops, render ops) on the runtime's registry, which is the same registry alknet-call dispatches through.
 - **alknet-tensor is a sibling, not a child.** The format has no JS or wgpu dependency — it's pure Rust (`memmap2`, `jsonschema`, `serde`). A model server using only the format doesn't pull the runtime. `alknet-compute` depends on both `alknet-runtime` and `alknet-tensor` and registers the `load_model`/`stream_model` ops that bridge them.
 The consequence: a node running `alknet-compute` as a `CallClient`-connected worker exposes its tensor ops as `External` operations on the registry, discovered via `services/list` + `services/schema` by any peer. The peer's `from_call` adapter imports them. Distributed training is a `flowgraph` template mixing local and `from_call`-imported remote ops — same template, same execution model, different target. The runtime is what makes a worker "an operation host that happens to do ML" rather than a special ML-specific endpoint.
 ---
 ## The wgpu Substrate: Why It's Unconditional
 The observation that drives wgpu being a hard dependency rather than a feature flag:
 - **llvmpipe compiles WGSL to optimized SIMD.** On a box with no physical GPU (this OVH server, CI runners, edge devices), wgpu's llvmpipe backend translates WGSL shaders to LLVM IR, which LLVM optimizes to native SIMD. The result is compute that beats hand-written JS for any non-trivial workload — verified empirically by writing a SHA-256 shader in WGSL and measuring it against the JS implementation in Deno's WebGPU (which itself uses wgpu under the hood). The GPU-less case is not a "wgpu doesn't work" case; it's a "wgpu gives you SIMD compute for free" case.
 - **The same WGSL runs on a real GPU at full speed.** A kernel written and tested on llvmpipe deploys unchanged to a vast.ai GPU instance. No `#ifdef CUDA`, no platform-specific build matrix. wgpu's "one API, many backends" (Vulkan / Metal / DX12 / llvmpipe) means the deployment target is a runtime concern, not a code concern.
 - **This makes WGSL kernels a first-class UDF optimization layer.** Any UDF that's compute-bound — not just ML ops — can register a WGSL kernel via the runtime's primitive compute dispatch and get SIMD-speed compute on a CPU-only box, full GPU speed when present. The toolEnv WASM-QuickJS baseline (sandbox with no GPU access) is strictly weaker: native rquickjs + wgpu gives you the same sandbox boundary *plus* a portable compute accelerator.
 The Rust-SIMD-would-be-faster caveat is real but mostly orthogonal to the design: hand-tuned native Rust SIMD beats llvmpipe, but (a) llvmpipe-on-any-box beats JS-on-any-box, (b) the same WGSL is full-GPU-speed on a real device, and (c) the cases where you'd reach for native Rust SIMD are the cases you control at build time — which is exactly what `alknet-compute`'s build-time codegen pipeline produces (precompiled kernels for the built-in op table). You get both layers: hand-tuned Rust where it matters, portable WGSL for everything else, and UDF-authored WGSL for the long tail of compute-bound ops that aren't worth a Rust crate.
 ### Shading-language support (multi-backend codegen)
 wgpu accepts SPIR-V, GLSL, WGSL (default), and naga-IR as shader input languages (`wgpu` crate features). The codegen pipeline in `alknet-compute` (`ShaderGenerator`) should be parameterized by target language, not hardcoded to WGSL:
 - **WGSL first** — existing reference material (`webgpu-torch`'s op table), the author's familiarity, and the default wgpu feature make it the right starting point.
 - **SPIR-V / GLSL / naga-IR later** — the handlebars-rs codegen templates can be retargeted; `KernelSpec` is language-agnostic data, only the final template render is language-specific. Keeping `ShaderGenerator` as a trait with `WgslGenerator` as the first impl preserves the option without committing to it now.
 The `ShaderGenerator` trait lives in `alknet-compute` (it's a tensor-compute concern: rendering `KernelSpec` to a shader string), not in `alknet-runtime` (runtime only compiles whatever shader string the consumer hands it via the primitive dispatch surface).
 ---
 ## The Shared JS Core Bundle
 The 271 modules POC-2 verified are the runtime's baseline:
 | Module | Purpose | Owner | Audit burden |
 |---|---|---|---|
 | `@preact/signals-core` | Reactive primitives | preactjs | ~1 file, ~3KB |
 | `@alkdev/typebox` + `/value` | Schema system (250 modules transitively) | alkdev | owned |
 | `@alkdev/ujsx` | Reconciler (fiber tree, Value.Diff prop diffing, signal wiring) | alkdev | owned |
 | `@alkdev/operations` | Operations protocol (registry, call, envelopes, ACL) | alkdev | owned |
 | `@alkdev/pubsub` | `Repeater` async iterators | alkdev | owned |
 | `@logtape/logtape` | Logging (with neutral `#util` subpath) | logtape | small |
 This is the **runtime's** bundle — the layer every consumer needs. Consumer-specific bundles layer on top:
 - **`alknet-compute`** adds `@alkdev/flowgraph` (the `<Operation>`/`<Sequential>`/`<Parallel>`/`<Conditional>`/`<Map>` components + `GraphologyHostConfig` + `ReactiveHostConfig`). The graphology JS dependency is the candidate for the petgraph port (a separate `alknet-compute` concern, not a runtime concern).
 - **`alknet-desktop`** adds three.js + the user-facing ujsx HostConfigs (Three for 3D, SDF for 2D UI). three.js is the largest external dep and the one with the open scoping unknown (loader op enumeration — see `alknet-desktop/poc-summary.md` §Open Unknowns).
 The runtime owns the cold-start budget for the core bundle. Consumers own their own cold-start additions. The POC-2 note about `embed!` bytecode preload as the cold-start mitigation applies to the runtime's bundle; consumers can use the same technique for their additions.
 ---
 ## The Sandbox / Privilege Model
 `toolEnv` (`/workspace/toolEnv/core/sandbox/`) proved the concept with WASM-QuickJS (`@sebastianwessel/quickjs` + `@jitl/quickjs-ng-wasmfile-release-sync`): `SandboxManager.executeScript(code, env, consoleHandler, timeout)` with `allowFetch` / `allowFs` privilege flags and an `envProxy` exposing only registered operations. The runtime generalizes this to native rquickjs:
 - **Same privilege shape.** `allowFetch: bool`, `allowFs: bool`, `envProxy: { [opName]: fn }`. A UDF host with `allowFetch: false` / `allowFs: false` and only registered operations exposed is a real sandbox — the UDF cannot reach the network or filesystem except through the ops the host chose to expose.
 - **Different isolation boundary.** WASM-QuickJS gives browser-compatible maximal isolation at slower speed; native rquickjs gives OS-level isolation (process boundary) at native speed. The operations protocol is the same in both — UDFs authored for the WASM path run unchanged on the native path. The choice is a deployment-time decision, not a design-time one.
 - **The runtime owns the privilege enforcement point.** `allowFetch`/`allowFs` become gates on whether the runtime exposes `fetch`/`fs` Rust ops to the JS isolate. The `envProxy` is the only way the UDF reaches the outside world when both are false.
 This is directly relevant to both consumers:
 - **`alknet-desktop`** is a UDF host whose operations happen to include "render this ujsx tree" and "handle this input event" — the desktop isn't special, it's an operation host with a GPU.
 - **`alknet-compute`** is a UDF host whose operations happen to include "dispatch this kernel" and "read this buffer" — same model, different op surface.
 An LLM-authored operation (model architecture, tool composition, custom kernel) runs in the same sandbox with the same privilege model regardless of which consumer hosts it.
 ---
 ## Open Unknowns (Substrate-Specific)
 These are the unknowns introduced by *extracting* the substrate. The consumer-specific unknowns (compositing, three.js loader op surface, autograd correctness, etc.) stay in their respective docs.
 ### 1. Does alknet-runtime own the `OperationRegistry`, or share one from alknet-call?
 alknet-call owns the canonical `OperationRegistry` type (ADR-013/017). The runtime's ops bridge registers into *an* `OperationRegistry` — but is it a fresh one the runtime owns, the consumer's global one, or a peer-scoped overlay (ADR-028/029)? ADR-017 §1 flagged this as a two-way door with a security dimension: sharing the global registry exposes local capabilities to a remote peer; a peer-scoped subset must filter by capability remote-safety, not just operation name. **Recommendation:** the runtime takes an `Arc<OperationRegistry>` (or a reference to one) at construction — the consumer (or alknet-call's `CallClient`) owns the registry; the runtime bridges into it. Keeps the registry authority in alknet-call where ADR-013 put it.
 ### 2. Device intent and the render/compute split
 The runtime acquires the wgpu device unconditionally, but the *adapter features* differ: compute-only needs no surface support, desktop needs `RenderIntent::Surface` (swapchain, present). The split is a config flag on one acquisition path, not two paths. The open question is the exact `DeviceRequest` shape — does it take an enum (`DeviceIntent::ComputeOnly | DeviceIntent::RenderSurface { window_handle }`) or a builder (`DeviceRequest::new().with_render_surface(window)`)?
 **Recommendation:** enum first; builder if it grows. The v29 surface-API migration scope lives entirely in the `RenderSurface` arm, which is only constructed by `alknet-desktop`. Compute-only consumers never touch it.
 ### 3. Microtask scheduling as a runtime API
 POC-2 noted that `queueMicrotask`-scheduled updates didn't flush before `Module::finish()` returned in the one-shot probe. In a per-frame render loop (desktop) or per-dispatch compute loop (compute), microtasks drain naturally because the runtime is re-entered. The runtime should expose an explicit `pump_jobs()` / `drain_microtasks()` API so one-shot hosts (UDF execution that evaluates and exits) and tests can flush deterministically. The open question is whether this is a public API or an internal detail the consumer's event loop calls implicitly.
 **Recommendation:** public API. Consumers with their own event loop (desktop's raf loop, compute's dispatch loop) call it implicitly; one-shot hosts and tests call it explicitly. Same mechanism, two call sites.
 ### 4. JS bundle composition
 The runtime embeds the 271-module core. Consumers add their own bundles (flowgraph, three.js). The open question is the composition mechanism: does the runtime expose a `load_bundle(bytes)` API that consumers call at init, or are consumer bundles separate `embed!` invocations in the consumer crate that share the same isolate?
 **Recommendation:** the runtime owns the isolate and the core bundle; consumers register additional module-resolver entries that point at their own embedded or file-loaded bundles. The module resolver is the composition point. A consumer that wants to ship flowgraph bundles its own `embed!` and registers the modules with the runtime's resolver.
 ### 5. Schema module placement
 The runtime needs the jsonschema + `TypeDef:*` custom keywords for the ops bridge (operation specs are JSON Schemas). `alknet-tensor` (the format) also needs jsonschema + offset computation, with no runtime dep. The open question: does the schema module (jsonschema wrapper + custom keywords) live in the runtime, in a tiny separate `alknet-schema` crate, or duplicated?
 **Recommendation:** tiny separate `alknet-schema` crate (jsonschema + custom keyword impls + offset computation), depended on by both `alknet-runtime` and `alknet-tensor`. Avoids the runtime pulling in tensor's offset logic, and avoids tensor pulling in the runtime's JS machinery. The split cost is one small crate; the win is the format crate stays pure-Rust.
 ### 6. WASM-QuickJS vs native-rquickjs as a runtime backend
 `toolEnv` v1 used WASM-QuickJS (browser-compatible, maximal isolation, slower). The native rquickjs path is the default for alknet-runtime (OS-level isolation via process boundaries, native speed). The open question is whether the runtime should support *both* backends behind a trait, so the same UDF code runs in a browser sandbox (WASM) or a native sandbox (rquickjs) depending on deployment.
 **Recommendation:** defer. The operations protocol is the same in both; the privilege model is the same. The runtime targets native rquickjs first; a WASM backend is a later addition if a browser-hosted UDF use case materializes. Don't generalize before there's a second consumer.
 ---
 ## Recommended Next POCs (Substrate-Specific)
 In priority order, these are POCs for the *extracted runtime* — distinct from the consumer POCs (compositing design, three.js loader enumeration, WGSL codegen, autograd-via-flowgraph) which stay in their respective docs.
 1. **Runtime skeleton** — `alknet-runtime` crate: `Cargo.toml` + `lib.rs` that constructs an rquickjs isolate, embeds the 271-module core bundle via `embed!`, acquires a wgpu device on llvmpipe with `DeviceIntent::ComputeOnly`, exposes the primitive compute dispatch ops (`compile_shader`/`create_buffer`/`dispatch`/`readback`) to JS, and runs a hardcoded WGSL matmul from JS via `envProxy`. Proves the substrate integrates end-to-end. One day.
 2. **Ops bridge probe** — extend the runtime skeleton to register a JS UDF (`test.echo`) on the `OperationRegistry`, call it from Rust via `env.invoke()`, and call a Rust op from JS via `envProxy`. Verifies the bidirectional bridge against alknet-call's registry type. Half-day on top of the skeleton.
 3. **Cold-start measurement** — instrument the runtime skeleton's startup: module load + link time, `embed!` bytecode preload vs source-parse, wgpu device acquisition time on llvmpipe. Produces the budget numbers the consumer crates need for their cold-start accounting. Half-day.
 4. **`pump_jobs()` API verification** — the POC-2 microtask note, made concrete: add `pump_jobs()` to the runtime, verify a one-shot UDF execution flushes microtasks deterministically with an explicit call, verify a per-frame loop flushes implicitly. Confirms the scheduling model. Half-day.
 5. **Sandbox privilege enforcement** — construct the runtime with `allowFetch: false` / `allowFs: false`, verify a UDF that attempts `fetch()` or `fs.readFile()` fails cleanly (op not exposed), verify a UDF that calls a registered op via `envProxy` succeeds. Confirms the privilege gate. Half-day on top of the ops bridge probe.
 ---
 ## Relationship to the Consumer Crates
 | | alknet-runtime | alknet-desktop | alknet-compute | alknet-tensor |
 |---|---|---|---|---|
 | **Owns** | JS isolate, wgpu device, ops bridge, shared JS core, sandbox, primitive compute dispatch | winit, surface/swapchain, three.js shims, Three/SDF HostConfigs, compositor, irpc-to-head client | Buffer manager, op table, `ShaderGenerator`, tensor ops, autograd-via-flowgraph, `gradcheck`, distributed training | Binary format, offset computation, mmap, QUIC stream mapping, ujsx layout authoring |
 | **Depends on** | alknet-call, alknet-schema (if extracted) | alknet-runtime (+ alknet-compute if in-process ML) | alknet-runtime, alknet-tensor, alknet-schema | alknet-schema (pure-format path) |
 | **wgpu usage** | Device acquisition + primitive compute dispatch | Render passes, surfaces, swapchain, compositing | Compute passes only — no surface | None (pure format) |
 | **JS layer** | Shared core bundle (271 modules) | + three.js + Three/SDF HostConfigs | + flowgraph + reactive execution host | None (pure format) |
 | **Complexity driver** | The extraction boundary itself (what's truly shared vs accidentally coupled) | 3D+2D compositing, three.js shim surface | Autograd correctness, kernel codegen, distributed training | Offset computation correctness, mmap safety, blob indirection |
 | **Network model** | Ops bridge into alknet-call registry (UDFs become network-callable) | Desktop worker dials head, renders UI (ADR-017 client contract) | Tensor ops on registry, distributed via `from_call` (ADR-017) | QUIC per-tensor streams (format property, not runtime) |
 ---
 ## What This Eliminates
 1. **Duplicate rquickjs setup.** Both consumer crates would have re-implemented isolate creation, module resolver, `embed!` bundling, microtask pumping. One implementation in the runtime.
 2. **Duplicate ops-protocol bridge.** Both consumer crates would have re-implemented the Rust↔JS call surface and `ResponseEnvelope` plumbing. One implementation in the runtime, bridging into alknet-call's registry.
 3. **Duplicate shared JS core loading.** Both consumer crates would have embedded the same 271-module bundle. One bundle in the runtime; consumers layer on top.
 4. **Duplicate sandbox/privilege enforcement.** `toolEnv`'s privilege model, generalized to native rquickjs once. Both consumers inherit it; neither re-implements the gate.
 5. **Duplicate wgpu device acquisition.** Both consumer crates would have acquired the device (with different intent flags). One acquisition path in the runtime, parameterized by intent.
 6. **The "is wgpu a feature flag" question, resolved.** wgpu is unconditional. Every consumer gets device access and primitive compute dispatch. The "UDF-only host with no GPU" case is reframed as "a compute host that happens to not render" — it still benefits from wgpu on llvmpipe.
 ---
 ## References
 - **alknet-desktop POCs (substrate verification):** `docs/research/alknet-desktop/poc-summary.md` — quickjs-reactive-probe (271 modules verified on QuickJS-NG), ui-spoke-poc (headless WebGPU + three.js + MSDF text on llvmpipe)
 - **alknet-compute architecture (consumer):** `docs/research/alknet-compute/architecture-summary.md` — tensor compute engine building on this runtime + alknet-tensor
 - **alknet-tensor format (sibling):** `docs/research/alknet-tensor/metatensor-format.md` — pure-format binary tensor layout, no runtime dep
 - **alknet-call (lower layer):** `docs/architecture/decisions/013-rust-canonical-implementation.md` (Rust canonical, adapter traits in alknet-call), `docs/architecture/decisions/017-call-protocol-client-and-adapter-contract.md` (`CallClient`, `from_call`, registry overlay model)
 - **toolEnv (sandbox precedent):** `/workspace/toolEnv/core/sandbox/` — `SandboxManager`, `SandboxEnv`, `SandboxOptions` with `allowFetch`/`allowFs` privilege flags
 - **quickjs-reactive-probe (the probe itself):** `/workspace/quickjs-reactive-probe` — `src/main.rs`, `probe.mjs`, `Cargo.toml`
 - **ui-spoke-poc (the headless WebGPU probe):** `/workspace/ui-spoke-poc` — `src/headless-canvas.ts`, `src/shims.ts`, `examples/threejs-webgpu.ts`, `examples/msdf-text.ts`
 - **wgpu shading-language support (multi-backend codegen):** https://docs.rs/wgpu/latest/wgpu/#shading-language-support — SPIR-V / GLSL / WGSL / naga-IR input languages
 - **webgpu-torch (reference for compute consumer):** `/workspace/webgpu-torch` — `src/op_spec.ts`, `src/op_table.ts`, `src/opgen.ts`, `src/kernel.ts`, `src/autograd.ts`
 - **flowgraph (compute graph layer, used by alknet-compute):** `/workspace/@alkdev/flowgraph` — `src/component/`, `src/host/{graphology,reactive}.ts`
 - **Operations protocol (verified on quickjs):** `/workspace/@alkdev/operations/src/` — `registry.ts`, `call.ts`, `types.ts`, `validation.ts`, `response-envelope.ts`, `access.ts`
--- a/docs/research/alknet-tensor/metatensor-format.md
+++ b/docs/research/alknet-tensor/metatensor-format.md
@@ -1,9 +1,11 @@
-# Metatensor: Schema-Driven Binary Tensor Format
+# alknet-tensor: Schema-Driven Binary Tensor Format (Metatensor)
 **Status:** Research / concept design. The schema layer is proven (TypeBox + jsonschema interop confirmed by API inspection); the offset computation and mmap layer are designed but not yet implemented.
-**Date:** 2026-06-20
+**Date:** 2026-06-20 (original), 2026-06-30 (reframed for crate split)
 **Scope:** A schema-driven binary tensor format extending safetensors, using TypeBox (JS) / jsonschema (Rust) as the layout specification language. Supports flat, struct, and blob tensor kinds for fixed-size, record-structured, and variable-length data. Memory-mappable, QUIC-streamable, and authorable via ujsx.
 **Crate note (2026-06-30):** This doc was originally titled `alknet-tensor/metatensor-format.md` and paired with `alknet-tensor/architecture-summary.md`. The crate-decomposition session split the original `alknet-tensor` concept into two crates: `alknet-tensor` (this doc — the pure binary format, no JS or wgpu dependency) and `alknet-compute` (the wgpu compute engine — now at `docs/research/alknet-compute/architecture-summary.md`, builds on `alknet-runtime` + `alknet-tensor`). A pure-Rust model server can use `alknet-tensor` for the format alone; `alknet-compute` is what bridges the format to wgpu buffers via `load_model`/`stream_model` ops registered on `alknet-runtime`'s registry.
 ---
 ## Executive Summary
@@ -417,4 +419,5 @@ Loading a tensor from mmap into a wgpu buffer: is it a copy (mmap → staging bu
 - **safetensor format reference:** `/workspace/research/typebox_research/metatensor/basic.ts` — 8-byte LE header length + JSON header + raw bytes; `TensorRef = {dtype, shape, data_offsets}`
 - **ujsx reconciler (verified on quickjs):** `/workspace/@alkdev/ujsx/src/host/{reconcile,config,fiber}.ts` — fiber-based reconciler, the typed-tree diff engine
 - **flowgraph (compute graph layer, uses ujsx):** `/workspace/@alkdev/flowgraph/` — `<Operation>`, `<Sequential>`, `<Parallel>`, `<Conditional>`, `<Map>` components; `GraphologyHostConfig` + `ReactiveHostConfig`
- **alknet-tensor architecture (parent doc):** `/workspace/@alkdev/alknet/docs/research/alknet-tensor/architecture-summary.md` — the tensor compute architecture this format serves
+- **alknet-compute architecture (consumer of this format):** `docs/research/alknet-compute/architecture-summary.md` — the wgpu compute engine that builds on `alknet-runtime` and this format; registers the `load_model`/`stream_model` ops that bridge metatensor files to wgpu buffers
 - **alknet-runtime (substrate, sibling dependency):** `docs/research/alknet-runtime/summary.md` — the JS+wgpu substrate; this format crate does not depend on the runtime (pure-Rust path), but `alknet-compute` depends on both