Files

glm-5.2 f11522aaa4 docs(research): extend alknet-tensor — flowgraph as compute graph layer, petgraph port

Adds a major section documenting how @alkdev/flowgraph (already npm-published,
uses ujsx) becomes the compute graph authoring and execution layer for
alknet-tensor, replacing webgpu-torch's imperative nn.Module hierarchy and
autograd recording with declarative ujsx templates and reactive DAG execution.

Key points documented:
- The ujsx tree IS the compute graph (CUDA-graphs-shaped but declarative)
- flowgraph's two HostConfigs: GraphologyHostConfig (compile/validate) and
  ReactiveHostConfig (execute with signal-driven status propagation)
- nn modules become ujsx components, autograd becomes reverse tree walk
- Conditional/Map components enable dynamic structure CUDA graphs can't express
- Network-callable compute graphs (mix local + remote ops in one template)
- TSX authoring via standard JSX→h transform (ujsx jsx-runtime as target)
- graphology → petgraph port: ~15 API methods map 1:1, removes ~5400 lines of JS
- Updated POC priorities: end-to-end skeleton now includes flowgraph integration,
  petgraph host port as a separate POC

2026-06-20 12:03:31 +00:00

36 KiB

Raw Blame History

alknet-tensor: Research Summary

Status: Early research — architecture direction established, no POCs yet. Derived from analyzing webgpu-torch as a reference design and the quickjs+wgpu verification from the alknet-desktop POCs. Date: 2026-06-20 Scope: Captures the architectural direction for a Rust+wgpu tensor library with autograd, using QuickJS as a thin API/composition layer and WGSL compute shaders for execution. Documents what webgpu-torch established as a reference, how the architecture differs from a straight port, and what unknowns remain. Separate from alknet-desktop but shares the same verified substrate (quickjs + wgpu + the operations protocol).

Executive Summary

alknet-tensor is a PyTorch-shaped tensor computation library built on Rust + wgpu, where the JS layer (QuickJS via rquickjs) is a thin API/composition surface and Rust owns memory, dispatch, and codegen. It is derived from the design of webgpu-torch (/workspace/webgpu-torch) — a pure-JS tensor + autograd library that runs entirely on the WebGPU compute pipeline — but is not a port of its code. webgpu-torch is the reference design; alknet-tensor is the production architecture.

The two completed alknet-desktop POCs (documented in docs/research/alknet-desktop/poc-summary.md) established the substrate this builds on:

wgpu renders on llvmpipe (software Vulkan) with no physical GPU — so tensor compute is testable on this OVH box right now, deployable to vast.ai GPU instances for production.
QuickJS-NG runs the operations protocol (@alkdev/operations registry, call, envelopes, ACL, buildCallHandler) — so every tensor op can be an OperationSpec on the registry, network-callable over irpc, same as any other operation.
typebox-rs already has the handlebars codegen pattern (/workspace/@alkimiadev/typebox-rs/src/codegen/) — RustGenerator and TypeScriptGenerator render typed schemas to target languages; a WgslGenerator is the same shape, rendering KernelSpec → WGSL shader strings.

This solves several downstream problems that weren't the original target (see §Downstream Problems Solved).

Reference Design: webgpu-torch

Location: /workspace/webgpu-torch (v0.4.0, npm-published, zero runtime deps except @webgpu/types, @xtuc/long, cross-fetch) Homepage: https://praeclarum.org/webgpu-torch

What it is

A PyTorch-like ML library that implements tensors, autograd, an nn module hierarchy, optimizers, and ONNX import/export — all in TypeScript, all running on WebGPU compute pipelines. No CUDA, no native bindings, no browser required (works in Deno with --unstable-webgpu).

The three-stage pipeline

webgpu-torch's op system is structured in three clean stages, each of which is relevant to the alknet-tensor architecture:

Stage 1 — OpSpec (declarative op description). (src/op_spec.ts:8-27, src/op_table.ts — 452 lines, ~100 ops)

type OpSpec = {
  name: string;
  nnName?: string;       // torch.nn name (e.g. "ReLU")
  torchName?: string;    // torch.* name
  nnOp?: boolean;        // is this an nn module?
  type: "unary" | "binary" | "reduction";
  forward: ExprCode;     // e.g. "output = abs(input)"
  backward?: ExprCode;   // e.g. "inputGrad = input == 0 ? 0 : ..."
  alpha?: boolean;       // binary ops with alpha scalar
  // reduction-specific:
  init?: ExprCode;       // e.g. "0" for sum
  combineOp?: "+" | "*" | "&&" | "||";
  reduce?: ExprCode;
};

The entire op table is declarative data — ~100 ops (abs, acos, add, matmul, conv2d, layer_norm, etc.) described as forward/backward expressions. No imperative dispatch code, no buffer management, no GPU calls. This is the schema layer.

Stage 2 — opgen.ts (op spec → kernel specs). (src/opgen.ts, 728 lines)

Transforms each OpSpec into one or more KernelSpec entries — one per dtype combination and gradient direction. A binary op like add produces 6+ kernel specs (forward for each dtype pair, plus backward variants). A KernelSpec (src/kernel.ts:34-45) is a complete compute-pass description:

type KernelSpec = {
  name: string;
  parameters: KernelParamSpec[];      // scalar params (alpha, dims, etc.)
  inputs: KernelInputSpec[];           // storage buffer bindings
  outputs: KernelOutputSpec[];         // read_write storage buffer bindings
  workgroupSize: [ExprCode, ExprCode, ExprCode];
  workgroupCount: [ExprCode, ExprCode, ExprCode];
  workgroupVariables?: KernelInputSpec[];
  shader: string;                      // the WGSL body (without scaffolding)
};

This stage is pure computation — array manipulation and expression compilation (ExprCode → compiled shader fragment). No GPU calls, no side effects. It runs fine in JS but could also run in Rust.

Stage 3 — getKernelShaderCode (kernel spec → final WGSL). (src/kernel.ts:299-375, ~70 lines)

Turns a KernelSpec into a complete WGSL shader by string-concatenating:

struct ${name}Parameters { ... } — parameter struct
@group(0) @binding(N) var<storage, read> input: ... — input bindings
@group(0) @binding(N) var<storage, read_write> output: ... — output bindings
@compute @workgroup_size(x, y, z) — compute entry point header
@builtin(global_invocation_id) global_id: vec3u — conditionally included if the shader references global_id
The shader body from spec.shader

This is template rendering — loops over inputs/outputs/parameters, conditional @builtin inclusion. It is exactly what handlebars does, and exactly the pattern typebox-rs codegen already uses.

The autograd system

src/autograd.ts (112 lines) — GradientContext, AutoFunction, backward dispatch. The autograd graph is pure bookkeeping: which op produced which tensor, what's the backward function, which tensors to save for backward. No heavy compute — just metadata wiring. backward() calls back into the kernel dispatch to run the backward shaders.

This stays in JS in alknet-tensor. It's the composition layer: users write loss.backward() and the graph traversal calls Rust-side backward kernels. The graph itself is lightweight (tensor handles + op references, no data).

The nn module hierarchy

src/nn_module.ts (467 lines) — Module base class with _children tree, Parameter (tensor with requiresGrad), StateDict for serialization. src/nn_basic.ts, nn_2d.ts, nn_norm.ts, nn_diffusers.ts, nn_applications.ts implement Conv2d, BatchNorm, Linear, attention, etc.

This is composition structure — it builds the call graph, not the compute. Stays in JS.

The optimizer

src/optim.ts (204 lines) — Optimizer base class, param groups, state tracking. Stays in JS (it's a loop over parameters calling Rust-side ops).

The GPU API surface it uses

Small and entirely compute-oriented (no render passes, no swapchain, no textures-as-render-targets):

createBuffer, createShaderModule, createComputePipeline, createBindGroup, beginComputePass, dispatchWorkgroups, copyBufferToBuffer, mapAsync, writeBuffer.

~10 distinct GPU API calls, all on the compute side. This is the easier half of wgpu to expose from Rust — no surface management, no present loop, no window handles. Tensor compute is structurally simpler than the UI rendering case.

The Architecture: JS as API, Rust as Execution

The key architectural decision: JS holds handles, Rust owns memory and dispatch. This is the PyTorch model (Python holds handles, C++/CUDA owns memory) applied to QuickJS + wgpu.

What lives in JS (QuickJS)

The thin API/composition layer. No tensor data, no GPU calls.

Tensor = {id: BufferId, shape: number[], dtype: string, requiresGrad: boolean, grad: Tensor | null} — metadata only, the data is a Rust-owned wgpu::Buffer
Op table — declarative OpSpec definitions (same schema as webgpu-torch's, possibly as TypeBox schemas for registry integration)
Autograd graph — GradientContext, AutoFunction, backward bookkeeping. Pure metadata wiring.
nn module hierarchy — Module, Parameter, Sequential, Conv2d, Linear, etc. Composition structure that builds the call graph.
Optimizer — param groups, state, the step() loop. Calls Rust-side ops.
Custom kernel registration — user writes WGSL string, calls register_kernel(name, wgsl, input_specs, output_specs). Rust compiles and caches.
Operations registry integration — each tensor op is an OperationSpec (verified on quickjs by POC 2). Built-in ops register at init; user ops register dynamically. All network-callable over irpc.

What lives in Rust

Memory, dispatch, codegen. The execution layer.

Buffer manager — HashMap<BufferId, wgpu::Buffer> with manual lifetime management. Replaces webgpu-torch's FinalizationRegistry-driven JS buffer pool with Rust-native resource management. No GC interaction, no weak refs, deterministic destruction.
Kernel compiler — wgpu::ShaderModule creation from WGSL strings. Built-in kernels compiled at startup (or build time via handlebars codegen); custom kernels compiled on register_kernel call. Pipeline cache by shader hash.
Dispatch — bind groups, compute pass encoding, dispatchWorkgroups, command submission. One Rust op per dispatch shape.
WGSL codegen — WgslGenerator (handlebars-rs) renders KernelSpec → WGSL string. Same pattern as typebox-rs's RustGenerator / TypeScriptGenerator. Build-time codegen for built-in ops; runtime compilation for custom kernels.
Readback — copyBufferToBuffer to a mapped read buffer, return ArrayBuffer to JS. The only data-crossing op (explicit, like PyTorch's .cpu() / .numpy()).

The Rust op surface

Minimal — ~4-5 high-level ops, not ~20 low-level GPU API calls:

Op	Signature	Purpose
`create_tensor`	`(data: ArrayBuffer, shape: number[], dtype: string) → BufferId`	Allocate a storage buffer, write initial data
`dispatch_kernel`	`(name: string, inputs: BufferId[], params: object, workgroup_count: [u, v, w]) → BufferId[]`	Look up compiled kernel, bind inputs, dispatch compute pass, return output buffer IDs
`register_kernel`	`(name: string, wgsl: string, input_specs: KernelInputSpec[], output_specs: KernelOutputSpec[]) → void`	Compile custom WGSL, cache by name
`read_tensor`	`(buffer_id: BufferId) → ArrayBuffer`	Copy buffer to mapped read buffer, return data to JS
`write_tensor`	`(buffer_id: BufferId, data: ArrayBuffer) → void`	Overwrite buffer contents from JS

The data-crossing boundary is read_tensor / write_tensor only. A matmul on a 4096×4096 tensor is one dispatch_kernel call passing three BufferIds — the 64MB of floats never touch JS.

The codegen pipeline

Build time:
  OpSpec[] (declarative, from op table)
    → opgen transform (opgen.ts logic, in Rust or JS)
    → KernelSpec[] (compute-pass descriptions)
    → WgslGenerator (handlebars-rs) renders each KernelSpec → WGSL string
    → wgpu pre-compiles each WGSL → ShaderModule (cached by name)

Runtime (built-in ops):
  JS calls dispatch_kernel("matmul", [a_id, b_id], params, count)
  → Rust looks up cached pipeline for "matmul"
  → binds buffers, dispatches, returns output BufferId

Runtime (custom kernels):
  JS calls register_kernel("my_op", wgsl_string, inputs, outputs)
  → Rust compiles WGSL via wgpu::ShaderModule
  → caches pipeline by name
  → subsequent dispatch_kernel("my_op", ...) uses the cached pipeline

The WgslGenerator is the natural third backend in typebox-rs's codegen module:

typebox-rs/src/codegen/
├── mod.rs          — pub use RustGenerator, TypeScriptGenerator, WgslGenerator
├── rust.rs         — Schema → Rust structs (existing)
├── typescript.rs   — Schema → TS interfaces (existing)
└── wgsl.rs         — KernelSpec → WGSL shader (new)

The WGSL template encodes the scaffolding from webgpu-torch's getKernelShaderCode (kernel.ts:299-375): struct declarations, @group(0) @binding(N) declarations, @compute @workgroup_size header, conditional @builtin inclusion. One handlebars template with {{#each inputs}}, {{#each outputs}}, {{#if uses_global_id}} blocks.

Downstream Problems Solved

This wasn't the original target, but the tensor architecture solves several planned problems as a side effect:

1. Distributed compute over irpc

Every tensor op is an OperationSpec on the registry (verified protocol-compatible on quickjs by POC 2). A matmul called locally dispatches on the local GPU. The same matmul called over irpc dispatches on a peer's GPU. This is the "vast.ai instance" deployment story with a concrete protocol backing it — no separate RPC layer needed, the operations registry is the RPC layer.

Distributed training follows: gradient ops, optimizer steps, and parameter sync are all operations, callable locally or remotely, with ACL enforcement on who can touch which model weights. Gradient sync across nodes is read_tensor + irpc write_tensor to the remote buffer.

2. LLM-authored model code (toolEnv pattern)

An agent emits JS that constructs an nn.Sequential and registers it as an operation, with allowFetch: false / allowFs: false sandboxing (the toolEnv privilege model from /workspace/toolEnv/core/sandbox/). The JS runs in a quickjs isolate, the compute runs in Rust/wgpu, the agent never touches the GPU directly. "MCP with scripting capabilities" extended to model authoring — an LLM composes a model architecture from declarative nn modules, the heavy ops execute on GPU.

3. Edge/embedded tensor compute

QuickJS-NG's 210 KiB footprint + wgpu's cross-platform backends (including llvmpipe software fallback) means tensor compute works where PyTorch can't fit — no Python runtime, no CUDA dependency, no large native binaries. The same JS model code runs on a server GPU (Vulkan/Metal/DX12), a laptop (same), or a headless box (llvmpipe, slower but functional).

4. The compositing problem from alknet-desktop

The alknet-desktop research doc flagged "compositing 3D + 2D onto one surface" as an open unknown. Tensor compute sidesteps it entirely — compute pipelines have no surface, no swapchain, no present loop. The compositing complexity is a render problem; tensor ops are pure compute. This makes alknet-tensor structurally simpler than alknet-desktop despite being a "heavier" workload.

5. Cross-platform by construction, not configuration

wgpu's "one API, many backends" design means the same WGSL shaders and the same dispatch code run on Vulkan (Linux), Metal (macOS), DX12 (Windows), and llvmpipe (anywhere). No #ifdef CUDA, no "Linux is second-class", no platform-specific build matrix. The op table is WGSL strings; the execution is wgpu; the platform is whatever wgpu supports. Currently: everything.

Relationship to alknet-desktop

alknet-tensor shares the verified substrate with alknet-desktop (quickjs + wgpu + the operations protocol) but is a separate concern:

	alknet-desktop	alknet-tensor
wgpu usage	Render passes, surfaces, swapchain, compositing	Compute passes only — no surface, no swapchain
GPU op surface	~25-40 ops (browser globals for three.js + surface management)	~4-5 ops (create/dispatch/register/read/write)
JS layer	ujsx reconciler + HostConfig (3D + 2D UI composition)	Op table + autograd graph + nn module hierarchy
Rust layer	winit window + wgpu surface + three.js browser-env shims	wgpu buffer manager + kernel compiler + WGSL codegen
Complexity driver	The 3D+2D compositing and three.js shim surface	The autograd graph correctness and kernel codegen
Network model	Desktop worker dials head, renders UI	Tensor ops callable locally or over irpc; distributed training is ops on the registry

They could share a crate (same quickjs runtime, same wgpu instance — a desktop app that also does tensor compute) or be separate crates (a pure compute server with no window). The operations registry is the shared seam — both register ops on the same protocol.

Open Unknowns

1. Where does the op table live — Rust or JS?

If built-in ops are Rust-side (specs compiled at build time via handlebars WgslGenerator, kernels pre-registered), JS just calls matmul(a, b) and Rust looks up the compiled kernel. Fast, simple, fixed op surface.

If the op table stays JS-side (op specs as data in JS, sent to Rust at init to compile), it's more flexible — swap op implementations at runtime, let users inspect/override specs, let LLMs generate new ops. Adds a startup cost and more JS↔Rust traffic at init.

Recommendation: Rust-side for built-ins (build-time codegen, pre-compiled), JS-side register_kernel for custom/user-defined ops. Gets both perf and flexibility. The OperationSpec wrapper on the registry is what makes them network-callable regardless of where the kernel was compiled.

2. Does `opgen.ts`'s `ExprCode` parser/compiler port cleanly to Rust?

The ExprCode system (src/expr.ts) parses forward/backward expressions like "output = abs(input)" and compiles them to shader fragments. This is the one non-trivial JS piece in stage 2. If it ports to Rust (via nom or pest or hand-rolled), stage 2 moves entirely to Rust and the op table becomes pure data that never touches JS. If it doesn't port cleanly, stage 2 stays in JS and sends KernelSpec to Rust at init.

Probeable: read src/expr.ts, assess the parser complexity. If it's regex + string substitution (likely, given the WGSL target), the Rust port is mechanical. If it's a recursive-descent parser with non-trivial precedence handling, more work.

3. Autograd graph correctness

webgpu-torch's autograd (src/autograd.ts, 112 lines) is compact but subtle — GradientContext, saveForBackward, needsInputGradient, the backward dispatch. Porting the design to JS-on-quickjs is straightforward (it's pure bookkeeping), but verifying gradient correctness across the op table requires a test harness. PyTorch's torch.autograd.gradcheck (numerical gradient verification) is the reference approach — finite-difference against analytical gradients.

Probeable: implement gradcheck as an operation on the registry, run it against a subset of the op table (abs, add, matmul, conv2d) to verify the backward expressions are correct. This is a test problem, not an architecture problem.

4. Buffer management strategy

webgpu-torch uses a FinalizationRegistry-driven buffer pool in JS (src/device_webgpu.ts:13-50) — when a JS tensor is GC'd, the underlying GPUBuffer returns to the pool. Under alknet-tensor, Rust owns the buffers, so the pool is a Rust HashMap with explicit drop_buffer(id) or reference counting. The question is the lifecycle model: explicit tensor.dispose() (PyTorch-style, manual), RAII via Rust's Drop (automatic when the JS handle is GC'd and Rust is notified), or a pool with eviction.

Recommendation: explicit dispose() for now (simplest, matches PyTorch's .detach() / context manager pattern), with a Rust-side leak detector that warns if buffers aren't disposed. RAII-via-GC-notification is a later optimization.

5. Multi-GPU and multi-queue

wgpu supports multiple adapters and queues. For distributed training across GPUs on one machine (or across machines via irpc), the dispatch needs to target a specific queue/adapter. The BufferId likely needs to be (AdapterId, BufferId) or the dispatch op takes an optional device parameter. Not a blocker for v1 (single-GPU), but the op signatures should be designed to accept it.

6. typebox-rs simplification (serde + jsonschema)

You noted that typebox-rs should be rewritten to use serde + jsonschema instead of the hand-rolled schema system. This simplifies the schema layer and makes KernelSpec / OpSpec directly serde-serializable (for irpc transport, for config files, for LLM-generated op specs). The codegen layer (handlebars-rs + templates) stays; only the input schema type changes. This is a prerequisite for clean KernelSpec serialization over the wire.

Recommended Next POCs

In priority order:

WGSL codegen probe — write the WgslGenerator handlebars template against KernelSpec, render all ~100 ops from op_table.ts, diff output against getKernelShaderCode's output. If they match, the Rust codegen path is proven. Half-day exercise.
ExprCode parser assessment — read src/expr.ts, determine if the parser ports to Rust cleanly. If yes, stage 2 moves to Rust entirely. If no, stage 2 stays in JS and sends KernelSpec to Rust at init.
End-to-end compute skeleton — Rust crate that creates a wgpu device on llvmpipe, exposes create_tensor / dispatch_kernel / read_tensor to quickjs, and runs a hardcoded matmul. Proves the ~4-op Rust surface is sufficient and the buffer management works. One day.
gradcheck test harness — implement finite-difference gradient verification as an operation, run against a subset of the op table. Proves the autograd design is correct before porting the full graph. Half-day.

Compute Graphs: flowgraph + ujsx as the Execution Layer

Location: /workspace/@alkdev/flowgraph (npm-published, uses ujsx) Relevance: Replaces webgpu-torch's imperative autograd + nn module hierarchy with a declarative, reactive, graph-validated compute graph authoring and execution system. This is the CUDA-graphs-shaped layer, and it's already built.

The insight

webgpu-torch's nn.Module hierarchy is an imperative call-graph: you write forward(x) that chains op calls, and autograd records the graph as a side effect. flowgraph inverts this — you write the graph declaratively as a ujsx tree, the graph is validated before execution, and reactive signals drive the execution. The ujsx tree is the compute graph, and the existing @alkdev/flowgraph library already implements this for the operations protocol that alknet-tensor uses.

What flowgraph provides

flowgraph sits between @alkdev/operations (what can be called) and execution. It defines three graphs:

Operation Graph — static graph built from OperationSpecs at startup. Nodes are operations, edges are type-compatibility relationships. Enables cycle detection, topological ordering, validation.
Call Graph — dynamic graph built from call protocol events at runtime. Nodes are call invocations with status/timestamps, edges are parent-child. Enables abort cascading and observability.
Workflow Template — declarative ujsx tree defining a reusable workflow structure. A validated path through the operation graph, instantiated as a call graph at runtime.

The graph is the specification. The template is the authoring surface. The call graph is the execution record.

The workflow components (/workspace/@alkdev/flowgraph/src/component/):

<Operation name="tensor.matmul" input={...} /> — a single op call, like a kernel launch
<Sequential> — ordered execution, outputs flow to inputs (CUDA stream ordering)
<Parallel maxConcurrency={n}> — concurrent execution (multiple CUDA streams)
<Conditional test={(results) => ...}> — data-dependent branching (no CUDA-graph equivalent — strictly more powerful)
<Map over={items} as="item"> — fan-out over a collection (batched dispatch)

The two host configs

flowgraph ships two HostConfig implementations (/workspace/@alkdev/flowgraph/src/host/):

GraphologyHostConfig (graphology.ts) — renders the ujsx tree into a DAG, validates it against the operation graph (cycle detection via hasCycle, type-compatibility edges, topological sort). This is the compile step — like cudagraph.capture() building the graph from recorded ops, but declarative and validated before execution.

ReactiveHostConfig (reactive.ts) — renders the ujsx tree into a reactive execution structure where node statuses (idle → waiting → ready → running → completed/failed/aborted) are @preact/signals-core signals. computePreconditions checks all predecessors completed, computeBlockedByFailure propagates abort cascades, registerStartEffect reactively transitions idle→ready when preconditions are met (/workspace/@alkdev/flowgraph/src/reactive/node-status.ts). This is the execute step — like cudagraph.launch() but with dynamic status propagation.

Both run on the same ujsx reconciler + signals-core that POC 2 verified on QuickJS-NG.

How this changes alknet-tensor

The nn module hierarchy becomes flowgraph templates. You don't port webgpu-torch's nn_module.ts Module class — you replace it with ujsx components:

// Instead of webgpu-torch's imperative Module:
class ConvNet extends Module {
  constructor() {
    this.conv1 = Conv2d(1, 20, 5);
    this.conv2 = Conv2d(20, 20, 5);
  }
  forward(x) { return this.conv2(this.conv1(x).relu()).relu(); }
}

// alknet-tensor's declarative template:
const ConvNet = () => (
  <Sequential>
    <Operation name="tensor.conv2d" input={{ weight: w1, stride: 1 }} />
    <Operation name="tensor.relu" />
    <Operation name="tensor.conv2d" input={{ weight: w2, stride: 1 }} />
    <Operation name="tensor.relu" />
  </Sequential>
);

The autograd graph is the ujsx tree. Each <Operation> node knows its backward kernel (from the OpSpec's backward expression). backward() walks the tree in reverse, dispatching backward kernels via the same flowgraph execution model. The GradientContext and saveForBackward bookkeeping from webgpu-torch's autograd (src/autograd.ts) becomes per-node state in the reactive host. The graph is declarative and inspectable before execution, not constructed as a side effect of running the forward pass — strictly cleaner than PyTorch's imperative autograd.

Training loops are nested templates. Composability is free because workflows are ujsx trees:

const TrainingStep = ({ batch, labels }) => (
  <Sequential>
    <Operation name="model.forward" input={{ x: batch }} />
    <Operation name="loss.crossEntropy" input={{ predictions: "$.output", labels }} />
    <Operation name="model.backward" />  // walks the forward graph in reverse
    <Operation name="optim.step" input={{ params: "$.model.params", grads: "$.grads" }} />
  </Sequential>
);

const Epoch = ({ dataset }) => (
  <Map over={dataset} as="batch">
    <TrainingStep batch={batch.x} labels={batch.y} />
  </Map>
);

CUDA-graphs-like capture and replay, but better:

// PyTorch CUDA graph:
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    out = model(input)
g.replay()  # re-run the captured graph

// alknet-tensor with flowgraph + ujsx:
const model = <Sequential>...</Sequential>;
// The ujsx tree IS the captured graph — declarative, not imperative capture.
// Replay = render(model) against the ReactiveHostConfig.
// The reconciler diffs the tree; only changed props re-dispatch.
// Conditional/Map allow dynamic structure that CUDA graphs can't express.

Network-callable compute graphs

Since operations are OperationSpecs on the registry, a workflow template can mix local and remote ops:

const Distributed = () => (
  <Parallel>
    <Operation name="tensor.matmul" input={{ a, b }} />       // local GPU
    <Operation name="remote.gpu1.matmul" input={{ a, b }} />   // peer GPU via irpc
    <Operation name="remote.gpu2.matmul" input={{ a, b }} />   // another peer
  </Parallel>
);

Same template, same execution model, different target. The Parallel host dispatches all three concurrently; the reactive status system tracks which completed; the results are collected. Distributed training is a workflow template, not a separate system.

TSX authoring

flowgraph's components (Operation, Sequential, Parallel, Conditional, Map) are UComponent functions that return {type, props, children} — the exact ujsx element shape. Authoring in TSX is sugar for h() calls:

<Sequential><Operation name="tensor.relu" /></Sequential>
// is sugar for:
h(Sequential, {}, h(Operation, { name: "tensor.relu" }))

The TSX→h transform is a build step (Rust crates: swc_ecma_parser / oxc can parse TSX and apply the standard JSX→h transform that ujsx's jsx-runtime.ts at /workspace/@alkdev/ujsx/src/core/jsx-runtime.ts is the target of). The runtime sees UElement trees either way; TSX is authoring ergonomics, not a runtime concern.

Graph ops in Rust (petgraph), not JS (graphology)

flowgraph currently uses graphology + graphology-dag (~5400 lines of JS). The actual API surface flowgraph touches is small — ~15 distinct methods:

graphology / graphology-dag API	petgraph equivalent
`new DirectedGraph()`	`DiGraph::new()`
`.addNode(id, attrs)`	`graph.add_node(attrs)` → returns `NodeIndex`
`.addEdgeWithKey(key, source, target, attrs)`	`graph.add_edge(source, target, attrs)` → returns `EdgeIndex`
`.dropEdge(source, target)`	`graph.remove_edge(edge_idx)`
`.hasNode(id)`	`graph.contains_node(idx)`
`.hasEdge(source, target)` / `.hasDirectedEdge(...)`	`graph.find_edge(n1, n2).is_some()`
`.nodes()` / `.edges()`	`graph.node_indices()` / `graph.edge_indices()`
`.order()` / `.size()`	`graph.node_count()` / `graph.edge_count()`
`.inDegree(id)` / `.outDegree(id)`	`graph.neighbors_directed(idx, Incoming/Outgoing).count()`
`.forEachNode(cb)` / `.forEachEdge(cb)`	`graph.node_indices().for_each(...)`
`hasCycle(graph)`	`petgraph::algo::is_cyclic_directed(graph)`
`topologicalSort(graph)`	`petgraph::algo::topological_sort(graph)`
`willCreateCycle(graph, source, target)`	add edge, check `is_cyclic_directed`, rollback — or check path exists from target to source

Every graphology operation flowgraph uses maps to a one-line petgraph call. Porting the graph layer to Rust:

Removes ~5400 lines of JS from the runtime (graphology + graphology-dag), shrinking the quickjs module load surface
Makes graph operations native-speed (petgraph is already in the alknet dependency tree as a standard Rust crate)
Enables graph validation to happen in Rust before the template is handed to the JS reactive host
Keeps the ujsx tree authoring + reactive execution in JS (where the reconciler + signals-core handle the dynamic status propagation)

The GraphologyHostConfig becomes a Rust-backed host that builds a petgraph::DiGraph instead of a graphology DirectedGraph, exposing the graph to JS only for inspection (not manipulation). The ReactiveHostConfig stays in JS — it's signals and status propagation, which is what quickjs is good at.

What this eliminates from the architecture

nn_module.ts port — replaced by flowgraph ujsx components. No Module base class, no Parameter wrapper, no StateDict serialization — those become flowgraph template inspection and registry queries.
Imperative autograd recording — replaced by declarative graph. The backward pass walks the ujsx tree, not a recorded tape. The graph is known before execution, not reconstructed after.
graphology JS dependency — replaced by petgraph in Rust. ~5400 lines of JS removed from the runtime.
Custom graph validation — flowgraph's validateTemplate already does cycle detection, type compatibility, topological ordering. This is graph validation that PyTorch and CUDA graphs don't have.

What flowgraph doesn't provide (stays in alknet-tensor)

The tensor ops themselves — tensor.matmul, tensor.conv2d, tensor.relu etc. are still Rust-side wgpu compute kernels, exposed as OperationSpecs on the registry. flowgraph orchestrates them; it doesn't implement them.
Buffer management — still Rust-owned wgpu::Buffer with BufferId handles in JS (the ~4-5 Rust ops from the architecture section above).
WGSL codegen — still WgslGenerator (handlebars-rs) rendering KernelSpec → WGSL. flowgraph is orthogonal to kernel compilation.
gradcheck — finite-difference gradient verification, still a test harness operation.

Updated Recommended Next POCs