docs(compute): add wonnx + handlebars/wgpu reference implementations

Document the two codebases that inform the ShaderGenerator's op table and the wgpu+handlebars+remote-GPU patterns: - wonnx (MIT/Apache-2.0, archived): comprehensive ONNX op set in Tera-templated WGSL at wonnx/templates/ — arithmetic, activation, gemm, conv, batchnorm, softmax, etc. Port the shader implementations, swap Tera for handlebars. compiler.rs's add_raw_template + include_str! pattern maps 1:1 to handlebars-rs register_template_string. - Handlebars + wgpu + remote-GPU patterns (private reference, patterns reusable): validates the handlebars-rs side and the vast.ai deployment shape. Patterns carried over: {{> partial}} includes for shared fragments, inline-able constant tables via switch statements (SHA-256 k-values, universal across wgpu versions), default-valued template parameters, wgpu-on-remote-GPU sync. sha256 as a base shader demonstrating non-ML compute on the same dispatch surface. Updated the WGSL codegen probe POC to reference wonnx's op set as the porting source.
2026-06-30 13:05:54 +00:00
parent 303b9a58e2
commit 3edc42e3b4
1 changed files with 25 additions and 1 deletions
--- a/docs/research/alknet-compute/architecture-summary.md
+++ b/docs/research/alknet-compute/architecture-summary.md
@@ -200,6 +200,28 @@ typebox-rs/src/codegen/

 The WGSL template encodes the scaffolding from webgpu-torch's `getKernelShaderCode` (`kernel.ts:299-375`): struct declarations, `@group(0) @binding(N)` declarations, `@compute @workgroup_size` header, conditional `@builtin` inclusion. One handlebars template with `{{#each inputs}}`, `{{#each outputs}}`, `{{#if uses_global_id}}` blocks. The trait abstraction means a SPIR-V or GLSL template can be added later without changing `KernelSpec` or the opgen transform — only the final render step is language-specific.

+### Reference implementations for the op table
+
+Two existing codebases inform the shader templates and the wgpu+handlebars + remote-GPU patterns; both are freely usable (MIT/Apache-2.0):
+
+**wonnx** (`/workspace/wonnx`, archived, MIT/Apache-2.0) — an ONNX-runtime-on-wgpu project whose `wonnx/templates/` directory is a comprehensive, already-implemented ONNX op set in shader-template form. The templates use Tera (`{% %}`/`{{ }}`), not handlebars, but the shader *implementations* are the asset — porting them to handlebars is mechanical. The op coverage spans what alknet-compute's built-in table needs:
+
+- `templates/endomorphism/` — `arithmetic.wgsl` (add/sub/mul/pow/PRelu, binary + scalar-broadcast), `activation.wgsl` (ReLU/Sigmoid/Tanh/etc. via `snippets/activation_scalar.wgsl` + `activation_vec.wgsl`), `batchnormalization.wgsl`, `softmax.wgsl`, `cast.wgsl`, `map.wgsl`, `gather.wgsl`, `onehot.wgsl`, `broadcast.wgsl`
+- `templates/matrix/` — `gemm.wgsl` + `gemm_1.wgsl` (matmul + matvec), `concat.wgsl`, `transpose.wgsl`, `split.wgsl`, `pad.wgsl`, `resize.wgsl`
+- `templates/pool/` — `conv.wgsl` (+ `conv_kernel_1.wgsl` / `conv_kernel_3.wgsl` specializations), `aggregate.wgsl`, `reduce.wgsl`
+- `templates/structs.wgsl` — the `Scalar`/`Vec4`/`Array` aliases, the `scalar_type_is_float` conditional for matrix types. Exactly the scaffolding pattern the `ShaderGenerator` needs to emit per-dtype.
+
+The `structs.wgsl` partial pattern (aliased types per scalar_type, conditional float-only matrix structs) is the template-includes shape that handlebars `{{> partial}}` mirrors directly. The `endomorphism/arithmetic.wgsl` shape — conditional binary-vs-scalar-broadcast via `{% if i_lens | length == 2 %}`, `{{ op_type }}` operator substitution — maps 1:1 to handlebars `{{#if}}`/`{{op_type}}`. wonnx's `compiler.rs` uses `tera::Tera` with `add_raw_template` + `include_str!` (compile-time embed); the handlebars-rs equivalent is `handlebars::Handlebars::new()` with `register_template_string` + `include_str!`. Same pattern, different crate.
+
+**Handlebars + wgpu + remote-GPU patterns** — a separate, production-deployed wgpu project (private, patterns reusable) validates the handlebars-rs side of the stack and the deployment shape. The patterns worth carrying into alknet-compute, independent of that project's application:
+
+- **`{{> partial}}` includes for shared shader fragments.** `sha256_header`, `sha256`, `rng` are partials included into the top-level template — exactly the `structs.wgsl`-as-partial pattern from wonnx, in handlebars. alknet-compute's `ShaderGenerator` should support partials for shared WGSL fragments (struct declarations, common helpers, activation snippets).
+- **Inline-able constant tables via switch statements.** SHA-256's round constants are the canonical example: the 64 k-values are inlined as a `fn k_value(i: u32) -> u32 { switch(i) { case 0u: { return 0x428a2f98u; } ... } }` block, generated by handlebars from a data table. This is the universally-compatible approach — `const` array initializers have had backend-specific support gaps across wgpu versions, but a switch-statement function compiles everywhere. The pattern generalizes to any shader that needs compile-time constant tables (S-boxes, lookup tables, weight preprocessor tables). The `ShaderGenerator` should expose a `{{#each constants}}` block that emits either a `const` array (on backends/wgpu versions that support it) or a switch-statement fallback (universal), chosen by template — not hand-written per shader.
+- **Default-valued template parameters.** `{{#if workgroup_size}}{{workgroup_size}}{{else}}256{{/if}}` — shaders should have sane defaults so a dispatch without explicit workgroup size compiles. The `KernelSpec` schema should mark which parameters are required vs defaultable, and the handlebars template renders defaults for the latter.
+- **wgpu + remote-GPU (vast.ai) sync.** The project dispatches wgpu compute to remote GPU instances over SSH/sync — the same deployment shape alknet-compute's distributed-training-over-irpc targets, but at a lower layer (direct wgpu on a remote box, not ops over the registry). The lesson: wgpu-on-remote-GPU works, the sync model is straightforward, and alknet-compute's `from_call`-imported remote `tensor.matmul` ops are the registry-layer version of the same deployment topology.
+
+**sha256 as a base shader.** The SHA-256 kernel (constant-inlining via switch, the `rotr`/`ch`/`maj`/`sig0`/`sig1` helper functions, the `{{> sha256}}` partial pattern) is a useful non-ML base shader for the `ShaderGenerator`'s test corpus and for UDF-authored compute. It demonstrates: (a) the constant-table-inlining gotcha and its template-driven solution, (b) partial includes, (c) conditional workgroup sizes, (d) that non-tensor compute (hashing, cryptography, any bit-twiddling workload) is a first-class use of the same `ShaderGenerator` + `dispatch_kernel` surface. The `alknet-runtime` substrate makes this available to *all* UDFs, not just ML ops — a UDF that needs SHA-256 registers it via `register_kernel` and dispatches on llvmpipe (CPU-only) or a real GPU.
+
 ---

 ## Downstream Problems Solved
@@ -479,7 +501,7 @@ The `GraphologyHostConfig` becomes a Rust-backed host that builds a `petgraph::D

 In priority order:

-1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template (first `ShaderGenerator` impl) against `KernelSpec`, render all ~100 ops from `op_table.ts`, diff output against `getKernelShaderCode`'s output. If they match, the Rust codegen path is proven. Half-day exercise.
+1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template (first `ShaderGenerator` impl) against `KernelSpec`, render a representative subset of ops (matmul, conv, activation, arithmetic — ported from wonnx's `templates/{matrix/pool/endomorphism}/*.wgsl`), diff output against wonnx's rendered shaders. If they match, the Rust codegen path is proven. Half-day exercise.

 2. **`ExprCode` parser assessment** — read `src/expr.ts`, determine if the parser ports to Rust cleanly. If yes, stage 2 moves to Rust entirely. If no, stage 2 stays in JS and sends `KernelSpec` to Rust at init.

@@ -496,6 +518,8 @@ In priority order:
 - **alknet-runtime (substrate this builds on):** `docs/research/alknet-runtime/summary.md` — JS isolate, wgpu device, ops bridge, shared JS core, sandbox, primitive compute dispatch
 - **alknet-tensor (format sibling):** `docs/research/alknet-tensor/metatensor-format.md` — pure-format binary tensor layout; `alknet-compute` registers the `load_model`/`stream_model` ops that bridge the format to wgpu buffers
 - **Reference design (tensor):** `/workspace/webgpu-torch` — `src/op_spec.ts` (OpSpec schema), `src/op_table.ts` (452 lines, ~100 ops), `src/opgen.ts` (728 lines, op→kernel transform), `src/kernel.ts:299-375` (WGSL shader generation), `src/autograd.ts` (112 lines, gradient graph), `src/nn_module.ts` (467 lines, module hierarchy), `src/optim.ts` (204 lines, optimizers), `src/device_webgpu.ts` (GPU device + buffer pool with FinalizationRegistry)
+- **wonnx (ONNX op set reference, MIT/Apache-2.0, archived):** `/workspace/wonnx` — `wonnx/templates/` (the shader implementations for ~25 ONNX ops: `endomorphism/{arithmetic,activation,batchnormalization,softmax,cast,map,gather,onehot,broadcast}.wgsl`, `matrix/{gemm,gemm_1,concat,transpose,split,pad,resize}.wgsl`, `pool/{conv,conv_kernel_1,conv_kernel_3,aggregate,reduce}.wgsl`, `snippets/{activation_scalar,activation_vec}.wgsl`, `structs.wgsl`), `wonnx/src/compiler.rs` (Tera-based template loading via `add_raw_template` + `include_str!` — the handlebars-rs equivalent is `register_template_string` + `include_str!`). Port the shaders, swap the template engine.
+- **Handlebars + wgpu + remote-GPU patterns (private reference):** `/workspace/@alkminer/reference/alkminer-v1` — `shaders/templates/header_hash.hbs` (top-level template shape, `{{> partial}}` includes, default-valued `{{#if workgroup_size}}`), `shaders/partials/sha256.hbs` (the inline-constant-table-via-switch pattern for the 64 SHA-256 k-values, the `{{#if max_size_words}}` default pattern), `Cargo.toml` (`handlebars = { version = "6", features = ["dir_source"] }`, `wgpu = "=24.0.5"`). Patterns carry over; the application does not.
 - **Compute graph layer:** `/workspace/@alkdev/flowgraph` — `src/component/` (`Operation`, `Sequential`, `Parallel`, `Conditional`, `Map` — ujsx components that build the workflow template), `src/host/graphology.ts` (`GraphologyHostConfig` — renders template to DAG, validates), `src/host/reactive.ts` (`ReactiveHostConfig` — renders template to reactive execution structure), `src/reactive/node-status.ts` (`computePreconditions`, `computeBlockedByFailure`, `registerStartEffect` — signal-driven DAG execution), `src/graph/` (construction, validation, queries — graphology API surface to port to petgraph), `src/analysis/` (type-compat, ordering, workflow — graph validation)
 - **Codegen infrastructure:** `/workspace/@alkimiadev/typebox-rs/src/codegen/` — `mod.rs` (`RustGenerator`, `TypeScriptGenerator`), `rust.rs` (handlebars → Rust structs), `typescript.rs` (handlebars → TS interfaces). The `ShaderGenerator` trait (with `WgslGenerator` as first impl) would be the third backend here.
 - **wgpu shading-language support (multi-backend codegen):** https://docs.rs/wgpu/latest/wgpu/#shading-language-support — SPIR-V / GLSL / WGSL / naga-IR input languages; the `ShaderGenerator` trait is parameterized by these