diff --git a/docs/research/alknet-compute/architecture-summary.md b/docs/research/alknet-compute/architecture-summary.md index 239b65a..5c0e91c 100644 --- a/docs/research/alknet-compute/architecture-summary.md +++ b/docs/research/alknet-compute/architecture-summary.md @@ -200,6 +200,28 @@ typebox-rs/src/codegen/ The WGSL template encodes the scaffolding from webgpu-torch's `getKernelShaderCode` (`kernel.ts:299-375`): struct declarations, `@group(0) @binding(N)` declarations, `@compute @workgroup_size` header, conditional `@builtin` inclusion. One handlebars template with `{{#each inputs}}`, `{{#each outputs}}`, `{{#if uses_global_id}}` blocks. The trait abstraction means a SPIR-V or GLSL template can be added later without changing `KernelSpec` or the opgen transform — only the final render step is language-specific. +### Reference implementations for the op table + +Two existing codebases inform the shader templates and the wgpu+handlebars + remote-GPU patterns; both are freely usable (MIT/Apache-2.0): + +**wonnx** (`/workspace/wonnx`, archived, MIT/Apache-2.0) — an ONNX-runtime-on-wgpu project whose `wonnx/templates/` directory is a comprehensive, already-implemented ONNX op set in shader-template form. The templates use Tera (`{% %}`/`{{ }}`), not handlebars, but the shader *implementations* are the asset — porting them to handlebars is mechanical. The op coverage spans what alknet-compute's built-in table needs: + +- `templates/endomorphism/` — `arithmetic.wgsl` (add/sub/mul/pow/PRelu, binary + scalar-broadcast), `activation.wgsl` (ReLU/Sigmoid/Tanh/etc. via `snippets/activation_scalar.wgsl` + `activation_vec.wgsl`), `batchnormalization.wgsl`, `softmax.wgsl`, `cast.wgsl`, `map.wgsl`, `gather.wgsl`, `onehot.wgsl`, `broadcast.wgsl` +- `templates/matrix/` — `gemm.wgsl` + `gemm_1.wgsl` (matmul + matvec), `concat.wgsl`, `transpose.wgsl`, `split.wgsl`, `pad.wgsl`, `resize.wgsl` +- `templates/pool/` — `conv.wgsl` (+ `conv_kernel_1.wgsl` / `conv_kernel_3.wgsl` specializations), `aggregate.wgsl`, `reduce.wgsl` +- `templates/structs.wgsl` — the `Scalar`/`Vec4`/`Array` aliases, the `scalar_type_is_float` conditional for matrix types. Exactly the scaffolding pattern the `ShaderGenerator` needs to emit per-dtype. + +The `structs.wgsl` partial pattern (aliased types per scalar_type, conditional float-only matrix structs) is the template-includes shape that handlebars `{{> partial}}` mirrors directly. The `endomorphism/arithmetic.wgsl` shape — conditional binary-vs-scalar-broadcast via `{% if i_lens | length == 2 %}`, `{{ op_type }}` operator substitution — maps 1:1 to handlebars `{{#if}}`/`{{op_type}}`. wonnx's `compiler.rs` uses `tera::Tera` with `add_raw_template` + `include_str!` (compile-time embed); the handlebars-rs equivalent is `handlebars::Handlebars::new()` with `register_template_string` + `include_str!`. Same pattern, different crate. + +**Handlebars + wgpu + remote-GPU patterns** — a separate, production-deployed wgpu project (private, patterns reusable) validates the handlebars-rs side of the stack and the deployment shape. The patterns worth carrying into alknet-compute, independent of that project's application: + +- **`{{> partial}}` includes for shared shader fragments.** `sha256_header`, `sha256`, `rng` are partials included into the top-level template — exactly the `structs.wgsl`-as-partial pattern from wonnx, in handlebars. alknet-compute's `ShaderGenerator` should support partials for shared WGSL fragments (struct declarations, common helpers, activation snippets). +- **Inline-able constant tables via switch statements.** SHA-256's round constants are the canonical example: the 64 k-values are inlined as a `fn k_value(i: u32) -> u32 { switch(i) { case 0u: { return 0x428a2f98u; } ... } }` block, generated by handlebars from a data table. This is the universally-compatible approach — `const` array initializers have had backend-specific support gaps across wgpu versions, but a switch-statement function compiles everywhere. The pattern generalizes to any shader that needs compile-time constant tables (S-boxes, lookup tables, weight preprocessor tables). The `ShaderGenerator` should expose a `{{#each constants}}` block that emits either a `const` array (on backends/wgpu versions that support it) or a switch-statement fallback (universal), chosen by template — not hand-written per shader. +- **Default-valued template parameters.** `{{#if workgroup_size}}{{workgroup_size}}{{else}}256{{/if}}` — shaders should have sane defaults so a dispatch without explicit workgroup size compiles. The `KernelSpec` schema should mark which parameters are required vs defaultable, and the handlebars template renders defaults for the latter. +- **wgpu + remote-GPU (vast.ai) sync.** The project dispatches wgpu compute to remote GPU instances over SSH/sync — the same deployment shape alknet-compute's distributed-training-over-irpc targets, but at a lower layer (direct wgpu on a remote box, not ops over the registry). The lesson: wgpu-on-remote-GPU works, the sync model is straightforward, and alknet-compute's `from_call`-imported remote `tensor.matmul` ops are the registry-layer version of the same deployment topology. + +**sha256 as a base shader.** The SHA-256 kernel (constant-inlining via switch, the `rotr`/`ch`/`maj`/`sig0`/`sig1` helper functions, the `{{> sha256}}` partial pattern) is a useful non-ML base shader for the `ShaderGenerator`'s test corpus and for UDF-authored compute. It demonstrates: (a) the constant-table-inlining gotcha and its template-driven solution, (b) partial includes, (c) conditional workgroup sizes, (d) that non-tensor compute (hashing, cryptography, any bit-twiddling workload) is a first-class use of the same `ShaderGenerator` + `dispatch_kernel` surface. The `alknet-runtime` substrate makes this available to *all* UDFs, not just ML ops — a UDF that needs SHA-256 registers it via `register_kernel` and dispatches on llvmpipe (CPU-only) or a real GPU. + --- ## Downstream Problems Solved @@ -479,7 +501,7 @@ The `GraphologyHostConfig` becomes a Rust-backed host that builds a `petgraph::D In priority order: -1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template (first `ShaderGenerator` impl) against `KernelSpec`, render all ~100 ops from `op_table.ts`, diff output against `getKernelShaderCode`'s output. If they match, the Rust codegen path is proven. Half-day exercise. +1. **WGSL codegen probe** — write the `WgslGenerator` handlebars template (first `ShaderGenerator` impl) against `KernelSpec`, render a representative subset of ops (matmul, conv, activation, arithmetic — ported from wonnx's `templates/{matrix/pool/endomorphism}/*.wgsl`), diff output against wonnx's rendered shaders. If they match, the Rust codegen path is proven. Half-day exercise. 2. **`ExprCode` parser assessment** — read `src/expr.ts`, determine if the parser ports to Rust cleanly. If yes, stage 2 moves to Rust entirely. If no, stage 2 stays in JS and sends `KernelSpec` to Rust at init. @@ -496,6 +518,8 @@ In priority order: - **alknet-runtime (substrate this builds on):** `docs/research/alknet-runtime/summary.md` — JS isolate, wgpu device, ops bridge, shared JS core, sandbox, primitive compute dispatch - **alknet-tensor (format sibling):** `docs/research/alknet-tensor/metatensor-format.md` — pure-format binary tensor layout; `alknet-compute` registers the `load_model`/`stream_model` ops that bridge the format to wgpu buffers - **Reference design (tensor):** `/workspace/webgpu-torch` — `src/op_spec.ts` (OpSpec schema), `src/op_table.ts` (452 lines, ~100 ops), `src/opgen.ts` (728 lines, op→kernel transform), `src/kernel.ts:299-375` (WGSL shader generation), `src/autograd.ts` (112 lines, gradient graph), `src/nn_module.ts` (467 lines, module hierarchy), `src/optim.ts` (204 lines, optimizers), `src/device_webgpu.ts` (GPU device + buffer pool with FinalizationRegistry) +- **wonnx (ONNX op set reference, MIT/Apache-2.0, archived):** `/workspace/wonnx` — `wonnx/templates/` (the shader implementations for ~25 ONNX ops: `endomorphism/{arithmetic,activation,batchnormalization,softmax,cast,map,gather,onehot,broadcast}.wgsl`, `matrix/{gemm,gemm_1,concat,transpose,split,pad,resize}.wgsl`, `pool/{conv,conv_kernel_1,conv_kernel_3,aggregate,reduce}.wgsl`, `snippets/{activation_scalar,activation_vec}.wgsl`, `structs.wgsl`), `wonnx/src/compiler.rs` (Tera-based template loading via `add_raw_template` + `include_str!` — the handlebars-rs equivalent is `register_template_string` + `include_str!`). Port the shaders, swap the template engine. +- **Handlebars + wgpu + remote-GPU patterns (private reference):** `/workspace/@alkminer/reference/alkminer-v1` — `shaders/templates/header_hash.hbs` (top-level template shape, `{{> partial}}` includes, default-valued `{{#if workgroup_size}}`), `shaders/partials/sha256.hbs` (the inline-constant-table-via-switch pattern for the 64 SHA-256 k-values, the `{{#if max_size_words}}` default pattern), `Cargo.toml` (`handlebars = { version = "6", features = ["dir_source"] }`, `wgpu = "=24.0.5"`). Patterns carry over; the application does not. - **Compute graph layer:** `/workspace/@alkdev/flowgraph` — `src/component/` (`Operation`, `Sequential`, `Parallel`, `Conditional`, `Map` — ujsx components that build the workflow template), `src/host/graphology.ts` (`GraphologyHostConfig` — renders template to DAG, validates), `src/host/reactive.ts` (`ReactiveHostConfig` — renders template to reactive execution structure), `src/reactive/node-status.ts` (`computePreconditions`, `computeBlockedByFailure`, `registerStartEffect` — signal-driven DAG execution), `src/graph/` (construction, validation, queries — graphology API surface to port to petgraph), `src/analysis/` (type-compat, ordering, workflow — graph validation) - **Codegen infrastructure:** `/workspace/@alkimiadev/typebox-rs/src/codegen/` — `mod.rs` (`RustGenerator`, `TypeScriptGenerator`), `rust.rs` (handlebars → Rust structs), `typescript.rs` (handlebars → TS interfaces). The `ShaderGenerator` trait (with `WgslGenerator` as first impl) would be the third backend here. - **wgpu shading-language support (multi-backend codegen):** https://docs.rs/wgpu/latest/wgpu/#shading-language-support — SPIR-V / GLSL / WGSL / naga-IR input languages; the `ShaderGenerator` trait is parameterized by these