Files

glm-5.2 b7b5337586 docs(research): add metatensor format — schema-driven binary tensor layout

Documents the metatensor format: a binary data format where a TypeBox/jsonschema
schema describes the layout of binary data at schema-computed offsets. Extends
safetensors (fixed TensorRef schema) to arbitrary schemas, enabling struct tensors
(records), blob tensors (variable-length via indirection), and nested layouts.

Key points:
- TypeBox schemas render to standard JSON Schema; the jsonschema Rust crate
  validates them with zero translation. Custom typedef.ts kinds (TFloat32,
  TInt32, TStruct) map to jsonschema custom keywords via with_keyword().
- This eliminates typebox-rs as a schema engine — replaced by jsonschema +
  a small offset-computation module + ~50 lines of custom keyword impls.
- Three tensor kinds: flat (safetensor today), struct (record of typed fields),
  blob (struct tensor as index + flat tensor as data store, for variable-length)
- Memory-mappable: parse header, compute offsets, mmap data, typed views per
  schema. No copy, no deserialization.
- QUIC-streamable: header is one small JSON message, each tensor is a separate
  stream. Lazy loading, parallel transfer, incremental compute.
- ujsx-authorable: <Tensor>, <Struct>, <Field> as layout components, same
  reconciler that diffs UI trees diffs model schemas. Model versioning is
  tree diffing.
- Category-theory foundation: ujsx as universal typed-tree IR, HostConfig as
  interpreter. <Tensor> is no stranger than <div>.

2026-06-20 14:09:04 +00:00

24 KiB

Raw Blame History

Metatensor: Schema-Driven Binary Tensor Format

Status: Research / concept design. The schema layer is proven (TypeBox + jsonschema interop confirmed by API inspection); the offset computation and mmap layer are designed but not yet implemented. Date: 2026-06-20 Scope: A schema-driven binary tensor format extending safetensors, using TypeBox (JS) / jsonschema (Rust) as the layout specification language. Supports flat, struct, and blob tensor kinds for fixed-size, record-structured, and variable-length data. Memory-mappable, QUIC-streamable, and authorable via ujsx.

Executive Summary

Metatensor is a binary data format where a JSON Schema (authored in TypeBox, validated by either TypeBox in JS or the jsonschema crate in Rust) describes the layout of binary data, and the binary data follows at schema-computed offsets. It extends the safetensor format — which is fixed-schema (every tensor is {dtype, shape, data_offsets}) — to arbitrary schemas, enabling struct tensors (records with typed fields), blob tensors (variable-length data via indirection), and nested layouts.

The format is: header_length (8 bytes) + JSON header (schema + offsets) + raw binary data (row-major, at offsets the header describes).

The key insight: the schema is the format. A TypeBox schema like Type.Object({conv1_weight: TensorRef, conv1_bias: TensorRef, ...}) is both:

The validation spec (TypeBox's Value.Check in JS, jsonschema's validator.is_valid in Rust)
The layout spec (field offsets computed from the schema's type sizes, the same way typebox-rs computed struct layouts)

No separate format definition, no separate parser, no separate validator. One schema, three uses: validate, compute offsets, access data.

The Schema Layer: TypeBox ↔ jsonschema

Why this works

TypeBox modules render to standard JSON Schema under $defs (the Type.Module wrapper is just $defs). A TypeBox schema like:

const TensorRef = Type.Object({
  dtype: Type.Union([Type.Literal("F32"), Type.Literal("I16"), ...]),
  shape: Type.Array(Type.Number()),
  data_offsets: Type.Tuple([Type.Number(), Type.Number()])
});

serialized to JSON is:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {
    "dtype": { "type": "string", "enum": ["F32", "I16", ...] },
    "shape": { "type": "array", "items": { "type": "number" } },
    "data_offsets": { "type": "array", "minItems": 2, "maxItems": 2, ... }
  },
  "required": ["dtype", "shape", "data_offsets"]
}

That JSON feeds directly into jsonschema::validator_for(&serde_json::from_str(...)) on the Rust side. Zero translation. The same schema validates in both ecosystems.

Custom type kinds (typedef.ts port)

The typedef.ts file at /workspace/@alkdev/typebox/example/typedef/typedef.ts (619 lines) defines custom TypeBox schema kinds that carry binary layout semantics:

TFloat32 — [Types.Kind]: 'TypeDef:Float32', static: number, 4-byte float
TInt32 — [Types.Kind]: 'TypeDef:Int32', static: number, 4-byte int
TStruct — [Types.Kind]: 'TypeDef:Struct', fields with byte offsets, record layout
TUnion — tagged union of structs (discriminator + mapping)
TEnum — string enum

These are registered in TypeBox via TypeRegistry.Set<TFloat32>('TypeDef:Float32', (schema, value) => ValueCheck.Check(schema, value)) — a custom validator that runs when Value.Check encounters the TypeDef:Float32 kind.

The Rust jsonschema crate supports the same pattern via custom keywords:

use jsonschema::{Keyword, ValidationError};
use serde_json::{Map, Value};

struct Float32Validator;

impl Keyword for Float32Validator {
    fn validate<'i>(&self, instance: &'i Value) -> Result<(), ValidationError<'i>> {
        // Validate that the instance is a number representable as f32
        match instance {
            Value::Number(n) if n.as_f64().is_some() => Ok(()),
            _ => Err(ValidationError::custom("expected f32")),
        }
    }
    fn is_valid(&self, instance: &Value) -> bool {
        instance.is_number()
    }
}

fn float32_factory(
    _parent: &Map<String, Value>,
    _value: &Value,
    _path: Location,
) -> Result<Box<dyn Keyword>, ValidationError<'static>> {
    Ok(Box::new(Float32Validator))
}

let validator = jsonschema::options()
    .with_keyword("TypeDef:Float32", float32_factory)
    .with_keyword("TypeDef:Int32", int32_factory)
    .with_keyword("TypeDef:Struct", struct_factory)
    .build(&schema)?;

Every TypeRegistry.Set<...>('TypeDef:...', validator) in typedef.ts maps to a with_keyword("TypeDef:...", factory) in Rust. Same semantics, different language, same JSON Schema wire format.

The jsonschema crate also supports custom format validators via with_format("f32", |v| ...), for the binary-type-specific format annotations. And compiled validators are reusable (validator_for(&schema)? then call validator.is_valid(&instance) repeatedly) — same performance profile as TypeBox's Value.Check.

What this eliminates

typebox-rs becomes unnecessary as a schema engine. The /workspace/@alkimiadev/typebox-rs/ crate (builder, schema, registry, validate, value) is replaced by:

jsonschema crate — validation, custom keywords, custom formats, compiled validators
A small offset-computation module — walks a schema, computes byte offsets for each field based on type sizes (the one piece of real logic beyond jsonschema)
Custom keyword implementations — the TypeDef:Float32 / TypeDef:Int32 / TypeDef:Struct validators (small functions, ~10 lines each)

No hand-rolled schema type system, no builder, no registry. The codebase drops from "a port of TypeBox" to "jsonschema + an offset map + ~50 lines of custom keyword implementations."

The Three Tensor Kinds

Flat tensor (fixed-size, homogeneous)

The safetensor format as it exists today. A single dtype, a shape, and a contiguous byte range.

const FlatTensor = Type.Object({
  dtype: Dtypes,                    // "F32", "I16", etc.
  shape: Type.Array(Type.Number()),  // [20, 1, 5, 5]
  data_offsets: Type.Tuple([Type.Number(), Type.Number()])  // [start, end] byte offsets
});

Offsets: dtype_size(dtype) * product(shape). Direct mmap — one region, typed view, done. This is the base case.

Struct tensor (fixed-size, heterogeneous)

A record of typed fields, each a flat tensor or another struct. One binary region, multiple typed views.

const Layer = Type.Object({
  weight: TensorRef,   // {dtype: "F32", shape: [20, 1, 5, 5], data_offsets: [0, 2000]}
  bias: TensorRef,     // {dtype: "F32", shape: [20], data_offsets: [2000, 2080]}
  running_mean: TensorRef,  // {dtype: "F32", shape: [20], data_offsets: [2080, 2160]}
  running_var: TensorRef,   // {dtype: "F32", shape: [20], data_offsets: [2160, 2240]}
});

const Model = Type.Object({
  conv1: Layer,
  conv2: Layer,
}, { additionalProperties: false });

The schema describes the structure; the offsets in each TensorRef point into the same binary region. Loading a model = mmap the file, construct typed views per the schema's field offsets. No copy, no deserialization — the schema is the access pattern.

This is "a row of records" where each row is a struct and the whole file is an array of structs. For a model file, the whole file is one struct; for a dataset, it's an array of structs.

Blob tensor (variable-length)

The hard case: strings, ragged arrays, arbitrary blobs. Solved via indirection — a struct tensor holds metadata + offsets, a flat tensor holds the actual variable-length bytes.

const BlobEntry = Type.Object({
  offset: Type.Number(),   // byte offset into the blob flat tensor
  length: Type.Number(),   // byte length of this blob
});

const TextDataset = Type.Object({
  // The index: one BlobEntry per string, stored as a struct tensor
  entries: Type.Array(BlobEntry),
  // The data: raw bytes, stored as a flat tensor (dtype: "U8")
  blob_data: Type.Object({
    dtype: Type.Literal("U8"),
    shape: Type.Array(Type.Number()),  // [total_blob_bytes]
    data_offsets: Type.Tuple([Type.Number(), Type.Number()])
  })
});

To read string i: look up entries[i] → get (offset, length) → slice blob_data[offset..offset+length] → decode. Two mmaps (index + data), O(1) random access. Long strings, ragged arrays, JSON blobs, images — all handled by the same indirection pattern.

The blob tensor is a combination of a struct tensor (the index) and a flat tensor (the blob store). The struct tensor's fields are metadata + offsets into the flat tensor. This is the same pattern as Apache Arrow's variable-length types (strings are stored as an offsets array + a byte buffer), but described in TypeBox schema instead of Arrow's IPC format.

Offset Computation

The one piece of real logic beyond jsonschema. Given a schema, compute the byte offset and size of each field. This is the same computation the old typebox-rs implementation did for struct layouts.

struct OffsetMap {
    fields: Vec<(String, ByteRange)>,  // field name → byte range
    total_size: usize,
}

struct ByteRange {
    start: usize,
    end: usize,
}

fn compute_offsets(schema: &serde_json::Value) -> OffsetMap {
    // Walk the schema:
    // - TypeDef:Float32 → 4 bytes
    // - TypeDef:Int32 → 4 bytes
    // - TypeDef:Array of T → element_size * len
    // - TypeDef:Struct → recurse, sum field sizes, align
    // - TensorRef → use data_offsets from the header
    // - Array of struct → element_size * count
    // ...
}

The offset computation is schema-driven: the schema's type kinds determine byte sizes, the struct's field order determines offsets, alignment rules (from the schema's annotations or defaults) determine padding. The output is an OffsetMap — a flat table of (field_path, byte_range) pairs.

For flat tensors, the offset is already in the data_offsets field of the TensorRef — no computation needed. For struct tensors, the offsets are computed from the field types. For blob tensors, the offsets are stored in the struct tensor's fields (the offset and length of each BlobEntry), pointing into the flat blob tensor.

Memory Mapping

The payoff of schema-driven offsets: direct mmap with typed views, no deserialization.

use memmap2::Mmap;

struct MetatensorFile {
    header: serde_json::Value,     // parsed JSON header (schema + tensor refs)
    schema: serde_json::Value,     // the TypeBox/jsonschema schema
    offset_map: OffsetMap,         // computed from schema
    data: Mmap,                     // mmap'd binary region
}

impl MetatensorFile {
    fn open(path: &Path) -> Result<Self> {
        let file = File::open(path)?;
        let mmap = unsafe { Mmap::map(&file)? };

        // Read 8-byte header length prefix
        let header_len = u64::from_le_bytes(mmap[0..8].try_into().unwrap()) as usize;

        // Parse JSON header (schema + tensor refs with offsets)
        let header: serde_json::Value = serde_json::from_slice(&mmap[8..8+header_len])?;

        // The data region starts after the header
        let data_start = 8 + header_len;

        // Compute offsets from schema
        let offset_map = compute_offsets(&header);

        Ok(Self { header, schema, offset_map, data: mmap })
    }

    fn tensor(&self, field_path: &str) -> &[u8] {
        let range = self.offset_map.get(field_path).unwrap();
        &self.data[range.start..range.end]
    }

    fn tensor_as_f32(&self, field_path: &str) -> &[f32] {
        let bytes = self.tensor(field_path);
        // Safe because the schema validated the dtype as F32
        unsafe { std::slice::from_raw_parts(bytes.as_ptr() as *const f32, bytes.len() / 4) }
    }
}

No copy. No deserialization. The schema is the access pattern. For a 10GB model file, opening is O(header size) — parse the JSON header, compute offsets, mmap the data. Accessing any tensor is O(1) — the OS page-faults the relevant pages on demand.

QUIC Stream Mapping

The format maps naturally to QUIC streams (and therefore to alknet's BiStream):

Header stream — the JSON header (schema + tensor refs with offsets). Small, sent first. The receiver knows the entire layout before any data arrives.
Per-tensor data streams — each tensor's byte range is a separate QUIC stream. The receiver can request specific tensors on demand (lazy loading) or receive all in parallel. Stream priority can be set per-tensor (load the first layer's weights before the last layer's).

Stream 0: header (JSON: schema + offset table)
Stream 1: conv1.weight bytes [0..2000]
Stream 2: conv1.bias bytes [2000..2080]
Stream 3: conv2.weight bytes [2080..4080]
Stream 4: conv2.bias bytes [4080..4160]
...

This is the "maps naturally to QUIC streams" property — the offset table in the header tells the receiver which stream has which tensor's bytes. A model can be partially loaded (just the layers needed for inference), streamed incrementally (start computing on layer 1 while layer 2 is still transferring), or broadcast to multiple peers (each peer subscribes to the streams it needs).

Over irpc/alknet-call, this becomes: load_model is an operation whose output is a struct of BufferIds; the operation's implementation opens QUIC streams per the offset table and writes each tensor's bytes into a wgpu buffer. The caller gets back tensor handles and starts computing.

ujsx Authoring

The schema is a TypeBox module, and TypeBox modules are ujsx-authorable via the TsToModule codegen (/workspace/research/typebox_research/codegen/ts-to-module.ts) and the ujsx-as-AST pattern (/workspace/research/typebox_research/ujsx/*.gen.ts).

Just as <Sequential> and <Operation> are ujsx components for compute graphs, <Tensor> and <Struct> and <Field> are ujsx components for data layouts:

import { h } from "@alkdev/ujsx";
import { Tensor, Struct, Field, ArrayOf } from "./layout-components";

const ModelLayout = (
  <Struct name="ConvNet">
    <Field name="conv1">
      <Struct>
        <Field name="weight"><Tensor dtype="f32" shape={[20, 1, 5, 5]} /></Field>
        <Field name="bias"><Tensor dtype="f32" shape={[20]} /></Field>
        <Field name="running_mean"><Tensor dtype="f32" shape={[20]} /></Field>
        <Field name="running_var"><Tensor dtype="f32" shape={[20]} /></Field>
      </Struct>
    </Field>
    <Field name="conv2">
      <Struct>
        <Field name="weight"><Tensor dtype="f32" shape={[20, 20, 5, 5]} /></Field>
        <Field name="bias"><Tensor dtype="f32" shape={[20]} /></Field>
      </Struct>
    </Field>
  </Struct>
);

// render(ModelLayout) against a LayoutHostConfig → produces a TypeBox schema
// + computed offsets → can serialize/deserialize/mmap/stream the model

The LayoutHostConfig's createInstance("Struct", props) builds a Type.Object(...); createInstance("Tensor", {dtype, shape}) builds a TensorRef; appendChild nests fields into the struct. The output is a TypeBox schema + offset table — the same data structure metatensor uses, authored declaratively as JSX.

The reconciler diffs layout changes the same way it diffs UI trees: add a layer (new <Field> child), change a shape (changed prop on <Tensor>), remove a parameter (removed <Field>). The diff is a schema delta, which is a model delta, which is a version bump. Model versioning is tree diffing.

The category-theory foundation

The ujsx research docs mention the category-theory foundations of ujsx as a universal typed-tree IR. The application to binary layouts is the same foundation:

A ujsx element is a typed tree node. <Tensor dtype="f32" shape={[20]} /> is a node of type Tensor with props {dtype, shape}.
A component is a typed tree constructor. <Struct> constructs a Type.Object(...).
The reconciler is a typed tree diff. Changing shape={[20, 1, 5, 5]} to shape={[40, 1, 5, 5]} is a prop diff that produces a schema delta.
The HostConfig is the interpreter. LayoutHostConfig interprets the tree as a schema + offset table; GraphologyHostConfig interprets it as a DAG; ThreeHostConfig interprets it as a three.js scene graph.

<Tensor> is no stranger than <div>. Both are typed tree nodes. The HostConfig decides what they mean. This is why ujsx works as a universal IR — the tree structure is generic, the interpretation is pluggable.

Relationship to Existing Formats

Format	Schema	Binary layout	Variable-length	Memory-mappable	Stream-friendly	Safe (no code exec)
PyTorch `.pt`	pickle (unsafe)	pickle	yes	no	no	no (arbitrary code)
safetensors	fixed (TensorRef)	flat only	no	yes	partial	yes
Apache Arrow IPC	Arrow schema	columnar	yes (offsets)	yes	yes	yes
metatensor	TypeBox/jsonschema (arbitrary)	flat + struct + blob	yes (blob indirection)	yes	yes (QUIC per-tensor)	yes

metatensor sits between safetensors (simpler, fixed schema) and Arrow (more complex, columnar). The distinguishing property: the schema is TypeBox/jsonschema, so it's the same schema system used everywhere else in the alkdev ecosystem — operations, ujsx, flowgraph, the call protocol. Arrow has its own schema format; metatensor uses yours.

What This Enables

Model serialization

A model is a struct tensor. Save = serialize schema (TypeBox → JSON) + write tensor bytes at computed offsets. Load = parse header, compute offsets, mmap. No pickle, no code execution, safe to load untrusted models.

Dataset storage

A dataset is an array of struct tensors. Each record is a struct (fields: input tensor, label tensor, metadata blob). The blob tensor handles variable-length text fields. Memory-map the whole dataset, access records by index, stream over QUIC per-record on demand.

Network transport

The header (schema + offsets) is one small JSON message. The tensor bytes are QUIC streams. A receiver can:

Load the schema, validate it, know the full layout before any data arrives
Request specific tensors (lazy load just the layers needed)
Receive in parallel (multiple QUIC streams concurrently)
Start computing on early tensors while later ones are still transferring

Over irpc: load_model is an operation that opens the file, mmaps it, returns a struct of BufferIds (one per tensor, written to wgpu buffers from the mmap'd bytes). Or: stream_model is an operation that sends the header then opens QUIC streams per tensor, writing to the receiver's wgpu buffers as each stream arrives.

Schema evolution

The schema is a TypeBox module. TypeBox has Value.Diff, Value.Migrate, Value.Convert — schema evolution is built in. Adding a field to a model is a schema delta; old files still load (the new field is optional or has a default). Changing a field type is a migration with a converter function. Model versioning is schema versioning, which is tree diffing, which is what the ujsx reconciler already does.

Open Questions

1. Alignment and padding rules

Different platforms have different alignment requirements for GPU upload (wgpu may want 256-byte alignment for some buffer types). The offset computation needs to respect alignment, which may be schema-annotated (Type.Number({ align: 256 })) or globally configured. Need to decide: is alignment a schema property, a format property, or a runtime property?

Recommendation: schema-annotated with a sensible default (4-byte for floats, 16-byte for structs). The offset computation respects the annotation; wgpu buffer creation rounds up as needed.

2. Endianness

safetensors is little-endian. metatensor should be too (wgpu buffers are little-endian on all supported backends). But the schema should probably declare it ({ endian: "little" } in the header) for completeness. Cross-endian platforms (none currently supported by wgpu) would need a byte-swap on load.

3. Compression

safetensors is uncompressed. For large models, compression matters. Options: compress the whole data region (zstd), compress per-tensor (each QUIC stream compressed independently), or leave compression to the transport layer (QUIC has built-in compression via HTTP/3 header compression; irpc could add a compression layer). Need to decide where compression lives.

Recommendation: transport-layer compression (QUIC/irpc handles it), format is uncompressed on disk. Keeps the format simple; compression is a deployment concern.

4. The blob tensor's two-region layout

The blob tensor requires two binary regions (index struct + data flat tensor). This means the metatensor format needs to support multiple data regions, not one contiguous blob. Options: the header lists multiple data regions with their own offset ranges, or the blob tensor's data_offsets are absolute (pointing into the single data region, with the index at the start and the blob data after). Need to decide the region model.

Recommendation: single data region, absolute offsets. The index struct lives at the start of the data region; the blob data follows. All offsets are absolute from the start of the data region. Simple, one mmap, the offset map handles indirection.

5. Integration with wgpu buffers

Loading a tensor from mmap into a wgpu buffer: is it a copy (mmap → staging buffer → GPU buffer) or a direct map (if wgpu supports host-visible buffers)? Depends on the GPU backend — discrete GPUs need a copy; integrated/llvmpipe might map directly. The load_model operation needs to handle both paths.

References

TypeBox (JS schema): /workspace/@alkdev/typebox/ — src/type/ (builders), src/value/ (Value.Check, Value.Diff, Value.Clone, Value.Equal — verified on QuickJS-NG by POC 2)
typedef.ts (binary layout types): /workspace/@alkdev/typebox/example/typedef/typedef.ts (619 lines) — TFloat32, TInt32, TStruct, TUnion, TEnum with TypeRegistry.Set<...> custom validators
jsonschema (Rust validation): /workspace/jsonschema — validator_for(&schema)?, with_keyword("TypeDef:...", factory) for custom kinds, with_format("...", validator) for custom formats, Draft 2020-12/2019-09/7/6/4 support
typebox-rs (to be replaced by jsonschema + offset map): /workspace/@alkimiadev/typebox-rs/ — src/codegen/ (handlebars codegen: RustGenerator, TypeScriptGenerator — WgslGenerator would be the third backend here)
metatensor concept: /workspace/research/typebox_research/metatensor/basic.ts (78 lines) — Dtypes, TensorRef, TensorMap TypeBox schemas, safetensor header parsing, Value.Check(TensorMap, header) validation
ujsx as AST: /workspace/research/typebox_research/ujsx/ — ujsx.ts (UJSX TypeBox module), unist.gen.ts (unist AST → TypeBox module), mdast.gen.ts (mdast AST → TypeBox module), jpath.gen.ts (JSONPath AST → TypeBox module)
TsToModule codegen: /workspace/research/typebox_research/codegen/ts-to-module.ts — generates TypeBox modules from TypeScript type definitions via the TypeScript compiler API
safetensor format reference: /workspace/research/typebox_research/metatensor/basic.ts — 8-byte LE header length + JSON header + raw bytes; TensorRef = {dtype, shape, data_offsets}
ujsx reconciler (verified on quickjs): /workspace/@alkdev/ujsx/src/host/{reconcile,config,fiber}.ts — fiber-based reconciler, the typed-tree diff engine
flowgraph (compute graph layer, uses ujsx): /workspace/@alkdev/flowgraph/ — <Operation>, <Sequential>, <Parallel>, <Conditional>, <Map> components; GraphologyHostConfig + ReactiveHostConfig
alknet-tensor architecture (parent doc): /workspace/@alkdev/alknet/docs/research/alknet-tensor/architecture-summary.md — the tensor compute architecture this format serves

24 KiB Raw Blame History