diff --git a/docs/research/category-theory-ujsx.md b/docs/research/category-theory-ujsx.md new file mode 100644 index 0000000..fd8f451 --- /dev/null +++ b/docs/research/category-theory-ujsx.md @@ -0,0 +1,348 @@ +# Category Theory and UJSX: Functors, Adjunctions, and Round-Trip Fidelity + +**Date**: 2026-05-09 +**Topic**: Formal foundations for the UJSX transform system, connecting categorical concepts to practical implementation + +--- + +## 1. The Big Picture + +UJSX is a system for transforming trees between representations. It has: +- A universal tree model (`UElement` / `URoot` / `UPrimitive`) +- A `TransformRegistry` with bidirectional rules and a `direction` field +- A `HostConfig` reconciler pattern that renders trees into target formats +- TypeBox schemas that define the shape of valid trees at runtime + +This document connects these practical pieces to their categorical foundations, and explains why those connections matter for correctness. + +--- + +## 2. Categories We Work With + +A **category** consists of objects (types) and morphisms (structure-preserving maps between types) with an identity morphism and composable morphisms. + +### 2.1 The TypeBox Schema Category + +Objects: TypeBox schema types (`TObject`, `TString`, `TUnion`, `TModule`, etc.) +Morphisms: Schema transformations — `Type.Optional(T)`, `Type.Ref("X")`, `Type.Composite([A, B])`, `Module.Import("X")` +Composition: Nesting and combining schemas +Identity: `Type.Ref("Self")` within a `Type.Module` + +This is approximately a **cartesian closed category**: +- **Products** — `Type.Composite([A, B])` (intersection types) +- **Coproducts** — `Type.Union([A, B])` (union types) +- **Exponentials** — `Type.Function([...], R)` (function types as first-class objects) +- **Terminal object** — `Type.Void()` (every type has a unique map to it) + +### 2.2 The Tree Category + +Trees (ASTs, IRs, DOMs) form categories where: +- **Objects** are tree species: UJSX elements, mdast nodes, hast nodes, TypeBox schemas-as-trees, safetensor headers +- **Morphisms** are transforms: `ujsx → mdast`, `mdast → ujsx`, `ujsx → typebox`, `typebox → ujsx`, etc. +- **Composition** is pipeline composition: first transform A→B, then B→C + +A UJSX tree (or mdast tree, or hast tree, or TypeBox Module-as-tree) is itself an object in a tree category. The `TransformRegistry.transform()` function with `next` continuation is a **recursive functor** — it maps each subtree while preserving parent-child structure. + +### 2.3 The Host Category + +Each `HostConfig` defines a rendering functor from the UJSX tree category to a target category: + +``` +HostConfig> +``` + +- Source category: UJSX tree category +- Target category: string instances (DOM nodes, markdown strings, TypeBox schema objects, binary buffers) +- Functor: `createHostRoot(config).render(tree)` maps UJSX elements to target instances + +--- + +## 3. Functors and the TransformRegistry + +### 3.1 Transforms Are Functors + +A **functor** `F: C → D` maps: +- Every object `A` in `C` to an object `F(A)` in `D` +- Every morphism `f: A → B` in `C` to a morphism `F(f): F(A) → F(B)` in `D` +- Such that `F(id_A) = id_{F(A)}` and `F(g ∘ f) = F(g) ∘ F(f)` + +In UJSX terms: + +| Categorical Concept | UJSX Implementation | +|---------------------|---------------------| +| Object in C | A UJSX tree conforming to some schema | +| Object in D | An mdast tree / hast tree / TypeBox schema | +| Functor F | A set of `TransformRule`s with `direction: "ujsx→mdast"` | +| `F(g ∘ f)` | The `next` continuation in `transform(node, ctx, next)` | +| `F(id_A)` | Identity transform (pass-through rule) | + +The `next` parameter in `TransformRule.transform` is the key: it ensures that when you transform a subtree, you first transform the current node, then recursively transform children via `next`. This is the functoriality condition — the structure is preserved. + +### 3.2 Direction Selects the Functor + +The `Direction` type: + +```typescript +type Direction = "ujsx→mdast" | "mdast→ujsx" | "ujsx→jpath" | "jpath→ujsx" | "ujsx→hast" | "hast→ujsx" +``` + +Each direction is a different functor. The pair `"ujsx→mdast"` and `"mdast→ujsx"` are potentially **adjoint functors**. + +### 3.3 Adjunctions and Round-Trip Fidelity + +Two functors `F: C → D` and `G: D → C` form an **adjunction** if there exists a natural bijection: + +``` +Hom_D(F(A), B) ≅ Hom_C(A, G(B)) +``` + +In practice, this means: for every UJSX tree `a` and mdast tree `b`, there's a one-to-one correspondence between maps `F(a) → b` (mdast transforms) and maps `a → G(b)` (UJSX transforms). + +The **round-trip property** is a stronger, more practical notion: for all `a` in C, + +``` +G(F(a)) ≅ a (unit of the adjunction) +``` + +and for all `b` in D, + +``` +F(G(b)) ≅ b (counit of the adjunction) +``` + +This is what we actually want to verify: if I transform a UJSX tree to mdast and back, I get something equivalent to what I started with. + +--- + +## 4. The Impedance Mismatch: ASG vs AST + +### 4.1 The Problem + +Not all transforms satisfy round-trip fidelity. The key failure mode is when we have an **Abstract Semantic Graph** (ASG) on one side and an **Abstract Syntax Tree** (AST) on the other. + +An **AST** represents syntax — the textual structure of a format. A mdast tree is an AST. It has a `type` discriminant and positional information. + +An **ASG** represents semantics — meaning that may not round-trip through syntax. A UJSX `` component is an ASG node. It represents the *idea* of a table, which can be rendered as: +- A markdown table (`| col1 | col2 |`) +- An HTML `
` element +- A grid of key-value pairs in plain text +- A data structure (headers + rows) in JSON + +But the markdown table syntax cannot express all table semantics (no colspans in GFM, no per-cell formatting), so the round trip `Table → mdast → Table` may lose information. + +### 4.2 Categorical Formulation + +In category theory terms, this is a **reflection** rather than an equivalence: + +- The functor `F: ASG → AST` (render abstract semantics to concrete syntax) is **surjective on objects** (every ASG node maps *somewhere* in AST-land) +- The functor `G: AST → ASG` (parse concrete syntax back to abstract semantics) is **not injective on objects** (some ASG constructs have no AST representation, and some map to "lossy" approximations) + +The adjunction unit `η: id → G ∘ F` is not an isomorphism — `G(F(a))` may be a *simplification* of `a`. This is expected and correct. + +### 4.3 Practical Examples + +| ASG Node | AST Representation | Round-Trip? | Loss? | +|----------|-------------------|-------------|-------| +| `
` | mdast `table` node | Yes | Column alignment only | +| `
` | mdast `table` node | Partial | colspan lost (GFM limit) | +| `` | mdast `blockquote` with marker | Partial | Structural equivalence, semantic loss | +| `ax² + bx + c` | Plain text `ax² + bx + c` | No | Layout/formatting lost | +| `` | mdast `code` with `lang="ts"` | Yes | None | +| `` | mdast `heading` with `depth: 2` | Yes | None | + +### 4.4 Implications for Transform Design + +For ASG→AST transforms, we should: +1. **Acknowledge loss explicitly** — transform rules can declare their fidelity level (exact, structural, lossy) +2. **Preserve semantics through metadata** — `UElement.metadata` can carry information that the target format can't express, enabling partial recovery on the return trip +3. **Not attempt impossible round trips** — instead, design the return transform `G: AST → ASG` to produce the *best possible* ASG node, not a faithful reproduction + +--- + +## 5. Codegen and Functor Construction + +### 5.1 From TypeBox Modules to Transform Rules + +The `ts-to-module.ts` codegen produces `Type.Module({...})` definitions from TypeScript sources. We have modules for Unist, MDAST, and JPATH. Each module defines an object in the TypeBox Schema Category. + +Given two modules (say `UJSX` and `Mdast`), a functor between them can be partially constructed by analyzing the structural correspondence: + +- `UJSX.UniversalElement.type` is a `String` or `Function` → matches `Mdast` nodes with `type` discriminants +- `UJSX.UniversalProps.children` is `Array(UniversalNode)` → matches `Mdast.Parent.children` +- `UJSX.UniversalElement.props` → matches `Mdast` node-specific properties + +The codegen can produce **skeleton transform rules** — rules where `match` predicates are derived from schema intersection, and `transform` functions are stubs waiting for implementation: + +```typescript +// Auto-generated from schema correspondence analysis +registry.register({ + name: "heading-ujsx-to-mdast", + direction: "ujsx→mdast", + match: (node) => isUElement(node) && node.type === "heading", // derived from UJSX schema + transform: (node, ctx, next) => { + // TODO: implement heading-specific transform + // node.props.level maps to mdast heading depth + // node.children maps to mdast heading children + throw new Error("Not implemented: heading transform"); + }, + priority: 1, +}); +``` + +### 5.2 From Type Definitions to Host Config Element Types + +Each TypeBox `Type.XXX()` call in a module definition becomes a potential UJSX element type in a host config: + +- `Type.Object({...})` → `` element +- `Type.String()` → `` element +- `Type.Ref("X")` → `` element +- `Type.Literal("heading")` → `` element +- `Type.Array(Type.Ref("Node"))` → `` + +This is a functor from the TypeBox Schema Category to the UJSX Element Category — every TypeBox constructor maps to a UJSX element, and TypeBox composition maps to UJSX tree composition. + +### 5.3 The Proxy Insight: Call Trees and Definition Trees + +The proxy experiment in `/research/typebox_research/ujsx/proxy.ts` traced TypeBox's internal call tree. The relationship: + +``` +Call tree (imperative): Definition tree (declarative): +Type.Object({ + name: Type.String(), + ... ... +}) +``` + +These are the same information, just in different traversal orders. The **call tree** is a depth-first execution trace; the **definition tree** is the resulting data structure. UJSX's declarative form unifies them — the JSX element IS the definition tree, and rendering it through a host produces the call tree. + +--- + +## 6. Proof as Program: Verification via Generation + +### 6.1 The Curry-Howard-Lambek Correspondence + +The Curry-Howard-Lambek correspondence establishes: + +| Logic | Programming | Category Theory | +|-------|------------|-----------------| +| Propositions | Types | Objects | +| Proofs | Programs | Morphisms | +| Implication A → B | Function type A → B | Exponential object B^A | +| Conjunction A ∧ B | Product type A × B | Product object | +| Disjunction A ∨ B | Union type A \| B | Coproduct object | + +Under this correspondence: a TypeBox schema is a **proposition** (a specification of what valid data looks like). A value that passes `Value.Check(schema, data)` is a **proof** (evidence that the proposition holds for this data). A transform `f: SchemaA → SchemaB` is a **proof that if A-structured data exists, then B-structured data can be derived from it**. + +### 6.2 Faker as Proof Generator + +`typebox-schema-faker` generates random inhabitants of TypeBox schemas. In logical terms, it generates **witnesses** — concrete evidence that a type (proposition) is inhabited. This serves as: + +1. **Model checking** — generate many instances of UJSX trees, transform them, validate the output against the target schema. If all generated instances pass, you have statistical confidence in the transform's correctness. + +2. **Round-trip verification** — for adjoint functor pairs: + ``` + for (const x of generate(UJSXSchema)) { + const y = F(x) // ujsx → mdast + const x2 = G(y) // mdast → ujsx + assert(equivalent(x, x2, fidelityLevel)) + } + ``` + This tests the adjunction laws for generated inhabitants. + +3. **Schema coverage** — faker can be guided to hit edge cases (null children, empty arrays, deeply nested structures) that hand-written tests miss. + +### 6.3 Compositional Verification + +If transform `f: A → B` is verified and transform `g: B → C` is verified, then `g ∘ f: A → C` should also be correct by composition. The `composedOf` field from the graphology pattern ([`category-theory-graph.md`](./category-theory-graph.md)) can track this: + +```typescript +interface TransformRule { + name: string; + match: (node: T) => boolean; + transform: (node: T, ctx: TransformContext, next: TransformFn) => U; + priority?: number; + composedOf?: [string, string]; // provenance: which rules composed to produce this one +} +``` + +This enables provenance tracking — every composed transform knows its derivation, enabling targeted re-verification when a base rule changes. + +--- + +## 7. The TypeDef Bridge: Binary Layout as a Category + +### 7.1 Standard TypeBox vs TypeDef + +Standard TypeBox types represent **JSON values** — data serialized as JSON, validated as JSON Schema. The TypeDef system adds **binary layout semantics**: + +| Standard TypeBox | TypeDef | Difference | +|-----------------|---------|-----------| +| `Type.Number()` | `TF.Float32()`, `TF.Int32()`, etc. | Fixed-width, specific encoding | +| `Type.String()` | `TF.String()` | Same, but in struct context has known offset | +| `Type.Object({...})` | `TF.Struct({...})` | Ordered fields, computable offsets | +| `Type.Union([...], 'type')` | `TF.Union([...], 'type')` | Tagged union with discriminant | + +The `TypeDef:Kind` markers (`TypeDef:Float32`, `TypeDef:Struct`, etc.) act as **discriminants in a functor** — they map from the abstract number type to the concrete binary representation. + +### 7.2 Offset Computation as a Functor + +Given a `TF.Struct({ name: TF.String(), embedding: TF.Array(TF.Float32()) })`: + +1. The **TypeBox functor** maps this to a JSON Schema that validates JSON objects with `name` (string) and `embedding` (number array). +2. The **Binary layout functor** maps this to a byte offset table: `name` at offset 0 (variable, preceded by length), `embedding` at next aligned offset, with shape `[768]` → `768 × 4 = 3072` bytes. + +These are **two different functors from the same source category** (TypeDef schemas) to two different target categories (JSON validation and binary layout). They're naturally isomorphic in the sense that every TypeDef schema has both a JSON interpretation and a binary interpretation. + +### 7.3 SafeTensor as a Container Category + +The SafeTensor format is: `header_length (8 bytes) + header (JSON) + data (raw bytes)` + +This is a **coproduct** (sum type): `SafeTensor = Header × Data`. The header contains `TensorRef` entries with `data_offsets: [start, end]` — these are morphisms from the header category to positions in the data category. + +The metatensor concept adds `TF.Struct` entries alongside raw tensors, enabling a third functor: from TypeDef schemas to byte ranges within the data section. + +--- + +## 8. Summary: The Categorical Structure of UJSX + +``` + TypeBox Schema Category + │ + ┌──────┼──────┐ + │ │ │ + F₁ │ F₂ │ F₃ │ + ┌───┘ │ └───┐ + ▼ ▼ ▼ + UJSX Tree MDAST Tree Binary Layout + Category Category Category + │ ▲ + │ G₂ │ + └──────────┘ + │ F₂⁻¹ │ + │ │ + (adjoint pair: F₂ and G₂) +``` + +- **F₁** — TypeBox → UJSX host (renders type definitions as JSX elements) +- **F₂** — UJSX → MDAST transform (renders abstract components as markdown AST) +- **G₂** — MDAST → UJSX transform (parses markdown AST back to abstract components) +- **F₃** — TypeDef → Binary Layout (computes offsets and validates binary data) + +**F₂ and G₂ should form an adjunction** — verified by round-trip property testing with `typebox-schema-faker`. + +**F₁ is a faithful embedding** — TypeBox schemas embed into UJSX without information loss. + +**F₃ is a projection** — TypeDef schemas project onto binary layouts, but not all binary layouts correspond to valid TypeDef schemas (lossy in the other direction). + +**The ASG/AST impedance mismatch is a reflection, not an equivalence** — some ASG nodes have no faithful AST representation, and this is expected and correct. + +--- + +## 9. Practical Next Steps + +1. **Fork `typebox-schema-faker`** into the alkdev ecosystem for model checking +2. **Add `composedOf` provenance tracking** to `TransformRule` +3. **Add `fidelity` levels** to transform rules (`"exact"` | `"structural"` | `"lossy"`) +4. **Implement the TypeBox host** — prove F₁ works before building F₂/F₃ +5. **Use `ts-to-module.ts` codegen** to produce module definitions, then derive skeleton transform rules from schema correspondence analysis +6. **Design the TypeDef UJSX elements** (``, ``, etc.) as a second host that computes binary offsets \ No newline at end of file