Files
ujsx/docs/research/category-theory-ujsx.md

348 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Category Theory and UJSX: Functors, Adjunctions, and Round-Trip Fidelity
**Date**: 2026-05-09
**Topic**: Formal foundations for the UJSX transform system, connecting categorical concepts to practical implementation
---
## 1. The Big Picture
UJSX is a system for transforming trees between representations. It has:
- A universal tree model (`UElement` / `URoot` / `UPrimitive`)
- A `TransformRegistry` with bidirectional rules and a `direction` field
- A `HostConfig` reconciler pattern that renders trees into target formats
- TypeBox schemas that define the shape of valid trees at runtime
This document connects these practical pieces to their categorical foundations, and explains why those connections matter for correctness.
---
## 2. Categories We Work With
A **category** consists of objects (types) and morphisms (structure-preserving maps between types) with an identity morphism and composable morphisms.
### 2.1 The TypeBox Schema Category
Objects: TypeBox schema types (`TObject`, `TString`, `TUnion`, `TModule`, etc.)
Morphisms: Schema transformations — `Type.Optional(T)`, `Type.Ref("X")`, `Type.Composite([A, B])`, `Module.Import("X")`
Composition: Nesting and combining schemas
Identity: `Type.Ref("Self")` within a `Type.Module`
This is approximately a **cartesian closed category**:
- **Products** — `Type.Composite([A, B])` (intersection types)
- **Coproducts** — `Type.Union([A, B])` (union types)
- **Exponentials** — `Type.Function([...], R)` (function types as first-class objects)
- **Terminal object** — `Type.Void()` (every type has a unique map to it)
### 2.2 The Tree Category
Trees (ASTs, IRs, DOMs) form categories where:
- **Objects** are tree species: UJSX elements, mdast nodes, hast nodes, TypeBox schemas-as-trees, safetensor headers
- **Morphisms** are transforms: `ujsx → mdast`, `mdast → ujsx`, `ujsx → typebox`, `typebox → ujsx`, etc.
- **Composition** is pipeline composition: first transform A→B, then B→C
A UJSX tree (or mdast tree, or hast tree, or TypeBox Module-as-tree) is itself an object in a tree category. The `TransformRegistry.transform()` function with `next` continuation is a **recursive functor** — it maps each subtree while preserving parent-child structure.
### 2.3 The Host Category
Each `HostConfig` defines a rendering functor from the UJSX tree category to a target category:
```
HostConfig<string, string, Record<string, unknown>>
```
- Source category: UJSX tree category
- Target category: string instances (DOM nodes, markdown strings, TypeBox schema objects, binary buffers)
- Functor: `createHostRoot(config).render(tree)` maps UJSX elements to target instances
---
## 3. Functors and the TransformRegistry
### 3.1 Transforms Are Functors
A **functor** `F: C → D` maps:
- Every object `A` in `C` to an object `F(A)` in `D`
- Every morphism `f: A → B` in `C` to a morphism `F(f): F(A) → F(B)` in `D`
- Such that `F(id_A) = id_{F(A)}` and `F(g ∘ f) = F(g) ∘ F(f)`
In UJSX terms:
| Categorical Concept | UJSX Implementation |
|---------------------|---------------------|
| Object in C | A UJSX tree conforming to some schema |
| Object in D | An mdast tree / hast tree / TypeBox schema |
| Functor F | A set of `TransformRule`s with `direction: "ujsx→mdast"` |
| `F(g ∘ f)` | The `next` continuation in `transform(node, ctx, next)` |
| `F(id_A)` | Identity transform (pass-through rule) |
The `next` parameter in `TransformRule.transform` is the key: it ensures that when you transform a subtree, you first transform the current node, then recursively transform children via `next`. This is the functoriality condition — the structure is preserved.
### 3.2 Direction Selects the Functor
The `Direction` type:
```typescript
type Direction = "ujsx→mdast" | "mdast→ujsx" | "ujsx→jpath" | "jpath→ujsx" | "ujsx→hast" | "hast→ujsx"
```
Each direction is a different functor. The pair `"ujsx→mdast"` and `"mdast→ujsx"` are potentially **adjoint functors**.
### 3.3 Adjunctions and Round-Trip Fidelity
Two functors `F: C → D` and `G: D → C` form an **adjunction** if there exists a natural bijection:
```
Hom_D(F(A), B) ≅ Hom_C(A, G(B))
```
In practice, this means: for every UJSX tree `a` and mdast tree `b`, there's a one-to-one correspondence between maps `F(a) → b` (mdast transforms) and maps `a → G(b)` (UJSX transforms).
The **round-trip property** is a stronger, more practical notion: for all `a` in C,
```
G(F(a)) ≅ a (unit of the adjunction)
```
and for all `b` in D,
```
F(G(b)) ≅ b (counit of the adjunction)
```
This is what we actually want to verify: if I transform a UJSX tree to mdast and back, I get something equivalent to what I started with.
---
## 4. The Impedance Mismatch: ASG vs AST
### 4.1 The Problem
Not all transforms satisfy round-trip fidelity. The key failure mode is when we have an **Abstract Semantic Graph** (ASG) on one side and an **Abstract Syntax Tree** (AST) on the other.
An **AST** represents syntax — the textual structure of a format. A mdast tree is an AST. It has a `type` discriminant and positional information.
An **ASG** represents semantics — meaning that may not round-trip through syntax. A UJSX `<Table>` component is an ASG node. It represents the *idea* of a table, which can be rendered as:
- A markdown table (`| col1 | col2 |`)
- An HTML `<table>` element
- A grid of key-value pairs in plain text
- A data structure (headers + rows) in JSON
But the markdown table syntax cannot express all table semantics (no colspans in GFM, no per-cell formatting), so the round trip `Table → mdast → Table` may lose information.
### 4.2 Categorical Formulation
In category theory terms, this is a **reflection** rather than an equivalence:
- The functor `F: ASG → AST` (render abstract semantics to concrete syntax) is **surjective on objects** (every ASG node maps *somewhere* in AST-land)
- The functor `G: AST → ASG` (parse concrete syntax back to abstract semantics) is **not injective on objects** (some ASG constructs have no AST representation, and some map to "lossy" approximations)
The adjunction unit `η: id → G ∘ F` is not an isomorphism — `G(F(a))` may be a *simplification* of `a`. This is expected and correct.
### 4.3 Practical Examples
| ASG Node | AST Representation | Round-Trip? | Loss? |
|----------|-------------------|-------------|-------|
| `<Table cols={["Name","Age"]}>` | mdast `table` node | Yes | Column alignment only |
| `<Table cols={["Name","Age"]} colspan={{1:3}}>` | mdast `table` node | Partial | colspan lost (GFM limit) |
| `<Callout type="warning">` | mdast `blockquote` with marker | Partial | Structural equivalence, semantic loss |
| `<Formula>ax² + bx + c</Formula>` | Plain text `ax² + bx + c` | No | Layout/formatting lost |
| `<Code language="ts">` | mdast `code` with `lang="ts"` | Yes | None |
| `<Heading level={2}>` | mdast `heading` with `depth: 2` | Yes | None |
### 4.4 Implications for Transform Design
For ASG→AST transforms, we should:
1. **Acknowledge loss explicitly** — transform rules can declare their fidelity level (exact, structural, lossy)
2. **Preserve semantics through metadata**`UElement.metadata` can carry information that the target format can't express, enabling partial recovery on the return trip
3. **Not attempt impossible round trips** — instead, design the return transform `G: AST → ASG` to produce the *best possible* ASG node, not a faithful reproduction
---
## 5. Codegen and Functor Construction
### 5.1 From TypeBox Modules to Transform Rules
The `ts-to-module.ts` codegen produces `Type.Module({...})` definitions from TypeScript sources. We have modules for Unist, MDAST, and JPATH. Each module defines an object in the TypeBox Schema Category.
Given two modules (say `UJSX` and `Mdast`), a functor between them can be partially constructed by analyzing the structural correspondence:
- `UJSX.UniversalElement.type` is a `String` or `Function` → matches `Mdast` nodes with `type` discriminants
- `UJSX.UniversalProps.children` is `Array(UniversalNode)` → matches `Mdast.Parent.children`
- `UJSX.UniversalElement.props` → matches `Mdast` node-specific properties
The codegen can produce **skeleton transform rules** — rules where `match` predicates are derived from schema intersection, and `transform` functions are stubs waiting for implementation:
```typescript
// Auto-generated from schema correspondence analysis
registry.register({
name: "heading-ujsx-to-mdast",
direction: "ujsx→mdast",
match: (node) => isUElement(node) && node.type === "heading", // derived from UJSX schema
transform: (node, ctx, next) => {
// TODO: implement heading-specific transform
// node.props.level maps to mdast heading depth
// node.children maps to mdast heading children
throw new Error("Not implemented: heading transform");
},
priority: 1,
});
```
### 5.2 From Type Definitions to Host Config Element Types
Each TypeBox `Type.XXX()` call in a module definition becomes a potential UJSX element type in a host config:
- `Type.Object({...})``<Object>` element
- `Type.String()``<String>` element
- `Type.Ref("X")``<Ref name="X" />` element
- `Type.Literal("heading")``<Literal value="heading" />` element
- `Type.Array(Type.Ref("Node"))``<Array><Ref name="Node" /></Array>`
This is a functor from the TypeBox Schema Category to the UJSX Element Category — every TypeBox constructor maps to a UJSX element, and TypeBox composition maps to UJSX tree composition.
### 5.3 The Proxy Insight: Call Trees and Definition Trees
The proxy experiment in `/research/typebox_research/ujsx/proxy.ts` traced TypeBox's internal call tree. The relationship:
```
Call tree (imperative): Definition tree (declarative):
Type.Object({ <Object name="MyType">
name: Type.String(), <property name="name"><String /></property>
... ...
}) </Object>
```
These are the same information, just in different traversal orders. The **call tree** is a depth-first execution trace; the **definition tree** is the resulting data structure. UJSX's declarative form unifies them — the JSX element IS the definition tree, and rendering it through a host produces the call tree.
---
## 6. Proof as Program: Verification via Generation
### 6.1 The Curry-Howard-Lambek Correspondence
The Curry-Howard-Lambek correspondence establishes:
| Logic | Programming | Category Theory |
|-------|------------|-----------------|
| Propositions | Types | Objects |
| Proofs | Programs | Morphisms |
| Implication A → B | Function type A → B | Exponential object B^A |
| Conjunction A ∧ B | Product type A × B | Product object |
| Disjunction A B | Union type A \| B | Coproduct object |
Under this correspondence: a TypeBox schema is a **proposition** (a specification of what valid data looks like). A value that passes `Value.Check(schema, data)` is a **proof** (evidence that the proposition holds for this data). A transform `f: SchemaA → SchemaB` is a **proof that if A-structured data exists, then B-structured data can be derived from it**.
### 6.2 Faker as Proof Generator
`typebox-schema-faker` generates random inhabitants of TypeBox schemas. In logical terms, it generates **witnesses** — concrete evidence that a type (proposition) is inhabited. This serves as:
1. **Model checking** — generate many instances of UJSX trees, transform them, validate the output against the target schema. If all generated instances pass, you have statistical confidence in the transform's correctness.
2. **Round-trip verification** — for adjoint functor pairs:
```
for (const x of generate(UJSXSchema)) {
const y = F(x) // ujsx → mdast
const x2 = G(y) // mdast → ujsx
assert(equivalent(x, x2, fidelityLevel))
}
```
This tests the adjunction laws for generated inhabitants.
3. **Schema coverage** — faker can be guided to hit edge cases (null children, empty arrays, deeply nested structures) that hand-written tests miss.
### 6.3 Compositional Verification
If transform `f: A → B` is verified and transform `g: B → C` is verified, then `g ∘ f: A → C` should also be correct by composition. The `composedOf` field from the graphology pattern ([`category-theory-graph.md`](./category-theory-graph.md)) can track this:
```typescript
interface TransformRule<T, U, A> {
name: string;
match: (node: T) => boolean;
transform: (node: T, ctx: TransformContext<A>, next: TransformFn<T, U, A>) => U;
priority?: number;
composedOf?: [string, string]; // provenance: which rules composed to produce this one
}
```
This enables provenance tracking — every composed transform knows its derivation, enabling targeted re-verification when a base rule changes.
---
## 7. The TypeDef Bridge: Binary Layout as a Category
### 7.1 Standard TypeBox vs TypeDef
Standard TypeBox types represent **JSON values** — data serialized as JSON, validated as JSON Schema. The TypeDef system adds **binary layout semantics**:
| Standard TypeBox | TypeDef | Difference |
|-----------------|---------|-----------|
| `Type.Number()` | `TF.Float32()`, `TF.Int32()`, etc. | Fixed-width, specific encoding |
| `Type.String()` | `TF.String()` | Same, but in struct context has known offset |
| `Type.Object({...})` | `TF.Struct({...})` | Ordered fields, computable offsets |
| `Type.Union([...], 'type')` | `TF.Union([...], 'type')` | Tagged union with discriminant |
The `TypeDef:Kind` markers (`TypeDef:Float32`, `TypeDef:Struct`, etc.) act as **discriminants in a functor** — they map from the abstract number type to the concrete binary representation.
### 7.2 Offset Computation as a Functor
Given a `TF.Struct({ name: TF.String(), embedding: TF.Array(TF.Float32()) })`:
1. The **TypeBox functor** maps this to a JSON Schema that validates JSON objects with `name` (string) and `embedding` (number array).
2. The **Binary layout functor** maps this to a byte offset table: `name` at offset 0 (variable, preceded by length), `embedding` at next aligned offset, with shape `[768]` → `768 × 4 = 3072` bytes.
These are **two different functors from the same source category** (TypeDef schemas) to two different target categories (JSON validation and binary layout). They're naturally isomorphic in the sense that every TypeDef schema has both a JSON interpretation and a binary interpretation.
### 7.3 SafeTensor as a Container Category
The SafeTensor format is: `header_length (8 bytes) + header (JSON) + data (raw bytes)`
This is a **coproduct** (sum type): `SafeTensor = Header × Data`. The header contains `TensorRef` entries with `data_offsets: [start, end]` — these are morphisms from the header category to positions in the data category.
The metatensor concept adds `TF.Struct` entries alongside raw tensors, enabling a third functor: from TypeDef schemas to byte ranges within the data section.
---
## 8. Summary: The Categorical Structure of UJSX
```
TypeBox Schema Category
┌──────┼──────┐
│ │ │
F₁ │ F₂ │ F₃ │
┌───┘ │ └───┐
▼ ▼ ▼
UJSX Tree MDAST Tree Binary Layout
Category Category Category
│ ▲
│ G₂ │
└──────────┘
│ F₂⁻¹ │
│ │
(adjoint pair: F₂ and G₂)
```
- **F₁** — TypeBox → UJSX host (renders type definitions as JSX elements)
- **F₂** — UJSX → MDAST transform (renders abstract components as markdown AST)
- **G₂** — MDAST → UJSX transform (parses markdown AST back to abstract components)
- **F₃** — TypeDef → Binary Layout (computes offsets and validates binary data)
**F₂ and G₂ should form an adjunction** — verified by round-trip property testing with `typebox-schema-faker`.
**F₁ is a faithful embedding** — TypeBox schemas embed into UJSX without information loss.
**F₃ is a projection** — TypeDef schemas project onto binary layouts, but not all binary layouts correspond to valid TypeDef schemas (lossy in the other direction).
**The ASG/AST impedance mismatch is a reflection, not an equivalence** — some ASG nodes have no faithful AST representation, and this is expected and correct.
---
## 9. Practical Next Steps
1. **Fork `typebox-schema-faker`** into the alkdev ecosystem for model checking
2. **Add `composedOf` provenance tracking** to `TransformRule`
3. **Add `fidelity` levels** to transform rules (`"exact"` | `"structural"` | `"lossy"`)
4. **Implement the TypeBox host** — prove F₁ works before building F₂/F₃
5. **Use `ts-to-module.ts` codegen** to produce module definitions, then derive skeleton transform rules from schema correspondence analysis
6. **Design the TypeDef UJSX elements** (`<Struct>`, `<Float32>`, etc.) as a second host that computes binary offsets