add category theory research doc connecting UJSX to functors, adjunctions, and round-trip fidelity

This commit is contained in:
2026-05-09 11:40:51 +00:00
parent 3eb1f1d896
commit ba74afd0b3

View File

@@ -0,0 +1,348 @@
# Category Theory and UJSX: Functors, Adjunctions, and Round-Trip Fidelity
**Date**: 2026-05-09
**Topic**: Formal foundations for the UJSX transform system, connecting categorical concepts to practical implementation
---
## 1. The Big Picture
UJSX is a system for transforming trees between representations. It has:
- A universal tree model (`UElement` / `URoot` / `UPrimitive`)
- A `TransformRegistry` with bidirectional rules and a `direction` field
- A `HostConfig` reconciler pattern that renders trees into target formats
- TypeBox schemas that define the shape of valid trees at runtime
This document connects these practical pieces to their categorical foundations, and explains why those connections matter for correctness.
---
## 2. Categories We Work With
A **category** consists of objects (types) and morphisms (structure-preserving maps between types) with an identity morphism and composable morphisms.
### 2.1 The TypeBox Schema Category
Objects: TypeBox schema types (`TObject`, `TString`, `TUnion`, `TModule`, etc.)
Morphisms: Schema transformations — `Type.Optional(T)`, `Type.Ref("X")`, `Type.Composite([A, B])`, `Module.Import("X")`
Composition: Nesting and combining schemas
Identity: `Type.Ref("Self")` within a `Type.Module`
This is approximately a **cartesian closed category**:
- **Products** — `Type.Composite([A, B])` (intersection types)
- **Coproducts** — `Type.Union([A, B])` (union types)
- **Exponentials** — `Type.Function([...], R)` (function types as first-class objects)
- **Terminal object** — `Type.Void()` (every type has a unique map to it)
### 2.2 The Tree Category
Trees (ASTs, IRs, DOMs) form categories where:
- **Objects** are tree species: UJSX elements, mdast nodes, hast nodes, TypeBox schemas-as-trees, safetensor headers
- **Morphisms** are transforms: `ujsx → mdast`, `mdast → ujsx`, `ujsx → typebox`, `typebox → ujsx`, etc.
- **Composition** is pipeline composition: first transform A→B, then B→C
A UJSX tree (or mdast tree, or hast tree, or TypeBox Module-as-tree) is itself an object in a tree category. The `TransformRegistry.transform()` function with `next` continuation is a **recursive functor** — it maps each subtree while preserving parent-child structure.
### 2.3 The Host Category
Each `HostConfig` defines a rendering functor from the UJSX tree category to a target category:
```
HostConfig<string, string, Record<string, unknown>>
```
- Source category: UJSX tree category
- Target category: string instances (DOM nodes, markdown strings, TypeBox schema objects, binary buffers)
- Functor: `createHostRoot(config).render(tree)` maps UJSX elements to target instances
---
## 3. Functors and the TransformRegistry
### 3.1 Transforms Are Functors
A **functor** `F: C → D` maps:
- Every object `A` in `C` to an object `F(A)` in `D`
- Every morphism `f: A → B` in `C` to a morphism `F(f): F(A) → F(B)` in `D`
- Such that `F(id_A) = id_{F(A)}` and `F(g ∘ f) = F(g) ∘ F(f)`
In UJSX terms:
| Categorical Concept | UJSX Implementation |
|---------------------|---------------------|
| Object in C | A UJSX tree conforming to some schema |
| Object in D | An mdast tree / hast tree / TypeBox schema |
| Functor F | A set of `TransformRule`s with `direction: "ujsx→mdast"` |
| `F(g ∘ f)` | The `next` continuation in `transform(node, ctx, next)` |
| `F(id_A)` | Identity transform (pass-through rule) |
The `next` parameter in `TransformRule.transform` is the key: it ensures that when you transform a subtree, you first transform the current node, then recursively transform children via `next`. This is the functoriality condition — the structure is preserved.
### 3.2 Direction Selects the Functor
The `Direction` type:
```typescript
type Direction = "ujsx→mdast" | "mdast→ujsx" | "ujsx→jpath" | "jpath→ujsx" | "ujsx→hast" | "hast→ujsx"
```
Each direction is a different functor. The pair `"ujsx→mdast"` and `"mdast→ujsx"` are potentially **adjoint functors**.
### 3.3 Adjunctions and Round-Trip Fidelity
Two functors `F: C → D` and `G: D → C` form an **adjunction** if there exists a natural bijection:
```
Hom_D(F(A), B) ≅ Hom_C(A, G(B))
```
In practice, this means: for every UJSX tree `a` and mdast tree `b`, there's a one-to-one correspondence between maps `F(a) → b` (mdast transforms) and maps `a → G(b)` (UJSX transforms).
The **round-trip property** is a stronger, more practical notion: for all `a` in C,
```
G(F(a)) ≅ a (unit of the adjunction)
```
and for all `b` in D,
```
F(G(b)) ≅ b (counit of the adjunction)
```
This is what we actually want to verify: if I transform a UJSX tree to mdast and back, I get something equivalent to what I started with.
---
## 4. The Impedance Mismatch: ASG vs AST
### 4.1 The Problem
Not all transforms satisfy round-trip fidelity. The key failure mode is when we have an **Abstract Semantic Graph** (ASG) on one side and an **Abstract Syntax Tree** (AST) on the other.
An **AST** represents syntax — the textual structure of a format. A mdast tree is an AST. It has a `type` discriminant and positional information.
An **ASG** represents semantics — meaning that may not round-trip through syntax. A UJSX `<Table>` component is an ASG node. It represents the *idea* of a table, which can be rendered as:
- A markdown table (`| col1 | col2 |`)
- An HTML `<table>` element
- A grid of key-value pairs in plain text
- A data structure (headers + rows) in JSON
But the markdown table syntax cannot express all table semantics (no colspans in GFM, no per-cell formatting), so the round trip `Table → mdast → Table` may lose information.
### 4.2 Categorical Formulation
In category theory terms, this is a **reflection** rather than an equivalence:
- The functor `F: ASG → AST` (render abstract semantics to concrete syntax) is **surjective on objects** (every ASG node maps *somewhere* in AST-land)
- The functor `G: AST → ASG` (parse concrete syntax back to abstract semantics) is **not injective on objects** (some ASG constructs have no AST representation, and some map to "lossy" approximations)
The adjunction unit `η: id → G ∘ F` is not an isomorphism — `G(F(a))` may be a *simplification* of `a`. This is expected and correct.
### 4.3 Practical Examples
| ASG Node | AST Representation | Round-Trip? | Loss? |
|----------|-------------------|-------------|-------|
| `<Table cols={["Name","Age"]}>` | mdast `table` node | Yes | Column alignment only |
| `<Table cols={["Name","Age"]} colspan={{1:3}}>` | mdast `table` node | Partial | colspan lost (GFM limit) |
| `<Callout type="warning">` | mdast `blockquote` with marker | Partial | Structural equivalence, semantic loss |
| `<Formula>ax² + bx + c</Formula>` | Plain text `ax² + bx + c` | No | Layout/formatting lost |
| `<Code language="ts">` | mdast `code` with `lang="ts"` | Yes | None |
| `<Heading level={2}>` | mdast `heading` with `depth: 2` | Yes | None |
### 4.4 Implications for Transform Design
For ASG→AST transforms, we should:
1. **Acknowledge loss explicitly** — transform rules can declare their fidelity level (exact, structural, lossy)
2. **Preserve semantics through metadata**`UElement.metadata` can carry information that the target format can't express, enabling partial recovery on the return trip
3. **Not attempt impossible round trips** — instead, design the return transform `G: AST → ASG` to produce the *best possible* ASG node, not a faithful reproduction
---
## 5. Codegen and Functor Construction
### 5.1 From TypeBox Modules to Transform Rules
The `ts-to-module.ts` codegen produces `Type.Module({...})` definitions from TypeScript sources. We have modules for Unist, MDAST, and JPATH. Each module defines an object in the TypeBox Schema Category.
Given two modules (say `UJSX` and `Mdast`), a functor between them can be partially constructed by analyzing the structural correspondence:
- `UJSX.UniversalElement.type` is a `String` or `Function` → matches `Mdast` nodes with `type` discriminants
- `UJSX.UniversalProps.children` is `Array(UniversalNode)` → matches `Mdast.Parent.children`
- `UJSX.UniversalElement.props` → matches `Mdast` node-specific properties
The codegen can produce **skeleton transform rules** — rules where `match` predicates are derived from schema intersection, and `transform` functions are stubs waiting for implementation:
```typescript
// Auto-generated from schema correspondence analysis
registry.register({
name: "heading-ujsx-to-mdast",
direction: "ujsx→mdast",
match: (node) => isUElement(node) && node.type === "heading", // derived from UJSX schema
transform: (node, ctx, next) => {
// TODO: implement heading-specific transform
// node.props.level maps to mdast heading depth
// node.children maps to mdast heading children
throw new Error("Not implemented: heading transform");
},
priority: 1,
});
```
### 5.2 From Type Definitions to Host Config Element Types
Each TypeBox `Type.XXX()` call in a module definition becomes a potential UJSX element type in a host config:
- `Type.Object({...})``<Object>` element
- `Type.String()``<String>` element
- `Type.Ref("X")``<Ref name="X" />` element
- `Type.Literal("heading")``<Literal value="heading" />` element
- `Type.Array(Type.Ref("Node"))``<Array><Ref name="Node" /></Array>`
This is a functor from the TypeBox Schema Category to the UJSX Element Category — every TypeBox constructor maps to a UJSX element, and TypeBox composition maps to UJSX tree composition.
### 5.3 The Proxy Insight: Call Trees and Definition Trees
The proxy experiment in `/research/typebox_research/ujsx/proxy.ts` traced TypeBox's internal call tree. The relationship:
```
Call tree (imperative): Definition tree (declarative):
Type.Object({ <Object name="MyType">
name: Type.String(), <property name="name"><String /></property>
... ...
}) </Object>
```
These are the same information, just in different traversal orders. The **call tree** is a depth-first execution trace; the **definition tree** is the resulting data structure. UJSX's declarative form unifies them — the JSX element IS the definition tree, and rendering it through a host produces the call tree.
---
## 6. Proof as Program: Verification via Generation
### 6.1 The Curry-Howard-Lambek Correspondence
The Curry-Howard-Lambek correspondence establishes:
| Logic | Programming | Category Theory |
|-------|------------|-----------------|
| Propositions | Types | Objects |
| Proofs | Programs | Morphisms |
| Implication A → B | Function type A → B | Exponential object B^A |
| Conjunction A ∧ B | Product type A × B | Product object |
| Disjunction A B | Union type A \| B | Coproduct object |
Under this correspondence: a TypeBox schema is a **proposition** (a specification of what valid data looks like). A value that passes `Value.Check(schema, data)` is a **proof** (evidence that the proposition holds for this data). A transform `f: SchemaA → SchemaB` is a **proof that if A-structured data exists, then B-structured data can be derived from it**.
### 6.2 Faker as Proof Generator
`typebox-schema-faker` generates random inhabitants of TypeBox schemas. In logical terms, it generates **witnesses** — concrete evidence that a type (proposition) is inhabited. This serves as:
1. **Model checking** — generate many instances of UJSX trees, transform them, validate the output against the target schema. If all generated instances pass, you have statistical confidence in the transform's correctness.
2. **Round-trip verification** — for adjoint functor pairs:
```
for (const x of generate(UJSXSchema)) {
const y = F(x) // ujsx → mdast
const x2 = G(y) // mdast → ujsx
assert(equivalent(x, x2, fidelityLevel))
}
```
This tests the adjunction laws for generated inhabitants.
3. **Schema coverage** — faker can be guided to hit edge cases (null children, empty arrays, deeply nested structures) that hand-written tests miss.
### 6.3 Compositional Verification
If transform `f: A → B` is verified and transform `g: B → C` is verified, then `g ∘ f: A → C` should also be correct by composition. The `composedOf` field from the graphology pattern ([`category-theory-graph.md`](./category-theory-graph.md)) can track this:
```typescript
interface TransformRule<T, U, A> {
name: string;
match: (node: T) => boolean;
transform: (node: T, ctx: TransformContext<A>, next: TransformFn<T, U, A>) => U;
priority?: number;
composedOf?: [string, string]; // provenance: which rules composed to produce this one
}
```
This enables provenance tracking — every composed transform knows its derivation, enabling targeted re-verification when a base rule changes.
---
## 7. The TypeDef Bridge: Binary Layout as a Category
### 7.1 Standard TypeBox vs TypeDef
Standard TypeBox types represent **JSON values** — data serialized as JSON, validated as JSON Schema. The TypeDef system adds **binary layout semantics**:
| Standard TypeBox | TypeDef | Difference |
|-----------------|---------|-----------|
| `Type.Number()` | `TF.Float32()`, `TF.Int32()`, etc. | Fixed-width, specific encoding |
| `Type.String()` | `TF.String()` | Same, but in struct context has known offset |
| `Type.Object({...})` | `TF.Struct({...})` | Ordered fields, computable offsets |
| `Type.Union([...], 'type')` | `TF.Union([...], 'type')` | Tagged union with discriminant |
The `TypeDef:Kind` markers (`TypeDef:Float32`, `TypeDef:Struct`, etc.) act as **discriminants in a functor** — they map from the abstract number type to the concrete binary representation.
### 7.2 Offset Computation as a Functor
Given a `TF.Struct({ name: TF.String(), embedding: TF.Array(TF.Float32()) })`:
1. The **TypeBox functor** maps this to a JSON Schema that validates JSON objects with `name` (string) and `embedding` (number array).
2. The **Binary layout functor** maps this to a byte offset table: `name` at offset 0 (variable, preceded by length), `embedding` at next aligned offset, with shape `[768]` → `768 × 4 = 3072` bytes.
These are **two different functors from the same source category** (TypeDef schemas) to two different target categories (JSON validation and binary layout). They're naturally isomorphic in the sense that every TypeDef schema has both a JSON interpretation and a binary interpretation.
### 7.3 SafeTensor as a Container Category
The SafeTensor format is: `header_length (8 bytes) + header (JSON) + data (raw bytes)`
This is a **coproduct** (sum type): `SafeTensor = Header × Data`. The header contains `TensorRef` entries with `data_offsets: [start, end]` — these are morphisms from the header category to positions in the data category.
The metatensor concept adds `TF.Struct` entries alongside raw tensors, enabling a third functor: from TypeDef schemas to byte ranges within the data section.
---
## 8. Summary: The Categorical Structure of UJSX
```
TypeBox Schema Category
┌──────┼──────┐
│ │ │
F₁ │ F₂ │ F₃ │
┌───┘ │ └───┐
▼ ▼ ▼
UJSX Tree MDAST Tree Binary Layout
Category Category Category
│ ▲
│ G₂ │
└──────────┘
│ F₂⁻¹ │
│ │
(adjoint pair: F₂ and G₂)
```
- **F₁** — TypeBox → UJSX host (renders type definitions as JSX elements)
- **F₂** — UJSX → MDAST transform (renders abstract components as markdown AST)
- **G₂** — MDAST → UJSX transform (parses markdown AST back to abstract components)
- **F₃** — TypeDef → Binary Layout (computes offsets and validates binary data)
**F₂ and G₂ should form an adjunction** — verified by round-trip property testing with `typebox-schema-faker`.
**F₁ is a faithful embedding** — TypeBox schemas embed into UJSX without information loss.
**F₃ is a projection** — TypeDef schemas project onto binary layouts, but not all binary layouts correspond to valid TypeDef schemas (lossy in the other direction).
**The ASG/AST impedance mismatch is a reflection, not an equivalence** — some ASG nodes have no faithful AST representation, and this is expected and correct.
---
## 9. Practical Next Steps
1. **Fork `typebox-schema-faker`** into the alkdev ecosystem for model checking
2. **Add `composedOf` provenance tracking** to `TransformRule`
3. **Add `fidelity` levels** to transform rules (`"exact"` | `"structural"` | `"lossy"`)
4. **Implement the TypeBox host** — prove F₁ works before building F₂/F₃
5. **Use `ts-to-module.ts` codegen** to produce module definitions, then derive skeleton transform rules from schema correspondence analysis
6. **Design the TypeDef UJSX elements** (`<Struct>`, `<Float32>`, etc.) as a second host that computes binary offsets