Files
ujsx/docs/research/unist-ecosystem-jsx-to-markdown.md

1212 lines
46 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Unist Ecosystem Research: JSX → Markdown Pipeline for LLM Consumption
**Date**: 2026-04-28
**Topic**: Feasibility of JSX components → hast → mdast → markdown pipeline using the Unist/syntax-tree ecosystem
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [unist: The Universal Foundation](#2-unist-the-universal-foundation)
3. [hast: Hypertext Abstract Syntax Tree](#3-hast-hypertext-abstract-syntax-tree)
4. [mdast: Markdown Abstract Syntax Tree](#4-mdast-markdown-abstract-syntax-tree)
5. [hast-util-to-mdast: The Key Transform](#5-hast-util-to-mdast-the-key-transform)
6. [mdast-util-to-markdown: Serialization to Markdown](#6-mdast-util-to-markdown-serialization-to-markdown)
7. [remark/rehype Ecosystem](#7-remarkrehype-ecosystem)
8. [unist-util-visit and Related Utilities](#8-unist-util-visit-and-related-utilities)
9. [TypeScript Type Definitions](#9-typescript-type-definitions)
10. [Pipeline Feasibility Assessment](#10-pipeline-feasibility-assessment)
11. [Alternative Approaches](#11-alternative-approaches)
12. [Recommended Architecture](#12-recommended-architecture)
13. [Appendix: Element-to-Markdown Mapping Table](#13-appendix-element-to-markdown-mapping-table)
---
## 1. Executive Summary
The JSX → hast → mdast → markdown pipeline is **feasible and well-supported** by mature, well-typed libraries in the unist/syntax-tree ecosystem. The core transformation chain is:
```
JSX Component Tree → hast (HTML AST) → mdast (Markdown AST) → markdown string
│ │ │ │
React rendering hast-util-from-html hast-util-to-mdast mdast-util-to-markdown
or react-dom/ or manual hast (v10.1.2) (v2.1.2)
server rendering construction
```
**Key finding**: The hardest step is not hast→mdast→markdown (which is solved by existing, mature libraries), but rather **JSX → hast** and handling **custom components** that have no direct HTML/markdown equivalent. The ecosystem provides excellent tooling for standard HTML elements but requires a custom strategy for framework-specific components.
**Verdict**: Use the existing unist ecosystem libraries for the hast→mdast→markdown steps. Build a custom JSX→hast adapter layer that handles React component rendering and custom element mapping.
---
## 2. unist: The Universal Foundation
**Repository**: https://github.com/syntax-tree/unist
**Current version**: 3.0.0
**License**: CC-BY-4.0
unist is the abstract base specification that hast, mdast, xast, and nlcst all implement. It defines the minimal node interface that all syntax tree nodes share.
### Core Node Interface
```typescript
interface Node {
type: string // Non-empty string identifying the node variant
data?: Data // Ecosystem-specific metadata
position?: Position // Source location info
}
interface Parent <: Node {
children: [Node] // Child nodes
}
interface Literal <: Node {
value: any // Node's value
}
interface Position {
start: Point
end: Point
}
interface Point {
line: number // 1-indexed
column: number // 1-indexed
offset?: number // 0-indexed
}
```
### Design Principles
- All values must be JSON-serializable (no functions, undefined, symbols)
- Trees can survive `JSON.parse(JSON.stringify(tree))` roundtrips
- `data` field is reserved for ecosystem use; specifications never define fields on it
- `position` must be absent on generated nodes
### Why This Matters for Our Pipeline
The JSON-serializability constraint means the AST is inherently portable and can be passed between contexts (server/client, different frameworks). The `data` field provides an escape hatch for custom metadata that custom component handlers can use.
---
## 3. hast: Hypertext Abstract Syntax Tree
**Repository**: https://github.com/syntax-tree/hast
**Spec version**: 2.4.0
**Type definitions**: `@types/hast`
**Stars**: 892
hast represents HTML (and embedded SVG/MathML) as an abstract syntax tree. It extends unist.
### Node Types
| Node Type | Extends | Description | Key Fields |
|-----------|---------|-------------|------------|
| **`Root`** | Parent | Document root | `children` |
| **`Element`** | Parent | HTML element | `tagName`, `properties`, `children`, `content?` |
| **`Text`** | Literal | Text content | `value` |
| **`Comment`** | Literal | HTML comment | `value` |
| **`Doctype`** | Node | Document type declaration | (none beyond unist Node) |
### Element Interface (the workhorse)
```typescript
interface Element <: Parent {
type: 'element'
tagName: string // e.g., 'div', 'span', 'custom-card'
properties: Properties // HTML attributes mapped to DOM properties
content?: Root // Only for <template> elements
children: [Comment | Element | Text]
}
```
### Properties System
hast uses DOM-style property names, not HTML attribute names:
| HTML Attribute | hast Property |
|----------------|---------------|
| `class` | `className` (array: `['foo', 'bar']`) |
| `for` | `htmlFor` |
| `data-*` | `data*` (camelCase) |
| `aria-*` | `aria*` (camelCase) |
| `tabindex` | `tabIndex` |
| `colspan` | `colSpan` |
Property values:
- Boolean attributes: `true`/`false`
- Numeric attributes: `number`
- Space-separated: `string[]` (e.g., `className: ['foo', 'bar']`)
- All other: `string`
### Example: hast tree for HTML
HTML:
```html
<a href="https://alpha.com" class="bravo" download>Link</a>
```
hast:
```json
{
"type": "element",
"tagName": "a",
"properties": {
"href": "https://alpha.com",
"className": ["bravo"],
"download": true
},
"children": [{"type": "text", "value": "Link"}]
}
```
### Key Utilities for hast Construction
- **`hastscript`** (v9.0.1) — `h()` function to create hast trees, like React's `createElement`. Supports JSX via automatic runtime (`@jsxImportSource hastscript`).
- **`hast-util-from-html`** — Parse HTML string to hast
- **`hast-util-from-dom`** — Convert browser DOM nodes to hast
- **`hast-util-to-html`** — Serialize hast to HTML string
- **`hast-util-to-jsx-runtime`** (v2.3.6) — Convert hast to React/Preact/Solid/Svelte/Vue (the *reverse* direction of what we need)
- **`hast-util-select`** — CSS selector queries on hast trees (`querySelector`, etc.)
### hastscript JSX Support
This is significant: hastscript supports using JSX syntax to directly create hast trees:
```jsx
/** @jsxImportSource hastscript */
const tree = (
<div class="foo" id="some-id">
<span>some text</span>
<input type="text" value="foo" />
</div>
)
```
This produces a **hast tree**, not a React element. This is a potential alternative entry point for our pipeline.
---
## 4. mdast: Markdown Abstract Syntax Tree
**Repository**: https://github.com/syntax-tree/mdast
**Spec version**: 5.0.0
**Type definitions**: `@types/mdast`
**Stars**: 1.4k
mdast represents markdown as an abstract syntax tree. It extends unist.
### Core Node Types (CommonMark)
| Node Type | Category | Description | Key Fields |
|-----------|----------|-------------|------------|
| **`Root`** | — | Document root | `children` |
| **`Paragraph`** | Content | Text paragraph | `children: [PhrasingContent]` |
| **`Heading`** | Flow | Section heading | `depth: 1-6`, `children: [PhrasingContent]` |
| **`Blockquote`** | Flow | Quoted section | `children: [FlowContent]` |
| **`List`** | Flow | Ordered/unordered list | `ordered`, `start`, `spread`, `children: [ListItem]` |
| **`ListItem`** | ListContent | List item | `spread`, `checked?`, `children: [FlowContent]` |
| **`Code`** | Flow | Fenced/indented code block | `value`, `lang?`, `meta?` |
| **`ThematicBreak`** | Flow | Horizontal rule `---` | (none) |
| **`Html`** | Flow/Phrasing | Raw HTML in markdown | `value` |
| **`Definition`** | Content | Link/image reference def | `identifier`, `label`, `url`, `title` |
| **`Text`** | Phrasing | Plain text | `value` |
| **`Emphasis`** | Phrasing | Italic `*text*` | `children: [PhrasingContent]` |
| **`Strong`** | Phrasing | Bold `**text**` | `children: [PhrasingContent]` |
| **`InlineCode`** | Phrasing | Inline code `` `code` `` | `value` |
| **`Break`** | Phrasing | Hard line break | (none) |
| **`Link`** | Phrasing | Hyperlink | `url`, `title?`, `children: [PhrasingContent]` |
| **`LinkReference`** | Phrasing | Link by reference | `identifier`, `label`, `referenceType` |
| **`Image`** | Phrasing | Image | `url`, `title?`, `alt?` |
| **`ImageReference`** | Phrasing | Image by reference | `identifier`, `label`, `referenceType`, `alt?` |
### GFM Extension Nodes
| Node Type | Description | Key Fields |
|-----------|-------------|------------|
| **`Delete`** | Strikethrough `~~text~~` | `children: [PhrasingContent]` |
| **`Table`** | Table | `align?: [alignType]`, `children: [TableRow]` |
| **`TableRow`** | Table row | `children: [TableCell]` |
| **`TableCell`** | Table cell | `children: [PhrasingContent]` |
| **`FootnoteDefinition`** | Footnote def | `identifier`, `label`, `children: [FlowContent]` |
| **`FootnoteReference`** | Footnote ref | `identifier`, `label` |
### Content Model Hierarchy
```
MdastContent = FlowContent | ListContent | PhrasingContent
FlowContent = Blockquote | Code | Heading | Html | List | ThematicBreak | Paragraph
ListContent = ListItem
PhrasingContent = Break | Emphasis | Html | Image | ImageReference | InlineCode
| Link | LinkReference | Strong | Text
+ GFM: Delete | FootnoteReference
```
This hierarchy is critical: hast-util-to-mdast must map HTML's content model (which doesn't have this distinction) into mdast's strict flow/phrasing content model.
### Mixin Types
```typescript
interface Resource {
url: string
title?: string
}
interface Alternative {
alt?: string
}
interface Association {
identifier: string
label?: string
}
interface Reference {
referenceType: 'shortcut' | 'collapsed' | 'full'
}
```
---
## 5. hast-util-to-mdast: The Key Transform
**Repository**: https://github.com/syntax-tree/hast-util-to-mdast
**Current version**: 10.1.2
**License**: MIT
**Stars**: 43
This is the critical library in the pipeline — it converts hast (HTML AST) into mdast (Markdown AST).
### API
```typescript
import { toMdast } from 'hast-util-to-mdast'
const mdastTree = toMdast(hastTree, options?)
```
### Options
```typescript
interface Options {
newlines?: boolean // Keep line endings when collapsing whitespace (default: false)
checked?: string // Value for checked checkbox (default: '[x]')
unchecked?: string // Value for unchecked checkbox (default: '[ ]')
quotes?: string[] // Quote characters for <q> nesting (default: ['"'])
document?: boolean // Whether tree is a complete document (default: auto-detect)
handlers?: Record<string, Handle> // Custom element handlers
nodeHandlers?: Record<string, NodeHandle> // Custom node type handlers
}
```
### Custom Handlers
This is the **extensibility mechanism** most relevant to our use case:
```typescript
type Handle = (
state: State,
element: Element,
parent: HastParent
) => Array<MdastNode> | MdastNode | undefined
type NodeHandle = (
state: State,
node: any,
parent: HastParent
) => Array<MdastNode> | MdastNode | undefined
```
The `handlers` option maps HTML tag names to custom conversion functions. Custom handlers are **merged** into the defaults, so you can override specific tags without reimplementing everything.
The `nodeHandlers` option maps hast node types (like `'text'`, `'comment'`) to custom handlers.
### State Object
Passed to all handlers:
```typescript
interface State {
patch: (from: HastNode, to: MdastNode) => undefined // Copy positional info
one: (node: HastNode, parent?: HastParent) => MdastNode // Transform single node
all: (parent: HastParent) => Array<MdastContent> // Transform children
toFlow: (nodes: Array<MdastContent>) => Array<MdastFlowContent> // Promote to flow content
resolve: (url: string | null | undefined) => string // Resolve URLs
options: Options
elementById: Map<string, Element>
handlers: Record<string, Handle>
nodeHandlers: Record<string, NodeHandle>
inTable: boolean // Whether we're inside a table
qNesting: number // <q> nesting depth
}
```
### How It Handles Different Element Categories
#### Inline Elements (Phrasing Content)
| HTML Element | mdast Node | Notes |
|-------------|------------|-------|
| `<strong>`, `<b>` | `strong` | Children processed recursively |
| `<em>`, `<i>` | `emphasis` | Children processed recursively |
| `<code>` | `inlineCode` | Value extracted from text child |
| `<a href="...">` | `link` | `url` from `href`, `title` from `title` attr |
| `<br>` | `break` | Hard line break |
| `<del>`, `<s>`, `<strike>` | `delete` (GFM) | Strikethrough |
| `<q>` | `text` with quotes | Uses `quotes` option for nesting |
| `<img>` | `image` | `url` from `src`, `alt` from `alt` attr |
| `<sub>`, `<sup>`, `<mark>`, etc. | Text content only | Non-semantic in markdown; children extracted |
| `<input type="checkbox">` | `text` | Uses `checked`/`unchecked` options |
#### Block Elements (Flow Content)
| HTML Element | mdast Node | Notes |
|-------------|------------|-------|
| `<h1>``<h6>` | `heading` (depth 1-6) | |
| `<p>` | `paragraph` | |
| `<blockquote>` | `blockquote` | |
| `<ul>` | `list` (ordered: false) | |
| `<ol>` | `list` (ordered: true, start) | |
| `<li>` | `listItem` | GFM: `checked` for task lists |
| `<pre><code>` | `code` | `lang` from class (`language-js`), `meta` from data attributes |
| `<table>` | `table` (GFM) | With `align` from `align` attribute |
| `<tr>` | `tableRow` | |
| `<td>`/`<th>` | `tableCell` | |
| `<hr>` | `thematicBreak` | |
| `<dl>`, `<dt>`, `<dd>` | Paragraphs | No markdown equivalent; downgraded |
#### Special Behaviors
- **`<template>`**: Content is processed from the `content` field
- **`<noscript>`**: Children processed as if scripting is disabled
- **`<svg>`**, **`<math>`**: **Ignored** by default (no markdown equivalent)
- **`<video>`**, **`<audio>`**, **`<iframe>`**: Converted to **links** to the source
- **`<form>`** elements: Processed for their text content
- **Implicit paragraphs**: The algorithm correctly handles HTML's implicit paragraph model (e.g., text + heading inside a container gets proper paragraph wrapping)
#### `data-mdast` Attribute
Elements with `data-mdast="ignore"` are excluded from output:
```html
<p><strong>Important</strong> and <em data-mdast="ignore">ignored</em>.</p>
```
→ `**Important** and .`
### Algorithm
The algorithm is described as "very powerful" and handles all HTML elements including ancient and obscure ones. It is particularly good at:
1. **Implicit/explicit paragraph handling**: Correctly wraps loose text in paragraphs when adjacent to block elements
2. **Whitespace collapsing**: Collapses inter-element whitespace to single spaces (configurable with `newlines`)
3. **Content model enforcement**: Ensures phrasing content doesn't end up in flow contexts (auto-wraps in paragraphs)
4. **GFM output**: Tables produce GFM `table` nodes; `<del>`/`<s>`/`<strike>` produce `delete` nodes
### Custom Handler Example: Preserving SVG as Raw HTML
```typescript
import { toHtml } from 'hast-util-to-html'
const mdast = toMdast(hast, {
handlers: {
svg(state, node) {
const result = { type: 'html', value: toHtml(node, { space: 'svg' }) }
state.patch(node, result)
return result
}
}
})
```
This pattern — converting an unhandled element to an mdast `html` node — is the standard escape hatch for elements that don't map cleanly to markdown.
---
## 6. mdast-util-to-markdown: Serialization to Markdown
**Repository**: https://github.com/syntax-tree/mdast-util-to-markdown
**Current version**: 2.1.2
**License**: MIT
**Stars**: 139
This library serializes an mdast tree back to a markdown string.
### API
```typescript
import { toMarkdown } from 'mdast-util-to-markdown'
const markdown = toMarkdown(mdastTree, options?)
```
### Key Options
```typescript
interface Options {
// List formatting
bullet?: '*' | '+' | '-' // Unordered list marker (default: '*')
bulletOther?: '*' | '+' | '-' // Fallback list marker (default: '-')
bulletOrdered?: '.' | ')' // Ordered list marker (default: '.')
listItemIndent?: 'mixed' | 'one' | 'tab' // List item indentation (default: 'one')
incrementListMarker?: boolean // Increment ordered list numbers (default: true)
// Heading formatting
closeAtx?: boolean // Close ATX headings with trailing #s (default: false)
setext?: boolean // Use setext headings when possible (default: false)
// Emphasis/strong markers
emphasis?: '*' | '_' // Emphasis marker (default: '*')
strong?: '*' | '_' // Strong marker (default: '*')
// Code blocks
fence?: '`' | '~' // Fenced code marker (default: '`')
fences?: boolean // Always use fenced code (default: true)
// Links
resourceLink?: boolean // Always use resource links (default: false)
quote?: '"' | "'" // Title quote character (default: '"')
// Thematic breaks
rule?: '*' | '-' | '_' // Thematic break marker (default: '*')
ruleRepetition?: number // Number of markers (default: 3, min: 3)
ruleSpaces?: boolean // Spaces between markers (default: false)
// Definitions
tightDefinitions?: boolean // No blank lines between definitions (default: false)
// Extensibility
handlers?: Handlers // Custom node type handlers
join?: Array<Join> // Custom block-joining behavior
unsafe?: Array<Unsafe> // Characters that need escaping in contexts
extensions?: Array<Options> // Extension options (e.g., GFM)
}
```
### GFM Support
GFM output is achieved by using the `mdast-util-gfm` extension:
```typescript
import { toMarkdown } from 'mdast-util-to-markdown'
import { gfmToMarkdown } from 'mdast-util-gfm'
const markdown = toMarkdown(tree, {
extensions: [gfmToMarkdown()]
})
```
This adds support for:
- Tables (`| col1 | col2 |`)
- Strikethrough (`~~text~~`)
- Task lists (`- [x] item`)
- Autolink literals
- Footnotes
### Safety / Escaping
The library carefully escapes characters that would be interpreted as markdown syntax:
```typescript
// Character that would break markdown is properly escaped:
Input mdast: { type: 'text', value: '- a\nb !' }
Output: \- a\nb \!
```
This is handled via the `Unsafe` type system, which specifies which characters are dangerous in which constructs.
### Custom Handlers
```typescript
type Handle = (node, parent, state, info) => string
type Handlers = Record<Node['type'], Handle>
```
Custom handlers can be provided for any node type, including custom/extension node types.
---
## 7. remark/rehype Ecosystem
### Architecture
```
unified (core processor)
├── remark (markdown) ─── mdast ─── remarkParse / remarkStringify
├── rehype (HTML) ────── hast ───── rehypeParse / rehypeStringify
└── retext (NLP) ────── nlcst ───── retextEnglish / retextContent
```
### Cross-Ecosystem Plugins
| Plugin | Direction | Description |
|--------|-----------|-------------|
| `remark-rehype` | mdast → hast | Markdown to HTML (the common direction) |
| `rehype-remark` | hast → mdast | HTML to Markdown (our direction!) |
| `remark-retext` | mdast → nlcst | Markdown to NLP |
| `rehype-retext` | hast → nlcst | HTML to NLP |
### rehype-remark (v10.0.1)
**Repository**: https://github.com/rehypejs/rehype-remark
**Stars**: 99
This is the higher-level wrapper around `hast-util-to-mdast`. It operates as a unified plugin:
```typescript
import { unified } from 'unified'
import rehypeParse from 'rehype-parse'
import rehypeRemark from 'rehype-remark'
import remarkStringify from 'remark-stringify'
const file = await unified()
.use(rehypeParse) // HTML → hast
.use(rehypeRemark) // hast → mdast (uses hast-util-to-mdast internally)
.use(remarkStringify) // mdast → markdown (uses mdast-util-to-markdown internally)
.process(htmlString)
console.log(String(file))
```
**Recommendation**: For our JSX→markdown pipeline, we should use the lower-level utilities (`hast-util-to-mdast` + `mdast-util-to-markdown`) directly rather than the unified processor chain. This avoids the overhead of the unified pipeline and gives us more control.
However, `rehype-remark` plus remark plugins could be useful if we want to add post-processing transformations (e.g., `remark-gfm` for explicit GFM support).
### Available remark Plugins (for post-processing)
- `remark-gfm` — GFM syntax support
- `remark-frontmatter` — YAML/TOML frontmatter
- `remark-mdx` — MDX syntax support
- `remark-lint` — Markdown linting
- `remark-toc` — Table of contents generation
- `remark-comment-config` — Configure remark from HTML comments
- 150+ total plugins
---
## 8. unist-util-visit and Related Utilities
**Repository**: https://github.com/syntax-tree/unist-util-visit
**Current version**: 5.1.0
**License**: MIT
### Core Traversal: unist-util-visit
```typescript
import { visit, CONTINUE, EXIT, SKIP } from 'unist-util-visit'
visit(tree, 'heading', (node, index, parent) => {
// Return CONTINUE (default), EXIT, SKIP, or a new index
if (node.depth === 1) return EXIT
if (node.depth === 2) return SKIP // Don't visit children
return [SKIP, 5] // Skip children, continue from index 5
})
```
### Complete Utility Landscape
| Utility | Purpose | Stars |
|---------|---------|-------|
| `unist-util-visit` | Walk tree depth-first | 346 |
| `unist-util-visit-parents` | Walk with parent stack | — |
| `unist-util-is` | Check if node matches test | — |
| `unist-util-filter` | Create filtered tree | — |
| `unist-util-map` | Create mapped tree | — |
| `unist-util-remove` | Remove nodes from tree | — |
| `unist-util-select` | CSS-like selectors on trees | — |
| `unist-util-find-after` | Find node after another | — |
| `unist-util-find-and-replace` | Find/replace text in tree | — |
| `unist-builder` | Create trees programmatically | — |
### hast-specific Utilities
| Utility | Purpose |
|---------|---------|
| `hast-util-is-element` | Check if node is (a specific) element |
| `hast-util-select` | querySelector/querySelectorAll on hast |
| `hast-util-find-and-replace` | Text find/replace in hast |
| `hast-util-classnames` | Merge class names |
| `hast-util-to-string` | Get textContent |
| `hast-util-to-text` | Get innerText |
| `hast-util-phrasing` | Check if node is phrasing content |
| `hast-util-heading` | Check if node is heading content |
| `hast-util-embedded` | Check if node is embedded content |
| `hast-util-sanitize` | Sanitize tree (XSS prevention) |
### mdast-specific Utilities
| Utility | Purpose |
|---------|---------|
| `mdast-util-to-string` | Get plain text content |
| `mdast-util-definitions` | Find definition nodes |
| `mdast-util-heading-range` | Use headings as ranges |
| `mdast-util-toc` | Generate TOC |
| `mdast-util-phrasing` | Check if node is phrasing content |
| `mdast-util-gfm` | GFM parse/serialize |
| `mdast-util-gfm-table` | GFM tables specifically |
| `mdast-util-directive` | Generic directives |
---
## 9. TypeScript Type Definitions
### Availability
| Package | Types | Source |
|---------|-------|--------|
| `@types/unist` | Built-in | DefinitelyTyped |
| `@types/hast` | Built-in | DefinitelyTyped |
| `@types/mdast` | Built-in | DefinitelyTyped |
| `hast-util-to-mdast` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
| `mdast-util-to-markdown` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
| `unist-util-visit` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
| `hastscript` | **Shipped with package** | Included TypeScript types |
| `hast-util-to-jsx-runtime` | **Shipped with package** | Included TypeScript types |
| `rehype-remark` | **Shipped with package** | Included TypeScript types |
| `unified` | **Shipped with package** | Included TypeScript types |
### Type Quality
**Excellent**. All packages in the syntax-tree and unified ecosystems are written in TypeScript or ship hand-written `.d.ts` files. The type definitions are comprehensive and well-maintained. Key observations:
1. **Hast types are well-defined**: `Element`, `Text`, `Comment`, `Root`, `Properties`, `PropertyValue` are all properly typed
2. **Mdast types are well-defined**: All node types with their specific fields, plus content model types like `FlowContent`, `PhrasingContent`
3. **Handler types are exported**: `Handle`, `NodeHandle`, `Options`, `State` from hast-util-to-mdast
4. **Serializer types are exported**: `Handle`, `Handlers`, `Options`, `Unsafe`, `Join` from mdast-util-to-markdown
5. **Generic node types**: The TypeScript types support discriminated unions on `node.type`
### Type Usage Example
```typescript
import type { Element, Root, Text } from 'hast'
import type { Root as MdastRoot, Heading, Code } from 'mdast'
import type { Handle, Options as ToMdastOptions } from 'hast-util-to-mdast'
import type { Options as ToMarkdownOptions } from 'mdast-util-to-markdown'
```
---
## 10. Pipeline Feasibility Assessment
### The Full Pipeline
```
┌─────────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ JSX Components │ ──▶ │ hast │ ──▶ │ mdast │ ──▶ │ markdown │
│ (React tree) │ │ (HTML AST)│ │ (MD AST) │ │ (string) │
└─────────────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
STEP 1: STEP 2: STEP 3: STEP 4:
JSX → hast hast → mdast mdast → md md string
(the solved (well-solved)
problem)
```
### Step-by-Step Assessment
#### Step 1: JSX → hast (THE HARDEST STEP)
**Challenge**: React components are not HTML. They have:
- Custom component elements (`<Card>`, `<UserAvatar>`) that don't map to HTML tags
- Props that aren't HTML attributes
- Rendering logic (conditionals, loops, state)
- Event handlers that are meaningless in markdown
**Approaches**:
**Approach A: Render to HTML string, then parse to hast**
```typescript
import { renderToStaticMarkup } from 'react-dom/server'
import { fromHtml } from 'hast-util-from-html'
const html = renderToStaticMarkup(<MyComponent />)
const hast = fromHtml(html, { fragment: true })
```
- Pros: Complete rendering support, handles all React features
- Cons: Loss of custom element information, all components flatten to HTML
**Approach B: Use React's internal fiber tree to build hast directly**
- Pros: More control, can preserve custom element boundaries
- Cons: Relies on React internals, fragile
**Approach C: Use hastscript JSX runtime**
```jsx
/** @jsxImportSource hastscript */
const tree = <div class="card"><h2>Title</h2><p>Body</p></div>
```
- Pros: Direct hast construction, type-safe
- Cons: Can't use React components (no rendering), only raw elements
**Recommendation**: **Approach A** for the base pipeline, with **custom component registry** for handling non-HTML elements (see below).
#### Step 2: hast → mdast (SOLVED by hast-util-to-mdast)
- This step is fully solved by `hast-util-to-mdast` v10.1.2
- Handles all standard HTML elements
- Custom handlers for non-standard elements
- Proper paragraph wrapping, whitespace handling, content model enforcement
- GFM support (tables, strikethrough, task lists)
**Edge cases to watch**:
- Nested inline elements: `<strong><em>bold italic</em></strong>` → `***bold italic***`
- Mixed content: `<div>Text <h2>Heading</h2> More text</div>` → Proper paragraph wrapping
- Deeply nested lists
- Tables with merged cells (not supported in GFM)
- `<br>` inside `<p>`: Produces `break` nodes in phrasing content
#### Step 3: mdast → markdown (SOLVED by mdast-util-to-markdown)
- Fully solved by `mdast-util-to-markdown` v2.1.2
- Proper character escaping
- GFM output with `mdast-util-gfm` extension
- Configurable markers, indentation, heading styles
### Edge Cases and Failure Modes
#### Custom Components (The #1 Challenge)
A React component like `<DataField label="Name" value="Alice" />` renders to HTML like `<div class="data-field"><span class="label">Name</span><span class="value">Alice</span></div>`. But the markdown output would just be "Name Alice" without the structure.
**Solution**: Register custom handlers that use the component's semantic intent:
```typescript
// Option 1: Use CSS class-based detection
handlers: {
// Custom class→markdown mapping
'data-field': (state, node) => {
if (node.properties.className?.includes('data-field')) {
const label = findTextByClass(node, 'label')
const value = findTextByClass(node, 'value')
return { type: 'paragraph', children: [
{ type: 'strong', children: [{ type: 'text', value: label }] },
{ type: 'text', value: `: ${value}` }
]}
}
}
}
```
```typescript
// Option 2: Use data attributes for semantic hints
// Render with: <div data-md-type="field" data-md-label="Name">Alice</div>
handlers: {
div: (state, node) => {
const mdType = node.properties.dataMdType
if (mdType === 'field') {
// Custom conversion
}
return defaultHandlers.div(state, node, parent)
}
}
```
```typescript
// Option 3: Pre-transform the hast tree before passing to toMdast
// Walk the hast tree, replace custom structures with semantic mdast nodes
visit(hast, 'element', (node) => {
if (node.properties.className?.includes('callout')) {
// Transform the element's structure to something that maps cleanly
}
})
```
#### SVG/Math Content
- **Default behavior**: Ignored (content lost)
- **Workaround 1**: Convert to mdast `html` node (preserves as raw HTML in markdown)
- **Workaround 2**: Render to an image and use mdast `image` node
- **For LLM consumption**: Image alt text is likely more useful than raw SVG markup
#### Forms and Interactive Elements
- `<input>`, `<select>`, `<textarea>`: Processed for their text content
- Checkboxes become `[x]`/`[ ]` in GFM task lists
- Other form elements: Downgraded to text content
#### CSS-Dependent Layout
- Tables rendered via CSS grid/flexbox (not `<table>` elements) won't produce GFM tables
- Tab components rendered as `<div>` stacks won't produce meaningful markdown
- **Workaround**: Use semantic HTML (`<table>`, `<details>`, etc.) in the component's render output when markdown output is needed
#### Content That Has No Markdown Equivalent
| HTML Construct | Default Behavior | Better Alternative |
|---------------|-----------------|-------------------|
| `<details>/<summary>` | Text content only | Use mdast `html` node or custom directive |
| `<dialog>` | Ignored | Pre-process to extract content |
| `<meter>`, `<progress>` | Text content | Convert to descriptive text |
| `<ruby>`, `<rt>` | Text content | Custom handler for pronunciation annotation |
| `<iframe>` | Link to src | Custom handler for embed description |
| `<video>`, `<audio>` | Link to src | Custom handler for media description |
#### Whitespace and Formatting
- Multiple spaces in HTML collapse to single spaces by default
- `<pre>` whitespace is preserved in `code` nodes
- `newlines: true` option preserves line breaks during whitespace collapsing
- Indentation in `<pre>` blocks may need careful handling
#### Roundtrip Fidelity
Not all markdown constructs survive HTML roundtripping:
- Reference-style links (`[text][id]`) become direct links `[text](url)`
- Setext headings become ATX headings (configurable)
- Tight vs. loose lists may change
- Multiple markdown syntaxes collapse to one (e.g., both `*` and `_` for emphasis become `*`)
**For LLM consumption, this is acceptable** — the goal is readable markdown, not perfect roundtripping.
---
## 11. Alternative Approaches
### Existing Libraries: React/JSX → Markdown
There is **no mature, well-maintained library** that directly converts React component trees to markdown. The approaches that exist are simpler or serve different purposes:
#### 1. react-markdown + remark (the reverse direction)
**React Markdown** renders markdown as React components. This is the **opposite** of what we need. Not relevant.
#### 2. html-to-markdown (turndown)
**Repository**: https://github.com/mixmark-io/turndown
A widely-used HTML-to-markdown converter (not based on unist). It works on HTML strings, not ASTs.
- **Pros**: Simple API, well-tested, many plugins, configurable rules
- **Cons**: Not AST-based, no TypeScript AST types, no GFM table support out-of-box, less extensible than unist ecosystem
- **Comparison**: `hast-util-to-mdast` is more principled (proper AST, content model enforcement, better extensibility)
#### 3. react-to-markdown (hypothetical)
No prominent library exists under this name or concept on npm. Search terms "react to markdown", "jsx to mdast", "component to markdown" yield no direct results.
#### 4. mdast-util-from-adf
Converts Atlassian Document Format to mdast. Demonstrates the pattern of converting external formats to mdast, but for a different source format.
#### 5. hast-util-to-portable-text
Converts hast to Sanity's Portable Text format. Another example of hast → alternative format, but not markdown.
#### 6. Custom JSX → mdast builders
One could build React components that directly produce mdast nodes using `mdast-builder` or manual construction:
```typescript
function MarkdownHeading({ depth, children }) {
return { type: 'heading', depth, children }
}
```
- **Pros**: Direct control, no intermediate HTML
- **Cons**: Can't reuse existing React component ecosystem, dual rendering needed
#### 7. React Server Components + hast
If using RSC, the output could be intercepted and converted to hast before HTML serialization. This is speculative and would require custom infrastructure.
### Comparison Table
| Approach | Maturity | Type Safety | Extensibility | Custom Components | GFM Support |
|----------|----------|-------------|---------------|-------------------|-------------|
| hast-util-to-mdast | **High** (v10) | Excellent | Excellent (handlers) | Needs custom layer | Built-in |
| turndown | High | Poor | Good (rules) | Needs custom rules | Plugin |
| Custom JSX→mdast | Low | Manual | Full control | Full control | Manual |
| unified/rehype-remark | High | Excellent | Excellent (plugins) | Needs custom layer | Via remark-gfm |
**Recommended**: Use `hast-util-to-mdast` + `mdast-util-to-markdown` with custom JSX→hast adapter.
---
## 12. Recommended Architecture
### Overview
```
┌────────────────────────────────────────────────────────────────────┐
│ JSX → Markdown Pipeline │
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌─────────┐ ┌───────────┐ │
│ │ JSX Renderer │──▶│ hast Builder │──▶│ hast → │──▶│ mdast → │ │
│ │ (React SSR) │ │ (fromHtml + │ │ mdast │ │ markdown │ │
│ │ │ │ pre-process) │ │ │ │ (GFM) │ │
│ └──────────────┘ └───────────────┘ └─────────┘ └───────────┘ │
│ │ │ │ │ │
│ ┌──────────────┐ ┌───────────────┐ │ │ │
│ │ Component │ │ Custom │ │ │ │
│ │ Registry │ │ Handlers │ │ │ │
│ │ (markdown │ │ (data-md- │ │ │ │
│ │ hints) │ │ attributes) │ │ │ │
│ └──────────────┘ └───────────────┘ │ │ │
│ │ │ │
│ ┌────────┘ │ │
│ │ │ │
│ ┌──────────────┐ ┌──────────┘ │
│ │ Post-process │ │ Output │
│ │ (unist-util) │ │ (markdown string) │
│ └──────────────┘ │
└────────────────────────────────────────────────────────────────────┘
```
### Implementation Design
#### Phase 1: Core Pipeline
```typescript
// renderToMarkdown.ts
import { renderToStaticMarkup } from 'react-dom/server'
import { fromHtml } from 'hast-util-from-html'
import { toMdast } from 'hast-util-to-mdast'
import { toMarkdown } from 'mdast-util-to-markdown'
import { gfmToMarkdown } from 'mdast-util-gfm'
import type { Element, Root as HastRoot } from 'hast'
import type { Handle } from 'hast-util-to-mdast'
interface RenderOptions {
/** Custom hast → mdast handlers */
handlers?: Record<string, Handle>
/** Markdown serialization options */
markdownOptions?: MarkdownOptions
/** Whether to produce GFM output (default: true) */
gfm?: boolean
/** Pre-processing hook for the hast tree */
preProcess?: (hast: HastRoot) => HastRoot
/** Post-processing hook for the mdast tree */
postProcess?: (mdast: MdastRoot) => MdastRoot
}
function renderToMarkdown(element: React.ReactElement, options?: RenderOptions): string {
// Step 1: Render JSX to HTML
const html = renderToStaticMarkup(element)
// Step 2: Parse HTML to hast
let hast = fromHtml(html, { fragment: true })
// Step 3: Pre-process hast (optional: semantic annotation, cleanup)
if (options?.preProcess) {
hast = options.preProcess(hast)
}
// Step 4: Convert hast to mdast
const mdast = toMdast(hast, {
handlers: {
...defaultCustomHandlers,
...options?.handlers,
},
document: false,
})
// Step 5: Post-process mdast (optional: custom transforms)
if (options?.postProcess) {
mdast = options.postProcess(mdast)
}
// Step 6: Serialize mdast to markdown
const extensions = options?.gfm !== false ? [gfmToMarkdown()] : []
const markdown = toMarkdown(mdast, {
...options?.markdownOptions,
extensions,
})
return markdown
}
```
#### Phase 2: Component Registry / Semantic Annotations
The key innovation is a **data attribute convention** that components use to hint markdown semantics:
```tsx
// Component renders with markdown hints
function InfoCallout({ title, children }) {
return (
<div data-md="callout" data-md-callout-type="info">
<strong data-md="callout-title">{title}</strong>
<div data-md="callout-content">{children}</div>
</div>
)
}
```
Handler:
```typescript
const calloutHandler: Handle = (state, node) => {
const calloutType = node.properties.dataMdCalloutType || 'note'
const title = findChildByDataMd(node, 'callout-title')
const content = findChildByDataMd(node, 'callout-content')
return {
type: 'blockquote',
children: [
{
type: 'paragraph',
children: [
{ type: 'strong', children: [{ type: 'text', value: calloutType.toUpperCase() }] },
{ type: 'text', value: ': ' },
...state.all(content),
]
}
]
}
}
```
Registry:
```typescript
const defaultCustomHandlers: Record<string, Handle> = {
// Map data-md attributes to handlers
div: (state, node, parent) => {
const mdType = node.properties.dataMd
switch (mdType) {
case 'callout': return calloutHandler(state, node)
case 'admonition': return admonitionHandler(state, node)
// ... extensible
default: return undefined // Fall through to default handler
}
}
}
```
#### Phase 3: Advanced - Dual-Mode Components
For components that need to render differently in "markdown mode" vs "visual mode":
```tsx
interface MarkdownAwareProps {
/** When defined, used for markdown output instead of the visual render */
markdownContent?: string | React.ReactNode
}
function DataTable({ data, markdownContent }: MarkdownAwareProps & DataTableProps) {
// Visual mode: rich table with sorting, pagination
// Markdown mode: simple GFM table
return (
<div className="data-table" data-md="table" data-md-raw={markdownContent}>
{/* ... rich visual table ... */}
</div>
)
}
```
#### Phase 4: Pre-Render Hooks
For components that need full control over their markdown output:
```typescript
interface MarkdownRenderer {
/** If returned, used directly as markdown instead of converting the hast */
toMarkdown?(): string | MdastNode
}
// In the pre-process step:
function preProcess(hast: HastRoot): HastRoot {
visit(hast, 'element', (node) => {
// Check if this element's component provided a markdown hint
const markdownDirect = node.properties.dataMdRaw
if (markdownDirect) {
// Replace the element with a text node containing the markdown
// (this will be parsed separately or stored as raw markdown)
}
})
return hast
}
```
### Package Dependencies
```json
{
"dependencies": {
"hast-util-from-html": "^2.0.0",
"hast-util-to-mdast": "^10.0.0",
"mdast-util-to-markdown": "^2.0.0",
"mdast-util-gfm": "^3.0.0",
"unist-util-visit": "^5.0.0",
"hast-util-select": "^6.0.0",
"hast-util-to-string": "^3.0.0"
},
"devDependencies": {
"@types/hast": "^3.0.0",
"@types/mdast": "^4.0.0",
"@types/unist": "^3.0.0",
"react": "^18.0.0",
"react-dom": "^18.0.0"
}
}
```
Estimated total bundle size: ~50-80KB (tree-shaken, minified)
---
## 13. Appendix: Element-to-Markdown Mapping Table
### Complete Mapping for Standard HTML Elements
| HTML Element | mdast Output | Markdown | Notes |
|-------------|-------------|----------|-------|
| `<h1>` | `heading` (depth: 1) | `# ` | |
| `<h2>` | `heading` (depth: 2) | `## ` | |
| `<h3>` | `heading` (depth: 3) | `### ` | |
| `<h4>` | `heading` (depth: 4) | `#### ` | |
| `<h5>` | `heading` (depth: 5) | `##### ` | |
| `<h6>` | `heading` (depth: 6) | `###### ` | |
| `<p>` | `paragraph` | (blank line delimited) | |
| `<strong>`, `<b>` | `strong` | `**text**` | |
| `<em>`, `<i>` | `emphasis` | `*text*` | |
| `<del>`, `<s>`, `<strike>` | `delete` (GFM) | `~~text~~` | |
| `<code>` | `inlineCode` | `` `code` `` | |
| `<pre><code>` | `code` | ````\ncode\n```` | `lang` from `class="language-xxx"` |
| `<a href>` | `link` | `[text](url "title")` | |
| `<img>` | `image` | `![alt](src "title")` | |
| `<ul>` | `list` (ordered: false) | `- item` | |
| `<ol>` | `list` (ordered: true) | `1. item` | `start` attribute respected |
| `<li>` | `listItem` | (list item) | GFM: `checked` for task lists |
| `<blockquote>` | `blockquote` | `> text` | |
| `<hr>` | `thematicBreak` | `---` | |
| `<table>` | `table` (GFM) | `\| col \| col \|` | |
| `<tr>` | `tableRow` | | |
| `<th>`, `<td>` | `tableCell` | | `align` from `align` attr |
| `<br>` | `break` | (hard break) | Two spaces + newline or `\` |
| `<q>` | `text` with quotes | ("quoted") | Uses `quotes` option for nesting |
| `<input type="checkbox">` | `text` | `[x]` or `[ ]` | GFM task list item support |
| `<abbr>` | `text` | (just text) | Title lost |
| `<sup>` | `text` | (just text) | Superscript lost |
| `<sub>` | `text` | (just text) | Subscript lost |
| `<mark>` | `text` | (just text) | Highlight lost |
| `<small>` | `text` | (just text) | Small text lost |
| `<details>` | Children only | (content extracted) | No markdown equivalent |
| `<summary>` | `text` | (just text) | Collapsed with details |
| `<dl>`, `<dt>`, `<dd>` | paragraphs | (text only) | No definition list in markdown |
| `<figure>` | Children only | (content extracted) | |
| `<figcaption>` | paragraph/text | (text) | |
| `<video>`, `<audio>` | `link` | `[src](url)` | Downgraded to link |
| `<iframe>` | `link` | `[src](url)` | Downgraded to link |
| `<svg>` | (ignored) | (nothing) | Use custom handler to preserve |
| `<math>` | (ignored) | (nothing) | Use custom handler for LaTeX output |
| `<script>` | (ignored) | (nothing) | |
| `<style>` | (ignored) | (nothing) | |
| `<div>`, `<section>`, `<article>` | Children extracted | (children only) | Container elements unwrapped |
| `<span>`, `<time>` | Children extracted | (children only) | Inline containers unwrapped |
---
## Summary of Key Findings
1. **The hast→mdast→markdown chain is production-ready**: `hast-util-to-mdast` (v10.1.2) and `mdast-util-to-markdown` (v2.1.2) are mature, well-tested, well-typed libraries that handle the full HTML spec and produce clean GFM markdown.
2. **The gap is JSX→hast**: There is no off-the-shelf solution for converting React component trees directly to hast. The recommended approach is React server-side rendering (to HTML) followed by `hast-util-from-html` parsing.
3. **Custom components need a strategy**: Standard HTML elements map well, but custom React components lose their semantics through HTML rendering. A `data-md` attribute convention + custom handler registry is the recommended solution.
4. **TypeScript support is excellent**: All packages in the ecosystem ship with proper TypeScript types. The discriminated union on `node.type` makes pattern matching clean and type-safe.
5. **GFM is fully supported**: Tables, strikethrough, task lists, and autolink literals are all handled.
6. **The unist utility ecosystem is rich**: `unist-util-visit`, `hast-util-select`, and dozens of other utilities provide all the building blocks needed for pre/post-processing.
7. **For LLM consumption specifically**: The markdown output quality from this pipeline is already excellent for standard HTML. The main investment should go into the component annotation system and custom handlers for domain-specific components.
---
*Research completed 2026-04-28. All version numbers and repository states reflect the latest available at time of research.*