1212 lines
46 KiB
Markdown
1212 lines
46 KiB
Markdown
# Unist Ecosystem Research: JSX → Markdown Pipeline for LLM Consumption
|
||
|
||
**Date**: 2026-04-28
|
||
**Topic**: Feasibility of JSX components → hast → mdast → markdown pipeline using the Unist/syntax-tree ecosystem
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Executive Summary](#1-executive-summary)
|
||
2. [unist: The Universal Foundation](#2-unist-the-universal-foundation)
|
||
3. [hast: Hypertext Abstract Syntax Tree](#3-hast-hypertext-abstract-syntax-tree)
|
||
4. [mdast: Markdown Abstract Syntax Tree](#4-mdast-markdown-abstract-syntax-tree)
|
||
5. [hast-util-to-mdast: The Key Transform](#5-hast-util-to-mdast-the-key-transform)
|
||
6. [mdast-util-to-markdown: Serialization to Markdown](#6-mdast-util-to-markdown-serialization-to-markdown)
|
||
7. [remark/rehype Ecosystem](#7-remarkrehype-ecosystem)
|
||
8. [unist-util-visit and Related Utilities](#8-unist-util-visit-and-related-utilities)
|
||
9. [TypeScript Type Definitions](#9-typescript-type-definitions)
|
||
10. [Pipeline Feasibility Assessment](#10-pipeline-feasibility-assessment)
|
||
11. [Alternative Approaches](#11-alternative-approaches)
|
||
12. [Recommended Architecture](#12-recommended-architecture)
|
||
13. [Appendix: Element-to-Markdown Mapping Table](#13-appendix-element-to-markdown-mapping-table)
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
The JSX → hast → mdast → markdown pipeline is **feasible and well-supported** by mature, well-typed libraries in the unist/syntax-tree ecosystem. The core transformation chain is:
|
||
|
||
```
|
||
JSX Component Tree → hast (HTML AST) → mdast (Markdown AST) → markdown string
|
||
│ │ │ │
|
||
React rendering hast-util-from-html hast-util-to-mdast mdast-util-to-markdown
|
||
or react-dom/ or manual hast (v10.1.2) (v2.1.2)
|
||
server rendering construction
|
||
```
|
||
|
||
**Key finding**: The hardest step is not hast→mdast→markdown (which is solved by existing, mature libraries), but rather **JSX → hast** and handling **custom components** that have no direct HTML/markdown equivalent. The ecosystem provides excellent tooling for standard HTML elements but requires a custom strategy for framework-specific components.
|
||
|
||
**Verdict**: Use the existing unist ecosystem libraries for the hast→mdast→markdown steps. Build a custom JSX→hast adapter layer that handles React component rendering and custom element mapping.
|
||
|
||
---
|
||
|
||
## 2. unist: The Universal Foundation
|
||
|
||
**Repository**: https://github.com/syntax-tree/unist
|
||
**Current version**: 3.0.0
|
||
**License**: CC-BY-4.0
|
||
|
||
unist is the abstract base specification that hast, mdast, xast, and nlcst all implement. It defines the minimal node interface that all syntax tree nodes share.
|
||
|
||
### Core Node Interface
|
||
|
||
```typescript
|
||
interface Node {
|
||
type: string // Non-empty string identifying the node variant
|
||
data?: Data // Ecosystem-specific metadata
|
||
position?: Position // Source location info
|
||
}
|
||
|
||
interface Parent <: Node {
|
||
children: [Node] // Child nodes
|
||
}
|
||
|
||
interface Literal <: Node {
|
||
value: any // Node's value
|
||
}
|
||
|
||
interface Position {
|
||
start: Point
|
||
end: Point
|
||
}
|
||
|
||
interface Point {
|
||
line: number // 1-indexed
|
||
column: number // 1-indexed
|
||
offset?: number // 0-indexed
|
||
}
|
||
```
|
||
|
||
### Design Principles
|
||
|
||
- All values must be JSON-serializable (no functions, undefined, symbols)
|
||
- Trees can survive `JSON.parse(JSON.stringify(tree))` roundtrips
|
||
- `data` field is reserved for ecosystem use; specifications never define fields on it
|
||
- `position` must be absent on generated nodes
|
||
|
||
### Why This Matters for Our Pipeline
|
||
|
||
The JSON-serializability constraint means the AST is inherently portable and can be passed between contexts (server/client, different frameworks). The `data` field provides an escape hatch for custom metadata that custom component handlers can use.
|
||
|
||
---
|
||
|
||
## 3. hast: Hypertext Abstract Syntax Tree
|
||
|
||
**Repository**: https://github.com/syntax-tree/hast
|
||
**Spec version**: 2.4.0
|
||
**Type definitions**: `@types/hast`
|
||
**Stars**: 892
|
||
|
||
hast represents HTML (and embedded SVG/MathML) as an abstract syntax tree. It extends unist.
|
||
|
||
### Node Types
|
||
|
||
| Node Type | Extends | Description | Key Fields |
|
||
|-----------|---------|-------------|------------|
|
||
| **`Root`** | Parent | Document root | `children` |
|
||
| **`Element`** | Parent | HTML element | `tagName`, `properties`, `children`, `content?` |
|
||
| **`Text`** | Literal | Text content | `value` |
|
||
| **`Comment`** | Literal | HTML comment | `value` |
|
||
| **`Doctype`** | Node | Document type declaration | (none beyond unist Node) |
|
||
|
||
### Element Interface (the workhorse)
|
||
|
||
```typescript
|
||
interface Element <: Parent {
|
||
type: 'element'
|
||
tagName: string // e.g., 'div', 'span', 'custom-card'
|
||
properties: Properties // HTML attributes mapped to DOM properties
|
||
content?: Root // Only for <template> elements
|
||
children: [Comment | Element | Text]
|
||
}
|
||
```
|
||
|
||
### Properties System
|
||
|
||
hast uses DOM-style property names, not HTML attribute names:
|
||
|
||
| HTML Attribute | hast Property |
|
||
|----------------|---------------|
|
||
| `class` | `className` (array: `['foo', 'bar']`) |
|
||
| `for` | `htmlFor` |
|
||
| `data-*` | `data*` (camelCase) |
|
||
| `aria-*` | `aria*` (camelCase) |
|
||
| `tabindex` | `tabIndex` |
|
||
| `colspan` | `colSpan` |
|
||
|
||
Property values:
|
||
- Boolean attributes: `true`/`false`
|
||
- Numeric attributes: `number`
|
||
- Space-separated: `string[]` (e.g., `className: ['foo', 'bar']`)
|
||
- All other: `string`
|
||
|
||
### Example: hast tree for HTML
|
||
|
||
HTML:
|
||
```html
|
||
<a href="https://alpha.com" class="bravo" download>Link</a>
|
||
```
|
||
|
||
hast:
|
||
```json
|
||
{
|
||
"type": "element",
|
||
"tagName": "a",
|
||
"properties": {
|
||
"href": "https://alpha.com",
|
||
"className": ["bravo"],
|
||
"download": true
|
||
},
|
||
"children": [{"type": "text", "value": "Link"}]
|
||
}
|
||
```
|
||
|
||
### Key Utilities for hast Construction
|
||
|
||
- **`hastscript`** (v9.0.1) — `h()` function to create hast trees, like React's `createElement`. Supports JSX via automatic runtime (`@jsxImportSource hastscript`).
|
||
- **`hast-util-from-html`** — Parse HTML string to hast
|
||
- **`hast-util-from-dom`** — Convert browser DOM nodes to hast
|
||
- **`hast-util-to-html`** — Serialize hast to HTML string
|
||
- **`hast-util-to-jsx-runtime`** (v2.3.6) — Convert hast to React/Preact/Solid/Svelte/Vue (the *reverse* direction of what we need)
|
||
- **`hast-util-select`** — CSS selector queries on hast trees (`querySelector`, etc.)
|
||
|
||
### hastscript JSX Support
|
||
|
||
This is significant: hastscript supports using JSX syntax to directly create hast trees:
|
||
|
||
```jsx
|
||
/** @jsxImportSource hastscript */
|
||
const tree = (
|
||
<div class="foo" id="some-id">
|
||
<span>some text</span>
|
||
<input type="text" value="foo" />
|
||
</div>
|
||
)
|
||
```
|
||
|
||
This produces a **hast tree**, not a React element. This is a potential alternative entry point for our pipeline.
|
||
|
||
---
|
||
|
||
## 4. mdast: Markdown Abstract Syntax Tree
|
||
|
||
**Repository**: https://github.com/syntax-tree/mdast
|
||
**Spec version**: 5.0.0
|
||
**Type definitions**: `@types/mdast`
|
||
**Stars**: 1.4k
|
||
|
||
mdast represents markdown as an abstract syntax tree. It extends unist.
|
||
|
||
### Core Node Types (CommonMark)
|
||
|
||
| Node Type | Category | Description | Key Fields |
|
||
|-----------|----------|-------------|------------|
|
||
| **`Root`** | — | Document root | `children` |
|
||
| **`Paragraph`** | Content | Text paragraph | `children: [PhrasingContent]` |
|
||
| **`Heading`** | Flow | Section heading | `depth: 1-6`, `children: [PhrasingContent]` |
|
||
| **`Blockquote`** | Flow | Quoted section | `children: [FlowContent]` |
|
||
| **`List`** | Flow | Ordered/unordered list | `ordered`, `start`, `spread`, `children: [ListItem]` |
|
||
| **`ListItem`** | ListContent | List item | `spread`, `checked?`, `children: [FlowContent]` |
|
||
| **`Code`** | Flow | Fenced/indented code block | `value`, `lang?`, `meta?` |
|
||
| **`ThematicBreak`** | Flow | Horizontal rule `---` | (none) |
|
||
| **`Html`** | Flow/Phrasing | Raw HTML in markdown | `value` |
|
||
| **`Definition`** | Content | Link/image reference def | `identifier`, `label`, `url`, `title` |
|
||
| **`Text`** | Phrasing | Plain text | `value` |
|
||
| **`Emphasis`** | Phrasing | Italic `*text*` | `children: [PhrasingContent]` |
|
||
| **`Strong`** | Phrasing | Bold `**text**` | `children: [PhrasingContent]` |
|
||
| **`InlineCode`** | Phrasing | Inline code `` `code` `` | `value` |
|
||
| **`Break`** | Phrasing | Hard line break | (none) |
|
||
| **`Link`** | Phrasing | Hyperlink | `url`, `title?`, `children: [PhrasingContent]` |
|
||
| **`LinkReference`** | Phrasing | Link by reference | `identifier`, `label`, `referenceType` |
|
||
| **`Image`** | Phrasing | Image | `url`, `title?`, `alt?` |
|
||
| **`ImageReference`** | Phrasing | Image by reference | `identifier`, `label`, `referenceType`, `alt?` |
|
||
|
||
### GFM Extension Nodes
|
||
|
||
| Node Type | Description | Key Fields |
|
||
|-----------|-------------|------------|
|
||
| **`Delete`** | Strikethrough `~~text~~` | `children: [PhrasingContent]` |
|
||
| **`Table`** | Table | `align?: [alignType]`, `children: [TableRow]` |
|
||
| **`TableRow`** | Table row | `children: [TableCell]` |
|
||
| **`TableCell`** | Table cell | `children: [PhrasingContent]` |
|
||
| **`FootnoteDefinition`** | Footnote def | `identifier`, `label`, `children: [FlowContent]` |
|
||
| **`FootnoteReference`** | Footnote ref | `identifier`, `label` |
|
||
|
||
### Content Model Hierarchy
|
||
|
||
```
|
||
MdastContent = FlowContent | ListContent | PhrasingContent
|
||
|
||
FlowContent = Blockquote | Code | Heading | Html | List | ThematicBreak | Paragraph
|
||
ListContent = ListItem
|
||
PhrasingContent = Break | Emphasis | Html | Image | ImageReference | InlineCode
|
||
| Link | LinkReference | Strong | Text
|
||
+ GFM: Delete | FootnoteReference
|
||
```
|
||
|
||
This hierarchy is critical: hast-util-to-mdast must map HTML's content model (which doesn't have this distinction) into mdast's strict flow/phrasing content model.
|
||
|
||
### Mixin Types
|
||
|
||
```typescript
|
||
interface Resource {
|
||
url: string
|
||
title?: string
|
||
}
|
||
|
||
interface Alternative {
|
||
alt?: string
|
||
}
|
||
|
||
interface Association {
|
||
identifier: string
|
||
label?: string
|
||
}
|
||
|
||
interface Reference {
|
||
referenceType: 'shortcut' | 'collapsed' | 'full'
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 5. hast-util-to-mdast: The Key Transform
|
||
|
||
**Repository**: https://github.com/syntax-tree/hast-util-to-mdast
|
||
**Current version**: 10.1.2
|
||
**License**: MIT
|
||
**Stars**: 43
|
||
|
||
This is the critical library in the pipeline — it converts hast (HTML AST) into mdast (Markdown AST).
|
||
|
||
### API
|
||
|
||
```typescript
|
||
import { toMdast } from 'hast-util-to-mdast'
|
||
|
||
const mdastTree = toMdast(hastTree, options?)
|
||
```
|
||
|
||
### Options
|
||
|
||
```typescript
|
||
interface Options {
|
||
newlines?: boolean // Keep line endings when collapsing whitespace (default: false)
|
||
checked?: string // Value for checked checkbox (default: '[x]')
|
||
unchecked?: string // Value for unchecked checkbox (default: '[ ]')
|
||
quotes?: string[] // Quote characters for <q> nesting (default: ['"'])
|
||
document?: boolean // Whether tree is a complete document (default: auto-detect)
|
||
handlers?: Record<string, Handle> // Custom element handlers
|
||
nodeHandlers?: Record<string, NodeHandle> // Custom node type handlers
|
||
}
|
||
```
|
||
|
||
### Custom Handlers
|
||
|
||
This is the **extensibility mechanism** most relevant to our use case:
|
||
|
||
```typescript
|
||
type Handle = (
|
||
state: State,
|
||
element: Element,
|
||
parent: HastParent
|
||
) => Array<MdastNode> | MdastNode | undefined
|
||
|
||
type NodeHandle = (
|
||
state: State,
|
||
node: any,
|
||
parent: HastParent
|
||
) => Array<MdastNode> | MdastNode | undefined
|
||
```
|
||
|
||
The `handlers` option maps HTML tag names to custom conversion functions. Custom handlers are **merged** into the defaults, so you can override specific tags without reimplementing everything.
|
||
|
||
The `nodeHandlers` option maps hast node types (like `'text'`, `'comment'`) to custom handlers.
|
||
|
||
### State Object
|
||
|
||
Passed to all handlers:
|
||
|
||
```typescript
|
||
interface State {
|
||
patch: (from: HastNode, to: MdastNode) => undefined // Copy positional info
|
||
one: (node: HastNode, parent?: HastParent) => MdastNode // Transform single node
|
||
all: (parent: HastParent) => Array<MdastContent> // Transform children
|
||
toFlow: (nodes: Array<MdastContent>) => Array<MdastFlowContent> // Promote to flow content
|
||
resolve: (url: string | null | undefined) => string // Resolve URLs
|
||
options: Options
|
||
elementById: Map<string, Element>
|
||
handlers: Record<string, Handle>
|
||
nodeHandlers: Record<string, NodeHandle>
|
||
inTable: boolean // Whether we're inside a table
|
||
qNesting: number // <q> nesting depth
|
||
}
|
||
```
|
||
|
||
### How It Handles Different Element Categories
|
||
|
||
#### Inline Elements (Phrasing Content)
|
||
|
||
| HTML Element | mdast Node | Notes |
|
||
|-------------|------------|-------|
|
||
| `<strong>`, `<b>` | `strong` | Children processed recursively |
|
||
| `<em>`, `<i>` | `emphasis` | Children processed recursively |
|
||
| `<code>` | `inlineCode` | Value extracted from text child |
|
||
| `<a href="...">` | `link` | `url` from `href`, `title` from `title` attr |
|
||
| `<br>` | `break` | Hard line break |
|
||
| `<del>`, `<s>`, `<strike>` | `delete` (GFM) | Strikethrough |
|
||
| `<q>` | `text` with quotes | Uses `quotes` option for nesting |
|
||
| `<img>` | `image` | `url` from `src`, `alt` from `alt` attr |
|
||
| `<sub>`, `<sup>`, `<mark>`, etc. | Text content only | Non-semantic in markdown; children extracted |
|
||
| `<input type="checkbox">` | `text` | Uses `checked`/`unchecked` options |
|
||
|
||
#### Block Elements (Flow Content)
|
||
|
||
| HTML Element | mdast Node | Notes |
|
||
|-------------|------------|-------|
|
||
| `<h1>`–`<h6>` | `heading` (depth 1-6) | |
|
||
| `<p>` | `paragraph` | |
|
||
| `<blockquote>` | `blockquote` | |
|
||
| `<ul>` | `list` (ordered: false) | |
|
||
| `<ol>` | `list` (ordered: true, start) | |
|
||
| `<li>` | `listItem` | GFM: `checked` for task lists |
|
||
| `<pre><code>` | `code` | `lang` from class (`language-js`), `meta` from data attributes |
|
||
| `<table>` | `table` (GFM) | With `align` from `align` attribute |
|
||
| `<tr>` | `tableRow` | |
|
||
| `<td>`/`<th>` | `tableCell` | |
|
||
| `<hr>` | `thematicBreak` | |
|
||
| `<dl>`, `<dt>`, `<dd>` | Paragraphs | No markdown equivalent; downgraded |
|
||
|
||
#### Special Behaviors
|
||
|
||
- **`<template>`**: Content is processed from the `content` field
|
||
- **`<noscript>`**: Children processed as if scripting is disabled
|
||
- **`<svg>`**, **`<math>`**: **Ignored** by default (no markdown equivalent)
|
||
- **`<video>`**, **`<audio>`**, **`<iframe>`**: Converted to **links** to the source
|
||
- **`<form>`** elements: Processed for their text content
|
||
- **Implicit paragraphs**: The algorithm correctly handles HTML's implicit paragraph model (e.g., text + heading inside a container gets proper paragraph wrapping)
|
||
|
||
#### `data-mdast` Attribute
|
||
|
||
Elements with `data-mdast="ignore"` are excluded from output:
|
||
|
||
```html
|
||
<p><strong>Important</strong> and <em data-mdast="ignore">ignored</em>.</p>
|
||
```
|
||
→ `**Important** and .`
|
||
|
||
### Algorithm
|
||
|
||
The algorithm is described as "very powerful" and handles all HTML elements including ancient and obscure ones. It is particularly good at:
|
||
|
||
1. **Implicit/explicit paragraph handling**: Correctly wraps loose text in paragraphs when adjacent to block elements
|
||
2. **Whitespace collapsing**: Collapses inter-element whitespace to single spaces (configurable with `newlines`)
|
||
3. **Content model enforcement**: Ensures phrasing content doesn't end up in flow contexts (auto-wraps in paragraphs)
|
||
4. **GFM output**: Tables produce GFM `table` nodes; `<del>`/`<s>`/`<strike>` produce `delete` nodes
|
||
|
||
### Custom Handler Example: Preserving SVG as Raw HTML
|
||
|
||
```typescript
|
||
import { toHtml } from 'hast-util-to-html'
|
||
|
||
const mdast = toMdast(hast, {
|
||
handlers: {
|
||
svg(state, node) {
|
||
const result = { type: 'html', value: toHtml(node, { space: 'svg' }) }
|
||
state.patch(node, result)
|
||
return result
|
||
}
|
||
}
|
||
})
|
||
```
|
||
|
||
This pattern — converting an unhandled element to an mdast `html` node — is the standard escape hatch for elements that don't map cleanly to markdown.
|
||
|
||
---
|
||
|
||
## 6. mdast-util-to-markdown: Serialization to Markdown
|
||
|
||
**Repository**: https://github.com/syntax-tree/mdast-util-to-markdown
|
||
**Current version**: 2.1.2
|
||
**License**: MIT
|
||
**Stars**: 139
|
||
|
||
This library serializes an mdast tree back to a markdown string.
|
||
|
||
### API
|
||
|
||
```typescript
|
||
import { toMarkdown } from 'mdast-util-to-markdown'
|
||
|
||
const markdown = toMarkdown(mdastTree, options?)
|
||
```
|
||
|
||
### Key Options
|
||
|
||
```typescript
|
||
interface Options {
|
||
// List formatting
|
||
bullet?: '*' | '+' | '-' // Unordered list marker (default: '*')
|
||
bulletOther?: '*' | '+' | '-' // Fallback list marker (default: '-')
|
||
bulletOrdered?: '.' | ')' // Ordered list marker (default: '.')
|
||
listItemIndent?: 'mixed' | 'one' | 'tab' // List item indentation (default: 'one')
|
||
incrementListMarker?: boolean // Increment ordered list numbers (default: true)
|
||
|
||
// Heading formatting
|
||
closeAtx?: boolean // Close ATX headings with trailing #s (default: false)
|
||
setext?: boolean // Use setext headings when possible (default: false)
|
||
|
||
// Emphasis/strong markers
|
||
emphasis?: '*' | '_' // Emphasis marker (default: '*')
|
||
strong?: '*' | '_' // Strong marker (default: '*')
|
||
|
||
// Code blocks
|
||
fence?: '`' | '~' // Fenced code marker (default: '`')
|
||
fences?: boolean // Always use fenced code (default: true)
|
||
|
||
// Links
|
||
resourceLink?: boolean // Always use resource links (default: false)
|
||
quote?: '"' | "'" // Title quote character (default: '"')
|
||
|
||
// Thematic breaks
|
||
rule?: '*' | '-' | '_' // Thematic break marker (default: '*')
|
||
ruleRepetition?: number // Number of markers (default: 3, min: 3)
|
||
ruleSpaces?: boolean // Spaces between markers (default: false)
|
||
|
||
// Definitions
|
||
tightDefinitions?: boolean // No blank lines between definitions (default: false)
|
||
|
||
// Extensibility
|
||
handlers?: Handlers // Custom node type handlers
|
||
join?: Array<Join> // Custom block-joining behavior
|
||
unsafe?: Array<Unsafe> // Characters that need escaping in contexts
|
||
extensions?: Array<Options> // Extension options (e.g., GFM)
|
||
}
|
||
```
|
||
|
||
### GFM Support
|
||
|
||
GFM output is achieved by using the `mdast-util-gfm` extension:
|
||
|
||
```typescript
|
||
import { toMarkdown } from 'mdast-util-to-markdown'
|
||
import { gfmToMarkdown } from 'mdast-util-gfm'
|
||
|
||
const markdown = toMarkdown(tree, {
|
||
extensions: [gfmToMarkdown()]
|
||
})
|
||
```
|
||
|
||
This adds support for:
|
||
- Tables (`| col1 | col2 |`)
|
||
- Strikethrough (`~~text~~`)
|
||
- Task lists (`- [x] item`)
|
||
- Autolink literals
|
||
- Footnotes
|
||
|
||
### Safety / Escaping
|
||
|
||
The library carefully escapes characters that would be interpreted as markdown syntax:
|
||
|
||
```typescript
|
||
// Character that would break markdown is properly escaped:
|
||
Input mdast: { type: 'text', value: '- a\nb !' }
|
||
Output: \- a\nb \!
|
||
```
|
||
|
||
This is handled via the `Unsafe` type system, which specifies which characters are dangerous in which constructs.
|
||
|
||
### Custom Handlers
|
||
|
||
```typescript
|
||
type Handle = (node, parent, state, info) => string
|
||
|
||
type Handlers = Record<Node['type'], Handle>
|
||
```
|
||
|
||
Custom handlers can be provided for any node type, including custom/extension node types.
|
||
|
||
---
|
||
|
||
## 7. remark/rehype Ecosystem
|
||
|
||
### Architecture
|
||
|
||
```
|
||
unified (core processor)
|
||
├── remark (markdown) ─── mdast ─── remarkParse / remarkStringify
|
||
├── rehype (HTML) ────── hast ───── rehypeParse / rehypeStringify
|
||
└── retext (NLP) ────── nlcst ───── retextEnglish / retextContent
|
||
```
|
||
|
||
### Cross-Ecosystem Plugins
|
||
|
||
| Plugin | Direction | Description |
|
||
|--------|-----------|-------------|
|
||
| `remark-rehype` | mdast → hast | Markdown to HTML (the common direction) |
|
||
| `rehype-remark` | hast → mdast | HTML to Markdown (our direction!) |
|
||
| `remark-retext` | mdast → nlcst | Markdown to NLP |
|
||
| `rehype-retext` | hast → nlcst | HTML to NLP |
|
||
|
||
### rehype-remark (v10.0.1)
|
||
|
||
**Repository**: https://github.com/rehypejs/rehype-remark
|
||
**Stars**: 99
|
||
|
||
This is the higher-level wrapper around `hast-util-to-mdast`. It operates as a unified plugin:
|
||
|
||
```typescript
|
||
import { unified } from 'unified'
|
||
import rehypeParse from 'rehype-parse'
|
||
import rehypeRemark from 'rehype-remark'
|
||
import remarkStringify from 'remark-stringify'
|
||
|
||
const file = await unified()
|
||
.use(rehypeParse) // HTML → hast
|
||
.use(rehypeRemark) // hast → mdast (uses hast-util-to-mdast internally)
|
||
.use(remarkStringify) // mdast → markdown (uses mdast-util-to-markdown internally)
|
||
.process(htmlString)
|
||
|
||
console.log(String(file))
|
||
```
|
||
|
||
**Recommendation**: For our JSX→markdown pipeline, we should use the lower-level utilities (`hast-util-to-mdast` + `mdast-util-to-markdown`) directly rather than the unified processor chain. This avoids the overhead of the unified pipeline and gives us more control.
|
||
|
||
However, `rehype-remark` plus remark plugins could be useful if we want to add post-processing transformations (e.g., `remark-gfm` for explicit GFM support).
|
||
|
||
### Available remark Plugins (for post-processing)
|
||
|
||
- `remark-gfm` — GFM syntax support
|
||
- `remark-frontmatter` — YAML/TOML frontmatter
|
||
- `remark-mdx` — MDX syntax support
|
||
- `remark-lint` — Markdown linting
|
||
- `remark-toc` — Table of contents generation
|
||
- `remark-comment-config` — Configure remark from HTML comments
|
||
- 150+ total plugins
|
||
|
||
---
|
||
|
||
## 8. unist-util-visit and Related Utilities
|
||
|
||
**Repository**: https://github.com/syntax-tree/unist-util-visit
|
||
**Current version**: 5.1.0
|
||
**License**: MIT
|
||
|
||
### Core Traversal: unist-util-visit
|
||
|
||
```typescript
|
||
import { visit, CONTINUE, EXIT, SKIP } from 'unist-util-visit'
|
||
|
||
visit(tree, 'heading', (node, index, parent) => {
|
||
// Return CONTINUE (default), EXIT, SKIP, or a new index
|
||
if (node.depth === 1) return EXIT
|
||
if (node.depth === 2) return SKIP // Don't visit children
|
||
return [SKIP, 5] // Skip children, continue from index 5
|
||
})
|
||
```
|
||
|
||
### Complete Utility Landscape
|
||
|
||
| Utility | Purpose | Stars |
|
||
|---------|---------|-------|
|
||
| `unist-util-visit` | Walk tree depth-first | 346 |
|
||
| `unist-util-visit-parents` | Walk with parent stack | — |
|
||
| `unist-util-is` | Check if node matches test | — |
|
||
| `unist-util-filter` | Create filtered tree | — |
|
||
| `unist-util-map` | Create mapped tree | — |
|
||
| `unist-util-remove` | Remove nodes from tree | — |
|
||
| `unist-util-select` | CSS-like selectors on trees | — |
|
||
| `unist-util-find-after` | Find node after another | — |
|
||
| `unist-util-find-and-replace` | Find/replace text in tree | — |
|
||
| `unist-builder` | Create trees programmatically | — |
|
||
|
||
### hast-specific Utilities
|
||
|
||
| Utility | Purpose |
|
||
|---------|---------|
|
||
| `hast-util-is-element` | Check if node is (a specific) element |
|
||
| `hast-util-select` | querySelector/querySelectorAll on hast |
|
||
| `hast-util-find-and-replace` | Text find/replace in hast |
|
||
| `hast-util-classnames` | Merge class names |
|
||
| `hast-util-to-string` | Get textContent |
|
||
| `hast-util-to-text` | Get innerText |
|
||
| `hast-util-phrasing` | Check if node is phrasing content |
|
||
| `hast-util-heading` | Check if node is heading content |
|
||
| `hast-util-embedded` | Check if node is embedded content |
|
||
| `hast-util-sanitize` | Sanitize tree (XSS prevention) |
|
||
|
||
### mdast-specific Utilities
|
||
|
||
| Utility | Purpose |
|
||
|---------|---------|
|
||
| `mdast-util-to-string` | Get plain text content |
|
||
| `mdast-util-definitions` | Find definition nodes |
|
||
| `mdast-util-heading-range` | Use headings as ranges |
|
||
| `mdast-util-toc` | Generate TOC |
|
||
| `mdast-util-phrasing` | Check if node is phrasing content |
|
||
| `mdast-util-gfm` | GFM parse/serialize |
|
||
| `mdast-util-gfm-table` | GFM tables specifically |
|
||
| `mdast-util-directive` | Generic directives |
|
||
|
||
---
|
||
|
||
## 9. TypeScript Type Definitions
|
||
|
||
### Availability
|
||
|
||
| Package | Types | Source |
|
||
|---------|-------|--------|
|
||
| `@types/unist` | Built-in | DefinitelyTyped |
|
||
| `@types/hast` | Built-in | DefinitelyTyped |
|
||
| `@types/mdast` | Built-in | DefinitelyTyped |
|
||
| `hast-util-to-mdast` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
|
||
| `mdast-util-to-markdown` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
|
||
| `unist-util-visit` | **Shipped with package** | Included TypeScript types + `index.d.ts` |
|
||
| `hastscript` | **Shipped with package** | Included TypeScript types |
|
||
| `hast-util-to-jsx-runtime` | **Shipped with package** | Included TypeScript types |
|
||
| `rehype-remark` | **Shipped with package** | Included TypeScript types |
|
||
| `unified` | **Shipped with package** | Included TypeScript types |
|
||
|
||
### Type Quality
|
||
|
||
**Excellent**. All packages in the syntax-tree and unified ecosystems are written in TypeScript or ship hand-written `.d.ts` files. The type definitions are comprehensive and well-maintained. Key observations:
|
||
|
||
1. **Hast types are well-defined**: `Element`, `Text`, `Comment`, `Root`, `Properties`, `PropertyValue` are all properly typed
|
||
2. **Mdast types are well-defined**: All node types with their specific fields, plus content model types like `FlowContent`, `PhrasingContent`
|
||
3. **Handler types are exported**: `Handle`, `NodeHandle`, `Options`, `State` from hast-util-to-mdast
|
||
4. **Serializer types are exported**: `Handle`, `Handlers`, `Options`, `Unsafe`, `Join` from mdast-util-to-markdown
|
||
5. **Generic node types**: The TypeScript types support discriminated unions on `node.type`
|
||
|
||
### Type Usage Example
|
||
|
||
```typescript
|
||
import type { Element, Root, Text } from 'hast'
|
||
import type { Root as MdastRoot, Heading, Code } from 'mdast'
|
||
import type { Handle, Options as ToMdastOptions } from 'hast-util-to-mdast'
|
||
import type { Options as ToMarkdownOptions } from 'mdast-util-to-markdown'
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Pipeline Feasibility Assessment
|
||
|
||
### The Full Pipeline
|
||
|
||
```
|
||
┌─────────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
||
│ JSX Components │ ──▶ │ hast │ ──▶ │ mdast │ ──▶ │ markdown │
|
||
│ (React tree) │ │ (HTML AST)│ │ (MD AST) │ │ (string) │
|
||
└─────────────────┘ └──────────┘ └──────────┘ └──────────┘
|
||
│ │ │ │
|
||
STEP 1: STEP 2: STEP 3: STEP 4:
|
||
JSX → hast hast → mdast mdast → md md string
|
||
(the solved (well-solved)
|
||
problem)
|
||
```
|
||
|
||
### Step-by-Step Assessment
|
||
|
||
#### Step 1: JSX → hast (THE HARDEST STEP)
|
||
|
||
**Challenge**: React components are not HTML. They have:
|
||
- Custom component elements (`<Card>`, `<UserAvatar>`) that don't map to HTML tags
|
||
- Props that aren't HTML attributes
|
||
- Rendering logic (conditionals, loops, state)
|
||
- Event handlers that are meaningless in markdown
|
||
|
||
**Approaches**:
|
||
|
||
**Approach A: Render to HTML string, then parse to hast**
|
||
```typescript
|
||
import { renderToStaticMarkup } from 'react-dom/server'
|
||
import { fromHtml } from 'hast-util-from-html'
|
||
|
||
const html = renderToStaticMarkup(<MyComponent />)
|
||
const hast = fromHtml(html, { fragment: true })
|
||
```
|
||
- Pros: Complete rendering support, handles all React features
|
||
- Cons: Loss of custom element information, all components flatten to HTML
|
||
|
||
**Approach B: Use React's internal fiber tree to build hast directly**
|
||
- Pros: More control, can preserve custom element boundaries
|
||
- Cons: Relies on React internals, fragile
|
||
|
||
**Approach C: Use hastscript JSX runtime**
|
||
```jsx
|
||
/** @jsxImportSource hastscript */
|
||
const tree = <div class="card"><h2>Title</h2><p>Body</p></div>
|
||
```
|
||
- Pros: Direct hast construction, type-safe
|
||
- Cons: Can't use React components (no rendering), only raw elements
|
||
|
||
**Recommendation**: **Approach A** for the base pipeline, with **custom component registry** for handling non-HTML elements (see below).
|
||
|
||
#### Step 2: hast → mdast (SOLVED by hast-util-to-mdast)
|
||
|
||
- This step is fully solved by `hast-util-to-mdast` v10.1.2
|
||
- Handles all standard HTML elements
|
||
- Custom handlers for non-standard elements
|
||
- Proper paragraph wrapping, whitespace handling, content model enforcement
|
||
- GFM support (tables, strikethrough, task lists)
|
||
|
||
**Edge cases to watch**:
|
||
- Nested inline elements: `<strong><em>bold italic</em></strong>` → `***bold italic***`
|
||
- Mixed content: `<div>Text <h2>Heading</h2> More text</div>` → Proper paragraph wrapping
|
||
- Deeply nested lists
|
||
- Tables with merged cells (not supported in GFM)
|
||
- `<br>` inside `<p>`: Produces `break` nodes in phrasing content
|
||
|
||
#### Step 3: mdast → markdown (SOLVED by mdast-util-to-markdown)
|
||
|
||
- Fully solved by `mdast-util-to-markdown` v2.1.2
|
||
- Proper character escaping
|
||
- GFM output with `mdast-util-gfm` extension
|
||
- Configurable markers, indentation, heading styles
|
||
|
||
### Edge Cases and Failure Modes
|
||
|
||
#### Custom Components (The #1 Challenge)
|
||
|
||
A React component like `<DataField label="Name" value="Alice" />` renders to HTML like `<div class="data-field"><span class="label">Name</span><span class="value">Alice</span></div>`. But the markdown output would just be "Name Alice" without the structure.
|
||
|
||
**Solution**: Register custom handlers that use the component's semantic intent:
|
||
|
||
```typescript
|
||
// Option 1: Use CSS class-based detection
|
||
handlers: {
|
||
// Custom class→markdown mapping
|
||
'data-field': (state, node) => {
|
||
if (node.properties.className?.includes('data-field')) {
|
||
const label = findTextByClass(node, 'label')
|
||
const value = findTextByClass(node, 'value')
|
||
return { type: 'paragraph', children: [
|
||
{ type: 'strong', children: [{ type: 'text', value: label }] },
|
||
{ type: 'text', value: `: ${value}` }
|
||
]}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
```typescript
|
||
// Option 2: Use data attributes for semantic hints
|
||
// Render with: <div data-md-type="field" data-md-label="Name">Alice</div>
|
||
handlers: {
|
||
div: (state, node) => {
|
||
const mdType = node.properties.dataMdType
|
||
if (mdType === 'field') {
|
||
// Custom conversion
|
||
}
|
||
return defaultHandlers.div(state, node, parent)
|
||
}
|
||
}
|
||
```
|
||
|
||
```typescript
|
||
// Option 3: Pre-transform the hast tree before passing to toMdast
|
||
// Walk the hast tree, replace custom structures with semantic mdast nodes
|
||
visit(hast, 'element', (node) => {
|
||
if (node.properties.className?.includes('callout')) {
|
||
// Transform the element's structure to something that maps cleanly
|
||
}
|
||
})
|
||
```
|
||
|
||
#### SVG/Math Content
|
||
|
||
- **Default behavior**: Ignored (content lost)
|
||
- **Workaround 1**: Convert to mdast `html` node (preserves as raw HTML in markdown)
|
||
- **Workaround 2**: Render to an image and use mdast `image` node
|
||
- **For LLM consumption**: Image alt text is likely more useful than raw SVG markup
|
||
|
||
#### Forms and Interactive Elements
|
||
|
||
- `<input>`, `<select>`, `<textarea>`: Processed for their text content
|
||
- Checkboxes become `[x]`/`[ ]` in GFM task lists
|
||
- Other form elements: Downgraded to text content
|
||
|
||
#### CSS-Dependent Layout
|
||
|
||
- Tables rendered via CSS grid/flexbox (not `<table>` elements) won't produce GFM tables
|
||
- Tab components rendered as `<div>` stacks won't produce meaningful markdown
|
||
- **Workaround**: Use semantic HTML (`<table>`, `<details>`, etc.) in the component's render output when markdown output is needed
|
||
|
||
#### Content That Has No Markdown Equivalent
|
||
|
||
| HTML Construct | Default Behavior | Better Alternative |
|
||
|---------------|-----------------|-------------------|
|
||
| `<details>/<summary>` | Text content only | Use mdast `html` node or custom directive |
|
||
| `<dialog>` | Ignored | Pre-process to extract content |
|
||
| `<meter>`, `<progress>` | Text content | Convert to descriptive text |
|
||
| `<ruby>`, `<rt>` | Text content | Custom handler for pronunciation annotation |
|
||
| `<iframe>` | Link to src | Custom handler for embed description |
|
||
| `<video>`, `<audio>` | Link to src | Custom handler for media description |
|
||
|
||
#### Whitespace and Formatting
|
||
|
||
- Multiple spaces in HTML collapse to single spaces by default
|
||
- `<pre>` whitespace is preserved in `code` nodes
|
||
- `newlines: true` option preserves line breaks during whitespace collapsing
|
||
- Indentation in `<pre>` blocks may need careful handling
|
||
|
||
#### Roundtrip Fidelity
|
||
|
||
Not all markdown constructs survive HTML roundtripping:
|
||
- Reference-style links (`[text][id]`) become direct links `[text](url)`
|
||
- Setext headings become ATX headings (configurable)
|
||
- Tight vs. loose lists may change
|
||
- Multiple markdown syntaxes collapse to one (e.g., both `*` and `_` for emphasis become `*`)
|
||
|
||
**For LLM consumption, this is acceptable** — the goal is readable markdown, not perfect roundtripping.
|
||
|
||
---
|
||
|
||
## 11. Alternative Approaches
|
||
|
||
### Existing Libraries: React/JSX → Markdown
|
||
|
||
There is **no mature, well-maintained library** that directly converts React component trees to markdown. The approaches that exist are simpler or serve different purposes:
|
||
|
||
#### 1. react-markdown + remark (the reverse direction)
|
||
**React Markdown** renders markdown as React components. This is the **opposite** of what we need. Not relevant.
|
||
|
||
#### 2. html-to-markdown (turndown)
|
||
**Repository**: https://github.com/mixmark-io/turndown
|
||
A widely-used HTML-to-markdown converter (not based on unist). It works on HTML strings, not ASTs.
|
||
|
||
- **Pros**: Simple API, well-tested, many plugins, configurable rules
|
||
- **Cons**: Not AST-based, no TypeScript AST types, no GFM table support out-of-box, less extensible than unist ecosystem
|
||
- **Comparison**: `hast-util-to-mdast` is more principled (proper AST, content model enforcement, better extensibility)
|
||
|
||
#### 3. react-to-markdown (hypothetical)
|
||
No prominent library exists under this name or concept on npm. Search terms "react to markdown", "jsx to mdast", "component to markdown" yield no direct results.
|
||
|
||
#### 4. mdast-util-from-adf
|
||
Converts Atlassian Document Format to mdast. Demonstrates the pattern of converting external formats to mdast, but for a different source format.
|
||
|
||
#### 5. hast-util-to-portable-text
|
||
Converts hast to Sanity's Portable Text format. Another example of hast → alternative format, but not markdown.
|
||
|
||
#### 6. Custom JSX → mdast builders
|
||
One could build React components that directly produce mdast nodes using `mdast-builder` or manual construction:
|
||
|
||
```typescript
|
||
function MarkdownHeading({ depth, children }) {
|
||
return { type: 'heading', depth, children }
|
||
}
|
||
```
|
||
|
||
- **Pros**: Direct control, no intermediate HTML
|
||
- **Cons**: Can't reuse existing React component ecosystem, dual rendering needed
|
||
|
||
#### 7. React Server Components + hast
|
||
If using RSC, the output could be intercepted and converted to hast before HTML serialization. This is speculative and would require custom infrastructure.
|
||
|
||
### Comparison Table
|
||
|
||
| Approach | Maturity | Type Safety | Extensibility | Custom Components | GFM Support |
|
||
|----------|----------|-------------|---------------|-------------------|-------------|
|
||
| hast-util-to-mdast | **High** (v10) | Excellent | Excellent (handlers) | Needs custom layer | Built-in |
|
||
| turndown | High | Poor | Good (rules) | Needs custom rules | Plugin |
|
||
| Custom JSX→mdast | Low | Manual | Full control | Full control | Manual |
|
||
| unified/rehype-remark | High | Excellent | Excellent (plugins) | Needs custom layer | Via remark-gfm |
|
||
|
||
**Recommended**: Use `hast-util-to-mdast` + `mdast-util-to-markdown` with custom JSX→hast adapter.
|
||
|
||
---
|
||
|
||
## 12. Recommended Architecture
|
||
|
||
### Overview
|
||
|
||
```
|
||
┌────────────────────────────────────────────────────────────────────┐
|
||
│ JSX → Markdown Pipeline │
|
||
│ │
|
||
│ ┌──────────────┐ ┌───────────────┐ ┌─────────┐ ┌───────────┐ │
|
||
│ │ JSX Renderer │──▶│ hast Builder │──▶│ hast → │──▶│ mdast → │ │
|
||
│ │ (React SSR) │ │ (fromHtml + │ │ mdast │ │ markdown │ │
|
||
│ │ │ │ pre-process) │ │ │ │ (GFM) │ │
|
||
│ └──────────────┘ └───────────────┘ └─────────┘ └───────────┘ │
|
||
│ │ │ │ │ │
|
||
│ ┌──────────────┐ ┌───────────────┐ │ │ │
|
||
│ │ Component │ │ Custom │ │ │ │
|
||
│ │ Registry │ │ Handlers │ │ │ │
|
||
│ │ (markdown │ │ (data-md- │ │ │ │
|
||
│ │ hints) │ │ attributes) │ │ │ │
|
||
│ └──────────────┘ └───────────────┘ │ │ │
|
||
│ │ │ │
|
||
│ ┌────────┘ │ │
|
||
│ │ │ │
|
||
│ ┌──────────────┐ ┌──────────┘ │
|
||
│ │ Post-process │ │ Output │
|
||
│ │ (unist-util) │ │ (markdown string) │
|
||
│ └──────────────┘ │
|
||
└────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Implementation Design
|
||
|
||
#### Phase 1: Core Pipeline
|
||
|
||
```typescript
|
||
// renderToMarkdown.ts
|
||
|
||
import { renderToStaticMarkup } from 'react-dom/server'
|
||
import { fromHtml } from 'hast-util-from-html'
|
||
import { toMdast } from 'hast-util-to-mdast'
|
||
import { toMarkdown } from 'mdast-util-to-markdown'
|
||
import { gfmToMarkdown } from 'mdast-util-gfm'
|
||
import type { Element, Root as HastRoot } from 'hast'
|
||
import type { Handle } from 'hast-util-to-mdast'
|
||
|
||
interface RenderOptions {
|
||
/** Custom hast → mdast handlers */
|
||
handlers?: Record<string, Handle>
|
||
/** Markdown serialization options */
|
||
markdownOptions?: MarkdownOptions
|
||
/** Whether to produce GFM output (default: true) */
|
||
gfm?: boolean
|
||
/** Pre-processing hook for the hast tree */
|
||
preProcess?: (hast: HastRoot) => HastRoot
|
||
/** Post-processing hook for the mdast tree */
|
||
postProcess?: (mdast: MdastRoot) => MdastRoot
|
||
}
|
||
|
||
function renderToMarkdown(element: React.ReactElement, options?: RenderOptions): string {
|
||
// Step 1: Render JSX to HTML
|
||
const html = renderToStaticMarkup(element)
|
||
|
||
// Step 2: Parse HTML to hast
|
||
let hast = fromHtml(html, { fragment: true })
|
||
|
||
// Step 3: Pre-process hast (optional: semantic annotation, cleanup)
|
||
if (options?.preProcess) {
|
||
hast = options.preProcess(hast)
|
||
}
|
||
|
||
// Step 4: Convert hast to mdast
|
||
const mdast = toMdast(hast, {
|
||
handlers: {
|
||
...defaultCustomHandlers,
|
||
...options?.handlers,
|
||
},
|
||
document: false,
|
||
})
|
||
|
||
// Step 5: Post-process mdast (optional: custom transforms)
|
||
if (options?.postProcess) {
|
||
mdast = options.postProcess(mdast)
|
||
}
|
||
|
||
// Step 6: Serialize mdast to markdown
|
||
const extensions = options?.gfm !== false ? [gfmToMarkdown()] : []
|
||
const markdown = toMarkdown(mdast, {
|
||
...options?.markdownOptions,
|
||
extensions,
|
||
})
|
||
|
||
return markdown
|
||
}
|
||
```
|
||
|
||
#### Phase 2: Component Registry / Semantic Annotations
|
||
|
||
The key innovation is a **data attribute convention** that components use to hint markdown semantics:
|
||
|
||
```tsx
|
||
// Component renders with markdown hints
|
||
function InfoCallout({ title, children }) {
|
||
return (
|
||
<div data-md="callout" data-md-callout-type="info">
|
||
<strong data-md="callout-title">{title}</strong>
|
||
<div data-md="callout-content">{children}</div>
|
||
</div>
|
||
)
|
||
}
|
||
```
|
||
|
||
Handler:
|
||
```typescript
|
||
const calloutHandler: Handle = (state, node) => {
|
||
const calloutType = node.properties.dataMdCalloutType || 'note'
|
||
const title = findChildByDataMd(node, 'callout-title')
|
||
const content = findChildByDataMd(node, 'callout-content')
|
||
|
||
return {
|
||
type: 'blockquote',
|
||
children: [
|
||
{
|
||
type: 'paragraph',
|
||
children: [
|
||
{ type: 'strong', children: [{ type: 'text', value: calloutType.toUpperCase() }] },
|
||
{ type: 'text', value: ': ' },
|
||
...state.all(content),
|
||
]
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
Registry:
|
||
```typescript
|
||
const defaultCustomHandlers: Record<string, Handle> = {
|
||
// Map data-md attributes to handlers
|
||
div: (state, node, parent) => {
|
||
const mdType = node.properties.dataMd
|
||
switch (mdType) {
|
||
case 'callout': return calloutHandler(state, node)
|
||
case 'admonition': return admonitionHandler(state, node)
|
||
// ... extensible
|
||
default: return undefined // Fall through to default handler
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### Phase 3: Advanced - Dual-Mode Components
|
||
|
||
For components that need to render differently in "markdown mode" vs "visual mode":
|
||
|
||
```tsx
|
||
interface MarkdownAwareProps {
|
||
/** When defined, used for markdown output instead of the visual render */
|
||
markdownContent?: string | React.ReactNode
|
||
}
|
||
|
||
function DataTable({ data, markdownContent }: MarkdownAwareProps & DataTableProps) {
|
||
// Visual mode: rich table with sorting, pagination
|
||
// Markdown mode: simple GFM table
|
||
return (
|
||
<div className="data-table" data-md="table" data-md-raw={markdownContent}>
|
||
{/* ... rich visual table ... */}
|
||
</div>
|
||
)
|
||
}
|
||
```
|
||
|
||
#### Phase 4: Pre-Render Hooks
|
||
|
||
For components that need full control over their markdown output:
|
||
|
||
```typescript
|
||
interface MarkdownRenderer {
|
||
/** If returned, used directly as markdown instead of converting the hast */
|
||
toMarkdown?(): string | MdastNode
|
||
}
|
||
|
||
// In the pre-process step:
|
||
function preProcess(hast: HastRoot): HastRoot {
|
||
visit(hast, 'element', (node) => {
|
||
// Check if this element's component provided a markdown hint
|
||
const markdownDirect = node.properties.dataMdRaw
|
||
if (markdownDirect) {
|
||
// Replace the element with a text node containing the markdown
|
||
// (this will be parsed separately or stored as raw markdown)
|
||
}
|
||
})
|
||
return hast
|
||
}
|
||
```
|
||
|
||
### Package Dependencies
|
||
|
||
```json
|
||
{
|
||
"dependencies": {
|
||
"hast-util-from-html": "^2.0.0",
|
||
"hast-util-to-mdast": "^10.0.0",
|
||
"mdast-util-to-markdown": "^2.0.0",
|
||
"mdast-util-gfm": "^3.0.0",
|
||
"unist-util-visit": "^5.0.0",
|
||
"hast-util-select": "^6.0.0",
|
||
"hast-util-to-string": "^3.0.0"
|
||
},
|
||
"devDependencies": {
|
||
"@types/hast": "^3.0.0",
|
||
"@types/mdast": "^4.0.0",
|
||
"@types/unist": "^3.0.0",
|
||
"react": "^18.0.0",
|
||
"react-dom": "^18.0.0"
|
||
}
|
||
}
|
||
```
|
||
|
||
Estimated total bundle size: ~50-80KB (tree-shaken, minified)
|
||
|
||
---
|
||
|
||
## 13. Appendix: Element-to-Markdown Mapping Table
|
||
|
||
### Complete Mapping for Standard HTML Elements
|
||
|
||
| HTML Element | mdast Output | Markdown | Notes |
|
||
|-------------|-------------|----------|-------|
|
||
| `<h1>` | `heading` (depth: 1) | `# ` | |
|
||
| `<h2>` | `heading` (depth: 2) | `## ` | |
|
||
| `<h3>` | `heading` (depth: 3) | `### ` | |
|
||
| `<h4>` | `heading` (depth: 4) | `#### ` | |
|
||
| `<h5>` | `heading` (depth: 5) | `##### ` | |
|
||
| `<h6>` | `heading` (depth: 6) | `###### ` | |
|
||
| `<p>` | `paragraph` | (blank line delimited) | |
|
||
| `<strong>`, `<b>` | `strong` | `**text**` | |
|
||
| `<em>`, `<i>` | `emphasis` | `*text*` | |
|
||
| `<del>`, `<s>`, `<strike>` | `delete` (GFM) | `~~text~~` | |
|
||
| `<code>` | `inlineCode` | `` `code` `` | |
|
||
| `<pre><code>` | `code` | ````\ncode\n```` | `lang` from `class="language-xxx"` |
|
||
| `<a href>` | `link` | `[text](url "title")` | |
|
||
| `<img>` | `image` | `` | |
|
||
| `<ul>` | `list` (ordered: false) | `- item` | |
|
||
| `<ol>` | `list` (ordered: true) | `1. item` | `start` attribute respected |
|
||
| `<li>` | `listItem` | (list item) | GFM: `checked` for task lists |
|
||
| `<blockquote>` | `blockquote` | `> text` | |
|
||
| `<hr>` | `thematicBreak` | `---` | |
|
||
| `<table>` | `table` (GFM) | `\| col \| col \|` | |
|
||
| `<tr>` | `tableRow` | | |
|
||
| `<th>`, `<td>` | `tableCell` | | `align` from `align` attr |
|
||
| `<br>` | `break` | (hard break) | Two spaces + newline or `\` |
|
||
| `<q>` | `text` with quotes | ("quoted") | Uses `quotes` option for nesting |
|
||
| `<input type="checkbox">` | `text` | `[x]` or `[ ]` | GFM task list item support |
|
||
| `<abbr>` | `text` | (just text) | Title lost |
|
||
| `<sup>` | `text` | (just text) | Superscript lost |
|
||
| `<sub>` | `text` | (just text) | Subscript lost |
|
||
| `<mark>` | `text` | (just text) | Highlight lost |
|
||
| `<small>` | `text` | (just text) | Small text lost |
|
||
| `<details>` | Children only | (content extracted) | No markdown equivalent |
|
||
| `<summary>` | `text` | (just text) | Collapsed with details |
|
||
| `<dl>`, `<dt>`, `<dd>` | paragraphs | (text only) | No definition list in markdown |
|
||
| `<figure>` | Children only | (content extracted) | |
|
||
| `<figcaption>` | paragraph/text | (text) | |
|
||
| `<video>`, `<audio>` | `link` | `[src](url)` | Downgraded to link |
|
||
| `<iframe>` | `link` | `[src](url)` | Downgraded to link |
|
||
| `<svg>` | (ignored) | (nothing) | Use custom handler to preserve |
|
||
| `<math>` | (ignored) | (nothing) | Use custom handler for LaTeX output |
|
||
| `<script>` | (ignored) | (nothing) | |
|
||
| `<style>` | (ignored) | (nothing) | |
|
||
| `<div>`, `<section>`, `<article>` | Children extracted | (children only) | Container elements unwrapped |
|
||
| `<span>`, `<time>` | Children extracted | (children only) | Inline containers unwrapped |
|
||
|
||
---
|
||
|
||
## Summary of Key Findings
|
||
|
||
1. **The hast→mdast→markdown chain is production-ready**: `hast-util-to-mdast` (v10.1.2) and `mdast-util-to-markdown` (v2.1.2) are mature, well-tested, well-typed libraries that handle the full HTML spec and produce clean GFM markdown.
|
||
|
||
2. **The gap is JSX→hast**: There is no off-the-shelf solution for converting React component trees directly to hast. The recommended approach is React server-side rendering (to HTML) followed by `hast-util-from-html` parsing.
|
||
|
||
3. **Custom components need a strategy**: Standard HTML elements map well, but custom React components lose their semantics through HTML rendering. A `data-md` attribute convention + custom handler registry is the recommended solution.
|
||
|
||
4. **TypeScript support is excellent**: All packages in the ecosystem ship with proper TypeScript types. The discriminated union on `node.type` makes pattern matching clean and type-safe.
|
||
|
||
5. **GFM is fully supported**: Tables, strikethrough, task lists, and autolink literals are all handled.
|
||
|
||
6. **The unist utility ecosystem is rich**: `unist-util-visit`, `hast-util-select`, and dozens of other utilities provide all the building blocks needed for pre/post-processing.
|
||
|
||
7. **For LLM consumption specifically**: The markdown output quality from this pipeline is already excellent for standard HTML. The main investment should go into the component annotation system and custom handlers for domain-specific components.
|
||
|
||
---
|
||
|
||
*Research completed 2026-04-28. All version numbers and repository states reflect the latest available at time of research.*
|