Files
ujsx/docs/research/unist-ecosystem-jsx-to-markdown.md

46 KiB
Raw Blame History

Unist Ecosystem Research: JSX → Markdown Pipeline for LLM Consumption

Date: 2026-04-28
Topic: Feasibility of JSX components → hast → mdast → markdown pipeline using the Unist/syntax-tree ecosystem


Table of Contents

  1. Executive Summary
  2. unist: The Universal Foundation
  3. hast: Hypertext Abstract Syntax Tree
  4. mdast: Markdown Abstract Syntax Tree
  5. hast-util-to-mdast: The Key Transform
  6. mdast-util-to-markdown: Serialization to Markdown
  7. remark/rehype Ecosystem
  8. unist-util-visit and Related Utilities
  9. TypeScript Type Definitions
  10. Pipeline Feasibility Assessment
  11. Alternative Approaches
  12. Recommended Architecture
  13. Appendix: Element-to-Markdown Mapping Table

1. Executive Summary

The JSX → hast → mdast → markdown pipeline is feasible and well-supported by mature, well-typed libraries in the unist/syntax-tree ecosystem. The core transformation chain is:

JSX Component Tree  →  hast (HTML AST)  →  mdast (Markdown AST)  →  markdown string
         │                   │                     │                      │
  React rendering     hast-util-from-html    hast-util-to-mdast    mdast-util-to-markdown
  or react-dom/        or manual hast          (v10.1.2)             (v2.1.2)
  server rendering     construction

Key finding: The hardest step is not hast→mdast→markdown (which is solved by existing, mature libraries), but rather JSX → hast and handling custom components that have no direct HTML/markdown equivalent. The ecosystem provides excellent tooling for standard HTML elements but requires a custom strategy for framework-specific components.

Verdict: Use the existing unist ecosystem libraries for the hast→mdast→markdown steps. Build a custom JSX→hast adapter layer that handles React component rendering and custom element mapping.


2. unist: The Universal Foundation

Repository: https://github.com/syntax-tree/unist
Current version: 3.0.0
License: CC-BY-4.0

unist is the abstract base specification that hast, mdast, xast, and nlcst all implement. It defines the minimal node interface that all syntax tree nodes share.

Core Node Interface

interface Node {
  type: string          // Non-empty string identifying the node variant
  data?: Data           // Ecosystem-specific metadata
  position?: Position   // Source location info
}

interface Parent <: Node {
  children: [Node]      // Child nodes
}

interface Literal <: Node {
  value: any            // Node's value
}

interface Position {
  start: Point
  end: Point
}

interface Point {
  line: number     // 1-indexed
  column: number   // 1-indexed
  offset?: number  // 0-indexed
}

Design Principles

  • All values must be JSON-serializable (no functions, undefined, symbols)
  • Trees can survive JSON.parse(JSON.stringify(tree)) roundtrips
  • data field is reserved for ecosystem use; specifications never define fields on it
  • position must be absent on generated nodes

Why This Matters for Our Pipeline

The JSON-serializability constraint means the AST is inherently portable and can be passed between contexts (server/client, different frameworks). The data field provides an escape hatch for custom metadata that custom component handlers can use.


3. hast: Hypertext Abstract Syntax Tree

Repository: https://github.com/syntax-tree/hast
Spec version: 2.4.0
Type definitions: @types/hast
Stars: 892

hast represents HTML (and embedded SVG/MathML) as an abstract syntax tree. It extends unist.

Node Types

Node Type Extends Description Key Fields
Root Parent Document root children
Element Parent HTML element tagName, properties, children, content?
Text Literal Text content value
Comment Literal HTML comment value
Doctype Node Document type declaration (none beyond unist Node)

Element Interface (the workhorse)

interface Element <: Parent {
  type: 'element'
  tagName: string           // e.g., 'div', 'span', 'custom-card'
  properties: Properties    // HTML attributes mapped to DOM properties
  content?: Root            // Only for <template> elements
  children: [Comment | Element | Text]
}

Properties System

hast uses DOM-style property names, not HTML attribute names:

HTML Attribute hast Property
class className (array: ['foo', 'bar'])
for htmlFor
data-* data* (camelCase)
aria-* aria* (camelCase)
tabindex tabIndex
colspan colSpan

Property values:

  • Boolean attributes: true/false
  • Numeric attributes: number
  • Space-separated: string[] (e.g., className: ['foo', 'bar'])
  • All other: string

Example: hast tree for HTML

HTML:

<a href="https://alpha.com" class="bravo" download>Link</a>

hast:

{
  "type": "element",
  "tagName": "a",
  "properties": {
    "href": "https://alpha.com",
    "className": ["bravo"],
    "download": true
  },
  "children": [{"type": "text", "value": "Link"}]
}

Key Utilities for hast Construction

  • hastscript (v9.0.1) — h() function to create hast trees, like React's createElement. Supports JSX via automatic runtime (@jsxImportSource hastscript).
  • hast-util-from-html — Parse HTML string to hast
  • hast-util-from-dom — Convert browser DOM nodes to hast
  • hast-util-to-html — Serialize hast to HTML string
  • hast-util-to-jsx-runtime (v2.3.6) — Convert hast to React/Preact/Solid/Svelte/Vue (the reverse direction of what we need)
  • hast-util-select — CSS selector queries on hast trees (querySelector, etc.)

hastscript JSX Support

This is significant: hastscript supports using JSX syntax to directly create hast trees:

/** @jsxImportSource hastscript */
const tree = (
  <div class="foo" id="some-id">
    <span>some text</span>
    <input type="text" value="foo" />
  </div>
)

This produces a hast tree, not a React element. This is a potential alternative entry point for our pipeline.


4. mdast: Markdown Abstract Syntax Tree

Repository: https://github.com/syntax-tree/mdast
Spec version: 5.0.0
Type definitions: @types/mdast
Stars: 1.4k

mdast represents markdown as an abstract syntax tree. It extends unist.

Core Node Types (CommonMark)

Node Type Category Description Key Fields
Root Document root children
Paragraph Content Text paragraph children: [PhrasingContent]
Heading Flow Section heading depth: 1-6, children: [PhrasingContent]
Blockquote Flow Quoted section children: [FlowContent]
List Flow Ordered/unordered list ordered, start, spread, children: [ListItem]
ListItem ListContent List item spread, checked?, children: [FlowContent]
Code Flow Fenced/indented code block value, lang?, meta?
ThematicBreak Flow Horizontal rule --- (none)
Html Flow/Phrasing Raw HTML in markdown value
Definition Content Link/image reference def identifier, label, url, title
Text Phrasing Plain text value
Emphasis Phrasing Italic *text* children: [PhrasingContent]
Strong Phrasing Bold **text** children: [PhrasingContent]
InlineCode Phrasing Inline code `code` value
Break Phrasing Hard line break (none)
Link Phrasing Hyperlink url, title?, children: [PhrasingContent]
LinkReference Phrasing Link by reference identifier, label, referenceType
Image Phrasing Image url, title?, alt?
ImageReference Phrasing Image by reference identifier, label, referenceType, alt?

GFM Extension Nodes

Node Type Description Key Fields
Delete Strikethrough ~~text~~ children: [PhrasingContent]
Table Table align?: [alignType], children: [TableRow]
TableRow Table row children: [TableCell]
TableCell Table cell children: [PhrasingContent]
FootnoteDefinition Footnote def identifier, label, children: [FlowContent]
FootnoteReference Footnote ref identifier, label

Content Model Hierarchy

MdastContent = FlowContent | ListContent | PhrasingContent

FlowContent    = Blockquote | Code | Heading | Html | List | ThematicBreak | Paragraph
ListContent    = ListItem
PhrasingContent = Break | Emphasis | Html | Image | ImageReference | InlineCode
                 | Link | LinkReference | Strong | Text
                  + GFM: Delete | FootnoteReference

This hierarchy is critical: hast-util-to-mdast must map HTML's content model (which doesn't have this distinction) into mdast's strict flow/phrasing content model.

Mixin Types

interface Resource {
  url: string
  title?: string
}

interface Alternative {
  alt?: string
}

interface Association {
  identifier: string
  label?: string
}

interface Reference {
  referenceType: 'shortcut' | 'collapsed' | 'full'
}

5. hast-util-to-mdast: The Key Transform

Repository: https://github.com/syntax-tree/hast-util-to-mdast
Current version: 10.1.2
License: MIT
Stars: 43

This is the critical library in the pipeline — it converts hast (HTML AST) into mdast (Markdown AST).

API

import { toMdast } from 'hast-util-to-mdast'

const mdastTree = toMdast(hastTree, options?)

Options

interface Options {
  newlines?: boolean       // Keep line endings when collapsing whitespace (default: false)
  checked?: string         // Value for checked checkbox (default: '[x]')
  unchecked?: string       // Value for unchecked checkbox (default: '[ ]')
  quotes?: string[]        // Quote characters for <q> nesting (default: ['"'])
  document?: boolean       // Whether tree is a complete document (default: auto-detect)
  handlers?: Record<string, Handle>      // Custom element handlers
  nodeHandlers?: Record<string, NodeHandle>  // Custom node type handlers
}

Custom Handlers

This is the extensibility mechanism most relevant to our use case:

type Handle = (
  state: State,
  element: Element,
  parent: HastParent
) => Array<MdastNode> | MdastNode | undefined

type NodeHandle = (
  state: State,
  node: any,
  parent: HastParent
) => Array<MdastNode> | MdastNode | undefined

The handlers option maps HTML tag names to custom conversion functions. Custom handlers are merged into the defaults, so you can override specific tags without reimplementing everything.

The nodeHandlers option maps hast node types (like 'text', 'comment') to custom handlers.

State Object

Passed to all handlers:

interface State {
  patch: (from: HastNode, to: MdastNode) => undefined    // Copy positional info
  one: (node: HastNode, parent?: HastParent) => MdastNode  // Transform single node
  all: (parent: HastParent) => Array<MdastContent>         // Transform children
  toFlow: (nodes: Array<MdastContent>) => Array<MdastFlowContent>  // Promote to flow content
  resolve: (url: string | null | undefined) => string       // Resolve URLs
  options: Options
  elementById: Map<string, Element>
  handlers: Record<string, Handle>
  nodeHandlers: Record<string, NodeHandle>
  inTable: boolean           // Whether we're inside a table
  qNesting: number           // <q> nesting depth
}

How It Handles Different Element Categories

Inline Elements (Phrasing Content)

HTML Element mdast Node Notes
<strong>, <b> strong Children processed recursively
<em>, <i> emphasis Children processed recursively
<code> inlineCode Value extracted from text child
<a href="..."> link url from href, title from title attr
<br> break Hard line break
<del>, <s>, <strike> delete (GFM) Strikethrough
<q> text with quotes Uses quotes option for nesting
<img> image url from src, alt from alt attr
<sub>, <sup>, <mark>, etc. Text content only Non-semantic in markdown; children extracted
<input type="checkbox"> text Uses checked/unchecked options

Block Elements (Flow Content)

HTML Element mdast Node Notes
<h1><h6> heading (depth 1-6)
<p> paragraph
<blockquote> blockquote
<ul> list (ordered: false)
<ol> list (ordered: true, start)
<li> listItem GFM: checked for task lists
<pre><code> code lang from class (language-js), meta from data attributes
<table> table (GFM) With align from align attribute
<tr> tableRow
<td>/<th> tableCell
<hr> thematicBreak
<dl>, <dt>, <dd> Paragraphs No markdown equivalent; downgraded

Special Behaviors

  • <template>: Content is processed from the content field
  • <noscript>: Children processed as if scripting is disabled
  • <svg>, <math>: Ignored by default (no markdown equivalent)
  • <video>, <audio>, <iframe>: Converted to links to the source
  • <form> elements: Processed for their text content
  • Implicit paragraphs: The algorithm correctly handles HTML's implicit paragraph model (e.g., text + heading inside a container gets proper paragraph wrapping)

data-mdast Attribute

Elements with data-mdast="ignore" are excluded from output:

<p><strong>Important</strong> and <em data-mdast="ignore">ignored</em>.</p>

**Important** and .

Algorithm

The algorithm is described as "very powerful" and handles all HTML elements including ancient and obscure ones. It is particularly good at:

  1. Implicit/explicit paragraph handling: Correctly wraps loose text in paragraphs when adjacent to block elements
  2. Whitespace collapsing: Collapses inter-element whitespace to single spaces (configurable with newlines)
  3. Content model enforcement: Ensures phrasing content doesn't end up in flow contexts (auto-wraps in paragraphs)
  4. GFM output: Tables produce GFM table nodes; <del>/<s>/<strike> produce delete nodes

Custom Handler Example: Preserving SVG as Raw HTML

import { toHtml } from 'hast-util-to-html'

const mdast = toMdast(hast, {
  handlers: {
    svg(state, node) {
      const result = { type: 'html', value: toHtml(node, { space: 'svg' }) }
      state.patch(node, result)
      return result
    }
  }
})

This pattern — converting an unhandled element to an mdast html node — is the standard escape hatch for elements that don't map cleanly to markdown.


6. mdast-util-to-markdown: Serialization to Markdown

Repository: https://github.com/syntax-tree/mdast-util-to-markdown
Current version: 2.1.2
License: MIT
Stars: 139

This library serializes an mdast tree back to a markdown string.

API

import { toMarkdown } from 'mdast-util-to-markdown'

const markdown = toMarkdown(mdastTree, options?)

Key Options

interface Options {
  // List formatting
  bullet?: '*' | '+' | '-'           // Unordered list marker (default: '*')
  bulletOther?: '*' | '+' | '-'     // Fallback list marker (default: '-')
  bulletOrdered?: '.' | ')'          // Ordered list marker (default: '.')
  listItemIndent?: 'mixed' | 'one' | 'tab'  // List item indentation (default: 'one')
  incrementListMarker?: boolean      // Increment ordered list numbers (default: true)

  // Heading formatting
  closeAtx?: boolean                 // Close ATX headings with trailing #s (default: false)
  setext?: boolean                   // Use setext headings when possible (default: false)

  // Emphasis/strong markers
  emphasis?: '*' | '_'               // Emphasis marker (default: '*')
  strong?: '*' | '_'                 // Strong marker (default: '*')

  // Code blocks
  fence?: '`' | '~'                  // Fenced code marker (default: '`')
  fences?: boolean                   // Always use fenced code (default: true)

  // Links
  resourceLink?: boolean             // Always use resource links (default: false)
  quote?: '"' | "'"                  // Title quote character (default: '"')

  // Thematic breaks
  rule?: '*' | '-' | '_'            // Thematic break marker (default: '*')
  ruleRepetition?: number            // Number of markers (default: 3, min: 3)
  ruleSpaces?: boolean              // Spaces between markers (default: false)

  // Definitions
  tightDefinitions?: boolean         // No blank lines between definitions (default: false)

  // Extensibility
  handlers?: Handlers               // Custom node type handlers
  join?: Array<Join>                  // Custom block-joining behavior
  unsafe?: Array<Unsafe>             // Characters that need escaping in contexts
  extensions?: Array<Options>        // Extension options (e.g., GFM)
}

GFM Support

GFM output is achieved by using the mdast-util-gfm extension:

import { toMarkdown } from 'mdast-util-to-markdown'
import { gfmToMarkdown } from 'mdast-util-gfm'

const markdown = toMarkdown(tree, {
  extensions: [gfmToMarkdown()]
})

This adds support for:

  • Tables (| col1 | col2 |)
  • Strikethrough (~~text~~)
  • Task lists (- [x] item)
  • Autolink literals
  • Footnotes

Safety / Escaping

The library carefully escapes characters that would be interpreted as markdown syntax:

// Character that would break markdown is properly escaped:
Input mdast:  { type: 'text', value: '- a\nb !' }
Output:        \- a\nb \!

This is handled via the Unsafe type system, which specifies which characters are dangerous in which constructs.

Custom Handlers

type Handle = (node, parent, state, info) => string

type Handlers = Record<Node['type'], Handle>

Custom handlers can be provided for any node type, including custom/extension node types.


7. remark/rehype Ecosystem

Architecture

unified (core processor)
  ├── remark (markdown) ─── mdast ─── remarkParse / remarkStringify
  ├── rehype (HTML) ────── hast ───── rehypeParse / rehypeStringify
  └── retext (NLP) ────── nlcst ───── retextEnglish / retextContent

Cross-Ecosystem Plugins

Plugin Direction Description
remark-rehype mdast → hast Markdown to HTML (the common direction)
rehype-remark hast → mdast HTML to Markdown (our direction!)
remark-retext mdast → nlcst Markdown to NLP
rehype-retext hast → nlcst HTML to NLP

rehype-remark (v10.0.1)

Repository: https://github.com/rehypejs/rehype-remark
Stars: 99

This is the higher-level wrapper around hast-util-to-mdast. It operates as a unified plugin:

import { unified } from 'unified'
import rehypeParse from 'rehype-parse'
import rehypeRemark from 'rehype-remark'
import remarkStringify from 'remark-stringify'

const file = await unified()
  .use(rehypeParse)           // HTML → hast
  .use(rehypeRemark)          // hast → mdast  (uses hast-util-to-mdast internally)
  .use(remarkStringify)       // mdast → markdown (uses mdast-util-to-markdown internally)
  .process(htmlString)

console.log(String(file))

Recommendation: For our JSX→markdown pipeline, we should use the lower-level utilities (hast-util-to-mdast + mdast-util-to-markdown) directly rather than the unified processor chain. This avoids the overhead of the unified pipeline and gives us more control.

However, rehype-remark plus remark plugins could be useful if we want to add post-processing transformations (e.g., remark-gfm for explicit GFM support).

Available remark Plugins (for post-processing)

  • remark-gfm — GFM syntax support
  • remark-frontmatter — YAML/TOML frontmatter
  • remark-mdx — MDX syntax support
  • remark-lint — Markdown linting
  • remark-toc — Table of contents generation
  • remark-comment-config — Configure remark from HTML comments
  • 150+ total plugins

Repository: https://github.com/syntax-tree/unist-util-visit
Current version: 5.1.0
License: MIT

Core Traversal: unist-util-visit

import { visit, CONTINUE, EXIT, SKIP } from 'unist-util-visit'

visit(tree, 'heading', (node, index, parent) => {
  // Return CONTINUE (default), EXIT, SKIP, or a new index
  if (node.depth === 1) return EXIT
  if (node.depth === 2) return SKIP  // Don't visit children
  return [SKIP, 5]  // Skip children, continue from index 5
})

Complete Utility Landscape

Utility Purpose Stars
unist-util-visit Walk tree depth-first 346
unist-util-visit-parents Walk with parent stack
unist-util-is Check if node matches test
unist-util-filter Create filtered tree
unist-util-map Create mapped tree
unist-util-remove Remove nodes from tree
unist-util-select CSS-like selectors on trees
unist-util-find-after Find node after another
unist-util-find-and-replace Find/replace text in tree
unist-builder Create trees programmatically

hast-specific Utilities

Utility Purpose
hast-util-is-element Check if node is (a specific) element
hast-util-select querySelector/querySelectorAll on hast
hast-util-find-and-replace Text find/replace in hast
hast-util-classnames Merge class names
hast-util-to-string Get textContent
hast-util-to-text Get innerText
hast-util-phrasing Check if node is phrasing content
hast-util-heading Check if node is heading content
hast-util-embedded Check if node is embedded content
hast-util-sanitize Sanitize tree (XSS prevention)

mdast-specific Utilities

Utility Purpose
mdast-util-to-string Get plain text content
mdast-util-definitions Find definition nodes
mdast-util-heading-range Use headings as ranges
mdast-util-toc Generate TOC
mdast-util-phrasing Check if node is phrasing content
mdast-util-gfm GFM parse/serialize
mdast-util-gfm-table GFM tables specifically
mdast-util-directive Generic directives

9. TypeScript Type Definitions

Availability

Package Types Source
@types/unist Built-in DefinitelyTyped
@types/hast Built-in DefinitelyTyped
@types/mdast Built-in DefinitelyTyped
hast-util-to-mdast Shipped with package Included TypeScript types + index.d.ts
mdast-util-to-markdown Shipped with package Included TypeScript types + index.d.ts
unist-util-visit Shipped with package Included TypeScript types + index.d.ts
hastscript Shipped with package Included TypeScript types
hast-util-to-jsx-runtime Shipped with package Included TypeScript types
rehype-remark Shipped with package Included TypeScript types
unified Shipped with package Included TypeScript types

Type Quality

Excellent. All packages in the syntax-tree and unified ecosystems are written in TypeScript or ship hand-written .d.ts files. The type definitions are comprehensive and well-maintained. Key observations:

  1. Hast types are well-defined: Element, Text, Comment, Root, Properties, PropertyValue are all properly typed
  2. Mdast types are well-defined: All node types with their specific fields, plus content model types like FlowContent, PhrasingContent
  3. Handler types are exported: Handle, NodeHandle, Options, State from hast-util-to-mdast
  4. Serializer types are exported: Handle, Handlers, Options, Unsafe, Join from mdast-util-to-markdown
  5. Generic node types: The TypeScript types support discriminated unions on node.type

Type Usage Example

import type { Element, Root, Text } from 'hast'
import type { Root as MdastRoot, Heading, Code } from 'mdast'
import type { Handle, Options as ToMdastOptions } from 'hast-util-to-mdast'
import type { Options as ToMarkdownOptions } from 'mdast-util-to-markdown'

10. Pipeline Feasibility Assessment

The Full Pipeline

┌─────────────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  JSX Components  │ ──▶ │   hast    │ ──▶ │   mdast  │ ──▶ │ markdown │
│  (React tree)    │     │ (HTML AST)│     │ (MD AST) │     │ (string) │
└─────────────────┘     └──────────┘     └──────────┘     └──────────┘
         │                    │                │                │
    STEP 1:          STEP 2:          STEP 3:          STEP 4:
    JSX → hast       hast → mdast     mdast → md       md string
                      (the solved      (well-solved)
                       problem)

Step-by-Step Assessment

Step 1: JSX → hast (THE HARDEST STEP)

Challenge: React components are not HTML. They have:

  • Custom component elements (<Card>, <UserAvatar>) that don't map to HTML tags
  • Props that aren't HTML attributes
  • Rendering logic (conditionals, loops, state)
  • Event handlers that are meaningless in markdown

Approaches:

Approach A: Render to HTML string, then parse to hast

import { renderToStaticMarkup } from 'react-dom/server'
import { fromHtml } from 'hast-util-from-html'

const html = renderToStaticMarkup(<MyComponent />)
const hast = fromHtml(html, { fragment: true })
  • Pros: Complete rendering support, handles all React features
  • Cons: Loss of custom element information, all components flatten to HTML

Approach B: Use React's internal fiber tree to build hast directly

  • Pros: More control, can preserve custom element boundaries
  • Cons: Relies on React internals, fragile

Approach C: Use hastscript JSX runtime

/** @jsxImportSource hastscript */
const tree = <div class="card"><h2>Title</h2><p>Body</p></div>
  • Pros: Direct hast construction, type-safe
  • Cons: Can't use React components (no rendering), only raw elements

Recommendation: Approach A for the base pipeline, with custom component registry for handling non-HTML elements (see below).

Step 2: hast → mdast (SOLVED by hast-util-to-mdast)

  • This step is fully solved by hast-util-to-mdast v10.1.2
  • Handles all standard HTML elements
  • Custom handlers for non-standard elements
  • Proper paragraph wrapping, whitespace handling, content model enforcement
  • GFM support (tables, strikethrough, task lists)

Edge cases to watch:

  • Nested inline elements: <strong><em>bold italic</em></strong>***bold italic***
  • Mixed content: <div>Text <h2>Heading</h2> More text</div> → Proper paragraph wrapping
  • Deeply nested lists
  • Tables with merged cells (not supported in GFM)
  • <br> inside <p>: Produces break nodes in phrasing content

Step 3: mdast → markdown (SOLVED by mdast-util-to-markdown)

  • Fully solved by mdast-util-to-markdown v2.1.2
  • Proper character escaping
  • GFM output with mdast-util-gfm extension
  • Configurable markers, indentation, heading styles

Edge Cases and Failure Modes

Custom Components (The #1 Challenge)

A React component like <DataField label="Name" value="Alice" /> renders to HTML like <div class="data-field"><span class="label">Name</span><span class="value">Alice</span></div>. But the markdown output would just be "Name Alice" without the structure.

Solution: Register custom handlers that use the component's semantic intent:

// Option 1: Use CSS class-based detection
handlers: {
  // Custom class→markdown mapping
  'data-field': (state, node) => {
    if (node.properties.className?.includes('data-field')) {
      const label = findTextByClass(node, 'label')
      const value = findTextByClass(node, 'value')
      return { type: 'paragraph', children: [
        { type: 'strong', children: [{ type: 'text', value: label }] },
        { type: 'text', value: `: ${value}` }
      ]}
    }
  }
}
// Option 2: Use data attributes for semantic hints
// Render with: <div data-md-type="field" data-md-label="Name">Alice</div>
handlers: {
  div: (state, node) => {
    const mdType = node.properties.dataMdType
    if (mdType === 'field') {
      // Custom conversion
    }
    return defaultHandlers.div(state, node, parent)
  }
}
// Option 3: Pre-transform the hast tree before passing to toMdast
// Walk the hast tree, replace custom structures with semantic mdast nodes
visit(hast, 'element', (node) => {
  if (node.properties.className?.includes('callout')) {
    // Transform the element's structure to something that maps cleanly
  }
})

SVG/Math Content

  • Default behavior: Ignored (content lost)
  • Workaround 1: Convert to mdast html node (preserves as raw HTML in markdown)
  • Workaround 2: Render to an image and use mdast image node
  • For LLM consumption: Image alt text is likely more useful than raw SVG markup

Forms and Interactive Elements

  • <input>, <select>, <textarea>: Processed for their text content
  • Checkboxes become [x]/[ ] in GFM task lists
  • Other form elements: Downgraded to text content

CSS-Dependent Layout

  • Tables rendered via CSS grid/flexbox (not <table> elements) won't produce GFM tables
  • Tab components rendered as <div> stacks won't produce meaningful markdown
  • Workaround: Use semantic HTML (<table>, <details>, etc.) in the component's render output when markdown output is needed

Content That Has No Markdown Equivalent

HTML Construct Default Behavior Better Alternative
<details>/<summary> Text content only Use mdast html node or custom directive
<dialog> Ignored Pre-process to extract content
<meter>, <progress> Text content Convert to descriptive text
<ruby>, <rt> Text content Custom handler for pronunciation annotation
<iframe> Link to src Custom handler for embed description
<video>, <audio> Link to src Custom handler for media description

Whitespace and Formatting

  • Multiple spaces in HTML collapse to single spaces by default
  • <pre> whitespace is preserved in code nodes
  • newlines: true option preserves line breaks during whitespace collapsing
  • Indentation in <pre> blocks may need careful handling

Roundtrip Fidelity

Not all markdown constructs survive HTML roundtripping:

  • Reference-style links ([text][id]) become direct links [text](url)
  • Setext headings become ATX headings (configurable)
  • Tight vs. loose lists may change
  • Multiple markdown syntaxes collapse to one (e.g., both * and _ for emphasis become *)

For LLM consumption, this is acceptable — the goal is readable markdown, not perfect roundtripping.


11. Alternative Approaches

Existing Libraries: React/JSX → Markdown

There is no mature, well-maintained library that directly converts React component trees to markdown. The approaches that exist are simpler or serve different purposes:

1. react-markdown + remark (the reverse direction)

React Markdown renders markdown as React components. This is the opposite of what we need. Not relevant.

2. html-to-markdown (turndown)

Repository: https://github.com/mixmark-io/turndown
A widely-used HTML-to-markdown converter (not based on unist). It works on HTML strings, not ASTs.

  • Pros: Simple API, well-tested, many plugins, configurable rules
  • Cons: Not AST-based, no TypeScript AST types, no GFM table support out-of-box, less extensible than unist ecosystem
  • Comparison: hast-util-to-mdast is more principled (proper AST, content model enforcement, better extensibility)

3. react-to-markdown (hypothetical)

No prominent library exists under this name or concept on npm. Search terms "react to markdown", "jsx to mdast", "component to markdown" yield no direct results.

4. mdast-util-from-adf

Converts Atlassian Document Format to mdast. Demonstrates the pattern of converting external formats to mdast, but for a different source format.

5. hast-util-to-portable-text

Converts hast to Sanity's Portable Text format. Another example of hast → alternative format, but not markdown.

6. Custom JSX → mdast builders

One could build React components that directly produce mdast nodes using mdast-builder or manual construction:

function MarkdownHeading({ depth, children }) {
  return { type: 'heading', depth, children }
}
  • Pros: Direct control, no intermediate HTML
  • Cons: Can't reuse existing React component ecosystem, dual rendering needed

7. React Server Components + hast

If using RSC, the output could be intercepted and converted to hast before HTML serialization. This is speculative and would require custom infrastructure.

Comparison Table

Approach Maturity Type Safety Extensibility Custom Components GFM Support
hast-util-to-mdast High (v10) Excellent Excellent (handlers) Needs custom layer Built-in
turndown High Poor Good (rules) Needs custom rules Plugin
Custom JSX→mdast Low Manual Full control Full control Manual
unified/rehype-remark High Excellent Excellent (plugins) Needs custom layer Via remark-gfm

Recommended: Use hast-util-to-mdast + mdast-util-to-markdown with custom JSX→hast adapter.


Overview

┌────────────────────────────────────────────────────────────────────┐
│                        JSX → Markdown Pipeline                       │
│                                                                      │
│  ┌──────────────┐   ┌───────────────┐   ┌─────────┐   ┌───────────┐ │
│  │ JSX Renderer │──▶│ hast Builder  │──▶│ hast →  │──▶│ mdast →  │ │
│  │ (React SSR)  │   │ (fromHtml +   │   │ mdast   │   │ markdown │ │
│  │              │   │  pre-process) │   │         │   │ (GFM)    │ │
│  └──────────────┘   └───────────────┘   └─────────┘   └───────────┘ │
│         │                   │                │              │       │
│  ┌──────────────┐   ┌───────────────┐        │              │       │
│  │ Component    │   │ Custom        │        │              │       │
│  │ Registry     │   │ Handlers      │        │              │       │
│  │ (markdown    │   │ (data-md-     │        │              │       │
│  │  hints)      │   │  attributes)  │        │              │       │
│  └──────────────┘   └───────────────┘        │              │       │
│                                              │              │       │
│                                     ┌────────┘              │       │
│                                     │                       │       │
│                              ┌──────────────┐   ┌──────────┘       │
│                              │ Post-process  │   │ Output            │
│                              │ (unist-util)  │   │ (markdown string) │
│                              └──────────────┘                      │
└────────────────────────────────────────────────────────────────────┘

Implementation Design

Phase 1: Core Pipeline

// renderToMarkdown.ts

import { renderToStaticMarkup } from 'react-dom/server'
import { fromHtml } from 'hast-util-from-html'
import { toMdast } from 'hast-util-to-mdast'
import { toMarkdown } from 'mdast-util-to-markdown'
import { gfmToMarkdown } from 'mdast-util-gfm'
import type { Element, Root as HastRoot } from 'hast'
import type { Handle } from 'hast-util-to-mdast'

interface RenderOptions {
  /** Custom hast → mdast handlers */
  handlers?: Record<string, Handle>
  /** Markdown serialization options */
  markdownOptions?: MarkdownOptions
  /** Whether to produce GFM output (default: true) */
  gfm?: boolean
  /** Pre-processing hook for the hast tree */
  preProcess?: (hast: HastRoot) => HastRoot
  /** Post-processing hook for the mdast tree */
  postProcess?: (mdast: MdastRoot) => MdastRoot
}

function renderToMarkdown(element: React.ReactElement, options?: RenderOptions): string {
  // Step 1: Render JSX to HTML
  const html = renderToStaticMarkup(element)
  
  // Step 2: Parse HTML to hast
  let hast = fromHtml(html, { fragment: true })
  
  // Step 3: Pre-process hast (optional: semantic annotation, cleanup)
  if (options?.preProcess) {
    hast = options.preProcess(hast)
  }
  
  // Step 4: Convert hast to mdast
  const mdast = toMdast(hast, {
    handlers: {
      ...defaultCustomHandlers,
      ...options?.handlers,
    },
    document: false,
  })
  
  // Step 5: Post-process mdast (optional: custom transforms)
  if (options?.postProcess) {
    mdast = options.postProcess(mdast)
  }
  
  // Step 6: Serialize mdast to markdown
  const extensions = options?.gfm !== false ? [gfmToMarkdown()] : []
  const markdown = toMarkdown(mdast, {
    ...options?.markdownOptions,
    extensions,
  })
  
  return markdown
}

Phase 2: Component Registry / Semantic Annotations

The key innovation is a data attribute convention that components use to hint markdown semantics:

// Component renders with markdown hints
function InfoCallout({ title, children }) {
  return (
    <div data-md="callout" data-md-callout-type="info">
      <strong data-md="callout-title">{title}</strong>
      <div data-md="callout-content">{children}</div>
    </div>
  )
}

Handler:

const calloutHandler: Handle = (state, node) => {
  const calloutType = node.properties.dataMdCalloutType || 'note'
  const title = findChildByDataMd(node, 'callout-title')
  const content = findChildByDataMd(node, 'callout-content')
  
  return {
    type: 'blockquote',
    children: [
      {
        type: 'paragraph',
        children: [
          { type: 'strong', children: [{ type: 'text', value: calloutType.toUpperCase() }] },
          { type: 'text', value: ': ' },
          ...state.all(content),
        ]
      }
    ]
  }
}

Registry:

const defaultCustomHandlers: Record<string, Handle> = {
  // Map data-md attributes to handlers
  div: (state, node, parent) => {
    const mdType = node.properties.dataMd
    switch (mdType) {
      case 'callout': return calloutHandler(state, node)
      case 'admonition': return admonitionHandler(state, node)
      // ... extensible
      default: return undefined  // Fall through to default handler
    }
  }
}

Phase 3: Advanced - Dual-Mode Components

For components that need to render differently in "markdown mode" vs "visual mode":

interface MarkdownAwareProps {
  /** When defined, used for markdown output instead of the visual render */
  markdownContent?: string | React.ReactNode
}

function DataTable({ data, markdownContent }: MarkdownAwareProps & DataTableProps) {
  // Visual mode: rich table with sorting, pagination
  // Markdown mode: simple GFM table
  return (
    <div className="data-table" data-md="table" data-md-raw={markdownContent}>
      {/* ... rich visual table ... */}
    </div>
  )
}

Phase 4: Pre-Render Hooks

For components that need full control over their markdown output:

interface MarkdownRenderer {
  /** If returned, used directly as markdown instead of converting the hast */
  toMarkdown?(): string | MdastNode
}

// In the pre-process step:
function preProcess(hast: HastRoot): HastRoot {
  visit(hast, 'element', (node) => {
    // Check if this element's component provided a markdown hint
    const markdownDirect = node.properties.dataMdRaw
    if (markdownDirect) {
      // Replace the element with a text node containing the markdown
      // (this will be parsed separately or stored as raw markdown)
    }
  })
  return hast
}

Package Dependencies

{
  "dependencies": {
    "hast-util-from-html": "^2.0.0",
    "hast-util-to-mdast": "^10.0.0",
    "mdast-util-to-markdown": "^2.0.0",
    "mdast-util-gfm": "^3.0.0",
    "unist-util-visit": "^5.0.0",
    "hast-util-select": "^6.0.0",
    "hast-util-to-string": "^3.0.0"
  },
  "devDependencies": {
    "@types/hast": "^3.0.0",
    "@types/mdast": "^4.0.0",
    "@types/unist": "^3.0.0",
    "react": "^18.0.0",
    "react-dom": "^18.0.0"
  }
}

Estimated total bundle size: ~50-80KB (tree-shaken, minified)


13. Appendix: Element-to-Markdown Mapping Table

Complete Mapping for Standard HTML Elements

HTML Element mdast Output Markdown Notes
<h1> heading (depth: 1) #
<h2> heading (depth: 2) ##
<h3> heading (depth: 3) ###
<h4> heading (depth: 4) ####
<h5> heading (depth: 5) #####
<h6> heading (depth: 6) ######
<p> paragraph (blank line delimited)
<strong>, <b> strong **text**
<em>, <i> emphasis *text*
<del>, <s>, <strike> delete (GFM) ~~text~~
<code> inlineCode `code`
<pre><code> code \ncode\n lang from class="language-xxx"
<a href> link [text](url "title")
<img> image ![alt](src "title")
<ul> list (ordered: false) - item
<ol> list (ordered: true) 1. item start attribute respected
<li> listItem (list item) GFM: checked for task lists
<blockquote> blockquote > text
<hr> thematicBreak ---
<table> table (GFM) | col | col |
<tr> tableRow
<th>, <td> tableCell align from align attr
<br> break (hard break) Two spaces + newline or \
<q> text with quotes ("quoted") Uses quotes option for nesting
<input type="checkbox"> text [x] or [ ] GFM task list item support
<abbr> text (just text) Title lost
<sup> text (just text) Superscript lost
<sub> text (just text) Subscript lost
<mark> text (just text) Highlight lost
<small> text (just text) Small text lost
<details> Children only (content extracted) No markdown equivalent
<summary> text (just text) Collapsed with details
<dl>, <dt>, <dd> paragraphs (text only) No definition list in markdown
<figure> Children only (content extracted)
<figcaption> paragraph/text (text)
<video>, <audio> link [src](url) Downgraded to link
<iframe> link [src](url) Downgraded to link
<svg> (ignored) (nothing) Use custom handler to preserve
<math> (ignored) (nothing) Use custom handler for LaTeX output
<script> (ignored) (nothing)
<style> (ignored) (nothing)
<div>, <section>, <article> Children extracted (children only) Container elements unwrapped
<span>, <time> Children extracted (children only) Inline containers unwrapped

Summary of Key Findings

  1. The hast→mdast→markdown chain is production-ready: hast-util-to-mdast (v10.1.2) and mdast-util-to-markdown (v2.1.2) are mature, well-tested, well-typed libraries that handle the full HTML spec and produce clean GFM markdown.

  2. The gap is JSX→hast: There is no off-the-shelf solution for converting React component trees directly to hast. The recommended approach is React server-side rendering (to HTML) followed by hast-util-from-html parsing.

  3. Custom components need a strategy: Standard HTML elements map well, but custom React components lose their semantics through HTML rendering. A data-md attribute convention + custom handler registry is the recommended solution.

  4. TypeScript support is excellent: All packages in the ecosystem ship with proper TypeScript types. The discriminated union on node.type makes pattern matching clean and type-safe.

  5. GFM is fully supported: Tables, strikethrough, task lists, and autolink literals are all handled.

  6. The unist utility ecosystem is rich: unist-util-visit, hast-util-select, and dozens of other utilities provide all the building blocks needed for pre/post-processing.

  7. For LLM consumption specifically: The markdown output quality from this pipeline is already excellent for standard HTML. The main investment should go into the component annotation system and custom handlers for domain-specific components.


Research completed 2026-04-28. All version numbers and repository states reflect the latest available at time of research.