Skip to content

DOC-3470: Improve index.md quality with DOM preprocessing pipeline#4125

Open
kemister85 wants to merge 1 commit intomainfrom
hotfix/main/DOC-3470
Open

DOC-3470: Improve index.md quality with DOM preprocessing pipeline#4125
kemister85 wants to merge 1 commit intomainfrom
hotfix/main/DOC-3470

Conversation

@kemister85
Copy link
Copy Markdown
Contributor

Summary

  • Adds a DOM preprocessing pipeline to generate-markdown.mjs that cleans Antora HTML before dom-to-semantic-markdown conversion, fixing broken tables, about:blank anchors, admonition formatting, live demo noise, and leaked <style>/<script> elements.
  • Refactors the script to use const arrow functions with a composable transform array, following tinymce/tinymce-premium coding conventions.

Changes

Transform Before After
stripNonContent <style>, <script>, signup promos leaked into markdown Stripped before conversion
rewriteAdmonitions Broken table rows: | | text | Clean blockquotes: > **Note:** text
rewriteLiveDemos Tab list noise + duplicated raw HTML Single JS code block with Example label
stripHeadingAnchors about:blank#section-name in all headings Clean ## Section Name
rewriteCardTables Malformed markdown tables from layout grids Bulleted list: **[Link](url)** — description

Test plan

  • Ran against full build (1,442 pages converted, 16 skipped)
  • Verified byte-identical manifest output between old and new implementations
  • Verified 30 representative pages across content types (plugins, config options, migration, release notes, API docs, getting started)
  • Global sweep: zero about:blank, <style>, signup-promo, kapa-widget, or Edit on CodePen leaks across all generated .md files

Add preprocessing transforms to generate-markdown.mjs that clean
the Antora HTML before dom-to-semantic-markdown conversion:

- Strip <style>, <script>, and signup promo elements
- Convert admonition blocks to blockquote format
- Extract live demo JS code, drop scaffold noise
- Remove Antora heading anchor wrappers
- Convert card-layout tables to bulleted lists
- Fix about:blank# anchor references

Refactored to const arrow functions with a composable transform
pipeline, following tinymce/tinymce-premium conventions.

Verified: byte-identical output across all 1,442 pages; zero
about:blank, <style>, signup-promo, or kapa-widget leaks.
@kemister85 kemister85 requested a review from a team as a code owner April 27, 2026 13:23
@kemister85 kemister85 requested review from a team, TheSpyder, hollyjwaits, kimwoodfield, ltrouton, metricjs, soritaheng and spocke and removed request for a team April 27, 2026 13:23
@kemister85
Copy link
Copy Markdown
Contributor Author

@TheSpyder

I've been looking at the quality of our generated index.md files from V1 release, the Markdown endpoints that LLMs and AI coding tools consume when fetching TinyMCE documentation. The dom-to-semantic-markdown library does a solid job with straightforward content, but several Antora-specific HTML patterns were producing degraded output:

  • Admonition blocks (Note/Warning/Tip) rendered as broken table rows instead of blockquotes
  • Live demo sections produced noisy tab lists and duplicated raw HTML textarea content
  • Heading anchors resolved to about:blank#section-name instead of #section-name
  • Card layout tables (like on Getting Started) converted to malformed markdown
  • <style>, <script>, and signup promo blocks leaked through into the markdown

The fix adds a DOM preprocessing pipeline that runs before d2m conversion — each transform is a small, named function in a composable array, so adding or reordering transforms is straightforward. The script was also refactored to follow our established conventions (const arrows, single-purpose functions, verb-first naming).

Tested against all 1,442 pages with zero regressions. Happy to walk through the changes if anyone has questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant