Convert WG21 committee papers from PDF or HTML to clean Markdown.
tomd is purpose-built for C++ standards committee paper conversion. It understands WG21 metadata fields (document number, date, reply-to, audience), detects structural elements (headings, lists, tables, code blocks, wording sections), and produces Markdown that looks like a human wrote it, suitable for version control, pull request diffs, and plain-text review workflows.
From this directory:
pip install -e .
Requires Python 3.12 or newer. Runtime dependencies (pymupdf~=1.27,
beautifulsoup4~=4.14) are declared in pyproject.toml and installed
automatically.
tomd paper.pdf # -> paper.md (+ paper.prompts.md if uncertain)
tomd paper.html # -> paper.md
tomd *.pdf *.html --outdir out/ # batch mode
tomd -v paper.pdf # verbose logging
tomd -o out.md paper.pdf # explicit output path (single-file only)
Also runnable as python -m tomd.main ....
paper.mdis always produced. It contains YAML front matter (title, document number, date, audience, reply-to) followed by the paper body rendered as Markdown.paper.prompts.mdis produced only when the converter found uncertain regions. It pairs each uncertain span with both extraction paths (MuPDF and spatial) plus surrounding context, formatted for manual LLM reconciliation. If no uncertain regions exist, no prompts file is written (and any stale one at the output path is removed).
tomd uses dual-extraction with confidence scoring. When the MuPDF and spatial paths disagree on a page, the region is emitted in the output marked with an HTML comment:
<!-- tomd:uncertain:L120-L145 -->
The accompanying .prompts.md file contains ready-to-feed LLM prompts for
each marker. You resolve uncertain regions manually; the LLM fixes
structure, never content.
- No OCR. Scanned or image-only PDFs are not supported.
- No vision fallback. Papers that rely on non-extractable layout (complex equations, diagrams) will not convert cleanly.
- HTML generator coverage. Four generators are detected directly: mpark/wg21, Bikeshed, HackMD, and hand-written. Other sources fall back to a generic extractor that may miss metadata fields.
- LLM auto-resolution is deferred to v2. The
.prompts.mdfile is produced; feeding it to an LLM and applying the result is manual in this release.
Design and architecture documentation lives alongside the code:
CLAUDE.md- architecture rules and invariants (contributors and AI agents).lib/pdf/ARCHITECTURE.md- PDF converter pipeline and the techniques it uses.lib/html/ARCHITECTURE.md- HTML converter pipeline.
Read these in order if you are modifying tomd.
Install test extras and run the suite:
pip install -e .[test]
pytest tests/
Boost Software License 1.0. See LICENSE.