Skip to content

cppalliance/tomd

Repository files navigation

tomd

Convert WG21 committee papers from PDF or HTML to clean Markdown.

tomd is purpose-built for C++ standards committee paper conversion. It understands WG21 metadata fields (document number, date, reply-to, audience), detects structural elements (headings, lists, tables, code blocks, wording sections), and produces Markdown that looks like a human wrote it, suitable for version control, pull request diffs, and plain-text review workflows.

Install

From this directory:

pip install -e .

Requires Python 3.12 or newer. Runtime dependencies (pymupdf~=1.27, beautifulsoup4~=4.14) are declared in pyproject.toml and installed automatically.

Usage

tomd paper.pdf                  # -> paper.md (+ paper.prompts.md if uncertain)
tomd paper.html                 # -> paper.md
tomd *.pdf *.html --outdir out/ # batch mode
tomd -v paper.pdf               # verbose logging
tomd -o out.md paper.pdf        # explicit output path (single-file only)

Also runnable as python -m tomd.main ....

Output

  • paper.md is always produced. It contains YAML front matter (title, document number, date, audience, reply-to) followed by the paper body rendered as Markdown.
  • paper.prompts.md is produced only when the converter found uncertain regions. It pairs each uncertain span with both extraction paths (MuPDF and spatial) plus surrounding context, formatted for manual LLM reconciliation. If no uncertain regions exist, no prompts file is written (and any stale one at the output path is removed).

Uncertain regions

tomd uses dual-extraction with confidence scoring. When the MuPDF and spatial paths disagree on a page, the region is emitted in the output marked with an HTML comment:

<!-- tomd:uncertain:L120-L145 -->

The accompanying .prompts.md file contains ready-to-feed LLM prompts for each marker. You resolve uncertain regions manually; the LLM fixes structure, never content.

Limitations

  • No OCR. Scanned or image-only PDFs are not supported.
  • No vision fallback. Papers that rely on non-extractable layout (complex equations, diagrams) will not convert cleanly.
  • HTML generator coverage. Four generators are detected directly: mpark/wg21, Bikeshed, HackMD, and hand-written. Other sources fall back to a generic extractor that may miss metadata fields.
  • LLM auto-resolution is deferred to v2. The .prompts.md file is produced; feeding it to an LLM and applying the result is manual in this release.

Design

Design and architecture documentation lives alongside the code:

Read these in order if you are modifying tomd.

Development

Install test extras and run the suite:

pip install -e .[test]
pytest tests/

License

Boost Software License 1.0. See LICENSE.

About

pdf to markdown converter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages