Fix PDF structure classification bugs across batch by gregjkal · Pull Request #9 · cppalliance/tomd

gregjkal · 2026-04-17T20:34:29Z

Summary

Five targeted fixes to the PDF extraction and classification pipeline, discovered while running tomd in batch mode against a 52-paper WG21 corpus. Each commit addresses a distinct failure mode; the commits are independent and can be reviewed/landed individually if desired.

Closes #6.
Partially addresses #7 (word-level backticks eliminated; multi-thousand-character line concatenation on p3977r0 persists, root cause is two-column layout).

Commits

43488fb Fix header stripping for spatial path on multi-column headers
Edge items were collected from the MuPDF path only, so three-column running headers (left/center/right) produced patterns that only matched MuPDF's split form, leaving spatial's merged form unstripped. Union edge items from both paths.
59dc645 Strip repeating headers at span granularity
When spatial merges left/center/right columns into one line (one span per column), whole-line text matches no pattern but each column-span does. Added per-span stripping so matched columns drop and non-header spans (e.g. a one-off appendix title) remain. Also indexes repeating patterns by quantized y for fast lookup.
c77fc45 Fix body-size detection on code-heavy papers
Code-dense papers biased body size toward the smaller monospace font, pushing body prose into the heading font-rank range. Three coordinated changes: camelCase splitter handles letter/digit boundaries (so LMMono9 classifies correctly), propagate_monospace requires majority-mono chars per font (prevents short-span false positives contaminating prose fonts like Lato-Light), and _detect_body_size excludes monospace by default with a fallback for wording-heavy papers.
8fda9bd Reject heading classification for prose-length first lines
Numbered paragraphs like 1 A fiber is a single flow of control... matched SECTION_NUM_RE and cascaded to deep heading levels. Reject heading classification when the first line exceeds 12 words at LOW confidence (number-only, no font confirmation). Real section titles with MEDIUM/HIGH confidence are preserved even when long.
e61a787 Treat same-styled consecutive headings as siblings in nesting validation
Includes two changes (should have been two commits):
- Repetition demotion: LOW-confidence numbered headings whose section_num appears ≥3 times (paragraph-number resets) get demoted to paragraphs. Threshold of 3 preserves TOC/body pairs and single section/paragraph collisions.
- Sibling consolidation in _validate_nesting: runs of headings with matching font sizes are assigned the previous clamped level instead of prev + 1, preventing cascades like ### P21, #### P20, ##### P19.

Batch impact

Measured against a 52-paper WG21 corpus:

Two papers that had silent structural bugs now render correctly (p3844r3, p3844r4: bullets in changelog/revision sections no longer classified as deep headings).
One paper that produced confidently-wrong output now emits a .prompts.md for human/LLM review (p3844r4; an improvement per the project's "honest output" contract).
The remaining .prompts.md papers were already uncertain before; their output is cleaner.
Bullet-depth-6+ headings: p3844r3 86→0, p3844r4 7→0.
Numbered-paragraph-as-h2: p4003r0 100→29, p3874r1 36→11, p0876r22 50→1.
- Needs more work, but the path is not clear
No regressions observed.

Known issues not addressed in this PR

Observed during this work and not touched; may be worth filing or linking to existing issues:

Definition/example prose wrapped in word-level backticks and concatenated into multi-thousand-character lines #7 (partial) — Two-column layout concatenation on p3977r0 persists; backticking portion is fixed.
Bibliography sections flattened from vertical bulleted list to horizontal paragraph #8 — Bibliography flattening on p3984r0 (inline • not split).
TOC dot-leader lines concatenated into a single paragraph (p0876r22).
Compact tables (e.g. 5×2 vote tallies) not detected (p0876r22).
Spec-clause labels with unique numbers surviving as headings (## 6 *Postconditions:*, ## 8 Effects:).
Numbered footnote references with hyperlinks classified as headings (p3874r1).
Spatial path reading-order mangling with mixed monospace + prose (p3844r4, p3978r2).
Ligature glyph reordering in the spatial path (Specification → Specifci ation).
Bare-number stub headings where number and title get extracted separately (n5036).

Edge items were collected from the MuPDF path only, so three-column running headers (left/center/right) produced exact-match patterns only for MuPDF's split form. The spatial path merges those into a single line on shared y-coordinate and was left unstripped, polluting the dual-path comparison and flagging regions as uncertain. Union edge items from both paths before detecting repeating patterns.

When the spatial path merges left/center/right header columns into a single line (one span per column), the whole-line text matches no pattern but each column-span does. Stripping per-span drops the matched columns and preserves any non-header span that shares the header y-coordinate (e.g. a one-off appendix title on the bibliography page). Also indexes repeating patterns by quantized y so the common no-pattern-near-this-line case short-circuits without scanning, and factors out the y-quantize expression into a shared helper.

Code-dense papers (especially wording papers) biased the detected body size toward the smaller monospace font, which then pushed body prose into the font-rank heading range. The nesting validator ratcheted consecutive "headings" upward one level at a time, producing depth-6/7/8 cascades where bullet lines should have been list items. Three coordinated changes: - _CAMEL_SPLIT_RE now splits on letter/digit boundaries, so family names with embedded sizes (LMMono9, LMMono10, LMMono8) are recognized by the name-only monospace check. Previously only LMMonoLt10 split cleanly because its modifier separated the digits. - propagate_monospace now requires a majority of a font's spatial characters to be classified monospace before propagating. Short spans of digits or thin chars could false-positive the per-glyph metric, contaminating proportional fonts (e.g. Lato-Light) when any single short span passed the signal. - _detect_body_size excludes monospace spans by default. For wording papers with almost no prose at all, falls back to the overall mode so body size isn't pinned to a tiny outlier (e.g. superscript page numbers at 7pt).

Numbered paragraph clauses in wording sections ("1 A fiber is a single flow of control...") and numeric arithmetic results that land on a continuation line ("4 on IEEE-754 double implementations...") both match SECTION_NUM_RE and got classified as H2 headings. The nesting validator then let them cascade downward. Reject the heading classification when the first physical line exceeds _HEADING_MAX_WORDS (12) AND the confidence is LOW (number alone, no font-size or bold confirmation). Real section titles with number + larger-font signal at MEDIUM/HIGH confidence are preserved even when long (e.g. "9.2 Rationale: Why Lane Count (L) is the Sole Coordinate"). Across the batch: 201 -> 76 long misclassified headings in the four most affected papers.

When multiple consecutive headings share an initial styling (e.g., a run of "Changes since P0876R21", "Changes since P0876R20", ... entries), the existing clamp rule would assign each prev_clamped + 1, cascading to ever-deeper levels: level 3, 4, 5, 6, 7, 8, 8... even though they are clearly siblings at the same logical level. Track the previous heading's font size in _validate_nesting. When the current heading's font size matches closely, it's a sibling and gets the previous clamped level, not prev_clamped + 1. The tolerance is 0.1pt, which is wider than extraction noise for the same visual style but narrower than the real gap between heading tiers in typical designs.

Covers body-size prose preference, heading prose-length rejection, repeated-number demotion, sibling nesting clamp, monospace majority propagation, and per-span header stripping. Also documents review findings (sibling false-flatten risk, tight font tolerance, leftover LOW confidence on demoted paragraphs) in issues/pr9-review.md. Fixes stale y-bucket assertions in three existing detect_repeating tests that were broken by the bbox-center refactor in 59dc645. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Covers body-size prose preference, heading prose-length rejection, repeated-number demotion, sibling nesting clamp, monospace majority propagation, and per-span header stripping. Also documents review findings (sibling false-flatten risk, tight font tolerance, leftover LOW confidence on demoted paragraphs) in issues/pr9-review.md. Fixes stale y-bucket assertions in three existing detect_repeating tests that were broken by the bbox-center refactor in 59dc645.

gregjkal added 5 commits April 17, 2026 10:35

gregjkal force-pushed the fix/pdf-structure-classification branch from 9e1a7da to 1e1383d Compare April 17, 2026 21:10

vinniefalco merged commit 1e1383d into master Apr 17, 2026

gregjkal deleted the fix/pdf-structure-classification branch April 17, 2026 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PDF structure classification bugs across batch#9

Fix PDF structure classification bugs across batch#9
vinniefalco merged 6 commits intomasterfrom
fix/pdf-structure-classification

gregjkal commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gregjkal commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Batch impact

Known issues not addressed in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gregjkal commented Apr 17, 2026 •

edited

Loading