Skip to content

Fix PDF structure classification bugs across batch#9

Merged
vinniefalco merged 6 commits intomasterfrom
fix/pdf-structure-classification
Apr 17, 2026
Merged

Fix PDF structure classification bugs across batch#9
vinniefalco merged 6 commits intomasterfrom
fix/pdf-structure-classification

Conversation

@gregjkal
Copy link
Copy Markdown
Collaborator

@gregjkal gregjkal commented Apr 17, 2026

Summary

Five targeted fixes to the PDF extraction and classification pipeline, discovered while running tomd in batch mode against a 52-paper WG21 corpus. Each commit addresses a distinct failure mode; the commits are independent and can be reviewed/landed individually if desired.

Closes #6.
Partially addresses #7 (word-level backticks eliminated; multi-thousand-character line concatenation on p3977r0 persists, root cause is two-column layout).

Commits

  1. 43488fb Fix header stripping for spatial path on multi-column headers
    Edge items were collected from the MuPDF path only, so three-column running headers (left/center/right) produced patterns that only matched MuPDF's split form, leaving spatial's merged form unstripped. Union edge items from both paths.

  2. 59dc645 Strip repeating headers at span granularity
    When spatial merges left/center/right columns into one line (one span per column), whole-line text matches no pattern but each column-span does. Added per-span stripping so matched columns drop and non-header spans (e.g. a one-off appendix title) remain. Also indexes repeating patterns by quantized y for fast lookup.

  3. c77fc45 Fix body-size detection on code-heavy papers
    Code-dense papers biased body size toward the smaller monospace font, pushing body prose into the heading font-rank range. Three coordinated changes: camelCase splitter handles letter/digit boundaries (so LMMono9 classifies correctly), propagate_monospace requires majority-mono chars per font (prevents short-span false positives contaminating prose fonts like Lato-Light), and _detect_body_size excludes monospace by default with a fallback for wording-heavy papers.

  4. 8fda9bd Reject heading classification for prose-length first lines
    Numbered paragraphs like 1 A fiber is a single flow of control... matched SECTION_NUM_RE and cascaded to deep heading levels. Reject heading classification when the first line exceeds 12 words at LOW confidence (number-only, no font confirmation). Real section titles with MEDIUM/HIGH confidence are preserved even when long.

  5. e61a787 Treat same-styled consecutive headings as siblings in nesting validation
    Includes two changes (should have been two commits):

    • Repetition demotion: LOW-confidence numbered headings whose section_num appears ≥3 times (paragraph-number resets) get demoted to paragraphs. Threshold of 3 preserves TOC/body pairs and single section/paragraph collisions.
    • Sibling consolidation in _validate_nesting: runs of headings with matching font sizes are assigned the previous clamped level instead of prev + 1, preventing cascades like ### P21, #### P20, ##### P19.

Batch impact

Measured against a 52-paper WG21 corpus:

  • Two papers that had silent structural bugs now render correctly (p3844r3, p3844r4: bullets in changelog/revision sections no longer classified as deep headings).
  • One paper that produced confidently-wrong output now emits a .prompts.md for human/LLM review (p3844r4; an improvement per the project's "honest output" contract).
  • The remaining .prompts.md papers were already uncertain before; their output is cleaner.
  • Bullet-depth-6+ headings: p3844r3 86→0, p3844r4 7→0.
  • Numbered-paragraph-as-h2: p4003r0 100→29, p3874r1 36→11, p0876r22 50→1.
    • Needs more work, but the path is not clear
  • No regressions observed.

Known issues not addressed in this PR

Observed during this work and not touched; may be worth filing or linking to existing issues:

Edge items were collected from the MuPDF path only, so three-column
running headers (left/center/right) produced exact-match patterns only
for MuPDF's split form. The spatial path merges those into a single
line on shared y-coordinate and was left unstripped, polluting the
dual-path comparison and flagging regions as uncertain.

Union edge items from both paths before detecting repeating patterns.
When the spatial path merges left/center/right header columns into a
single line (one span per column), the whole-line text matches no
pattern but each column-span does. Stripping per-span drops the
matched columns and preserves any non-header span that shares the
header y-coordinate (e.g. a one-off appendix title on the bibliography
page).

Also indexes repeating patterns by quantized y so the common
no-pattern-near-this-line case short-circuits without scanning, and
factors out the y-quantize expression into a shared helper.
Code-dense papers (especially wording papers) biased the detected body
size toward the smaller monospace font, which then pushed body prose
into the font-rank heading range. The nesting validator ratcheted
consecutive "headings" upward one level at a time, producing
depth-6/7/8 cascades where bullet lines should have been list items.

Three coordinated changes:

- _CAMEL_SPLIT_RE now splits on letter/digit boundaries, so family
  names with embedded sizes (LMMono9, LMMono10, LMMono8) are recognized
  by the name-only monospace check. Previously only LMMonoLt10 split
  cleanly because its modifier separated the digits.

- propagate_monospace now requires a majority of a font's spatial
  characters to be classified monospace before propagating. Short
  spans of digits or thin chars could false-positive the per-glyph
  metric, contaminating proportional fonts (e.g. Lato-Light) when any
  single short span passed the signal.

- _detect_body_size excludes monospace spans by default. For wording
  papers with almost no prose at all, falls back to the overall mode
  so body size isn't pinned to a tiny outlier (e.g. superscript page
  numbers at 7pt).
Numbered paragraph clauses in wording sections ("1 A fiber is a single
flow of control...") and numeric arithmetic results that land on a
continuation line ("4 on IEEE-754 double implementations...") both
match SECTION_NUM_RE and got classified as H2 headings. The nesting
validator then let them cascade downward.

Reject the heading classification when the first physical line exceeds
_HEADING_MAX_WORDS (12) AND the confidence is LOW (number alone, no
font-size or bold confirmation). Real section titles with number +
larger-font signal at MEDIUM/HIGH confidence are preserved even when
long (e.g. "9.2 Rationale: Why Lane Count (L) is the Sole Coordinate").

Across the batch: 201 -> 76 long misclassified headings in the four
most affected papers.
When multiple consecutive headings share an initial styling (e.g., a
run of "Changes since P0876R21", "Changes since P0876R20", ... entries),
the existing clamp rule would assign each prev_clamped + 1, cascading
to ever-deeper levels: level 3, 4, 5, 6, 7, 8, 8... even though they
are clearly siblings at the same logical level.

Track the previous heading's font size in _validate_nesting. When the
current heading's font size matches closely, it's a sibling and gets
the previous clamped level, not prev_clamped + 1.

The tolerance is 0.1pt, which is wider than extraction noise for the
same visual style but narrower than the real gap between heading
tiers in typical designs.
gregjkal added a commit that referenced this pull request Apr 17, 2026
Covers body-size prose preference, heading prose-length rejection,
repeated-number demotion, sibling nesting clamp, monospace majority
propagation, and per-span header stripping. Also documents review
findings (sibling false-flatten risk, tight font tolerance, leftover
LOW confidence on demoted paragraphs) in issues/pr9-review.md.

Fixes stale y-bucket assertions in three existing detect_repeating
tests that were broken by the bbox-center refactor in 59dc645.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers body-size prose preference, heading prose-length rejection,
repeated-number demotion, sibling nesting clamp, monospace majority
propagation, and per-span header stripping. Also documents review
findings (sibling false-flatten risk, tight font tolerance, leftover
LOW confidence on demoted paragraphs) in issues/pr9-review.md.

Fixes stale y-bucket assertions in three existing detect_repeating
tests that were broken by the bbox-center refactor in 59dc645.
@gregjkal gregjkal force-pushed the fix/pdf-structure-classification branch from 9e1a7da to 1e1383d Compare April 17, 2026 21:10
@vinniefalco vinniefalco merged commit 1e1383d into master Apr 17, 2026
@gregjkal gregjkal deleted the fix/pdf-structure-classification branch April 17, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bullets in changelog/revision sections classified as headings to depth 7-8

2 participants