Fix PDF structure classification bugs across batch#9
Merged
vinniefalco merged 6 commits intomasterfrom Apr 17, 2026
Merged
Conversation
Edge items were collected from the MuPDF path only, so three-column running headers (left/center/right) produced exact-match patterns only for MuPDF's split form. The spatial path merges those into a single line on shared y-coordinate and was left unstripped, polluting the dual-path comparison and flagging regions as uncertain. Union edge items from both paths before detecting repeating patterns.
When the spatial path merges left/center/right header columns into a single line (one span per column), the whole-line text matches no pattern but each column-span does. Stripping per-span drops the matched columns and preserves any non-header span that shares the header y-coordinate (e.g. a one-off appendix title on the bibliography page). Also indexes repeating patterns by quantized y so the common no-pattern-near-this-line case short-circuits without scanning, and factors out the y-quantize expression into a shared helper.
Code-dense papers (especially wording papers) biased the detected body size toward the smaller monospace font, which then pushed body prose into the font-rank heading range. The nesting validator ratcheted consecutive "headings" upward one level at a time, producing depth-6/7/8 cascades where bullet lines should have been list items. Three coordinated changes: - _CAMEL_SPLIT_RE now splits on letter/digit boundaries, so family names with embedded sizes (LMMono9, LMMono10, LMMono8) are recognized by the name-only monospace check. Previously only LMMonoLt10 split cleanly because its modifier separated the digits. - propagate_monospace now requires a majority of a font's spatial characters to be classified monospace before propagating. Short spans of digits or thin chars could false-positive the per-glyph metric, contaminating proportional fonts (e.g. Lato-Light) when any single short span passed the signal. - _detect_body_size excludes monospace spans by default. For wording papers with almost no prose at all, falls back to the overall mode so body size isn't pinned to a tiny outlier (e.g. superscript page numbers at 7pt).
Numbered paragraph clauses in wording sections ("1 A fiber is a single
flow of control...") and numeric arithmetic results that land on a
continuation line ("4 on IEEE-754 double implementations...") both
match SECTION_NUM_RE and got classified as H2 headings. The nesting
validator then let them cascade downward.
Reject the heading classification when the first physical line exceeds
_HEADING_MAX_WORDS (12) AND the confidence is LOW (number alone, no
font-size or bold confirmation). Real section titles with number +
larger-font signal at MEDIUM/HIGH confidence are preserved even when
long (e.g. "9.2 Rationale: Why Lane Count (L) is the Sole Coordinate").
Across the batch: 201 -> 76 long misclassified headings in the four
most affected papers.
When multiple consecutive headings share an initial styling (e.g., a run of "Changes since P0876R21", "Changes since P0876R20", ... entries), the existing clamp rule would assign each prev_clamped + 1, cascading to ever-deeper levels: level 3, 4, 5, 6, 7, 8, 8... even though they are clearly siblings at the same logical level. Track the previous heading's font size in _validate_nesting. When the current heading's font size matches closely, it's a sibling and gets the previous clamped level, not prev_clamped + 1. The tolerance is 0.1pt, which is wider than extraction noise for the same visual style but narrower than the real gap between heading tiers in typical designs.
gregjkal
added a commit
that referenced
this pull request
Apr 17, 2026
Covers body-size prose preference, heading prose-length rejection, repeated-number demotion, sibling nesting clamp, monospace majority propagation, and per-span header stripping. Also documents review findings (sibling false-flatten risk, tight font tolerance, leftover LOW confidence on demoted paragraphs) in issues/pr9-review.md. Fixes stale y-bucket assertions in three existing detect_repeating tests that were broken by the bbox-center refactor in 59dc645. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers body-size prose preference, heading prose-length rejection, repeated-number demotion, sibling nesting clamp, monospace majority propagation, and per-span header stripping. Also documents review findings (sibling false-flatten risk, tight font tolerance, leftover LOW confidence on demoted paragraphs) in issues/pr9-review.md. Fixes stale y-bucket assertions in three existing detect_repeating tests that were broken by the bbox-center refactor in 59dc645.
9e1a7da to
1e1383d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five targeted fixes to the PDF extraction and classification pipeline, discovered while running
tomdin batch mode against a 52-paper WG21 corpus. Each commit addresses a distinct failure mode; the commits are independent and can be reviewed/landed individually if desired.Closes #6.
Partially addresses #7 (word-level backticks eliminated; multi-thousand-character line concatenation on p3977r0 persists, root cause is two-column layout).
Commits
43488fbFix header stripping for spatial path on multi-column headersEdge items were collected from the MuPDF path only, so three-column running headers (left/center/right) produced patterns that only matched MuPDF's split form, leaving spatial's merged form unstripped. Union edge items from both paths.
59dc645Strip repeating headers at span granularityWhen spatial merges left/center/right columns into one line (one span per column), whole-line text matches no pattern but each column-span does. Added per-span stripping so matched columns drop and non-header spans (e.g. a one-off appendix title) remain. Also indexes repeating patterns by quantized y for fast lookup.
c77fc45Fix body-size detection on code-heavy papersCode-dense papers biased body size toward the smaller monospace font, pushing body prose into the heading font-rank range. Three coordinated changes: camelCase splitter handles letter/digit boundaries (so
LMMono9classifies correctly),propagate_monospacerequires majority-mono chars per font (prevents short-span false positives contaminating prose fonts like Lato-Light), and_detect_body_sizeexcludes monospace by default with a fallback for wording-heavy papers.8fda9bdReject heading classification for prose-length first linesNumbered paragraphs like
1 A fiber is a single flow of control...matchedSECTION_NUM_REand cascaded to deep heading levels. Reject heading classification when the first line exceeds 12 words at LOW confidence (number-only, no font confirmation). Real section titles with MEDIUM/HIGH confidence are preserved even when long.e61a787Treat same-styled consecutive headings as siblings in nesting validationIncludes two changes (should have been two commits):
section_numappears ≥3 times (paragraph-number resets) get demoted to paragraphs. Threshold of 3 preserves TOC/body pairs and single section/paragraph collisions._validate_nesting: runs of headings with matching font sizes are assigned the previous clamped level instead ofprev + 1, preventing cascades like### P21,#### P20,##### P19.Batch impact
Measured against a 52-paper WG21 corpus:
.prompts.mdfor human/LLM review (p3844r4; an improvement per the project's "honest output" contract)..prompts.mdpapers were already uncertain before; their output is cleaner.Known issues not addressed in this PR
Observed during this work and not touched; may be worth filing or linking to existing issues:
•not split).## 6 *Postconditions:*,## 8 Effects:).Specification→Specifci ation).