Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions issues/pr9-review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# PR #9 Review Findings

Code-quality notes from the review of PR #9 (`fix/pdf-structure-classification`). None were blockers; tracked here so they surface when the surrounding code is touched next.

Each entry names the concern, a concrete example, root cause, and a fix direction (matching the format of `known-issues.md`).

---

## Sibling consolidation may demote legitimate subsection nesting

**Symptom:** `_validate_nesting` now treats consecutive headings that share a font size as siblings and clamps the second to `prev_level`. On papers that express heading depth through section numbering rather than typography (a common pattern for Markdown-origin PDFs), this flattens real nested subsections.

**Example sequence:**

```
## 2 Motivation (font_size = 14, level = 2)
### 2.1 Background (font_size = 12, level = 3)
#### 2.1.1 History (font_size = 12, level = 4)
```

With `_SIBLING_FONT_TOL = 0.1`, `History` is compared to `Background`: same font size, so it is reclassified as a sibling (`level = 3`). The numbering depth (3 dotted parts → level 4) is ignored. Output becomes:

```
## 2 Motivation
### 2.1 Background
### 2.1.1 History <-- should be ####
```

**Papers likely affected:** any WG21 paper where subsection headings at multiple depths share one font size. Not yet observed in the 52-paper batch as a regression, but the probability rises for papers that pass LOW-confidence heading classification (where section number is the only evidence).

**Root cause:** `lib/pdf/structure.py:_validate_nesting` (lines 912-921). The sibling check fires whenever `abs(sec.font_size - prev_font_size) <= _SIBLING_FONT_TOL` AND `sec.heading_level > prev_level`. It has no awareness of whether the section number supports the deeper level.

**Fix direction:** Let the section-number signal veto sibling clamping. When the incoming heading has a `has_number` match AND its `number_level == sec.heading_level == prev_level + 1`, allow the one-level descent through unchanged. Only clamp when the level would skip (`> prev_level + 1`) or when no numbering is present. In CLAUDE.md terms: dotted-decimal section numbering is higher-reliability than font-size agreement; font agreement should not override numbering agreement.

---

## `_SIBLING_FONT_TOL = 0.1` is very tight for real-world font sizes

**Symptom:** PDFs sometimes specify fractionally different font sizes for the same visual heading tier (e.g. `11.95` and `12.0`, or `14.04` and `14.0`). A tolerance of `0.1` will treat these as different tiers and skip sibling consolidation, reintroducing the cascade this fix was meant to prevent.

**Example:**

```
sec_a.font_size = 11.95
sec_b.font_size = 12.00
abs(12.00 - 11.95) = 0.05 # within tol, sibling
```

```
sec_a.font_size = 11.70
sec_b.font_size = 12.00
abs(12.00 - 11.70) = 0.30 # exceeds tol, NOT sibling
```

Fractions in the 0.1–0.5 range are most common in LaTeX-rendered PDFs and in scanned-then-OCR'd documents. Papers in the 52-doc batch did not happen to hit this, but the test coverage does not guard against it either.

**Root cause:** `_SIBLING_FONT_TOL = 0.1` in `lib/pdf/structure.py` is an absolute point-size tolerance with no relation to the heading size itself. At 12pt body text the tolerance is ~0.8% of the size; at 8pt captions it is ~1.3%. The constant was chosen without a documented rationale.

**Fix direction:** Either (a) widen the absolute tolerance to ~0.5 (covers typical fractional PDF size variance while still discriminating between real 10pt / 11pt / 12pt tiers), or (b) express it as a ratio: `abs(a - b) <= max(0.5, 0.02 * max(a, b))`. The ratio form is more defensible across font-size ranges.

---

## Local import of `Counter` inside `propagate_monospace`

**Symptom:** `lib/pdf/mono.py:195` adds `from collections import Counter` inside the body of `propagate_monospace`. Other modules in the project (e.g. `types.py`, `structure.py`, `cleanup.py`) import `Counter` at module scope. The inline import breaks the project's import-location consistency.

**Impact:** Cosmetic only. No measurable performance difference for a one-shot call per document, and no correctness risk.

**Root cause:** The import was likely added at the point of first use while iterating on the fix and never hoisted.

**Fix direction:** Move the import to the top of `mono.py` alongside `math` and `re`. Zero runtime impact, consistent with the rest of the codebase. Trivial to fold into the next mono.py edit.

---

## Commit 5 bundles two unrelated changes

**Symptom:** Commit `e61a787` ("Treat same-styled consecutive headings as siblings in nesting validation") contains two logically independent changes:

1. **Repetition demotion:** `_demote_repeated_low_confidence_numbers`, a new phase that demotes LOW-confidence headings whose `section_num` repeats ≥3 times.
2. **Sibling consolidation:** a new branch inside `_validate_nesting` that clamps runs of same-font-size headings to `prev_level`.

Either change is independently useful and could be reverted without the other. Bundling them reduces bisect-ability if a future regression traces to one of the two rules.

**Acknowledgement:** The PR description already flags this ("Includes two changes (should have been two commits)").

**Fix direction:** Split before merge via `git rebase -i e61a787` and `git reset HEAD~` + two `git commit -p` rounds. Optional — the description captures the distinction in prose, and the functions themselves are separately named and located.

---

## `_demote_repeated_low_confidence_numbers` leaves `confidence` at LOW on demoted paragraphs

**Symptom:** When a section is demoted from HEADING → PARAGRAPH, only `kind` and `heading_level` are updated. `confidence` remains at `Confidence.LOW`, the value it carried as a heading. Most PARAGRAPH sections elsewhere in the pipeline carry `Confidence.HIGH` (constructed by `_make_paragraph_section`).

**Example:**

```python
sec.kind == SectionKind.HEADING
sec.confidence == Confidence.LOW
sec.heading_level == 2

# after _demote_repeated_low_confidence_numbers:
sec.kind == SectionKind.PARAGRAPH
sec.confidence == Confidence.LOW # stays LOW
sec.heading_level == 0
```

**Impact:** None currently observable. A grep for `sec.confidence ==` across `lib/` shows only heading-path consumers inspect confidence (the `HIGH → MEDIUM` downgrade in `_validate_nesting`), and `compare_extractions` has precedent for assigning `Confidence.LOW` to paragraphs (line 182). The emitter ignores `confidence` for PARAGRAPH sections.

**Root cause:** `lib/pdf/structure.py:_demote_repeated_low_confidence_numbers` only resets the two fields it knows the downstream classifier cares about. Not a bug today, but a future consumer that trusts `confidence` on PARAGRAPH sections would be surprised.

**Fix direction:** Either (a) reset `sec.confidence = Confidence.HIGH` on demotion to match the provenance of other paragraphs, (b) leave it as LOW and add a short comment stating that confidence is not meaningful for non-HEADING sections, or (c) drop the field entirely from `Section` for non-HEADING kinds. Option (a) is the least risky and aligns with the "most paragraphs are HIGH" default.
10 changes: 9 additions & 1 deletion lib/pdf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,15 @@ def convert_pdf(path: Path) -> tuple[str, str | None]:
mupdf_blocks = extract_mupdf(page, pg_num)
spatial_blocks = extract_spatial(page, pg_num)

edge_items = get_edge_items(mupdf_blocks, pg_num, page_height)
# Collect edges from both paths: MuPDF splits left/center/right
# headers into separate lines on wide horizontal gaps; spatial
# merges them into one line. Patterns from both forms are needed
# so strip_repeating (which does exact text match) can match
# against either path's segmentation.
edge_items = (
get_edge_items(mupdf_blocks, pg_num, page_height)
+ get_edge_items(spatial_blocks, pg_num, page_height)
)
all_edge_items.append(edge_items)

links = collect_links(page)
Expand Down
86 changes: 62 additions & 24 deletions lib/pdf/cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@
_log = logging.getLogger(__name__)


def _y_bucket(bbox: tuple[float, float, float, float]) -> float:
"""Quantize a bbox's vertical center to the Y_TOLERANCE grid."""
y_center = (bbox[1] + bbox[3]) / 2.0
return round(y_center / Y_TOLERANCE) * Y_TOLERANCE


def get_edge_items(blocks: list[Block], page_num: int,
page_height: float) -> list[PageEdgeItem]:
"""Get the top N and bottom N text items from a page by y-coordinate.
Expand Down Expand Up @@ -72,8 +78,7 @@ def detect_repeating(all_edge_items: list[list[PageEdgeItem]],

for page_items in all_edge_items:
for item in page_items:
y_key = round(item.y / Y_TOLERANCE) * Y_TOLERANCE
y_buckets[y_key].append(item)
y_buckets[_y_bucket(item.bbox)].append(item)

repeating = set()
for y_key, items in y_buckets.items():
Expand Down Expand Up @@ -107,10 +112,36 @@ def detect_repeating(all_edge_items: list[list[PageEdgeItem]],

def strip_repeating(blocks: list[Block], repeating: set[tuple[float, str]],
) -> list[Block]:
"""Remove lines matching repeating header/footer patterns from blocks."""
"""Remove lines (or individual spans) matching repeating header/footer
patterns from blocks.

Spatial-path lines merge left/center/right header columns (one span
per column). The whole-line text then won't match any single pattern,
but each column-span will; per-span stripping handles that while
preserving non-header content that shares the header y-coordinate
(e.g. a one-off appendix title).
"""
if not repeating:
return blocks

patterns_by_y: dict[float, list[str]] = defaultdict(list)
for ry, rp in repeating:
patterns_by_y[ry].append(rp)

def _patterns_near(y_key: float) -> list[str]:
return (patterns_by_y.get(y_key - Y_TOLERANCE, [])
+ patterns_by_y.get(y_key, [])
+ patterns_by_y.get(y_key + Y_TOLERANCE, []))

def _matches(text: str, rpattern: str) -> bool:
if rpattern == text:
return True
if rpattern == "__PAGE_NUM__" and PAGE_NUM_RE.match(text):
return True
if rpattern == "__DOC_NUM__" and DOC_NUM_RE.search(text):
return True
return False

result = []
for block in blocks:
kept_lines = []
Expand All @@ -119,29 +150,36 @@ def strip_repeating(blocks: list[Block], repeating: set[tuple[float, str]],
if not text:
kept_lines.append(line)
continue
y_center = (line.bbox[1] + line.bbox[3]) / 2.0
y_key = round(y_center / Y_TOLERANCE) * Y_TOLERANCE
stripped = False
for ry, rpattern in repeating:
if abs(y_key - ry) > Y_TOLERANCE:
continue
if rpattern == text:
stripped = True
break
if rpattern == "__PAGE_NUM__" and PAGE_NUM_RE.match(text):
stripped = True
break
if rpattern == "__DOC_NUM__" and DOC_NUM_RE.search(text):
stripped = True
break
if not stripped:

line_patterns = _patterns_near(_y_bucket(line.bbox))
if not line_patterns:
kept_lines.append(line)
continue

if any(_matches(text, rp) for rp in line_patterns):
continue

kept_spans = []
stripped_any = False
for span in line.spans:
span_text = span.text.strip()
if not span_text:
kept_spans.append(span)
continue
span_patterns = _patterns_near(_y_bucket(span.bbox))
if any(_matches(span_text, rp) for rp in span_patterns):
stripped_any = True
continue
kept_spans.append(span)

if not any(sp.text.strip() for sp in kept_spans):
continue

kept_lines.append(
replace(line, spans=kept_spans) if stripped_any else line)

if kept_lines:
result.append(Block(
lines=kept_lines,
bbox=block.bbox,
page_num=block.page_num,
))
result.append(replace(block, lines=kept_lines))
return result


Expand Down
44 changes: 33 additions & 11 deletions lib/pdf/mono.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,12 @@

_MONO_KEYWORDS = frozenset({"mono", "courier", "code", "consolas", "menlo"})

_CAMEL_SPLIT_RE = re.compile(r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])")
_CAMEL_SPLIT_RE = re.compile(
r"(?<=[a-z])(?=[A-Z])"
r"|(?<=[A-Z])(?=[A-Z][a-z])"
r"|(?<=[A-Za-z])(?=\d)"
r"|(?<=\d)(?=[A-Za-z])"
)

_GLYPH_CV_THRESHOLD = 0.15
_FAT_THIN_REJECT_RATIO = 1.3
Expand Down Expand Up @@ -170,24 +175,41 @@ def classify_monospace(
return False


_PROPAGATE_MONO_MAJORITY = 0.5


def propagate_monospace(mupdf_blocks: list[Block], spatial_blocks: list[Block],
dominant_font: str) -> None:
"""Apply spatial path's glyph-width monospace decisions to MuPDF spans.

Collects fonts classified as monospace by the spatial path (which has
per-character data), filters out the dominant body font (the single
most common font by character count), and sets monospace=True on
matching MuPDF spans. The dominant_font must be pre-lowercased.
Only propagates fonts whose spatial spans are mostly classified
monospace (by character count). Short spans of digits or thin
characters can false-positive the per-glyph signal, so requiring a
majority keeps proportional fonts (e.g. a regular text font with a
handful of "1" or "3.1" spans) out of the mono set.

The dominant_font is still discarded unless it passes a
name-based mono check, to avoid body text leaking into code blocks
when metrics happen to agree across a majority.
"""
mono_fonts: set[str] = set()
from collections import Counter
mono_chars: Counter[str] = Counter()
total_chars: Counter[str] = Counter()
for b in spatial_blocks:
for ln in b.lines:
for s in ln.spans:
if s.monospace and s.text.strip():
mono_fonts.add(s.font_name.lower())
# Only discard the dominant font if it is not itself a known monospace
# family. On code-heavy papers the code font may dominate by character
# count, and discarding it would suppress all code block detection.
if not s.text.strip():
continue
key = s.font_name.lower()
total_chars[key] += len(s.text)
if s.monospace:
mono_chars[key] += len(s.text)

mono_fonts = {
f for f, total in total_chars.items()
if total > 0 and mono_chars[f] / total >= _PROPAGATE_MONO_MAJORITY
}

if dominant_font and not classify_monospace(dominant_font):
mono_fonts.discard(dominant_font)
if not mono_fonts:
Expand Down
Loading