cppalliance · vinniefalco · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
diff --git a/issues/pr9-review.md b/issues/pr9-review.md
@@ -0,0 +1,111 @@
+# PR #9 Review Findings
+
+Code-quality notes from the review of PR #9 (`fix/pdf-structure-classification`). None were blockers; tracked here so they surface when the surrounding code is touched next.
+
+Each entry names the concern, a concrete example, root cause, and a fix direction (matching the format of `known-issues.md`).
+
+---
+
+## Sibling consolidation may demote legitimate subsection nesting
+
+**Symptom:** `_validate_nesting` now treats consecutive headings that share a font size as siblings and clamps the second to `prev_level`. On papers that express heading depth through section numbering rather than typography (a common pattern for Markdown-origin PDFs), this flattens real nested subsections.
+
+**Example sequence:**
+
+```
+## 2 Motivation        (font_size = 14, level = 2)
+### 2.1 Background     (font_size = 12, level = 3)
+#### 2.1.1 History     (font_size = 12, level = 4)
+```
+
+With `_SIBLING_FONT_TOL = 0.1`, `History` is compared to `Background`: same font size, so it is reclassified as a sibling (`level = 3`). The numbering depth (3 dotted parts → level 4) is ignored. Output becomes:
+
+```
+## 2 Motivation
+### 2.1 Background
+### 2.1.1 History        <-- should be ####
+```
+
+**Papers likely affected:** any WG21 paper where subsection headings at multiple depths share one font size. Not yet observed in the 52-paper batch as a regression, but the probability rises for papers that pass LOW-confidence heading classification (where section number is the only evidence).
+
+**Root cause:** `lib/pdf/structure.py:_validate_nesting` (lines 912-921). The sibling check fires whenever `abs(sec.font_size - prev_font_size) <= _SIBLING_FONT_TOL` AND `sec.heading_level > prev_level`. It has no awareness of whether the section number supports the deeper level.
+
+**Fix direction:** Let the section-number signal veto sibling clamping. When the incoming heading has a `has_number` match AND its `number_level == sec.heading_level == prev_level + 1`, allow the one-level descent through unchanged. Only clamp when the level would skip (`> prev_level + 1`) or when no numbering is present. In CLAUDE.md terms: dotted-decimal section numbering is higher-reliability than font-size agreement; font agreement should not override numbering agreement.
+
+---
+
+## `_SIBLING_FONT_TOL = 0.1` is very tight for real-world font sizes
+
+**Symptom:** PDFs sometimes specify fractionally different font sizes for the same visual heading tier (e.g. `11.95` and `12.0`, or `14.04` and `14.0`). A tolerance of `0.1` will treat these as different tiers and skip sibling consolidation, reintroducing the cascade this fix was meant to prevent.
+
+**Example:**
+
+```
+sec_a.font_size = 11.95
+sec_b.font_size = 12.00
+abs(12.00 - 11.95) = 0.05   # within tol, sibling
+```
+
+```
+sec_a.font_size = 11.70
+sec_b.font_size = 12.00
+abs(12.00 - 11.70) = 0.30   # exceeds tol, NOT sibling
+```
+
+Fractions in the 0.1–0.5 range are most common in LaTeX-rendered PDFs and in scanned-then-OCR'd documents. Papers in the 52-doc batch did not happen to hit this, but the test coverage does not guard against it either.
+
+**Root cause:** `_SIBLING_FONT_TOL = 0.1` in `lib/pdf/structure.py` is an absolute point-size tolerance with no relation to the heading size itself. At 12pt body text the tolerance is ~0.8% of the size; at 8pt captions it is ~1.3%. The constant was chosen without a documented rationale.
+
+**Fix direction:** Either (a) widen the absolute tolerance to ~0.5 (covers typical fractional PDF size variance while still discriminating between real 10pt / 11pt / 12pt tiers), or (b) express it as a ratio: `abs(a - b) <= max(0.5, 0.02 * max(a, b))`. The ratio form is more defensible across font-size ranges.
+
+---
+
+## Local import of `Counter` inside `propagate_monospace`
+
+**Symptom:** `lib/pdf/mono.py:195` adds `from collections import Counter` inside the body of `propagate_monospace`. Other modules in the project (e.g. `types.py`, `structure.py`, `cleanup.py`) import `Counter` at module scope. The inline import breaks the project's import-location consistency.
+
+**Impact:** Cosmetic only. No measurable performance difference for a one-shot call per document, and no correctness risk.
+
+**Root cause:** The import was likely added at the point of first use while iterating on the fix and never hoisted.
+
+**Fix direction:** Move the import to the top of `mono.py` alongside `math` and `re`. Zero runtime impact, consistent with the rest of the codebase. Trivial to fold into the next mono.py edit.
+
+---
+
+## Commit 5 bundles two unrelated changes
+
+**Symptom:** Commit `e61a787` ("Treat same-styled consecutive headings as siblings in nesting validation") contains two logically independent changes:
+
+1. **Repetition demotion:** `_demote_repeated_low_confidence_numbers`, a new phase that demotes LOW-confidence headings whose `section_num` repeats ≥3 times.
+2. **Sibling consolidation:** a new branch inside `_validate_nesting` that clamps runs of same-font-size headings to `prev_level`.
+
+Either change is independently useful and could be reverted without the other. Bundling them reduces bisect-ability if a future regression traces to one of the two rules.
+
+**Acknowledgement:** The PR description already flags this ("Includes two changes (should have been two commits)").
+
+**Fix direction:** Split before merge via `git rebase -i e61a787` and `git reset HEAD~` + two `git commit -p` rounds. Optional — the description captures the distinction in prose, and the functions themselves are separately named and located.
+
+---
+
+## `_demote_repeated_low_confidence_numbers` leaves `confidence` at LOW on demoted paragraphs
+
+**Symptom:** When a section is demoted from HEADING → PARAGRAPH, only `kind` and `heading_level` are updated. `confidence` remains at `Confidence.LOW`, the value it carried as a heading. Most PARAGRAPH sections elsewhere in the pipeline carry `Confidence.HIGH` (constructed by `_make_paragraph_section`).
+
+**Example:**
+
+```python
+sec.kind        == SectionKind.HEADING
+sec.confidence  == Confidence.LOW
+sec.heading_level == 2
+
+# after _demote_repeated_low_confidence_numbers:
+sec.kind        == SectionKind.PARAGRAPH
+sec.confidence  == Confidence.LOW          # stays LOW
+sec.heading_level == 0
+```
+
+**Impact:** None currently observable. A grep for `sec.confidence ==` across `lib/` shows only heading-path consumers inspect confidence (the `HIGH → MEDIUM` downgrade in `_validate_nesting`), and `compare_extractions` has precedent for assigning `Confidence.LOW` to paragraphs (line 182). The emitter ignores `confidence` for PARAGRAPH sections.
+
+**Root cause:** `lib/pdf/structure.py:_demote_repeated_low_confidence_numbers` only resets the two fields it knows the downstream classifier cares about. Not a bug today, but a future consumer that trusts `confidence` on PARAGRAPH sections would be surprised.
+
+**Fix direction:** Either (a) reset `sec.confidence = Confidence.HIGH` on demotion to match the provenance of other paragraphs, (b) leave it as LOW and add a short comment stating that confidence is not meaningful for non-HEADING sections, or (c) drop the field entirely from `Section` for non-HEADING kinds. Option (a) is the least risky and aligns with the "most paragraphs are HIGH" default.
diff --git a/lib/pdf/__init__.py b/lib/pdf/__init__.py
@@ -116,7 +116,15 @@ def convert_pdf(path: Path) -> tuple[str, str | None]:
             mupdf_blocks = extract_mupdf(page, pg_num)
             spatial_blocks = extract_spatial(page, pg_num)
 
-            edge_items = get_edge_items(mupdf_blocks, pg_num, page_height)
+            # Collect edges from both paths: MuPDF splits left/center/right
+            # headers into separate lines on wide horizontal gaps; spatial
+            # merges them into one line. Patterns from both forms are needed
+            # so strip_repeating (which does exact text match) can match
+            # against either path's segmentation.
+            edge_items = (
+                get_edge_items(mupdf_blocks, pg_num, page_height)
+                + get_edge_items(spatial_blocks, pg_num, page_height)
+            )
             all_edge_items.append(edge_items)
 
             links = collect_links(page)

diff --git a/lib/pdf/cleanup.py b/lib/pdf/cleanup.py
@@ -17,6 +17,12 @@
 _log = logging.getLogger(__name__)
 
 
+def _y_bucket(bbox: tuple[float, float, float, float]) -> float:
+    """Quantize a bbox's vertical center to the Y_TOLERANCE grid."""
+    y_center = (bbox[1] + bbox[3]) / 2.0
+    return round(y_center / Y_TOLERANCE) * Y_TOLERANCE
+
+
 def get_edge_items(blocks: list[Block], page_num: int,
                     page_height: float) -> list[PageEdgeItem]:
     """Get the top N and bottom N text items from a page by y-coordinate.
@@ -72,8 +78,7 @@ def detect_repeating(all_edge_items: list[list[PageEdgeItem]],
 
     for page_items in all_edge_items:
         for item in page_items:
-            y_key = round(item.y / Y_TOLERANCE) * Y_TOLERANCE
-            y_buckets[y_key].append(item)
+            y_buckets[_y_bucket(item.bbox)].append(item)
 
     repeating = set()
     for y_key, items in y_buckets.items():
@@ -107,10 +112,36 @@ def detect_repeating(all_edge_items: list[list[PageEdgeItem]],
 
 def strip_repeating(blocks: list[Block], repeating: set[tuple[float, str]],
                     ) -> list[Block]:
-    """Remove lines matching repeating header/footer patterns from blocks."""
+    """Remove lines (or individual spans) matching repeating header/footer
+    patterns from blocks.
+
+    Spatial-path lines merge left/center/right header columns (one span
+    per column). The whole-line text then won't match any single pattern,
+    but each column-span will; per-span stripping handles that while
+    preserving non-header content that shares the header y-coordinate
+    (e.g. a one-off appendix title).
+    """
     if not repeating:
         return blocks
 
+    patterns_by_y: dict[float, list[str]] = defaultdict(list)
+    for ry, rp in repeating:
+        patterns_by_y[ry].append(rp)
+
+    def _patterns_near(y_key: float) -> list[str]:
+        return (patterns_by_y.get(y_key - Y_TOLERANCE, [])
+                + patterns_by_y.get(y_key, [])
+                + patterns_by_y.get(y_key + Y_TOLERANCE, []))
+
+    def _matches(text: str, rpattern: str) -> bool:
+        if rpattern == text:
+            return True
+        if rpattern == "__PAGE_NUM__" and PAGE_NUM_RE.match(text):
+            return True
+        if rpattern == "__DOC_NUM__" and DOC_NUM_RE.search(text):
+            return True
+        return False
+
     result = []
     for block in blocks:
         kept_lines = []
@@ -119,29 +150,36 @@ def strip_repeating(blocks: list[Block], repeating: set[tuple[float, str]],
             if not text:
                 kept_lines.append(line)
                 continue
-            y_center = (line.bbox[1] + line.bbox[3]) / 2.0
-            y_key = round(y_center / Y_TOLERANCE) * Y_TOLERANCE
-            stripped = False
-            for ry, rpattern in repeating:
-                if abs(y_key - ry) > Y_TOLERANCE:
-                    continue
-                if rpattern == text:
-                    stripped = True
-                    break
-                if rpattern == "__PAGE_NUM__" and PAGE_NUM_RE.match(text):
-                    stripped = True
-                    break
-                if rpattern == "__DOC_NUM__" and DOC_NUM_RE.search(text):
-                    stripped = True
-                    break
-            if not stripped:
+
+            line_patterns = _patterns_near(_y_bucket(line.bbox))
+            if not line_patterns:
                 kept_lines.append(line)
+                continue
+
+            if any(_matches(text, rp) for rp in line_patterns):
+                continue
+
+            kept_spans = []
+            stripped_any = False
+            for span in line.spans:
+                span_text = span.text.strip()
+                if not span_text:
+                    kept_spans.append(span)
+                    continue
+                span_patterns = _patterns_near(_y_bucket(span.bbox))
+                if any(_matches(span_text, rp) for rp in span_patterns):
+                    stripped_any = True
+                    continue
+                kept_spans.append(span)
+
+            if not any(sp.text.strip() for sp in kept_spans):
+                continue
+
+            kept_lines.append(
+                replace(line, spans=kept_spans) if stripped_any else line)
+
         if kept_lines:
-            result.append(Block(
-                lines=kept_lines,
-                bbox=block.bbox,
-                page_num=block.page_num,
-            ))
+            result.append(replace(block, lines=kept_lines))
     return result
 
 

diff --git a/lib/pdf/mono.py b/lib/pdf/mono.py
@@ -33,7 +33,12 @@
 
 _MONO_KEYWORDS = frozenset({"mono", "courier", "code", "consolas", "menlo"})
 
-_CAMEL_SPLIT_RE = re.compile(r"(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])")
+_CAMEL_SPLIT_RE = re.compile(
+    r"(?<=[a-z])(?=[A-Z])"
+    r"|(?<=[A-Z])(?=[A-Z][a-z])"
+    r"|(?<=[A-Za-z])(?=\d)"
+    r"|(?<=\d)(?=[A-Za-z])"
+)
 
 _GLYPH_CV_THRESHOLD = 0.15
 _FAT_THIN_REJECT_RATIO = 1.3
@@ -170,24 +175,41 @@ def classify_monospace(
     return False
 
 
+_PROPAGATE_MONO_MAJORITY = 0.5
+
+
 def propagate_monospace(mupdf_blocks: list[Block], spatial_blocks: list[Block],
                         dominant_font: str) -> None:
     """Apply spatial path's glyph-width monospace decisions to MuPDF spans.
 
-    Collects fonts classified as monospace by the spatial path (which has
-    per-character data), filters out the dominant body font (the single
-    most common font by character count), and sets monospace=True on
-    matching MuPDF spans. The dominant_font must be pre-lowercased.
+    Only propagates fonts whose spatial spans are mostly classified
+    monospace (by character count). Short spans of digits or thin
+    characters can false-positive the per-glyph signal, so requiring a
+    majority keeps proportional fonts (e.g. a regular text font with a
+    handful of "1" or "3.1" spans) out of the mono set.
+
+    The dominant_font is still discarded unless it passes a
+    name-based mono check, to avoid body text leaking into code blocks
+    when metrics happen to agree across a majority.
     """
-    mono_fonts: set[str] = set()
+    from collections import Counter
+    mono_chars: Counter[str] = Counter()
+    total_chars: Counter[str] = Counter()
     for b in spatial_blocks:
         for ln in b.lines:
             for s in ln.spans:
-                if s.monospace and s.text.strip():
-                    mono_fonts.add(s.font_name.lower())
-    # Only discard the dominant font if it is not itself a known monospace
-    # family. On code-heavy papers the code font may dominate by character
-    # count, and discarding it would suppress all code block detection.
+                if not s.text.strip():
+                    continue
+                key = s.font_name.lower()
+                total_chars[key] += len(s.text)
+                if s.monospace:
+                    mono_chars[key] += len(s.text)
+
+    mono_fonts = {
+        f for f, total in total_chars.items()
+        if total > 0 and mono_chars[f] / total >= _PROPAGATE_MONO_MAJORITY
+    }
+
     if dominant_font and not classify_monospace(dominant_font):
         mono_fonts.discard(dominant_font)
     if not mono_fonts: