Skip to content

Definition/example prose wrapped in word-level backticks and concatenated into multi-thousand-character lines #7

@sentientsergio

Description

@sentientsergio

Reproducer

Tested against tomd master at commit ad567e3.

Reproduce

tomd p3977r0.pdf --outdir out/
awk '{print length, NR}' out/p3977r0.md | sort -rn | head -5

The longest line in the output is several thousand characters and contains the entire region from Definition 5.1 through the end of section II concatenated.

Symptom

Numbered definitions and examples that should be separate paragraphs are concatenated, with every word in body text wrapped in single backticks. Excerpt:

**Definition** **5.5.** `A` `contract` `is` **disconnecting** `if` `and` `only` `if` `neither` `the` `primary` `nor` `secondary` `domains` `are` `empty,` `and` `for` `at` `least` `one` `element` `of` `the` `secondary` `domain` `1.` `A` `call` `is` `made` `that` `attempts` `to` `end` `the` `program;` `or` `2.` `Program` `execution` `continues` `indefinitely` `without` `return` `control` `to` `the` `caller` **Example** **5.5.a.** `A` `version` `of` `float` `sqrt(float)` `which,` `for` `negative` `numbers,` `is` `specified` `to` `call` `std::abort` `has` `a` `disconnecting` `contract.` ...

The pattern is: bold for the keyword (**Definition**, **Example**) and identifier/Latin label, then every word of the body wrapped in single backticks, with successive definitions and examples joined onto the same line.

Expected

Prose remains prose. Numbered definitions and examples are separate paragraphs/blocks. Words are not wrapped in inline code spans. docling on the same PDF produces clean prose paragraphs with proper line breaks between definitions.

Impact

  • Discovery treats the region as code, not prose. Word-level grammar and spelling defects inside become invisible to LLM scanning.
  • In paperlint pipeline-in-the-loop runs, three real grammar/spelling findings docling identified in this region were missed in the tomd-pipeline run, including:
    • "Program execution continues indefinitely without return control to the caller" (missing word "returning")
    • "the adverb reasonably is used where the adjective reasonable is required"
    • "the C++ standard macro FE_INVALID is written as FE INVALID"

Uncertainty signal

p3977r0.prompts.md is written for this paper (54KB), but it covers reconciliation regions starting from page 0 — it does not surgically point at the definitions/examples region as the problem area.

Hypothesis on root cause

One of three symptoms of the same classifier-confidence bug — see the two companion issues filed alongside this one (bullets becoming deep headings; bibliography lists flattened). All three involve over-aggressive structural classification of ambiguous content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions