Description of the bug
With the layout feature enabled (import pymupdf.layout + page.get_layout()), Page.find_tables() can return a Table whose .cells is an empty list. Reading the public Table.bbox property then computes min(map(itemgetter(0), c)) over the empty cell list and raises ValueError: min() iterable argument is empty.
Because Table.bbox is part of the public API — and pymupdf4llm's to_markdown dereferences t.bbox for every detected table — a single zero-cell "phantom" table aborts the whole run on otherwise-valid PDFs (typically image-heavy / scanned slides that go through the layout path).
How to reproduce the bug
Minimal reproducible example (in-memory, no files). Requires the layout package: pip install pymupdf pymupdf-layout.
import pymupdf
import pymupdf.layout # enables layout-aware table detection
# Eight short text fragments scattered like an OCR'd slide. The layout model
# reads the region as a table, but the grid finder extracts no cells from it.
PLACEMENTS = [
(84, 620, "Cost", 10), (214, 280, "Net", 12),
(88, 505, "12%", 9), (213, 378, "Margin", 11),
(130, 245, "Margin", 10), (373, 156, "South", 8),
(67, 222, "North", 11), (140, 475, "3.4", 11),
]
doc = pymupdf.open()
page = doc.new_page() # default A4
for x, y, text, size in PLACEMENTS:
page.insert_text((x, y), text, fontsize=size)
page.get_layout()
tables = page.find_tables()
print("tables found:", len(tables.tables))
for t in tables.tables:
print("cells:", len(t.cells))
print("bbox:", t.bbox) # <-- raises ValueError for the zero-cell table
Actual output / traceback (pymupdf 1.27.2.3, pymupdf-layout 1.27.2.3):
tables found: 1
cells: 0
Traceback (most recent call last):
...
File ".../pymupdf/table.py", line 1534, in bbox
min(map(itemgetter(0), c)),
^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: min() iterable argument is empty
Cause. Table.bbox in pymupdf/table.py reduces over the cell list without guarding for an empty one:
@property
def bbox(self):
c = self.cells
return (
min(map(itemgetter(0), c)), # ValueError when c == []
min(map(itemgetter(1), c)),
max(map(itemgetter(2), c)),
max(map(itemgetter(3), c)),
)
find_tables() can append Table(page, cells=[]) when the layout model tags a region as a table but the grid finder extracts no cells from it. bbox does not handle that case.
Expected behaviour. Either find_tables() should not emit zero-cell tables, or Table.bbox should return a degenerate/empty rect (e.g. (0, 0, 0, 0)) for a cell-less table instead of raising — so that iterating detected tables and reading .bbox (as pymupdf4llm does) is safe.
Potential fixes, in increasing order of "correctness":
-
Guard Table.bbox — return a degenerate rect when there are no cells, so the property never reduces over an empty sequence:
@property
def bbox(self):
c = self.cells
if not c:
return (0.0, 0.0, 0.0, 0.0)
return (
min(map(itemgetter(0), c)),
min(map(itemgetter(1), c)),
max(map(itemgetter(2), c)),
max(map(itemgetter(3), c)),
)
Smallest change; the zero-cell table still exists but its bbox is safe.
-
Don't emit zero-cell tables from find_tables() — skip appending Table(page, cells=[]) when the grid finder extracts no cells from a layout-tagged region. A table with no cells carries no information, and this avoids having to special-case every cell-reducing accessor downstream. This looks like the cleaner fix.
Happy to open a PR for whichever approach you prefer.
Also reproduces on PyMuPDF 1.27.2 (with pymupdf-layout 1.27.2), in addition to the latest 1.27.2.3 selected below.
Related issues — prior reports of the same crash, all closed without a fix for this code path:
PyMuPDF version
1.27.2.3
Operating system
Linux
Python version
3.12
Description of the bug
With the layout feature enabled (
import pymupdf.layout+page.get_layout()),Page.find_tables()can return aTablewhose.cellsis an empty list. Reading the publicTable.bboxproperty then computesmin(map(itemgetter(0), c))over the empty cell list and raisesValueError: min() iterable argument is empty.Because
Table.bboxis part of the public API — and pymupdf4llm'sto_markdowndereferencest.bboxfor every detected table — a single zero-cell "phantom" table aborts the whole run on otherwise-valid PDFs (typically image-heavy / scanned slides that go through the layout path).How to reproduce the bug
Minimal reproducible example (in-memory, no files). Requires the layout package:
pip install pymupdf pymupdf-layout.Actual output / traceback (pymupdf 1.27.2.3, pymupdf-layout 1.27.2.3):
Cause.
Table.bboxinpymupdf/table.pyreduces over the cell list without guarding for an empty one:find_tables()can appendTable(page, cells=[])when the layout model tags a region as a table but the grid finder extracts no cells from it.bboxdoes not handle that case.Expected behaviour. Either
find_tables()should not emit zero-cell tables, orTable.bboxshould return a degenerate/empty rect (e.g.(0, 0, 0, 0)) for a cell-less table instead of raising — so that iterating detected tables and reading.bbox(as pymupdf4llm does) is safe.Potential fixes, in increasing order of "correctness":
Guard
Table.bbox— return a degenerate rect when there are no cells, so the property never reduces over an empty sequence:Smallest change; the zero-cell table still exists but its
bboxis safe.Don't emit zero-cell tables from
find_tables()— skip appendingTable(page, cells=[])when the grid finder extracts no cells from a layout-tagged region. A table with no cells carries no information, and this avoids having to special-case every cell-reducing accessor downstream. This looks like the cleaner fix.Happy to open a PR for whichever approach you prefer.
Also reproduces on PyMuPDF 1.27.2 (with pymupdf-layout 1.27.2), in addition to the latest 1.27.2.3 selected below.
Related issues — prior reports of the same crash, all closed without a fix for this code path:
PyMuPDF version
1.27.2.3
Operating system
Linux
Python version
3.12