Recover PDF text truncated by inline images using PyMuPDF fallback#2092
Recover PDF text truncated by inline images using PyMuPDF fallback#2092Muhtasim-Munif-Fahim wants to merge 1 commit into
Conversation
|
@Muhtasim-Munif-Fahim please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an optional PyMuPDF-based text extraction fallback to improve PDF conversion when primary extractors (pdfplumber/pdfminer) appear to return truncated text, particularly around inline images.
Changes:
- Introduces an optional PyMuPDF extraction path and heuristics to prefer it when primary output looks truncated.
- Adds a regression test to validate choosing PyMuPDF when primary extraction misses post-image text.
- Documents and packages PyMuPDF as an optional extra (and includes it in the
allextra).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/markitdown/tests/test_module_misc.py | Adds a test that simulates truncation and asserts the PyMuPDF path is preferred. |
| packages/markitdown/src/markitdown/converters/_pdf_converter.py | Adds PyMuPDF extraction helper and selection heuristics based on images + output length. |
| packages/markitdown/pyproject.toml | Adds pymupdf optional dependency (extra + included in all). |
| packages/markitdown/README.md | Documents how to install the optional PyMuPDF extra. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| monkeypatch.setattr(pdf_converter_module.pdfplumber, "open", lambda _: _FakePdf()) | ||
| monkeypatch.setattr( | ||
| pdf_converter_module.pdfminer.high_level, | ||
| "extract_text", | ||
| lambda _: "BEFORE_IMAGE: this text should be extracted", | ||
| ) | ||
| monkeypatch.setattr( | ||
| pdf_converter_module.fitz, | ||
| "open", | ||
| lambda *args, **kwargs: _FakePyMuPdfDoc(), | ||
| ) |
| if fitz is not None and has_images and markdown and len(markdown) < 2048: | ||
| try: | ||
| pymupdf_markdown = _extract_with_pymupdf(pdf_bytes) | ||
| except Exception: | ||
| pymupdf_markdown = None | ||
| else: | ||
| if pymupdf_markdown is not None: | ||
| primary_length = len(markdown.strip()) | ||
| pymupdf_length = len(pymupdf_markdown.strip()) | ||
| if pymupdf_length > primary_length and ( | ||
| primary_length == 0 | ||
| or pymupdf_length >= primary_length * 1.5 | ||
| or pymupdf_length - primary_length >= 200 | ||
| ): |
| pdf_bytes.seek(0) | ||
| chunks: list[str] = [] | ||
| with fitz.open(stream=pdf_bytes.read(), filetype="pdf") as doc: |
0a99efc to
60c5aa8
Compare
60c5aa8 to
07d779c
Compare
Fixes #1870.
Some PDFs with inline images cause pdfplumber/pdfminer to silently drop text that appears after the image. When the primary extraction path returns a much shorter body on image-bearing pages, this change optionally retries with PyMuPDF and prefers that output if it is substantially longer.
Also documents the optional pymupdf extra and adds a regression test that simulates the truncation path.