Skip to content

Add pypdfium2 as optional PDF parser#270

Open
rejojer wants to merge 9 commits into
mainfrom
add-pypdfium2-parser
Open

Add pypdfium2 as optional PDF parser#270
rejojer wants to merge 9 commits into
mainfrom
add-pypdfium2-parser

Conversation

@rejojer
Copy link
Copy Markdown
Member

@rejojer rejojer commented May 11, 2026

Summary

  • Adds opt-in pdf_parser="pypdfium2" alongside the existing PyPDF2 default
  • Default behavior is unchanged — existing users see no diff; pypdfium2 is lazy-imported only when selected
  • pypdfium2 yields cleaner text (no Reser ve-style broken words, correct Unicode like /) and is roughly 3-5x faster on the corpus in examples/documents/

Changes

  • utils.py: new read_pdf_pages(doc, pdf_parser) helper with PyPDF2 / pypdfium2 / PyMuPDF branches; get_page_tokens now a thin wrapper over it
  • client.py and retrieve.py: route their direct PyPDF2.PdfReader calls through read_pdf_pages so the parser choice flows end-to-end (parser is recorded on the document so cache-miss reads stay consistent)
  • page_index() and PageIndexClient.__init__() accept pdf_parser parameter (default None, falls back to config.yaml)
  • config.yaml: pdf_parser: \"PyPDF2\" default
  • run_pageindex.py: --pdf-parser CLI arg
  • requirements.txt: pypdfium2 noted as optional (no new hard dependency)

Usage

# default unchanged
client = PageIndexClient()

# opt-in pypdfium2 (after `pip install pypdfium2`)
client = PageIndexClient(pdf_parser=\"pypdfium2\")
python run_pageindex.py --pdf_path doc.pdf --pdf-parser pypdfium2

Test plan

  • ConfigLoader().load().pdf_parser == \"PyPDF2\" (default preserved)
  • read_pdf_pages(pdf) byte-identical to PyPDF2.PdfReader(pdf).pages[*].extract_text()
  • get_page_tokens(pdf) with no parser arg produces identical output to pre-PR behavior
  • pdf_parser=\"pypdfium2\" extracts pages, normalizes \r\n to \n, accepts both file paths and BytesIO
  • PyMuPDF branch still works
  • Unknown parser name raises clear ValueError
  • Missing pypdfium2 raises ImportError with pip install pypdfium2 hint
  • page_index() and PageIndexClient.__init__() signatures include pdf_parser

Comment thread run_pageindex.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant