Add pypdfium2 as optional PDF parser by rejojer · Pull Request #270 · VectifyAI/PageIndex

rejojer · 2026-05-11T08:04:36Z

Summary

Adds opt-in pdf_parser="pypdfium2" alongside the existing PyPDF2 default
Default behavior is unchanged — existing users see no diff; pypdfium2 is lazy-imported only when selected
pypdfium2 yields cleaner text (no Reser ve-style broken words, correct Unicode like †/‡) and is roughly 3-5x faster on the corpus in examples/documents/

Changes

utils.py: new read_pdf_pages(doc, pdf_parser) helper with PyPDF2 / pypdfium2 / PyMuPDF branches; get_page_tokens now a thin wrapper over it
client.py and retrieve.py: route their direct PyPDF2.PdfReader calls through read_pdf_pages so the parser choice flows end-to-end (parser is recorded on the document so cache-miss reads stay consistent)
page_index() and PageIndexClient.__init__() accept pdf_parser parameter (default None, falls back to config.yaml)
config.yaml: pdf_parser: \"PyPDF2\" default
run_pageindex.py: --pdf-parser CLI arg
requirements.txt: pypdfium2 noted as optional (no new hard dependency)

Usage

# default unchanged
client = PageIndexClient()

# opt-in pypdfium2 (after `pip install pypdfium2`)
client = PageIndexClient(pdf_parser=\"pypdfium2\")

python run_pageindex.py --pdf_path doc.pdf --pdf-parser pypdfium2

Test plan

ConfigLoader().load().pdf_parser == \"PyPDF2\" (default preserved)
read_pdf_pages(pdf) byte-identical to PyPDF2.PdfReader(pdf).pages[*].extract_text()
get_page_tokens(pdf) with no parser arg produces identical output to pre-PR behavior
pdf_parser=\"pypdfium2\" extracts pages, normalizes \r\n to \n, accepts both file paths and BytesIO
PyMuPDF branch still works
Unknown parser name raises clear ValueError
Missing pypdfium2 raises ImportError with pip install pypdfium2 hint
page_index() and PageIndexClient.__init__() signatures include pdf_parser

Default behavior unchanged. Users can opt in via pdf_parser="pypdfium2" for cleaner text extraction (no broken words, correct Unicode) and 3-5x faster parsing. PyPDF2 remains the only required dependency; pypdfium2 is lazy-imported.

rejojer added 8 commits May 11, 2026 16:04

Add pypdfium2 as optional PDF parser

9539fe7

Default behavior unchanged. Users can opt in via pdf_parser="pypdfium2" for cleaner text extraction (no broken words, correct Unicode) and 3-5x faster parsing. PyPDF2 remains the only required dependency; pypdfium2 is lazy-imported.

Keep pdf_parser default in code, not config.yaml

3b2ddef

Drop unnecessary docstring

de58581

Take pdf_parser out of ConfigLoader, use plain function arg

1629ef4

Centralize default parser as DEFAULT_PDF_PARSER constant

ec1aaca

Move pdf_parser off doc dict, pass via call args

108cb28

Make PageIndexClient parser-agnostic, pdf_parser per index() call

63e11ef

Replace pdf_parser plumbing with mutable DEFAULT_PDF_PARSER global

4dec4d6

github-code-quality Bot found potential problems May 11, 2026

View reviewed changes

Comment thread run_pageindex.py Fixed

Use single import style for pageindex.utils in run_pageindex

7b15dea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pypdfium2 as optional PDF parser#270

Add pypdfium2 as optional PDF parser#270
rejojer wants to merge 9 commits into
mainfrom
add-pypdfium2-parser

rejojer commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rejojer commented May 11, 2026

Summary

Changes

Usage

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant