Add Python PDF RAG ingestion pipeline example by jdrhyne · Pull Request #192 · PSPDFKit/awesome-nutrient

jdrhyne · 2026-04-27T01:03:18Z

Summary

Companion code for the tutorial Build a PDF ingestion pipeline for AI apps in Python (currently in flight in PSPDFKit/nutrient-website#3873).

End-to-end ingestion pipeline:

PDF → Markdown (DWS Processor API) → heading-aware chunks → OpenAI embeddings → Chroma → Claude answer with cited sources

Use case: developers shipping RAG, agents, or document-aware LLM apps who want a working starting point on Nutrient's cloud Markdown extraction.

What's in this PR

New folder pdf-rag-ingestion-pipeline-python/ containing:

README.md — 5-minute quickstart, decision matrix for the three Nutrient extraction paths, production checklist
pyproject.toml — Python 3.10+, nutrient-dws, chromadb, openai, anthropic
Makefile — install, ingest, ask, demo, lint, test
.env.example
ingestion/ — extract.py (PDF → Markdown via nutrient-dws), chunk.py (heading-aware splitter), embed.py (OpenAI), store.py (Chroma)
retrieval/ask.py — top-k retrieval + Claude answer with cited sources
run.py — end-to-end CLI
tests/test_chunk.py — pytest unit tests for the chunker (no live API calls)
LICENSE — modified BSD, matching the existing dws/ example

Placement

Adding at top level (pdf-rag-ingestion-pipeline-python/) to avoid invasive restructuring of the existing dws/ folder, which is itself a single Vite example. Happy to move under dws/ (or to a future dws-examples/ umbrella) if you want to refactor that area in a follow-up — naming is easy to change.

Test plan

cd pdf-rag-ingestion-pipeline-python && python -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]' succeeds on macOS and Linux
pytest -q passes (chunker tests use no live API calls)
ruff check . and ruff format --check . clean
cp .env.example .env, add real NUTRIENT_API_KEY / OPENAI_API_KEY / ANTHROPIC_API_KEY, drop a sample PDF in pdfs/, run python run.py and python -m retrieval.ask "..." end-to-end

Companion code for the tutorial "Build a PDF ingestion pipeline for AI apps in Python" on nutrient.io. Walks through PDF -> Markdown via the DWS Processor API -> heading-aware chunking -> OpenAI embeddings -> Chroma -> Claude answer with cited sources. Apache-style modified BSD license matching the rest of the repo. Includes Makefile, .env.example, pytest unit tests for the chunker, and a 5-minute quickstart.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI review requested due to automatic review settings April 27, 2026 01:03

jdrhyne requested review from a team as code owners April 27, 2026 01:03

jdrhyne requested review from ritz078 and sc0 April 27, 2026 01:03

Copilot started reviewing on behalf of jdrhyne April 27, 2026 01:03 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

jdrhyne marked this pull request as draft April 27, 2026 13:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python PDF RAG ingestion pipeline example#192

Add Python PDF RAG ingestion pipeline example#192
jdrhyne wants to merge 1 commit intomasterfrom
examples/dws-pdf-rag-ingestion-python

jdrhyne commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdrhyne commented Apr 27, 2026

Summary

What's in this PR

Placement

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants