Skip to content

Add Python PDF RAG ingestion pipeline example#192

Draft
jdrhyne wants to merge 1 commit intomasterfrom
examples/dws-pdf-rag-ingestion-python
Draft

Add Python PDF RAG ingestion pipeline example#192
jdrhyne wants to merge 1 commit intomasterfrom
examples/dws-pdf-rag-ingestion-python

Conversation

@jdrhyne
Copy link
Copy Markdown
Contributor

@jdrhyne jdrhyne commented Apr 27, 2026

Summary

Companion code for the tutorial Build a PDF ingestion pipeline for AI apps in Python (currently in flight in PSPDFKit/nutrient-website#3873).

End-to-end ingestion pipeline:

PDF → Markdown (DWS Processor API) → heading-aware chunks → OpenAI embeddings → Chroma → Claude answer with cited sources

Use case: developers shipping RAG, agents, or document-aware LLM apps who want a working starting point on Nutrient's cloud Markdown extraction.

What's in this PR

New folder pdf-rag-ingestion-pipeline-python/ containing:

  • README.md — 5-minute quickstart, decision matrix for the three Nutrient extraction paths, production checklist
  • pyproject.toml — Python 3.10+, nutrient-dws, chromadb, openai, anthropic
  • Makefileinstall, ingest, ask, demo, lint, test
  • .env.example
  • ingestion/extract.py (PDF → Markdown via nutrient-dws), chunk.py (heading-aware splitter), embed.py (OpenAI), store.py (Chroma)
  • retrieval/ask.py — top-k retrieval + Claude answer with cited sources
  • run.py — end-to-end CLI
  • tests/test_chunk.py — pytest unit tests for the chunker (no live API calls)
  • LICENSE — modified BSD, matching the existing dws/ example

Placement

Adding at top level (pdf-rag-ingestion-pipeline-python/) to avoid invasive restructuring of the existing dws/ folder, which is itself a single Vite example. Happy to move under dws/ (or to a future dws-examples/ umbrella) if you want to refactor that area in a follow-up — naming is easy to change.

Test plan

  • cd pdf-rag-ingestion-pipeline-python && python -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]' succeeds on macOS and Linux
  • pytest -q passes (chunker tests use no live API calls)
  • ruff check . and ruff format --check . clean
  • cp .env.example .env, add real NUTRIENT_API_KEY / OPENAI_API_KEY / ANTHROPIC_API_KEY, drop a sample PDF in pdfs/, run python run.py and python -m retrieval.ask "..." end-to-end

Companion code for the tutorial "Build a PDF ingestion pipeline for AI
apps in Python" on nutrient.io. Walks through PDF -> Markdown via the
DWS Processor API -> heading-aware chunking -> OpenAI embeddings ->
Chroma -> Claude answer with cited sources.

Apache-style modified BSD license matching the rest of the repo.
Includes Makefile, .env.example, pytest unit tests for the chunker,
and a 5-minute quickstart.
Copilot AI review requested due to automatic review settings April 27, 2026 01:03
@jdrhyne jdrhyne requested review from a team as code owners April 27, 2026 01:03
@jdrhyne jdrhyne requested review from ritz078 and sc0 April 27, 2026 01:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@jdrhyne jdrhyne marked this pull request as draft April 27, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants