Add Python PDF RAG ingestion pipeline example#192
Draft
Conversation
Companion code for the tutorial "Build a PDF ingestion pipeline for AI apps in Python" on nutrient.io. Walks through PDF -> Markdown via the DWS Processor API -> heading-aware chunking -> OpenAI embeddings -> Chroma -> Claude answer with cited sources. Apache-style modified BSD license matching the rest of the repo. Includes Makefile, .env.example, pytest unit tests for the chunker, and a 5-minute quickstart.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion code for the tutorial Build a PDF ingestion pipeline for AI apps in Python (currently in flight in PSPDFKit/nutrient-website#3873).
End-to-end ingestion pipeline:
Use case: developers shipping RAG, agents, or document-aware LLM apps who want a working starting point on Nutrient's cloud Markdown extraction.
What's in this PR
New folder
pdf-rag-ingestion-pipeline-python/containing:README.md— 5-minute quickstart, decision matrix for the three Nutrient extraction paths, production checklistpyproject.toml— Python 3.10+,nutrient-dws,chromadb,openai,anthropicMakefile—install,ingest,ask,demo,lint,test.env.exampleingestion/—extract.py(PDF → Markdown vianutrient-dws),chunk.py(heading-aware splitter),embed.py(OpenAI),store.py(Chroma)retrieval/ask.py— top-kretrieval + Claude answer with cited sourcesrun.py— end-to-end CLItests/test_chunk.py— pytest unit tests for the chunker (no live API calls)LICENSE— modified BSD, matching the existingdws/examplePlacement
Adding at top level (
pdf-rag-ingestion-pipeline-python/) to avoid invasive restructuring of the existingdws/folder, which is itself a single Vite example. Happy to move underdws/(or to a futuredws-examples/umbrella) if you want to refactor that area in a follow-up — naming is easy to change.Test plan
cd pdf-rag-ingestion-pipeline-python && python -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]'succeeds on macOS and Linuxpytest -qpasses (chunker tests use no live API calls)ruff check .andruff format --check .cleancp .env.example .env, add realNUTRIENT_API_KEY/OPENAI_API_KEY/ANTHROPIC_API_KEY, drop a sample PDF inpdfs/, runpython run.pyandpython -m retrieval.ask "..."end-to-end