From 557d7216e013dd52b25f393ff8d54319295be046 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 20 Apr 2026 13:28:13 +0100 Subject: [PATCH 1/2] Updates README with more comprehensive information. --- README.md | 759 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 734 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index 4bc792175..6743c21ce 100644 --- a/README.md +++ b/README.md @@ -1,60 +1,769 @@ +

+ + PyMuPDF + +

+ # PyMuPDF -**PyMuPDF** is a high performance **Python** library for data extraction, analysis, conversion & manipulation of [PDF (and other) documents](https://pymupdf.readthedocs.io/en/latest/the-basics.html#supported-file-types). +

+ pymupdf%2FPyMuPDF | Trendshift +

+ +[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://pymupdf.readthedocs.io) +[![PyPI Version](https://img.shields.io/pypi/v/pymupdf?color=blue&label=PyPI)](https://pypi.org/project/PyMuPDF/) +[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pymupdf)](https://pypi.org/project/pymupdf/) +[![License AGPL](https://img.shields.io/github/license/pymupdf/pymupdf)](https://github.com/pymupdf/PyMuPDF/blob/master/COPYING) +[![PyPI Downloads](https://static.pepy.tech/badge/pymupdf/month)](https://pepy.tech/projects/pymupdf) +[![Github Stars](https://img.shields.io/github/stars/pymupdf/PyMuPDF?style=social)](https://github.com/pymupdf/PyMuPDF/stargazers) +[![Discord](https://img.shields.io/discord/770681584617652264?color=6A7EC2&logo=discord&logoColor=ffffff)](https://pymupdf.io/discord/artifex/) +[![Forum](https://img.shields.io/badge/Forum-ff6600?logo=python&logoColor=ffffff)](https://forum.mupdf.com/c/general/4) +[![Twitter](https://img.shields.io/twitter/follow/pymupdf4llm)](https://x.com/pymupdf4llm) +[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97_Hugging_Face-007ec6)](https://huggingface.co/artifex-software) +[![Demo](https://img.shields.io/badge/PyMuPDF4LLM-live?badge&label=DEMO&logo=python&logoColor=ffffff)](https://demo.pymupdf.io) + +**The PDF engine behind over 50 million monthly downloads, powering AI pipelines worldwide.** + +**PyMuPDF is a high-performance Python library for data extraction, analysis, conversion, rendering and manipulation of PDF (and other) documents.** Built on top of MuPDF — a lightweight, fast C engine — PyMuPDF gives you precise, low-level control over documents alongside high-level convenience APIs. No mandatory external dependencies. + +[![Star on GitHub](https://img.shields.io/github/stars/pymupdf/PyMuPDF.svg?style=for-the-badge&label=Star&logo=github)](https://github.com/pymupdf/PyMuPDF/) + +--- + +## Why PyMuPDF? + +- **Fast** — powered by [MuPDF](https://mupdf.com/) , a best-in-class C rendering engine +- **Accurate** — pixel-perfect text extraction with font, color, and position metadata +- **Versatile** — read, write, annotate, redact, merge, split, and convert documents +- **LLM-ready** — native Markdown output via [PyMuPDF4LLM](https://pypi.org/project/pymupdf4llm/) for RAG and AI pipelines +- **No mandatory dependencies** — `pip install pymupdf` and you're done + +--- + +## Installation + +```bash +pip install pymupdf +``` + +Wheels are available for **Windows**, **macOS**, and **Linux** on Python 3.10–3.14. If no pre-built wheel exists for your platform, pip will compile from source (requires a C/C++ toolchain). + +### Optional extras + +| Package | Purpose | +|---|---| +| `pymupdf-fonts` | Extended font collection for text output | +| `pymupdf4llm` | LLM/RAG-optimised Markdown and JSON extraction | +| `pymupdfpro` | Adds Office document support | +| `tesseract-ocr` | OCR for scanned pages and images (separate install) | + +```bash +# More fonts +pip install pymupdf-fonts + +# LLM-ready extraction +pip install pymupdf4llm + +# Office support +pip install pymupdfpro + +# OCR (Tesseract must be installed separately) +# macOS +brew install tesseract + +# Ubuntu / Debian +sudo apt install tesseract-ocr +``` + +--- + +## Supported File Formats + +### Input + +| Category | Formats | +|---|---| +| PDF & derivatives | PDF, XPS, EPUB, CBZ, MOBI, FB2, SVG, TXT | +| Images | PNG, JPEG, BMP, TIFF, GIF, and more | +| Microsoft Office *(Pro)* | DOC, DOCX, XLS, XLSX, PPT, PPTX | +| Korean Office *(Pro)* | HWP, HWPX | + +### Output + +| Format | Notes | +|---|---| +| PDF | Full fidelity conversion from Office formats | +| SVG | Vector page rendering | +| Image (PNG, JPEG, …) | Page rasterisation at any DPI | +| Markdown | Structure-aware, LLM-ready | +| JSON | Bounding boxes, layout data, per-element detail | +| Plain text | Fast, lightweight extraction | + +--- + + +## Quick start + +### Extract text + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +for page in doc: + print(page.get_text()) +``` + +### Extract text with layout metadata + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +page = doc[0] + +blocks = page.get_text("dict")["blocks"] +for block in blocks: + if block["type"] == 0: # text block + for line in block["lines"]: + for span in line["spans"]: + print(f"{span['text']!r} font={span['font']} size={span['size']:.1f}") +``` + +### Extract tables + +```python +import pymupdf + +doc = pymupdf.open("spreadsheet.pdf") +page = doc[0] + +tables = page.find_tables() +for table in tables: + print(table.to_markdown()) + + # or get as Pandas DataFrame + df = table.to_pandas() +``` + +### Render a page to an image + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +page = doc[0] + +pixmap = page.get_pixmap(dpi=150) +pixmap.save("page_0.png") +``` + +### OCR a scanned document + +```python +import pymupdf + +doc = pymupdf.open("scanned.pdf") +page = doc[0] + +# Requires Tesseract installed and on PATH +text = page.get_textpage_ocr(language="eng").extractText() +print(text) +``` + +### Convert to Markdown for LLMs + +```python +import pymupdf4llm + +md = pymupdf4llm.to_markdown("report.pdf") +# Pass directly to your LLM or vector store +print(md) +``` + +### Annotate and redact + +```python +import pymupdf + +doc = pymupdf.open("contract.pdf") +page = doc[0] -# Community -Join us on **Discord** here: [#pymupdf](https://discord.gg/TSpYGBW4eq) +# Add a highlight annotation +rect = pymupdf.Rect(72, 100, 400, 120) +page.add_highlight_annot(rect) +# Add a redaction and apply it +page.add_redact_annot(rect) +page.apply_redactions() -# Installation +doc.save("contract_redacted.pdf") +``` -**PyMuPDF** requires **Python 3.10 or later**, install using **pip** with: +### Merge PDFs -`pip install PyMuPDF` +```python +import pymupdf -There are **no mandatory** external dependencies. However, some [optional features](#pymupdf-optional-features) become available only if additional packages are installed. +merger = pymupdf.open() +for path in ["part1.pdf", "part2.pdf", "part3.pdf"]: + merger.insert_pdf(pymupdf.open(path)) -You can also try without installing by visiting [PyMuPDF.io](https://pymupdf.io/#examples). +merger.save("merged.pdf") +``` +### Convert an Office document to PDF + +```python +import pymupdf.pro -# Usage +pymupdf.pro.unlock("YOUR-LICENSE-KEY") -Basic usage is as follows: +doc = pymupdf.open("presentation.pptx") +pdf_bytes = doc.convert_to_pdf() + +with open("output.pdf", "wb") as f: + f.write(pdf_bytes) +``` + +### Extract LLM-ready Markdown from a Word document ```python -import pymupdf # imports the pymupdf library -doc = pymupdf.open("example.pdf") # open a document -for page in doc: # iterate the document pages - text = page.get_text() # get plain text encoded as UTF-8 +import pymupdf4llm +import pymupdf.pro + +pymupdf.pro.unlock("YOUR-LICENSE-KEY") +md = pymupdf4llm.to_markdown("document.docx") +print(md) ``` +--- + +## Features + +### Core capabilities + +| Feature | Description | +|---|---| +| **Text extraction** | Plain text, rich dict (font, size, color, bbox), HTML, XML, raw blocks | +| **Table detection** | `find_tables()` — locate, extract, and export tables as Markdown or structured data | +| **Image extraction** | Extract embedded images and render any page to a high-resolution `Pixmap` | +| **Rendering** | Render PDF pages to images or `Pixmap` data for use in UI or other workflows | +| **OCR** | Tesseract integration — full-page or partial OCR, configurable language | +| **Annotations** | Read and write highlights, underlines, squiggly lines, sticky notes, free text, ink, stamps | +| **Redaction** | Add and permanently apply redaction annotations | +| **Forms** | Read and fill PDF AcroForm fields | +| **PDF editing** | Insert, delete, and reorder pages; set metadata; merge and split documents | +| **Drawing** | Draw lines, curves, rectangles, and circles; insert HTML boxes | +| **Encryption** | Open password-protected PDFs; save with RC4 or AES encryption | +| **Links** | Extract hyperlinks, internal cross-references, and URI targets | +| **Bookmarks** | Read and write the outline / table of contents tree | +| **Metadata** | Title, author, creation date, producer, subject, and custom entries | +| **Color spaces** | RGB, CMYK, greyscale; color space conversion | + +### LLM & AI output (via PyMuPDF4LLM) + +| Output | API | +|---|---| +| Markdown | `pymupdf4llm.to_markdown(path)` | +| JSON | `pymupdf4llm.to_json(path)` | +| Plain text | `pymupdf4llm.to_text(path)` | + +Supports multi-column layouts, natural reading order and page chunking. + + +[![Demo](https://img.shields.io/badge/Pymupdf4llm-live?style=for-the-badge&label=DEMO&logo=python&logoColor=ffffff)](https://demo.pymupdf.io) + +--- + +## Supported Python versions + +Python **3.10 – 3.14** (as of v1.27.x). Wheels ship for: + +- `manylinux` x86\_64 and aarch64 +- `musllinux` x86\_64 +- macOS x86\_64 and arm64 +- Windows x86 and x86\_64 + +--- + +## Performance + +PyMuPDF is built on MuPDF — one of the fastest PDF rendering engines available. Typical benchmarks against pure-Python PDF libraries show **10–50× speed improvements** for text extraction and **100× or more** for page rendering, with a minimal memory footprint. + +For AI workloads, PyMuPDF4LLM processes documents **without a GPU**, cutting infrastructure costs significantly compared to vision-based LLM approaches. + +--- + +## Recipes + +
+Extract all images from a PDF + +```python +import pymupdf +from pathlib import Path + +doc = pymupdf.open("document.pdf") +out = Path("images") +out.mkdir(exist_ok=True) + +for page_index, page in enumerate(doc): + for img_index, img in enumerate(page.get_images()): + xref = img[0] + pix = pymupdf.Pixmap(doc, xref) + if pix.n > 4: # convert CMYK + pix = pymupdf.Pixmap(pymupdf.csRGB, pix) + pix.save(out / f"page{page_index}_img{img_index}.png") +``` +
+ +
+Search for text across a document + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +needle = "confidential" + +for page in doc: + hits = page.search_for(needle) + if hits: + print(f"Page {page.number}: {len(hits)} occurrence(s)") + for rect in hits: + page.add_highlight_annot(rect) + +doc.save("highlighted.pdf") +``` +
+ +
+Split a PDF into individual pages + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +for i, page in enumerate(doc): + out = pymupdf.open() + out.insert_pdf(doc, from_page=i, to_page=i) + out.save(f"page_{i + 1}.pdf") +``` +
+ +
+Insert a watermark on every page + +```python +import pymupdf + +doc = pymupdf.open("document.pdf") +for page in doc: + page.insert_text( + point=pymupdf.Point(72, page.rect.height / 2), + text="DRAFT", + fontsize=72, + color=(0.8, 0.8, 0.8), + rotate=45, + ) + +doc.save("watermarked.pdf") +``` +
+ +--- + +## Office Document Processing + +PyMuPDF can be extended with PyMuPDF Pro. This adds a conversion layer that handles Microsoft and Korean Office formats natively — no Office installation, no COM interop, no LibreOffice subprocess. + +Once unlocked, `pymupdf.open()` accepts Office files exactly like PDFs: + +```python +import pymupdf.pro +pymupdf.pro.unlock("YOUR-LICENSE-KEY") + +# Works identically regardless of format +for fmt in ["contract.docx", "data.xlsx", "deck.pptx", "report.hwpx"]: + doc = pymupdf.open(fmt) + for page in doc: + print(page.get_text()) +``` + +[Get a trial license key for PyMuPDF Pro](https://pymupdf.pro/try-pro) + +**What you can do with Office documents:** + +- Extract text and images page-by-page +- Convert to PDF with `doc.convert_to_pdf()` +- Rasterise pages to PNG/JPEG for visual inspection +- Feed directly into PyMuPDF4LLM for AI-ready output + + + +### Restrictions Without a License Key + +When `pymupdf.pro.unlock()` is called **without** a key, the following restrictions apply: + +| Restriction | Detail | +|---|---| +| Page limit | Only the **first 3 pages** of any document are accessible | +| Time limit | Evaluation period — functionality expires after a set duration | + +All other Pro features work normally within these constraints, making it straightforward to prototype before purchasing a license. + + +--- + + + +## Frequently Asked Questions + +### Can I use PyMuPDF, PyMuPDF4LLM and PyMuPDF Pro without sending data to the cloud? + +Yes, absolutely — and this is one of PyMuPDF's most significant advantages. + +PyMuPDF runs entirely locally. It is a native Python library built on top of the MuPDF C engine. When you call `pymupdf.open()`, `page.get_text()`, `page.find_tables()`, or any other method, everything executes in-process on your own machine. No data is transmitted anywhere. + + +There are no telemetry calls, no licence validation callbacks, no cloud dependencies of any kind in the open-source AGPL build or the commercial build. Once the package is installed, it works fully air-gapped. + +This makes PyMuPDF well-suited for: + +- Regulated industries — healthcare (HIPAA), finance, legal, government, where documents cannot leave a controlled environment +- On-premise deployments — servers with no outbound internet access +- Air-gapped systems — classified or sensitive environments +- Self-hosted RAG pipelines — processing confidential documents locally before feeding an on-premise LLM +- Saving on token costs for document pre-processing before sending data to your LLM + +The only thing you need an internet connection for is the initial `pip install`. After that, the package and all its capabilities are entirely self-contained. + + +### Should I `import pymupdf` or `import fitz`? + +Use `import pymupdf`. The `fitz` name is a legacy alias that still works as of v1.24.0+, but `import pymupdf` is the recommended and future-proof approach. The two are interchangeable in existing code: + +```python +import pymupdf # recommended +# import fitz # legacy alias — still works but avoid for new code +``` + +### Does PyMuPDF work with Korean, Japanese, or Chinese documents? + +Yes — PyMuPDF has solid CJK support + +### How do I extract Markdown from PDF for LLM? + +Let PyMuPDF4LLM do everything (recommended for RAG). + +PyMuPDF4LLM is a high-level wrapper that outputs standard text and table content together in an integrated Markdown-formatted string across all document pages PyMuPDF — tables are detected, converted to GitHub-compatible Markdown, and interleaved with surrounding text in the correct reading order. This is the best starting point for feeding an LLM or building a RAG pipeline. + +```python +import pymupdf4llm + +md = pymupdf4llm.to_markdown("report.pdf") +print(md) +# Tables appear as Markdown | col1 | col2 | ... inline with the text +``` + + +### Text extraction returns garbled characters or empty output. Why? + +This usually means the PDF uses custom font encodings without a proper character map (CMAP). The font's glyphs are present but cannot be mapped back to Unicode. In these cases: + +- Use OCR as a fallback (`page.get_textpage_ocr()`) +- Consider that scanned PDFs will always need OCR — text extraction on scans returns nothing + + + +### How do I extract text from a specific area of a page? + +Pass a `clip` rectangle to `get_text()`: + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") +page = doc[0] + +# Define the area you want (x0, y0, x1, y1) in points +clip = pymupdf.Rect(50, 100, 400, 300) +text = page.get_text("text", clip=clip) +``` + + + +### How do I search for text and find its location on the page? + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") +page = doc[0] + +# Returns a list of Rect objects surrounding each match +locations = page.search_for("invoice number") +for rect in locations: + print(rect) # e.g. Rect(72.0, 120.5, 210.0, 134.0) +``` + + + +### `get_images` shows no images but I can clearly see charts in the PDF. Why? + +Charts and diagrams created by tools like matplotlib, Excel, or R are typically rendered as vector graphics (PDF drawing commands), not raster images. `get_images ` only lists embedded raster image objects and will not detect vector graphics. To capture these, rasterise the entire page with `page.get_pixmap()`. + + + +### How does OCR work in PyMuPDF? Does it require a separate Tesseract installation? + +PyMuPDF uses Tesseract for OCR, but Tesseract's C++ code is compiled directly into MuPDF — it is not called as an external subprocess. The only external requirement is the **Tesseract language data files** (`tessdata`). Over 100 languages are supported. There is no Python-level pytesseract dependency. + +```python +import pymupdf + +doc = pymupdf.open("scanned.pdf") +page = doc[0] + +# Get a text page using OCR +tp = page.get_textpage_ocr(language="eng") +text = page.get_text(textpage=tp) +print(text) +``` + + +### How do I run OCR on a standalone image file (not a PDF)? + +```python +import pymupdf + +pix = pymupdf.Pixmap("image.png") +if pix.alpha: + pix = pymupdf.Pixmap(pix, 0) # remove alpha channel — required for OCR + +# Wrap in a 1-page PDF and OCR it +doc = pymupdf.open() +page = doc.new_page(width=pix.width, height=pix.height) +page.insert_image(page.rect, pixmap=pix) +tp = page.get_textpage_ocr() +text = page.get_text(textpage=tp) +``` + + +### How do I highlight text in a PDF? + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") +page = doc[0] + +# Use quads=True for accurate highlights on non-horizontal text +quads = page.search_for("important term", quads=True) +page.add_highlight_annot(quads) + +doc.save("highlighted.pdf") +``` + +PyMuPDF supports all standard PDF text markers: highlight, underline, strikeout, and squiggly. + + + +### How do I permanently redact (remove) content from a PDF? + +Redaction is a deliberate two-step process so you can review before committing: + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") +page = doc[0] + +# Step 1: Mark the area(s) to redact +rect = page.search_for("confidential")[0] +page.add_redact_annot(rect, fill=(1, 1, 1)) # white fill + +# Step 2: Apply — permanently removes the underlying content +page.apply_redactions() + +doc.save("redacted.pdf") +``` + +After `apply_redactions()`, the original content is gone. It cannot be recovered from the saved file. + + + + + +### How do I read form field values from a PDF? + +```python +import pymupdf + +doc = pymupdf.open("form.pdf") +page = doc[0] + +for field in page.widgets(): + print(f"{field.field_name}: {field.field_value}") +``` + + + +### How do I fill in a PDF form programmatically? + +```python +import pymupdf + +doc = pymupdf.open("form.pdf") +page = doc[0] + +for field in page.widgets(): + if field.field_name == "First Name": + field.field_value = "Ada" + field.update() + +doc.save("filled_form.pdf") +``` + + + +### Can I use multithreading with PyMuPDF? + +No. PyMuPDF does not support multithreaded use, even with Python's newer free-threading mode. The underlying MuPDF library only provides partial thread safety, and a fully thread-safe PyMuPDF implementation would still impose a single-threaded overhead — negating the benefit. + +**Use multiprocessing instead.** Each process opens the file independently and works on its own page range: + +```python +from multiprocessing import Pool +import pymupdf + +def process_pages(args): + path, start, end = args + doc = pymupdf.open(path) # each process opens its own handle + results = [] + for i in range(start, end): + results.append(doc[i].get_text()) + return results + +with Pool(4) as pool: + chunks = [("input.pdf", 0, 25), ("input.pdf", 25, 50), ...] + all_results = pool.map(process_pages, chunks) +``` + + + +### How can I speed up repeated text extraction on the same page? + +Reuse a `TextPage` object. Creating a `TextPage` is the expensive part — once created, switching between extraction formats is cheap: + +```python +import pymupdf + +page = doc[0] +tp = page.get_textpage() # create once + +text = page.get_text("text", textpage=tp) +words = page.get_text("words", textpage=tp) +data = page.get_text("dict", textpage=tp) +``` + +This can reduce execution time by 50–95% for repeated extractions on the same page. + + + + +### How do I read and write PDF metadata? + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") + +# Read +print(doc.metadata) +# {'title': '...', 'author': '...', 'subject': '...', 'keywords': '...', ...} + +# Write +doc.set_metadata({ + "title": "Annual Report 2025", + "author": "Finance Team", + "keywords": "annual, finance, 2025" +}) +doc.save("output.pdf") +``` + + +### How do I read or set the table of contents / bookmarks? + +```python +import pymupdf + +doc = pymupdf.open("input.pdf") + +# Read — returns a list of [level, title, page_number] entries +toc = doc.get_toc() +for level, title, page in toc: + print(" " * level, title, "→ page", page) + +# Write +new_toc = [ + [1, "Introduction", 1], + [1, "Methods", 5], + [2, "Data sources", 6], +] +doc.set_toc(new_toc) +doc.save("output.pdf") +``` + + + +--- -# Documentation +## Documentation -Full documentation can be found on [pymupdf.readthedocs.io](https://pymupdf.readthedocs.io). +Full installation guide, API reference, cookbook, and tutorial at **[pymupdf.readthedocs.io](https://pymupdf.readthedocs.io)**. +- [Installation guide](https://pymupdf.readthedocs.io/en/latest/installation.html) +- [API reference](https://pymupdf.readthedocs.io/en/latest/classes.html) +- [Cookbook](https://pymupdf.readthedocs.io/en/latest/the-basics.html) +- [Tutorial](https://pymupdf.readthedocs.io/en/latest/tutorial.html) +- [Changelog](https://pymupdf.readthedocs.io/en/latest/changes.html) +- [PyMuPDF4LLM docs](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) +- [PyMuPDF Pro docs](https://pymupdf.readthedocs.io/en/latest/pymupdf-pro/index.html) +--- -# Optional Features -* [fontTools](https://pypi.org/project/fonttools/) for creating font subsets. -* [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) contains some nice fonts for your text output. -* [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition in images and document pages. +## Related projects +| Project | Description | +|---|---| +| [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm) | TLLM/RAG-optimised Markdown and JSON extraction | +| [PyMuPDF Pro](https://pymupdf.io/pro) | Adds Office and HWP document support | +| [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) | Extended font collection for PyMuPDF text output | +--- -# About +## Licensing -**PyMuPDF** adds **Python** bindings and abstractions to [MuPDF](https://mupdf.com/), a lightweight **PDF**, **XPS**, and **eBook** viewer, renderer, and toolkit. Both **PyMuPDF** and **MuPDF** are maintained and developed by [Artifex Software, Inc](https://artifex.com). +PyMuPDF and MuPDF are maintained by [Artifex Software, Inc.](https://artifex.com) -**PyMuPDF** was originally written by [Jorj X. McKie](mailto:jorj.x.mckie@outlook.de). +- **Open source** — [GNU AGPL v3](https://www.gnu.org/licenses/agpl-3.0.html). Free for open-source projects. +- **Commercial** — separate commercial licences available from [Artifex](https://artifex.com/licensing) for proprietary applications. +--- -# License and Copyright +## Contributing -**PyMuPDF** is available under [open-source AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements. If you determine you cannot meet the requirements of the **AGPL**, please contact [Artifex](https://artifex.com/contact/pymupdf-inquiry.php) for more information regarding a commercial license. +Contributions are welcome. Please open an issue before submitting large pull requests. +- [Issue tracker](https://github.com/pymupdf/PyMuPDF/issues) +- [Discord community](https://pymupdf.pro/discord/artifex/) +## ⭐ Support this project +If you find this useful, please consider giving it a star — it helps others discover it! +[![Star on GitHub](https://img.shields.io/github/stars/pymupdf/PyMuPDF.svg?style=for-the-badge&label=Star&logo=github)](https://github.com/pymupdf/PyMuPDF/) From 5af4954d74b272e7dbd9a434234219a6d4092031 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Mon, 20 Apr 2026 16:59:43 +0100 Subject: [PATCH 2/2] README updates following Copilot suggestions. --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 6743c21ce..a0b98a087 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ ## Why PyMuPDF? -- **Fast** — powered by [MuPDF](https://mupdf.com/) , a best-in-class C rendering engine +- **Fast** — powered by [MuPDF](https://mupdf.com/), a best-in-class C rendering engine - **Accurate** — pixel-perfect text extraction with font, color, and position metadata - **Versatile** — read, write, annotate, redact, merge, split, and convert documents - **LLM-ready** — native Markdown output via [PyMuPDF4LLM](https://pypi.org/project/pymupdf4llm/) for RAG and AI pipelines @@ -513,13 +513,13 @@ for rect in locations: ### `get_images` shows no images but I can clearly see charts in the PDF. Why? -Charts and diagrams created by tools like matplotlib, Excel, or R are typically rendered as vector graphics (PDF drawing commands), not raster images. `get_images ` only lists embedded raster image objects and will not detect vector graphics. To capture these, rasterise the entire page with `page.get_pixmap()`. +Charts and diagrams created by tools like matplotlib, Excel, or R are typically rendered as vector graphics (PDF drawing commands), not raster images. `get_images` only lists embedded raster image objects and will not detect vector graphics. To capture these, rasterise the entire page with `page.get_pixmap()`. ### How does OCR work in PyMuPDF? Does it require a separate Tesseract installation? -PyMuPDF uses Tesseract for OCR, but Tesseract's C++ code is compiled directly into MuPDF — it is not called as an external subprocess. The only external requirement is the **Tesseract language data files** (`tessdata`). Over 100 languages are supported. There is no Python-level pytesseract dependency. +PyMuPDF uses MuPDF's built-in Tesseract-based OCR support, so there is no Python-level `pytesseract` dependency. However, PyMuPDF still needs access to the **Tesseract language data files** (`tessdata`), and automatic tessdata discovery may invoke the `tesseract` executable (for example, to list available languages) if you do not explicitly provide a tessdata path. In practice, the recommended setup is to either install Tesseract so discovery works automatically, or configure the tessdata location yourself via the `tessdata` parameter or the `TESSDATA_PREFIX` environment variable. Over 100 languages are supported. ```python import pymupdf @@ -740,7 +740,7 @@ Full installation guide, API reference, cookbook, and tutorial at **[pymupdf.rea | Project | Description | |---|---| -| [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm) | TLLM/RAG-optimised Markdown and JSON extraction | +| [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm) | LLM/RAG-optimised Markdown and JSON extraction | | [PyMuPDF Pro](https://pymupdf.io/pro) | Adds Office and HWP document support | | [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) | Extended font collection for PyMuPDF text output |