Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ Once installed, `parxy` provides the following commands:
| `parxy docker` | Generate a Docker Compose configuration for self-hosted services |
| `parxy pdf:merge` | Merge multiple PDF files into one, with support for selecting specific page ranges |
| `parxy pdf:split` | Split a PDF file into individual pages |
| `parxy pdf:outline` | Print or export a PDF's outline (bookmarks / table of contents) |
| `parxy pdf:tags` | Extract the tag (structure) tree of a tagged, accessible PDF |
| `parxy pdf:tags-check` | Check whether a PDF is a tagged (accessible) PDF |
| `parxy pdf:xmp` | Read and extract the XMP metadata of a PDF |

```bash
# Parse a PDF to markdown
Expand Down
150 changes: 150 additions & 0 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,28 @@ parxy pdf:merge [OPTIONS] INPUTS...
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Output file path for the merged PDF. If not specified, you will be prompted. |

## `parxy pdf:outline`

Print or export the outline (bookmarks / table of contents) of a PDF

```
parxy pdf:outline [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | PDF file to inspect |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Write the outline as JSON to this file instead of printing a tree. |
| `--json` | - | `flag` | `false` | Print the outline as JSON to stdout. |
| `--flat` | - | `flag` | `false` | Print a flat, indented list instead of a tree. |

## `parxy pdf:split`

Split a PDF file into individual pages
Expand All @@ -245,6 +267,134 @@ parxy pdf:split [OPTIONS] INPUT_FILE
| `--prefix` | `-p` | `text` | - | Prefix for output filenames. If not specified, uses the input filename. |
| `--pages` | - | `text` | - | Page range to extract (1-based). Examples: "1" (single page), "1:3" (pages 1-3), ":3" (up to page 3), "3:" (from page 3). If not specified, all pages are extracted. |
| `--combine` | - | `flag` | `false` | Combine extracted pages into a single PDF instead of one file per page. |
| `--every` | `-e` | `integer` | - | Split into chunks of N pages each. Cannot be used with --combine. |

## `parxy pdf:split-by-text`

Split a PDF into chunks whenever a page matches a text condition

```
parxy pdf:split-by-text [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | PDF file to split |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--text` | `-t` | `text` | - | Text to match. Can be repeated for multiple patterns (OR logic). |
| `--mode` | `-m` | `text` | `contains` | Matching mode: "contains" (default) or "starts-with". |
| `--ignore-case` | `-i` | `flag` | `false` | Case-insensitive matching. |
| `--regex` | - | `flag` | `false` | Treat --text values as regular expressions. |
| `--discard-preamble` | - | `flag` | `false` | Discard pages that appear before the first matching page. |
| `--output` | `-o` | `text` | - | Output directory for chunk files (default: {stem}_split next to input). |
| `--prefix` | `-p` | `text` | - | Prefix for output filenames. Defaults to the input filename stem. |

## `parxy pdf:tag-skeleton`

Copy a tagged PDF keeping its tags but removing visible content

```
parxy pdf:tag-skeleton [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Tagged PDF file to strip |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Output path for the tags-only PDF (default: {stem}_tags.pdf next to input). |

## `parxy pdf:tag-template`

Create an empty tagged PDF skeleton for accessibility work

```
parxy pdf:tag-template [OPTIONS]
```

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Output file path for the template PDF. If not specified, you will be prompted. |
| `--pages` | - | `integer` | `1` | Number of blank pages to create (default: 1). |
| `--lang` | - | `text` | `en-US` | Document language tag set on the catalog (default: en-US). |
| `--title` | - | `text` | - | Optional document title stored in the PDF metadata. |

## `parxy pdf:tags`

Extract the tag (structure) tree of a tagged PDF

```
parxy pdf:tags [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | PDF file to inspect |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Write the extracted tags as JSON to this file instead of printing a tree. |
| `--json` | - | `flag` | `false` | Print the extracted tags as JSON to stdout. |
| `--text` | - | `flag` | `false` | Include the text content of each element. Rebuilds the tree per page; accessibility attributes (alt text, page refs) are not shown in this mode. |

## `parxy pdf:tags-check`

Check whether a PDF is a tagged (accessible) PDF

```
parxy pdf:tags-check [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | PDF file to inspect |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--json` | - | `flag` | `false` | Output the detection result as JSON. |

## `parxy pdf:xmp`

Read and extract the XMP metadata of a PDF

```
parxy pdf:xmp [OPTIONS] INPUT_FILE
```

**Arguments:**

| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | PDF file to inspect |

**Options:**

| Option | Short | Type | Default | Description |
|--------|-------|------|---------|-------------|
| `--output` | `-o` | `text` | - | Write the metadata to this file. A .xml extension writes the raw XMP packet; any other extension writes parsed JSON. |
| `--json` | - | `flag` | `false` | Print the parsed metadata as JSON to stdout. |
| `--raw` | - | `flag` | `false` | Print the raw XMP XML packet to stdout. |

## `parxy tui`

Expand Down
88 changes: 88 additions & 0 deletions docs/tutorials/using_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ The Parxy CLI lets you:
| `parxy markdown` | Convert documents to Markdown files, with support for multiple drivers and folder processing |
| `parxy pdf:merge`| Merge multiple PDF files into one, with support for page ranges |
| `parxy pdf:split`| Split a PDF into individual pages, with optional page range and single-file extraction |
| `parxy pdf:outline`| Print or export a PDF's outline (bookmarks / table of contents) |
| `parxy pdf:tags` | Inspect and extract the tag (structure) tree of a tagged, accessible PDF |
| `parxy pdf:xmp` | Read and extract XMP metadata from a PDF |
| `parxy drivers` | List available document processing drivers |
| `parxy env` | Generate a default `.env` configuration file |
| `parxy docker` | Create a Docker Compose configuration for running Parxy-related services |
Expand Down Expand Up @@ -303,6 +306,88 @@ Page range formats (1-based): `3` · `2:5` · `:5` · `3:`
For more detailed examples and use cases, see the [Merge and split PDFs](../howto/merge_and_split_pdfs.md) guide.


## Inspecting PDFs

Beyond text extraction, Parxy can inspect a PDF's structure and metadata: its outline (bookmarks), its accessibility tag tree, and its XMP metadata. Each command prints a human-readable view by default and can emit JSON with `--json` (to stdout) or `--output` (to a file).

### Outline (bookmarks)

The `pdf:outline` command prints the table of contents as a tree:

```bash
parxy pdf:outline document.pdf
```

Use `--flat` for an indented list instead of a tree, or export the structure:

```bash
# Flat listing
parxy pdf:outline document.pdf --flat

# Export as JSON (flat entries + nested tree)
parxy pdf:outline document.pdf -o outline.json
```

The command exits with code `2` when the PDF has no bookmarks, which is handy in scripts.

### Tags (accessibility structure)

A *tagged* PDF carries a logical structure tree (`/StructTreeRoot`) that makes it accessible. Start by checking whether a PDF is tagged:

```bash
parxy pdf:tags-check document.pdf
```

This reports whether the content is marked, whether a structure tree is present, the document language, and the number of structure elements. It exits with `0` for a tagged PDF and `2` otherwise.

Extract the tag tree itself with `pdf:tags`:

```bash
# Print the structure tree (with page references and alt text)
parxy pdf:tags document.pdf

# Include the visible text of each element (rebuilt per page)
parxy pdf:tags document.pdf --text

# Export the full nested structure as JSON
parxy pdf:tags document.pdf -o tags.json
```

The default view walks the document-wide structure tree and shows accessibility attributes (alt text, titles, page references) but not body text, which lives in the page content streams. The `--text` view reconstructs the structure per page including each element's visible text, but without the accessibility attributes.

Two companion commands help with accessibility work:

```bash
# Copy a tagged PDF keeping its tags but removing visible content
parxy pdf:tag-skeleton document.pdf -o tags-only.pdf

# Create an empty tagged PDF skeleton from scratch
parxy pdf:tag-template -o template.pdf --pages 3 --lang en-US
```

### XMP metadata

The `pdf:xmp` command reads the XMP metadata packet (an RDF/XML block holding properties such as `dc:title`, `dc:creator`, and `pdf:Producer`) and prints the parsed properties alongside the classic `/Info` dictionary:

```bash
parxy pdf:xmp document.pdf
```

You can view the original packet or export the metadata:

```bash
# Print the raw XMP XML packet
parxy pdf:xmp document.pdf --raw

# Export parsed metadata as JSON
parxy pdf:xmp document.pdf --json

# Save the raw XMP packet (a .xml path writes the raw packet,
# any other extension writes parsed JSON)
parxy pdf:xmp document.pdf -o metadata.xml
```


## Managing Drivers

To view the list of supported document parsing drivers:
Expand Down Expand Up @@ -368,6 +453,9 @@ With the CLI, you can use Parxy as a **standalone document parsing tool** — id
| `parxy markdown` | Generate Markdown files; accepts JSON results and supports `--page-separators` |
| `parxy pdf:merge`| Merge multiple PDF files with page range support |
| `parxy pdf:split`| Split PDF into individual pages; supports `--pages` and `--combine` |
| `parxy pdf:outline`| Print or export a PDF's outline (bookmarks) |
| `parxy pdf:tags` | Inspect and extract a tagged PDF's structure tree; supports `--text` |
| `parxy pdf:xmp` | Read and extract XMP metadata; supports `--raw` and JSON export |
| `parxy drivers` | List supported drivers |
| `parxy env` | Create default configuration file |
| `parxy docker` | Generate Docker Compose setup |
Loading