BDRC OCR-Benchmark Line Alignment

Pipeline for downloading BDRC manuscript images, aligning them with transcriptions, and creating annotation batches for the BDRC annotation tool.

Overview

The workflow has three stages:

Download & Generate JSON — Fetch images and parquet alignment data from S3, produce per-volume JSON manifests.
Upload Images — Push local images to a public S3 bucket so annotators can access them.
Create Batches — POST the JSON manifests to the annotation tool API to create task batches.

Prerequisites

Python 3.13+
AWS credentials configured via named profiles (e.g. karma-bdrc, webuddhist)

pip install -r requirements.txt

Directory Structure

line-alignment/
├── batch_convert.py      # Stage 1: download images + generate JSON
├── upload_images.py       # Stage 2: upload images to public S3
├── create_batch.py        # Stage 3: create annotation batches via API
├── requirements.txt
├── images/                # Downloaded images (images/{catalog}/{vol_id}/)
└── json/                  # Generated JSON manifests (json/{catalog}/{vol_id}.json)

Usage

1. Download Images & Generate JSON

Reads a catalog CSV from S3, downloads BDRC volume images, loads parquet alignment files, and writes per-volume JSON manifests.

python3 batch_convert.py <s3_csv_uri> [--profile AWS_PROFILE] [--image-base-url URL] [--skip-download]

Examples:

# Full run — download images and generate JSON
python3 batch_convert.py \
    s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
    --profile karma-bdrc

# Skip image download, only regenerate JSON from existing parquets
python3 batch_convert.py \
    s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
    --profile karma-bdrc --skip-download

What it does:

Parses the catalog CSV to separate BDRC image rows from transcript rows.
Downloads volume images to images/{catalog}/{vol_id}/ (concurrent, skips existing files).
Loads parquet alignment files to pair each image with its transcription.
Falls back to image-only JSON when no parquet is available.
Writes JSON to json/{catalog}/{vol_id}.json.

2. Upload Images to Public S3

Uploads local images from images/ to s3://bec.data/OCR-Benchmark/{catalog}-{vol_id}/.

python3 upload_images.py [--profile PROFILE] [--catalog CATALOG] [--dry-run] [--no-skip-existing]

Examples:

# Dry run — see what would be uploaded
python3 upload_images.py --dry-run

# Upload everything
python3 upload_images.py

# Upload only a specific catalog
python3 upload_images.py --catalog Kurt

# Force re-upload even if files already exist in S3
python3 upload_images.py --no-skip-existing

Flag	Default	Description
`--profile`	`webuddhist`	AWS profile name
`--images-dir`	`images`	Root directory containing local images
`--catalog`	all catalogs	Limit to a specific catalog
`--dry-run`		Print plan without uploading
`--no-skip-existing`		Re-upload files that already exist in S3

3. Create Annotation Batches

Reads JSON manifests and POSTs them to the OpenPecha annotation tool API to create task batches.

python3 create_batch.py --catalog CATALOG --group-id GROUP_ID [--vol-id VOL_ID] [--dry-run]

Examples:

# Create all batches for the Kurt catalog
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax

# Create a single volume batch
python3 create_batch.py --catalog Kurt --vol-id I3PD874 --group-id 92rIrWB67rLqGtwovMFax

# Preview without sending (dry run)
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax --dry-run

Flag	Required	Description
`--catalog`	yes	Catalog name (directory under `json/`)
`--group-id`	yes	Group ID for the annotation tool
`--vol-id`	no	Process only a specific volume (e.g. `I3PD874`)
`--api-url`	no	Override the default API endpoint
`--dry-run`	no	Print payload info without sending requests

JSON Manifest Format

Each volume produces a JSON array of task records:

[
  {
    "name": "I1CZ39610001.jpg",
    "url": "https://s3.us-east-1.amazonaws.com/bec.data/OCR-Benchmark/Kurt-I1CZ3961/I1CZ39610001.jpg",
    "orientation": "landscape",
    "transcript": "optional transcription text"
  }
]

The transcript field is only present when parquet alignment data is available.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.MD		README.MD
batch_convert.py		batch_convert.py
create_batch.py		create_batch.py
requirements.txt		requirements.txt
upload_images.py		upload_images.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BDRC OCR-Benchmark Line Alignment

Overview

Prerequisites

Directory Structure

Usage

1. Download Images & Generate JSON

2. Upload Images to Public S3

3. Create Annotation Batches

JSON Manifest Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BDRC OCR-Benchmark Line Alignment

Overview

Prerequisites

Directory Structure

Usage

1. Download Images & Generate JSON

2. Upload Images to Public S3

3. Create Annotation Batches

JSON Manifest Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages