Pipeline for downloading BDRC manuscript images, aligning them with transcriptions, and creating annotation batches for the BDRC annotation tool.
The workflow has three stages:
- Download & Generate JSON — Fetch images and parquet alignment data from S3, produce per-volume JSON manifests.
- Upload Images — Push local images to a public S3 bucket so annotators can access them.
- Create Batches — POST the JSON manifests to the annotation tool API to create task batches.
- Python 3.13+
- AWS credentials configured via named profiles (e.g.
karma-bdrc,webuddhist)
pip install -r requirements.txtline-alignment/
├── batch_convert.py # Stage 1: download images + generate JSON
├── upload_images.py # Stage 2: upload images to public S3
├── create_batch.py # Stage 3: create annotation batches via API
├── requirements.txt
├── images/ # Downloaded images (images/{catalog}/{vol_id}/)
└── json/ # Generated JSON manifests (json/{catalog}/{vol_id}.json)
Reads a catalog CSV from S3, downloads BDRC volume images, loads parquet alignment files, and writes per-volume JSON manifests.
python3 batch_convert.py <s3_csv_uri> [--profile AWS_PROFILE] [--image-base-url URL] [--skip-download]Examples:
# Full run — download images and generate JSON
python3 batch_convert.py \
s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
--profile karma-bdrc
# Skip image download, only regenerate JSON from existing parquets
python3 batch_convert.py \
s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
--profile karma-bdrc --skip-downloadWhat it does:
- Parses the catalog CSV to separate BDRC image rows from transcript rows.
- Downloads volume images to
images/{catalog}/{vol_id}/(concurrent, skips existing files). - Loads parquet alignment files to pair each image with its transcription.
- Falls back to image-only JSON when no parquet is available.
- Writes JSON to
json/{catalog}/{vol_id}.json.
Uploads local images from images/ to s3://bec.data/OCR-Benchmark/{catalog}-{vol_id}/.
python3 upload_images.py [--profile PROFILE] [--catalog CATALOG] [--dry-run] [--no-skip-existing]Examples:
# Dry run — see what would be uploaded
python3 upload_images.py --dry-run
# Upload everything
python3 upload_images.py
# Upload only a specific catalog
python3 upload_images.py --catalog Kurt
# Force re-upload even if files already exist in S3
python3 upload_images.py --no-skip-existing| Flag | Default | Description |
|---|---|---|
--profile |
webuddhist |
AWS profile name |
--images-dir |
images |
Root directory containing local images |
--catalog |
all catalogs | Limit to a specific catalog |
--dry-run |
Print plan without uploading | |
--no-skip-existing |
Re-upload files that already exist in S3 |
Reads JSON manifests and POSTs them to the OpenPecha annotation tool API to create task batches.
python3 create_batch.py --catalog CATALOG --group-id GROUP_ID [--vol-id VOL_ID] [--dry-run]Examples:
# Create all batches for the Kurt catalog
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax
# Create a single volume batch
python3 create_batch.py --catalog Kurt --vol-id I3PD874 --group-id 92rIrWB67rLqGtwovMFax
# Preview without sending (dry run)
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax --dry-run| Flag | Required | Description |
|---|---|---|
--catalog |
yes | Catalog name (directory under json/) |
--group-id |
yes | Group ID for the annotation tool |
--vol-id |
no | Process only a specific volume (e.g. I3PD874) |
--api-url |
no | Override the default API endpoint |
--dry-run |
no | Print payload info without sending requests |
Each volume produces a JSON array of task records:
[
{
"name": "I1CZ39610001.jpg",
"url": "https://s3.us-east-1.amazonaws.com/bec.data/OCR-Benchmark/Kurt-I1CZ3961/I1CZ39610001.jpg",
"orientation": "landscape",
"transcript": "optional transcription text"
}
]The transcript field is only present when parquet alignment data is available.