Skip to content

OpenPecha/line-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BDRC OCR-Benchmark Line Alignment

Pipeline for downloading BDRC manuscript images, aligning them with transcriptions, and creating annotation batches for the BDRC annotation tool.

Overview

The workflow has three stages:

  1. Download & Generate JSON — Fetch images and parquet alignment data from S3, produce per-volume JSON manifests.
  2. Upload Images — Push local images to a public S3 bucket so annotators can access them.
  3. Create Batches — POST the JSON manifests to the annotation tool API to create task batches.

Prerequisites

  • Python 3.13+
  • AWS credentials configured via named profiles (e.g. karma-bdrc, webuddhist)
pip install -r requirements.txt

Directory Structure

line-alignment/
├── batch_convert.py      # Stage 1: download images + generate JSON
├── upload_images.py       # Stage 2: upload images to public S3
├── create_batch.py        # Stage 3: create annotation batches via API
├── requirements.txt
├── images/                # Downloaded images (images/{catalog}/{vol_id}/)
└── json/                  # Generated JSON manifests (json/{catalog}/{vol_id}.json)

Usage

1. Download Images & Generate JSON

Reads a catalog CSV from S3, downloads BDRC volume images, loads parquet alignment files, and writes per-volume JSON manifests.

python3 batch_convert.py <s3_csv_uri> [--profile AWS_PROFILE] [--image-base-url URL] [--skip-download]

Examples:

# Full run — download images and generate JSON
python3 batch_convert.py \
    s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
    --profile karma-bdrc

# Skip image download, only regenerate JSON from existing parquets
python3 batch_convert.py \
    s3://bec.bdrc.io/ocr_benchmark/alignments/202604/Kurt/catalog_volumes.csv \
    --profile karma-bdrc --skip-download

What it does:

  • Parses the catalog CSV to separate BDRC image rows from transcript rows.
  • Downloads volume images to images/{catalog}/{vol_id}/ (concurrent, skips existing files).
  • Loads parquet alignment files to pair each image with its transcription.
  • Falls back to image-only JSON when no parquet is available.
  • Writes JSON to json/{catalog}/{vol_id}.json.

2. Upload Images to Public S3

Uploads local images from images/ to s3://bec.data/OCR-Benchmark/{catalog}-{vol_id}/.

python3 upload_images.py [--profile PROFILE] [--catalog CATALOG] [--dry-run] [--no-skip-existing]

Examples:

# Dry run — see what would be uploaded
python3 upload_images.py --dry-run

# Upload everything
python3 upload_images.py

# Upload only a specific catalog
python3 upload_images.py --catalog Kurt

# Force re-upload even if files already exist in S3
python3 upload_images.py --no-skip-existing
Flag Default Description
--profile webuddhist AWS profile name
--images-dir images Root directory containing local images
--catalog all catalogs Limit to a specific catalog
--dry-run Print plan without uploading
--no-skip-existing Re-upload files that already exist in S3

3. Create Annotation Batches

Reads JSON manifests and POSTs them to the OpenPecha annotation tool API to create task batches.

python3 create_batch.py --catalog CATALOG --group-id GROUP_ID [--vol-id VOL_ID] [--dry-run]

Examples:

# Create all batches for the Kurt catalog
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax

# Create a single volume batch
python3 create_batch.py --catalog Kurt --vol-id I3PD874 --group-id 92rIrWB67rLqGtwovMFax

# Preview without sending (dry run)
python3 create_batch.py --catalog Kurt --group-id 92rIrWB67rLqGtwovMFax --dry-run
Flag Required Description
--catalog yes Catalog name (directory under json/)
--group-id yes Group ID for the annotation tool
--vol-id no Process only a specific volume (e.g. I3PD874)
--api-url no Override the default API endpoint
--dry-run no Print payload info without sending requests

JSON Manifest Format

Each volume produces a JSON array of task records:

[
  {
    "name": "I1CZ39610001.jpg",
    "url": "https://s3.us-east-1.amazonaws.com/bec.data/OCR-Benchmark/Kurt-I1CZ3961/I1CZ39610001.jpg",
    "orientation": "landscape",
    "transcript": "optional transcription text"
  }
]

The transcript field is only present when parquet alignment data is available.

About

data preperation pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages