ABDataBench

ABDataBench is a benchmark for evaluating LLM-based extraction of structured antibody data from scientific literature. It pairs a curated 32-document dataset (papers + patents + supplements) with a record-centered evaluation framework covering 22 fields across binding kinetics, sequences, and biological metadata.

ABCurator is the accompanying multi-agent extraction system that achieves state-of-the-art performance on this benchmark.

Overview

Benchmark Framework

The evaluation pipeline uses a tiered 22-field schema (Core / Standard / Auxiliary), record-level Hungarian matching, and LLM-as-Judge scoring to produce a final benchmark score.

Multi-Agent Pipeline (ABCurator)

ABCurator uses a four-stage pipeline: Document Conditioning → Skeleton Construction → Parallel Enrichment → Validation, with up to 3 retry rounds driven by an Agent–Scientist Co-evolution loop.

Model Benchmark Results

Model	Ab. Prec. ↑	Ab. Rec. ↑	Seq. Hit ↑	KD Hit ↑	Score ↑
Proprietary Models
Claude-4.7-Opus	38.1	96.9	91.5	73.8	84.0
Claude-4.6-Sonnet	31.8	95.6	84.3	67.7	77.9
Gemini-3.1-Pro	28.0	100.0	86.3	46.2	78.4
GPT-5.5	24.0	100.0	94.8	67.7	83.0
Open-Source Models
Qwen3.5-Plus	26.4	100.0	93.5	75.4	81.7
DeepSeek-V4-Pro	34.6	99.4	95.4	72.3	80.2
GLM-5.1	27.7	97.5	86.9	70.8	77.1
MiniMax-M2.7	33.8	96.2	85.6	69.2	76.4

Ab. Prec. / Ab. Rec.: Antibody-record precision and recall
Seq. Hit: Exact-or-partial hit rate for sequence fields
KD Hit: Exact-or-partial hit rate for binding affinity fields
Score: Final ABDataBench score

Repository Layout

agent/                  Multi-agent extraction system (ABCurator)
agent/prompts/          Versioned prompt assets
agent/skills/           Skill metadata loaded by the agents
dataset/                Default OCR benchmark dataset (32 documents)
benchmark/              Ground truth, evaluator, and visualization tools
figures/                Figures for documentation
ocr/                    Optional OCR helper scripts
frontend/               Optional React frontend for annotation/review
backend/                Optional FastAPI backend for annotation/review
scripts/run_pipeline.py End-to-end extraction, evaluation, and dashboard

Installation

git clone https://github.com/GAIR-NLP/ABDataBench.git
cd ABDataBench
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Create a local environment file:

cp .env.example .env

Set at least these values:

LLM_API_BASE=https://your-api-endpoint
LLM_API_KEY=your_api_key
LLM_MODEL=your_model_name
BENCHMARK_API_KEY=your_api_key

Quick Start

End-to-End Run

source .venv/bin/activate
set -a; source .env; set +a

python scripts/run_pipeline.py \
  --output-root runs \
  --papers-per-worker 4 \
  --llm-concurrency 8 \
  --paper-concurrency 5 \
  --trace \
  --serve \
  --host 0.0.0.0 \
  --port 8000

Extraction Only

python scripts/run_pipeline.py --skip-eval --output-root runs

Benchmark Only

cd benchmark
python run_eval.py \
  --gt ground_truth/ground_truth.json \
  --pred ../runs/dev/agent/benchmark_predictions.json \
  --output ../runs/dev/benchmark

Generate Dashboard

python benchmark/scripts/visualize_eval.py \
  runs/dev/benchmark/eval_result_latest.json \
  --output runs/dev/benchmark/eval_dashboard.html

Output Structure

runs/<run_name>/
├── agent/benchmark_predictions.json   # Predictions
├── benchmark/eval_result_latest.json  # Evaluation results
├── benchmark/eval_report_latest.md    # Evaluation report
└── benchmark/eval_dashboard.html      # Interactive dashboard

Configuration

See .env.example for the full configuration surface. Key settings:

Variable	Description
`LLM_API_BASE`, `LLM_API_KEY`, `LLM_MODEL`	Text extraction model
`LLM_REVIEW_MODEL`	Reviewer model (defaults to LLM_MODEL)
`VLM_API_BASE`, `VLM_API_KEY`, `VLM_MODEL`	Image extraction model
`BENCHMARK_API_KEY`, `BENCHMARK_MODEL`	Benchmark judge model
`NCBI_EMAIL`, `NCBI_API_KEY`	Optional NCBI/PDB lookup

License

Apache 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABDataBench

Overview

Benchmark Framework

Multi-Agent Pipeline (ABCurator)

Model Benchmark Results

Repository Layout

Installation

Quick Start

End-to-End Run

Extraction Only

Benchmark Only

Generate Dashboard

Output Structure

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
agent		agent
backend		backend
benchmark		benchmark
dataset		dataset
figures		figures
frontend		frontend
ocr		ocr
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ABDataBench

Overview

Benchmark Framework

Multi-Agent Pipeline (ABCurator)

Model Benchmark Results

Repository Layout

Installation

Quick Start

End-to-End Run

Extraction Only

Benchmark Only

Generate Dashboard

Output Structure

Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages