📚 ReviseBench

Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark [ACL 2026 Findings]

Zihan Luo, Hong Huang, Jianxun Lian, Yu Chang, Xing Xie, Hai Jin

⚙️ Environment Setup

1.1 Install required tools and packages

Install and configure the openhands CLI (must be available in PATH).
Requires Python 3.12+ and uv 0.11.6 or newer.

uv tool install openhands --python 3.12

Install essential Python dependencies for this repository:

pip install -r requirements.txt

1.2 Configure Evaluation API keys

Set Gemini API key for automated evaluation:

export GEMINI_API_KEY=YOUR_GEMINI_KEY

⚠️ Note: Ensure the current environment can access Google GenAI (Gemini) services.

1.3 Set execution model (OpenHands CLI)

Before execution, set your model for OpenHands firstly:

openhands

OpenHands CLI will store configuration under ~/.openhands/ (created on first run):

agent_settings.json: persisted agent settings (including LLM_API_KEY, LLM_MODEL, and LLM_BASE_URL)

⚠️ Note: If you want to learn more about OpenHands CLI and how to use it, please visit:
https://docs.openhands.dev/openhands/usage/cli/installation for complete documentation.

🚀 How to Run

2.0 Notebook demo (quick walkthrough)

For a concise, executable walkthrough of environment setup, execution, and evaluation commands, please refer to:

demo.ipynb (located at the repository root)

2.1 Execution only

If you want to run the revision pipeline first, inspect generated papers/logs, or prepare evaluation inputs without calling the judge model yet, use execution.sh.

bash execution.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [model_submission_name]

Key arguments:

max_papers: max number of papers; 0 means all (default 0)
start_index: 1-based start index (default 1)
idle_timeout_seconds: stop a paper if output does not grow (default 900)
workspace_root: injected ROOT_PATH in prompt (default: repo root)
source_paper_dir: source papers (default: ./paper)
model_submission_name: model PDF filename in eval input (without .pdf); inferred from ~/.openhands/agent_settings.json if omitted

⚠️ Note: If you want to test only a small subset before scaling up, set max_papers and start_index (for example, start with max_papers=1).

2.2 Evaluation only

If you already have prepared cases in evaluation/evaluation_inputs and only want final quality comparison/metrics, use evaluation.sh.

bash evaluation.sh [judge_model] [eval_input_dir] [eval_output_dir]

Key arguments:

judge_model: evaluator model (default gemini-2.5-flash)
eval_input_dir: default ./evaluation/evaluation_inputs
eval_output_dir: default ./evaluation/evaluation_outputs

2.3 One command: execution + evaluation

If you want a fully automated end-to-end pipeline with minimal manual operations, use run.sh as the default entry point.

bash run.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [judge_model] [run_eval] [model_submission_name]

run_eval=1: execution + evaluation
run_eval=0: execution only

2.4 Example

# Example A: run first 2 papers, then evaluate automatically
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 1

# Example B: run execution only
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 0

📝 Runtime Artifacts and File Tree

During execution, each paper is copied from paper/ into workspace/ and processed independently.
Evaluation inputs and logs are organized under evaluation/.

ReviseBench/
├── paper/                                  # Original paper data (source of truth; not modified)
│   └── <paper_name>/
├── workspace/                              # Per-run working copies used by OpenHands
│   └── <paper_name>/
│       └── revised_paper.pdf               # Model-revised paper generated by execution
├── outputs/
│   └── <paper_name>.jsonl                  # OpenHands runtime logs for each paper
├── .tmp_prompts/
│   └── <paper_name>.prompt.md              # Temporary prompt rendered for each run
├── evaluation/
│   ├── evaluation_inputs/
│   │   └── <paper_name>/
│   │       ├── paper-O.pdf                 # Original paper PDF
│   │       ├── paper-H.pdf                 # Human-revised paper PDF
│   │       └── <model_submission_name>.pdf # Model output PDF copied from workspace
│   └── evaluation_outputs/
│       └── eval_<timestamp>.log            # Evaluation summary/log file
├── execution.sh                            # Batch execution pipeline
├── evaluation.sh                           # Evaluation runner
└── run.sh                                  # One-click execution + evaluation

⚠️ Note: Before each paper run, workspace/<paper_name> will be reset from paper/<paper_name>.

🔄 Data Retrieval and Automated Expansion

We empirically observe that ReviseBench can be automatically expanded by OpenHands with providing key retrieval/preprocessing code snippets. Here we provide some key codes under data_collection/ for reference:

data_collection/paper.py: retrieves paper metadata and review-related signals from OpenReview, with arXiv version checks.
data_collection/review.py: fetches official review content and writes structured review files.
data_collection/latex.py: preprocesses LaTeX projects.
data_collection/config.yaml: configuration entry for retrieval credentials and local paths.

`data_collection/config.yaml` required fields

openreview:
  url: "https://api2.openreview.net"
  venue: "ICLR.cc/2025/Conference"
  username: ""
  pwd: ""
  datadir: "ICLR2025"

path:
  root_dir: ""

openai:
  url: ""
  api_key: ""

Fill in your credentials and paths before running retrieval scripts.

💡 Acknowledgement

We sincerely thank the OpenHands CLI team for their wonderful work:
https://github.com/OpenHands/OpenHands-CLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 ReviseBench

⚙️ Environment Setup

1.1 Install required tools and packages

1.2 Configure Evaluation API keys

1.3 Set execution model (OpenHands CLI)

🚀 How to Run

2.0 Notebook demo (quick walkthrough)

2.1 Execution only

2.2 Evaluation only

2.3 One command: execution + evaluation

2.4 Example

📝 Runtime Artifacts and File Tree

🔄 Data Retrieval and Automated Expansion

`data_collection/config.yaml` required fields

💡 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.tmp_prompts		.tmp_prompts
data_collection		data_collection
evaluation		evaluation
outputs		outputs
paper		paper
workspace/GeSubNet- Gene Interaction Inference for Disease Subtype Network Generation		workspace/GeSubNet- Gene Interaction Inference for Disease Subtype Network Generation
README.md		README.md
demo.ipynb		demo.ipynb
evaluation.sh		evaluation.sh
execution.sh		execution.sh
prompt.md		prompt.md
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

📚 ReviseBench

⚙️ Environment Setup

1.1 Install required tools and packages

1.2 Configure Evaluation API keys

1.3 Set execution model (OpenHands CLI)

🚀 How to Run

2.0 Notebook demo (quick walkthrough)

2.1 Execution only

2.2 Evaluation only

2.3 One command: execution + evaluation

2.4 Example

📝 Runtime Artifacts and File Tree

🔄 Data Retrieval and Automated Expansion

data_collection/config.yaml required fields

💡 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`data_collection/config.yaml` required fields

Packages