Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark [ACL 2026 Findings]
Zihan Luo, Hong Huang, Jianxun Lian, Yu Chang, Xing Xie, Hai Jin
- Install and configure the openhands CLI (must be available in
PATH). - Requires Python 3.12+ and uv 0.11.6 or newer.
uv tool install openhands --python 3.12- Install essential Python dependencies for this repository:
pip install -r requirements.txtSet Gemini API key for automated evaluation:
export GEMINI_API_KEY=YOUR_GEMINI_KEY
⚠️ Note: Ensure the current environment can access Google GenAI (Gemini) services.
Before execution, set your model for OpenHands firstly:
openhandsOpenHands CLI will store configuration under ~/.openhands/ (created on first run):
agent_settings.json: persisted agent settings (includingLLM_API_KEY,LLM_MODEL, andLLM_BASE_URL)
⚠️ Note: If you want to learn more about OpenHands CLI and how to use it, please visit:
https://docs.openhands.dev/openhands/usage/cli/installation for complete documentation.
For a concise, executable walkthrough of environment setup, execution, and evaluation commands, please refer to:
demo.ipynb(located at the repository root)
If you want to run the revision pipeline first, inspect generated papers/logs, or prepare evaluation inputs without calling the judge model yet, use execution.sh.
bash execution.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [model_submission_name]Key arguments:
max_papers: max number of papers;0means all (default0)start_index: 1-based start index (default1)idle_timeout_seconds: stop a paper if output does not grow (default900)workspace_root: injectedROOT_PATHin prompt (default: repo root)source_paper_dir: source papers (default:./paper)model_submission_name: model PDF filename in eval input (without.pdf); inferred from~/.openhands/agent_settings.jsonif omitted
⚠️ Note: If you want to test only a small subset before scaling up, setmax_papersandstart_index(for example, start withmax_papers=1).
If you already have prepared cases in evaluation/evaluation_inputs and only want final quality comparison/metrics, use evaluation.sh.
bash evaluation.sh [judge_model] [eval_input_dir] [eval_output_dir]Key arguments:
judge_model: evaluator model (defaultgemini-2.5-flash)eval_input_dir: default./evaluation/evaluation_inputseval_output_dir: default./evaluation/evaluation_outputs
If you want a fully automated end-to-end pipeline with minimal manual operations, use run.sh as the default entry point.
bash run.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [judge_model] [run_eval] [model_submission_name]run_eval=1: execution + evaluationrun_eval=0: execution only
# Example A: run first 2 papers, then evaluate automatically
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 1
# Example B: run execution only
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 0During execution, each paper is copied from paper/ into workspace/ and processed independently.
Evaluation inputs and logs are organized under evaluation/.
ReviseBench/
├── paper/ # Original paper data (source of truth; not modified)
│ └── <paper_name>/
├── workspace/ # Per-run working copies used by OpenHands
│ └── <paper_name>/
│ └── revised_paper.pdf # Model-revised paper generated by execution
├── outputs/
│ └── <paper_name>.jsonl # OpenHands runtime logs for each paper
├── .tmp_prompts/
│ └── <paper_name>.prompt.md # Temporary prompt rendered for each run
├── evaluation/
│ ├── evaluation_inputs/
│ │ └── <paper_name>/
│ │ ├── paper-O.pdf # Original paper PDF
│ │ ├── paper-H.pdf # Human-revised paper PDF
│ │ └── <model_submission_name>.pdf # Model output PDF copied from workspace
│ └── evaluation_outputs/
│ └── eval_<timestamp>.log # Evaluation summary/log file
├── execution.sh # Batch execution pipeline
├── evaluation.sh # Evaluation runner
└── run.sh # One-click execution + evaluation
⚠️ Note: Before each paper run,workspace/<paper_name>will be reset frompaper/<paper_name>.
We empirically observe that ReviseBench can be automatically expanded by OpenHands with providing key retrieval/preprocessing code snippets. Here we provide some key codes under data_collection/ for reference:
data_collection/paper.py: retrieves paper metadata and review-related signals from OpenReview, with arXiv version checks.data_collection/review.py: fetches official review content and writes structured review files.data_collection/latex.py: preprocesses LaTeX projects.data_collection/config.yaml: configuration entry for retrieval credentials and local paths.
openreview:
url: "https://api2.openreview.net"
venue: "ICLR.cc/2025/Conference"
username: ""
pwd: ""
datadir: "ICLR2025"
path:
root_dir: ""
openai:
url: ""
api_key: ""Fill in your credentials and paths before running retrieval scripts.
We sincerely thank the OpenHands CLI team for their wonderful work:
https://github.com/OpenHands/OpenHands-CLI