Skip to content

CGCL-codes/ReviseBench

Repository files navigation

📚 ReviseBench

Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark [ACL 2026 Findings]

Zihan Luo, Hong Huang, Jianxun Lian, Yu Chang, Xing Xie, Hai Jin

⚙️ Environment Setup

1.1 Install required tools and packages

  • Install and configure the openhands CLI (must be available in PATH).
  • Requires Python 3.12+ and uv 0.11.6 or newer.
uv tool install openhands --python 3.12
  • Install essential Python dependencies for this repository:
pip install -r requirements.txt

1.2 Configure Evaluation API keys

Set Gemini API key for automated evaluation:

export GEMINI_API_KEY=YOUR_GEMINI_KEY

⚠️ Note: Ensure the current environment can access Google GenAI (Gemini) services.

1.3 Set execution model (OpenHands CLI)

Before execution, set your model for OpenHands firstly:

openhands

OpenHands CLI will store configuration under ~/.openhands/ (created on first run):

  • agent_settings.json: persisted agent settings (including LLM_API_KEY, LLM_MODEL, and LLM_BASE_URL)

⚠️ Note: If you want to learn more about OpenHands CLI and how to use it, please visit:
https://docs.openhands.dev/openhands/usage/cli/installation for complete documentation.

🚀 How to Run

2.0 Notebook demo (quick walkthrough)

For a concise, executable walkthrough of environment setup, execution, and evaluation commands, please refer to:

  • demo.ipynb (located at the repository root)

2.1 Execution only

If you want to run the revision pipeline first, inspect generated papers/logs, or prepare evaluation inputs without calling the judge model yet, use execution.sh.

bash execution.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [model_submission_name]

Key arguments:

  • max_papers: max number of papers; 0 means all (default 0)
  • start_index: 1-based start index (default 1)
  • idle_timeout_seconds: stop a paper if output does not grow (default 900)
  • workspace_root: injected ROOT_PATH in prompt (default: repo root)
  • source_paper_dir: source papers (default: ./paper)
  • model_submission_name: model PDF filename in eval input (without .pdf); inferred from ~/.openhands/agent_settings.json if omitted

⚠️ Note: If you want to test only a small subset before scaling up, set max_papers and start_index (for example, start with max_papers=1).

2.2 Evaluation only

If you already have prepared cases in evaluation/evaluation_inputs and only want final quality comparison/metrics, use evaluation.sh.

bash evaluation.sh [judge_model] [eval_input_dir] [eval_output_dir]

Key arguments:

  • judge_model: evaluator model (default gemini-2.5-flash)
  • eval_input_dir: default ./evaluation/evaluation_inputs
  • eval_output_dir: default ./evaluation/evaluation_outputs

2.3 One command: execution + evaluation

If you want a fully automated end-to-end pipeline with minimal manual operations, use run.sh as the default entry point.

bash run.sh [max_papers] [start_index] [idle_timeout_seconds] [workspace_root] [source_paper_dir] [judge_model] [run_eval] [model_submission_name]
  • run_eval=1: execution + evaluation
  • run_eval=0: execution only

2.4 Example

# Example A: run first 2 papers, then evaluate automatically
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 1

# Example B: run execution only
bash run.sh 2 1 900 /root/ReviseBench /root/ReviseBench/paper gemini-2.5-flash 0

📝 Runtime Artifacts and File Tree

During execution, each paper is copied from paper/ into workspace/ and processed independently.
Evaluation inputs and logs are organized under evaluation/.

ReviseBench/
├── paper/                                  # Original paper data (source of truth; not modified)
│   └── <paper_name>/
├── workspace/                              # Per-run working copies used by OpenHands
│   └── <paper_name>/
│       └── revised_paper.pdf               # Model-revised paper generated by execution
├── outputs/
│   └── <paper_name>.jsonl                  # OpenHands runtime logs for each paper
├── .tmp_prompts/
│   └── <paper_name>.prompt.md              # Temporary prompt rendered for each run
├── evaluation/
│   ├── evaluation_inputs/
│   │   └── <paper_name>/
│   │       ├── paper-O.pdf                 # Original paper PDF
│   │       ├── paper-H.pdf                 # Human-revised paper PDF
│   │       └── <model_submission_name>.pdf # Model output PDF copied from workspace
│   └── evaluation_outputs/
│       └── eval_<timestamp>.log            # Evaluation summary/log file
├── execution.sh                            # Batch execution pipeline
├── evaluation.sh                           # Evaluation runner
└── run.sh                                  # One-click execution + evaluation

⚠️ Note: Before each paper run, workspace/<paper_name> will be reset from paper/<paper_name>.

🔄 Data Retrieval and Automated Expansion

We empirically observe that ReviseBench can be automatically expanded by OpenHands with providing key retrieval/preprocessing code snippets. Here we provide some key codes under data_collection/ for reference:

  • data_collection/paper.py: retrieves paper metadata and review-related signals from OpenReview, with arXiv version checks.
  • data_collection/review.py: fetches official review content and writes structured review files.
  • data_collection/latex.py: preprocesses LaTeX projects.
  • data_collection/config.yaml: configuration entry for retrieval credentials and local paths.

data_collection/config.yaml required fields

openreview:
  url: "https://api2.openreview.net"
  venue: "ICLR.cc/2025/Conference"
  username: ""
  pwd: ""
  datadir: "ICLR2025"

path:
  root_dir: ""

openai:
  url: ""
  api_key: ""

Fill in your credentials and paths before running retrieval scripts.


💡 Acknowledgement

We sincerely thank the OpenHands CLI team for their wonderful work:
https://github.com/OpenHands/OpenHands-CLI

About

[ACL26' Findings] Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors