TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

📢 News

[2025-12-] 🚀 We have released the evaluation code and the TextEditBench dataset!

📖 Introduction

TextEditBench is a comprehensive benchmark for evaluating Reasoning-aware Text Editing beyond mere rendering. TextEditBench explicitly focuses on text-centric regions across 14 topics and 6 task types, emphasizing reasoning-intensive scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.

To comprehensively assess model performance across diverse editing contexts, we establish a Dual-Track Evaluation Framework encompassing Pixel-Level Objective Metrics and MLLM-based Semantic Metrics. Besides, we propose a novel evaluation dimension, Semantic Expectation (SE), to measure the model's ability to maintain semantic consistency, contextual coherence, and cross-modal alignment.Our approach offers a scalable and reproducible alternative to human evaluation, while maintaining a high degree of alignment with human judgment regarding complex reasoning chains.

$TextEditBench Overview$

✨ Key Features

🧠 Reasoning-Centric: Introduces Semantic Expectation (SE) metric .
🌍 Diverse Scenarios: Covers 14 topics and 6 task types.
📏 Comprehensive Evaluation:
- Track 1 (Pixel-level): SSIM, PSNR, LPIPS, MSE.
- Track 2 (Semantic-level): Powered by GPT-4o, evaluating Instruction Following, Text Accuracy, Visual Consistency, Layout Preservation, and Semantic Expectation .

📊 Dataset Overview

TextEditBench comprises 1,196 high-quality instances, curated through a rigorous Human-AI-Human verification pipeline. The dataset balances diversity and annotation fidelity by combining Manual Production (58%) with Web-sourced instances (42%).

$Data Distribution$

🧩 Dataset Composition

14 Diverse Topics: Broad coverage of daily visual contexts, including Professional Documents, Digital Interfaces, Signage, Menus, and Packaging.
6 Atom Operations: Systematic editing tasks designed to test specific capabilities: Delete, Insert, Change, Relocation, Scaling, and Attribute transfer.
Hierarchical Difficulty: Each instance is scored (0-20) based on 10 difficulty attributes and categorized into Easy, Medium, and Hard tiers, enabling fine-grained analysis of model robustness.

🚀 Quick Start

Here we provide a quick start guide to evaluate Models on TextEditBench.

Setup Environment

git clone https://github.com/MATH-finding/TextEditBench.git

conda create -n textedit python=3.10
conda activate textedit
pip install -r requirments

Setup API key and API base URL in .env for gpt4o.

OPENAI_API_KEY=${your_api_proxy_provider_key}
OPENAI_API_URL=${your_ark_api_base_url}

Download Data

Download the TextEditBench data from Huggingface and unzip it under the root directory.

wget 
unzip data.zip

The file structure should be like this:

data/                                          
├── canva/                         
│   └── Art_Creative_Expression/
│   │   ├── 001/
│   │   │   ├── 1.jpg
│   │   │   ├── 1_mask.jpg
│   │   │   ├── Art_Creative_Expression_001.json
│   │   │   ├── text_delete_1.jpg
│   │   │   └── text_delete_1_mask.jpg
│   │   └── ...
│   │
│   └── ...
│
└── real/                         
    ├── Art_Creative_Expression/               
    │   ├── 001/
    │   │   ├── 1.jpg
    │   │   ├── 1_mask.jpg
    │   │   ├── Art_Creative_Expression_001.json
    │   │   └── text_delete_1_mask.jpg
    │   └── ...
    │
    └── ...

🛠️ Usage

Model Output Folder

edited_images/                                          
├── canva/                         
│   └── Art_Creative_Expression/
│   │   ├── 001_edited.jpg
│   │   └── ...
│   └── ...
│
└── real/                         
    ├── Art_Creative_Expression/               
    │   ├── 001_edited.jpg
    │   └── ...
    │
    └── ...

Evaluation

Track 1 (Pixel-level): SSIM, PSNR, LPIPS, MSE.

python evaluation/masked_mse_psnr_evaluation.py path/to/your_model_output.json

python evaluation/masked_ssim_lpips_evaluation.py path/to/your_model_output.json

Track 2 (Semantic-level): Powered by GPT-4o, evaluating Instruction Following, Text Accuracy, Visual Consistency, Layout Preservation, and Semantic Expectation .

python evaluation/GPT-4o_evaluation.py #for all evaluation code, you will need to modify outout file path to yours

📝 Citation

If you find our work or dataset useful, please cite us:

@article{texteditbench2026,
  title={TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering},
  author={Anonymous Authors},
  journal={CVPR Submission},
  volume={3050},
  year={2026}
}

📧 Contact

For any questions, please feel free to open an issue or contact email@example.com.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
TextEditBench		TextEditBench
assets		assets
evaluation		evaluation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

📢 News

📖 Introduction

✨ Key Features

📊 Dataset Overview

🧩 Dataset Composition

🚀 Quick Start

Setup Environment

Download Data

🛠️ Usage

Model Output Folder

Evaluation

📝 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

MATH-finding/TextEditBench

Folders and files

Latest commit

History

Repository files navigation

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

📢 News

📖 Introduction

✨ Key Features

📊 Dataset Overview

🧩 Dataset Composition

🚀 Quick Start

Setup Environment

Download Data

🛠️ Usage

Model Output Folder

Evaluation

📝 Citation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages