- [2025-12-] 🚀 We have released the evaluation code and the TextEditBench dataset!
TextEditBench is a comprehensive benchmark for evaluating Reasoning-aware Text Editing beyond mere rendering. TextEditBench explicitly focuses on text-centric regions across 14 topics and 6 task types, emphasizing reasoning-intensive scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.
To comprehensively assess model performance across diverse editing contexts, we establish a Dual-Track Evaluation Framework encompassing Pixel-Level Objective Metrics and MLLM-based Semantic Metrics. Besides, we propose a novel evaluation dimension, Semantic Expectation (SE), to measure the model's ability to maintain semantic consistency, contextual coherence, and cross-modal alignment.Our approach offers a scalable and reproducible alternative to human evaluation, while maintaining a high degree of alignment with human judgment regarding complex reasoning chains.
- 🧠 Reasoning-Centric: Introduces Semantic Expectation (SE) metric .
- 🌍 Diverse Scenarios: Covers 14 topics and 6 task types.
- 📏 Comprehensive Evaluation:
- Track 1 (Pixel-level): SSIM, PSNR, LPIPS, MSE.
- Track 2 (Semantic-level): Powered by GPT-4o, evaluating Instruction Following, Text Accuracy, Visual Consistency, Layout Preservation, and Semantic Expectation .
TextEditBench comprises 1,196 high-quality instances, curated through a rigorous Human-AI-Human verification pipeline. The dataset balances diversity and annotation fidelity by combining Manual Production (58%) with Web-sourced instances (42%).
- 14 Diverse Topics: Broad coverage of daily visual contexts, including Professional Documents, Digital Interfaces, Signage, Menus, and Packaging.
- 6 Atom Operations: Systematic editing tasks designed to test specific capabilities: Delete, Insert, Change, Relocation, Scaling, and Attribute transfer.
- Hierarchical Difficulty: Each instance is scored (0-20) based on 10 difficulty attributes and categorized into Easy, Medium, and Hard tiers, enabling fine-grained analysis of model robustness.
Here we provide a quick start guide to evaluate Models on TextEditBench.
git clone https://github.com/MATH-finding/TextEditBench.git
conda create -n textedit python=3.10
conda activate textedit
pip install -r requirments
Setup API key and API base URL in .env for gpt4o.
OPENAI_API_KEY=${your_api_proxy_provider_key}
OPENAI_API_URL=${your_ark_api_base_url}
Download the TextEditBench data from Huggingface and unzip it under the root directory.
wget
unzip data.zip
The file structure should be like this:
data/
├── canva/
│ └── Art_Creative_Expression/
│ │ ├── 001/
│ │ │ ├── 1.jpg
│ │ │ ├── 1_mask.jpg
│ │ │ ├── Art_Creative_Expression_001.json
│ │ │ ├── text_delete_1.jpg
│ │ │ └── text_delete_1_mask.jpg
│ │ └── ...
│ │
│ └── ...
│
└── real/
├── Art_Creative_Expression/
│ ├── 001/
│ │ ├── 1.jpg
│ │ ├── 1_mask.jpg
│ │ ├── Art_Creative_Expression_001.json
│ │ └── text_delete_1_mask.jpg
│ └── ...
│
└── ...
edited_images/
├── canva/
│ └── Art_Creative_Expression/
│ │ ├── 001_edited.jpg
│ │ └── ...
│ └── ...
│
└── real/
├── Art_Creative_Expression/
│ ├── 001_edited.jpg
│ └── ...
│
└── ...
Track 1 (Pixel-level): SSIM, PSNR, LPIPS, MSE.
python evaluation/masked_mse_psnr_evaluation.py path/to/your_model_output.json
python evaluation/masked_ssim_lpips_evaluation.py path/to/your_model_output.json
Track 2 (Semantic-level): Powered by GPT-4o, evaluating Instruction Following, Text Accuracy, Visual Consistency, Layout Preservation, and Semantic Expectation .
python evaluation/GPT-4o_evaluation.py #for all evaluation code, you will need to modify outout file path to yours
If you find our work or dataset useful, please cite us:
@article{texteditbench2026,
title={TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering},
author={Anonymous Authors},
journal={CVPR Submission},
volume={3050},
year={2026}
}For any questions, please feel free to open an issue or contact email@example.com.

