📌 Description:
The GPT GCJ Dataset is the largest publicly available dataset of LLM-rewritten Java code, containing files rewritten by GPT-4o. It is based on Google Code Jam 2020, where human-authored Java solutions were rewritten by an LLM while preserving functionality. This dataset is useful for:
- LLM-generated code classification
- Machine learning in software forensics
- AI-assisted software engineering research
📂 Dataset Structure:
- original/ → 58,524 human-authored Java files from Google Code Jam 2020
- gpt4o/ → 17,565 Java files rewritten by GPT-4o API
Each file was rewritten using the prompt:
"This is Java code. Rewrite it entirely while maintaining functionality."
📥 Download & Usage:
To download the dataset, click "Assets" below and select GPT-GCJ-Dataset.zip.
📄 Citation:
If you use this dataset in research, please cite it as follows:
@misc{P24_GCJ,
author = {Paek, Timothy},
title = {GPT GCJ Dataset: The Largest LLM-Generated Code Dataset from Google Code Jam},
year = {2024},
howpublished = {GitHub Repository},
url = {https://github.com/tipaek/GPT-Java-GCJ-Dataset}
}