Skip to content

GPT-GCJ-Dataset v1.0

Latest

Choose a tag to compare

@tipaek tipaek released this 30 Jan 19:22
efe47aa

📌 Description:
The GPT GCJ Dataset is the largest publicly available dataset of LLM-rewritten Java code, containing files rewritten by GPT-4o. It is based on Google Code Jam 2020, where human-authored Java solutions were rewritten by an LLM while preserving functionality. This dataset is useful for:

  • LLM-generated code classification
  • Machine learning in software forensics
  • AI-assisted software engineering research

📂 Dataset Structure:

  • original/ → 58,524 human-authored Java files from Google Code Jam 2020
  • gpt4o/ → 17,565 Java files rewritten by GPT-4o API

Each file was rewritten using the prompt:

"This is Java code. Rewrite it entirely while maintaining functionality."

📥 Download & Usage:
To download the dataset, click "Assets" below and select GPT-GCJ-Dataset.zip.

📄 Citation:
If you use this dataset in research, please cite it as follows:

@misc{P24_GCJ,
  author = {Paek, Timothy},
  title = {GPT GCJ Dataset: The Largest LLM-Generated Code Dataset from Google Code Jam},
  year = {2024},
  howpublished = {GitHub Repository},
  url = {https://github.com/tipaek/GPT-Java-GCJ-Dataset}
}