|
1 | 1 | # GPT-Java-GCJ-Dataset |
2 | | -A dataset for testing large-scale GPT 4o detection. |
| 2 | +<a name="readme-top"></a> |
| 3 | + |
| 4 | +<br /> |
| 5 | +<div align="center"> |
| 6 | + <a href="https://github.com/tipaek/GPT-Java-GCJ-Dataset"> |
| 7 | + <img src="https://cdn.iconscout.com/icon/free/png-256/free-java-file-51-775447.png" alt="logo of Java file" width="80" height="80"> |
| 8 | + |
| 9 | + </a> |
| 10 | + |
| 11 | + <h3 align="center">GPT Java GCJ Source Code Dataset</h3> |
| 12 | + |
| 13 | + <p align="center"> |
| 14 | + A dataset composed of 76,089 total Java source code files from over 1,000 authors in the 2020 Google Code Jam competition and GPT-4o rewritten code for code generation detection. |
| 15 | + <br /> |
| 16 | + <a href="https://github.com/tipaek/GPT-Java-GCJ-Dataset"><strong>Explore the files »</strong></a> |
| 17 | + <br /> |
| 18 | + <br /> |
| 19 | + </p> |
| 20 | +</div> |
| 21 | + |
| 22 | +<!-- TABLE OF CONTENTS --> |
| 23 | +<details> |
| 24 | + <summary>Table of Contents</summary> |
| 25 | + <ol> |
| 26 | + <li> |
| 27 | + <a href="#about-the-project">About The Project</a> |
| 28 | + </li> |
| 29 | + <li> |
| 30 | + <a href="#getting-started">Getting Started</a> |
| 31 | + <ul> |
| 32 | + <li><a href="#composition">File Composition</a></li> |
| 33 | + <li><a href="#installation">Installation</a></li> |
| 34 | + </ul> |
| 35 | + </li> |
| 36 | + <li><a href="#usage">Usage</a></li> |
| 37 | + <li><a href="#contact">Contact</a></li> |
| 38 | + <li><a href="#acknowledgments">Acknowledgments</a></li> |
| 39 | + </ol> |
| 40 | +</details> |
| 41 | + |
| 42 | + |
| 43 | + |
| 44 | +<!-- ABOUT THE PROJECT --> |
| 45 | +## About The Project |
| 46 | +With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I previously created the first ever [dataset](https://github.com/tipaek/GPT-Java-Dataset) for this purpose with the GPT-rewritten task which aims to solve the different author styles resulting from different prompts. This is a significantly upscaled version of the last using the 2020 Google Code Jam dataset. |
| 47 | + |
| 48 | +Here's the general idea: |
| 49 | +* **58,524 human-authored** Java source code files from over **1,000 participants** were retrieved from the 2020 Google Code Jam competition |
| 50 | +* **17,565 of these files were rewritten by GPT-4o** with the prompt: "This is java code. Rewrite it entirely while maintaining functionality." |
| 51 | +* Both the original and rewritten files are present in the final dataset to increase difficulty |
| 52 | +*The **rewriting task** simulates different GPT-4o coding styles by passing in a variety of contexts that emulate the model's ability to have differing outputs depending on the prompt |
| 53 | + |
| 54 | +This dataset serves aims to be a resource for researchers focusing on AI-generated code detection, providing a practical way to gauge real-world capability. |
| 55 | + |
| 56 | +<p align="right">(<a href="#readme-top">back to top</a>)</p> |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | +<!-- GETTING STARTED --> |
| 61 | +## Getting Started |
| 62 | + |
| 63 | +### Composition |
| 64 | + |
| 65 | +Here's a breakdown of the files in this dataset: |
| 66 | +* 76,089 total files |
| 67 | +* 58,524 files of original authors from the 2020 Google Code Jam |
| 68 | +* 17,565 rewritten files using GPT-4o |
| 69 | + |
| 70 | +### Installation |
| 71 | + |
| 72 | + |
| 73 | +To download this dataset, simply clone the repository: |
| 74 | + |
| 75 | +1. Clone the repository: |
| 76 | + ```bash |
| 77 | + git clone https://github.com/tipaek/GPT-Java-GCJ-Dataset.git |
| 78 | +2. Navigate to the dataset folder: |
| 79 | + ```bash |
| 80 | + cd GPT-Java-GCJ-Dataset/dataset |
| 81 | +
|
| 82 | +<p align="right">(<a href="#readme-top">back to top</a>)</p> |
| 83 | +
|
| 84 | +<!-- USAGE --> |
| 85 | +## Usage |
| 86 | +
|
| 87 | +Researchers can use this dataset to: |
| 88 | +
|
| 89 | +- Evaluate the performance and accuracy of models in detecting GPT-4o under various prompts |
| 90 | +- Build new datasets using this as a base |
| 91 | +
|
| 92 | +<p align="right">(<a href="#readme-top">back to top</a>)</p> |
| 93 | +
|
| 94 | +<!-- CONTACT --> |
| 95 | +## Contact |
| 96 | +
|
| 97 | +Timothy Paek - [LinkedIn](https://www.linkedin.com/in/timothy-paek/) - tipaek@syr.edu |
| 98 | +
|
| 99 | +Project Link: [https://github.com/tipaek/GPT-Java-GCJ-Dataset](https://github.com/tipaek/GPT-Java-GCJ-Dataset) |
| 100 | +
|
| 101 | +<p align="right">(<a href="#readme-top">back to top</a>)</p> |
| 102 | +
|
| 103 | +<!-- ACKNOWLEDGMENTS --> |
| 104 | +## Acknowledgments |
| 105 | +
|
| 106 | +Thanks to the participants of the 2020 Google Code Jam competition and [Jur1cek](https://github.com/Jur1cek/gcj-dataset) for making the creation of this dataset possible. |
| 107 | +
|
| 108 | +<p align="right">(<a href="#readme-top">back to top</a>)</p> |
0 commit comments