Skip to content

Commit 7b58a8e

Browse files
committed
2 parents a34a64e + a97c713 commit 7b58a8e

File tree

1 file changed

+107
-1
lines changed

1 file changed

+107
-1
lines changed

README.md

Lines changed: 107 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,108 @@
11
# GPT-Java-GCJ-Dataset
2-
A dataset for testing large-scale GPT 4o detection.
2+
<a name="readme-top"></a>
3+
4+
<br />
5+
<div align="center">
6+
<a href="https://github.com/tipaek/GPT-Java-GCJ-Dataset">
7+
<img src="https://cdn.iconscout.com/icon/free/png-256/free-java-file-51-775447.png" alt="logo of Java file" width="80" height="80">
8+
9+
</a>
10+
11+
<h3 align="center">GPT Java GCJ Source Code Dataset</h3>
12+
13+
<p align="center">
14+
A dataset composed of 76,089 total Java source code files from over 1,000 authors in the 2020 Google Code Jam competition and GPT-4o rewritten code for code generation detection.
15+
<br />
16+
<a href="https://github.com/tipaek/GPT-Java-GCJ-Dataset"><strong>Explore the files »</strong></a>
17+
<br />
18+
<br />
19+
</p>
20+
</div>
21+
22+
<!-- TABLE OF CONTENTS -->
23+
<details>
24+
<summary>Table of Contents</summary>
25+
<ol>
26+
<li>
27+
<a href="#about-the-project">About The Project</a>
28+
</li>
29+
<li>
30+
<a href="#getting-started">Getting Started</a>
31+
<ul>
32+
<li><a href="#composition">File Composition</a></li>
33+
<li><a href="#installation">Installation</a></li>
34+
</ul>
35+
</li>
36+
<li><a href="#usage">Usage</a></li>
37+
<li><a href="#contact">Contact</a></li>
38+
<li><a href="#acknowledgments">Acknowledgments</a></li>
39+
</ol>
40+
</details>
41+
42+
43+
44+
<!-- ABOUT THE PROJECT -->
45+
## About The Project
46+
With the release of OpenAI's ChatGPT, code written by GPT is becoming increasingly more common in everyday usage. However, students often use generated code to cheat on exams and homework. Being able to detect code written by GPT could be useful for organizations and schools as a classification or anomaly detection task. I previously created the first ever [dataset](https://github.com/tipaek/GPT-Java-Dataset) for this purpose with the GPT-rewritten task which aims to solve the different author styles resulting from different prompts. This is a significantly upscaled version of the last using the 2020 Google Code Jam dataset.
47+
48+
Here's the general idea:
49+
* **58,524 human-authored** Java source code files from over **1,000 participants** were retrieved from the 2020 Google Code Jam competition
50+
* **17,565 of these files were rewritten by GPT-4o** with the prompt: "This is java code. Rewrite it entirely while maintaining functionality."
51+
* Both the original and rewritten files are present in the final dataset to increase difficulty
52+
*The **rewriting task** simulates different GPT-4o coding styles by passing in a variety of contexts that emulate the model's ability to have differing outputs depending on the prompt
53+
54+
This dataset serves aims to be a resource for researchers focusing on AI-generated code detection, providing a practical way to gauge real-world capability.
55+
56+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
57+
58+
59+
60+
<!-- GETTING STARTED -->
61+
## Getting Started
62+
63+
### Composition
64+
65+
Here's a breakdown of the files in this dataset:
66+
* 76,089 total files
67+
* 58,524 files of original authors from the 2020 Google Code Jam
68+
* 17,565 rewritten files using GPT-4o
69+
70+
### Installation
71+
72+
73+
To download this dataset, simply clone the repository:
74+
75+
1. Clone the repository:
76+
```bash
77+
git clone https://github.com/tipaek/GPT-Java-GCJ-Dataset.git
78+
2. Navigate to the dataset folder:
79+
```bash
80+
cd GPT-Java-GCJ-Dataset/dataset
81+
82+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
83+
84+
<!-- USAGE -->
85+
## Usage
86+
87+
Researchers can use this dataset to:
88+
89+
- Evaluate the performance and accuracy of models in detecting GPT-4o under various prompts
90+
- Build new datasets using this as a base
91+
92+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
93+
94+
<!-- CONTACT -->
95+
## Contact
96+
97+
Timothy Paek - [LinkedIn](https://www.linkedin.com/in/timothy-paek/) - tipaek@syr.edu
98+
99+
Project Link: [https://github.com/tipaek/GPT-Java-GCJ-Dataset](https://github.com/tipaek/GPT-Java-GCJ-Dataset)
100+
101+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
102+
103+
<!-- ACKNOWLEDGMENTS -->
104+
## Acknowledgments
105+
106+
Thanks to the participants of the 2020 Google Code Jam competition and [Jur1cek](https://github.com/Jur1cek/gcj-dataset) for making the creation of this dataset possible.
107+
108+
<p align="right">(<a href="#readme-top">back to top</a>)</p>

0 commit comments

Comments
 (0)