An automated tool for generating high-quality outputs for LLM finetuning datasets using API Key based model APIs.
- Overview
- Features
- Installation
- Usage
- Dataset Format
- Directory Structure
- How It Works
- Future Improvements
- Contributing
- Disclaimer
- License
The LLM Finetuning Dataset Generator is a powerful automation tool designed to streamline the process of generating output responses for machine learning datasets. It leverages API Key based model APIs to generate high-quality outputs based on your custom system prompts.
โจ Smart Skip Feature - Automatically skips rows that already have outputs
โก High Performance - Generate approximately 500+ rows per run
๐ฏ Custom Prompts - Support for custom system prompts
๐ Batch Processing - Process multiple dataset files simultaneously
- ๐ค Automated Output Generation - Generate responses using advanced LLM models
- ๐ Batch Processing - Handle multiple dataset files in one go
- ๐ Intelligent Skipping - Skip rows with existing outputs automatically
- โ๏ธ Custom System Prompts - Use your own system prompts for generation
- ๐ Scalable - Generates ~430K rows per day with Nvidia Free API Key on a single device
- ๐ป Multiple Devices Compatible - Also works in Pydroid3 Mobile Application
- ๐พ JSON Support - Works with standard JSON dataset format
- Clone the repository
git clone https://github.com/sujalrajpoot/LLM-Finetuning-Dataset-Generator.git
cd LLM-Finetuning-Dataset-Generator- Install dependencies
pip install -r requirements.txt- Set up your dataset files
- Place all dataset files in the
Dataset-Files/directory - Ensure they follow the required JSON format (see below)
- Place all dataset files in the
- Prepare your datasets in the required JSON format
- Place them in the
Dataset-Files/folder - Run the generator:
python concurrently_main.py
The tool will automatically:
- Process all JSON files in the directory
- Generate outputs for empty fields
- Skip rows with existing outputs
- Save the updated datasets
Your dataset files must follow this exact JSON structure:
[
{
"instruction": "Summarize the given text into one sentence.",
"input": "Artificial Intelligence is transforming industries by automating tasks, enhancing decision-making, and improving user experiences across sectors like healthcare, finance, and education.",
"output": ""
},
{
"instruction": "Translate the following English sentence into French.",
"input": "The weather is beautiful today.",
"output": ""
},
{
"instruction": "Write a Python function that reverses a string.",
"input": "",
"output": ""
},
{
"instruction": "Generate three creative business name ideas for a coffee shop.",
"input": "",
"output": ""
},
{
"instruction": "Classify the sentiment of the given text as Positive, Negative, or Neutral.",
"input": "I really love the new phone update; it runs faster and looks amazing!",
"output": ""
},
{
"instruction": "Write a short poem about the sunset.",
"input": "",
"output": ""
},
{
"instruction": "Explain the concept of machine learning in simple terms.",
"input": "",
"output": ""
},
{
"instruction": "Convert the following temperature from Celsius to Fahrenheit.",
"input": "25ยฐC",
"output": ""
},
{
"instruction": "Write a SQL query to select all users who registered in the last 30 days.",
"input": "",
"output": ""
},
{
"instruction": "Create a short ad copy for an eco-friendly water bottle brand.",
"input": "",
"output": ""
}
]| Field | Description | Required |
|---|---|---|
instruction |
The task or prompt for the model | โ Yes |
input |
The input data/context | โ Optional โ Yes if the task depends on context |
output |
The generated response (leave empty for generation) | โ Yes |
๐ LLM-Finetuning-Dataset-Generator/
โโโ ๐ Config/
โ โโโ ๐ config.py
โโโ ๐ Providers/
โ โโโ ๐ DeepInfra.py
โ โโโ ๐ Nvidia.py
โ โโโ ๐ __init__.py
โโโ ๐ dataset_files/
| โโโ ๐ dataset_1.json
โ โโโ ๐ dataset_2.json
โโโ ๐ requirements.txt
โโโ ๐ main.py
- ๐ File Detection - Scans the
Dataset-Files/directory for JSON files - ๐ Row Analysis - Checks each row for empty output fields
- ๐ค API Communication - Uses API Key based model APIs with custom system prompts
- โ๏ธ Output Generation - Generates high-quality responses for empty fields
- โญ๏ธ Smart Skipping - Automatically skips rows with existing outputs
- ๐พ Auto-Save - Saves updated datasets with generated outputs
We're constantly working to make this tool better! Here's what's on the roadmap:
- ๐ More LLM Providers - Add support for additional free LLM model providers
- ๐ Enhanced Scalability - Improve dataset generation capacity for large datasets
- ๐ ๏ธ Better Error Handling - Implement advanced troubleshooting features
- ๐ Progress Tracking - Add real-time progress bars and statistics
- ๐ง Configuration File - Support for external configuration files
- ๐จ GUI Interface - Optional graphical user interface
- ๐ Logging System - Comprehensive logging for debugging
Contributions are welcome! This is an open-source project aimed at improving dataset generation efficiency.
- ๐ด Fork the repository
- ๐ฟ Create a feature branch (
git checkout -b feature/AmazingFeature) - ๐ป Commit your changes (
git commit -m 'Add some AmazingFeature') - ๐ค Push to the branch (
git push origin feature/AmazingFeature) - ๐ Open a Pull Request
- Write clean, documented code
- Test your changes thoroughly
- Update documentation as needed
- Follow existing code style
- Be respectful and constructive
This tool is intended for educational purposes only - to automate the dataset generation process and enhance productivity in machine learning workflows.
- ๐ Educational Use - Designed for learning and research
- ๐ซ No Malicious Intent - Not intended to harm any API or organization
- โ๏ธ Responsible Use - Users are responsible for compliance with API terms of service
- ๐ Respect Rate Limits - Use responsibly and respect API rate limits
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to all contributors who help improve this tool
- Inspired by the need for efficient dataset preparation in ML/AI workflows
- Built with โค๏ธ for the open-source community
Have questions or suggestions? Feel free to:
- ๐ Open an issue
- ๐ฌ Start a discussion
- โญ Star the repository if you find it useful!
Made with โค๏ธ for the ML/AI Community
โญ Star this repository if you find it helpful!