🤖 LLM Finetuning Dataset Generator

An automated tool for generating high-quality outputs for LLM finetuning datasets using API Key based model APIs.

📋 Table of Contents

Overview
Features
Installation
Usage
Dataset Format
Directory Structure
How It Works
Future Improvements
Contributing
Disclaimer
License

🌟 Overview

The LLM Finetuning Dataset Generator is a powerful automation tool designed to streamline the process of generating output responses for machine learning datasets. It leverages API Key based model APIs to generate high-quality outputs based on your custom system prompts.

Key Highlights

✨ Smart Skip Feature - Automatically skips rows that already have outputs
⚡ High Performance - Generate approximately 500+ rows per run
🎯 Custom Prompts - Support for custom system prompts
🔄 Batch Processing - Process multiple dataset files simultaneously

🚀 Features

🤖 Automated Output Generation - Generate responses using advanced LLM models
📊 Batch Processing - Handle multiple dataset files in one go
🔍 Intelligent Skipping - Skip rows with existing outputs automatically
⚙️ Custom System Prompts - Use your own system prompts for generation
📈 Scalable - Generates ~430K rows per day with Nvidia Free API Key on a single device
💻 Multiple Devices Compatible - Also works in Pydroid3 Mobile Application
💾 JSON Support - Works with standard JSON dataset format

📦 Installation

Clone the repository

git clone https://github.com/sujalrajpoot/LLM-Finetuning-Dataset-Generator.git
cd LLM-Finetuning-Dataset-Generator

Install dependencies

pip install -r requirements.txt

Set up your dataset files
- Place all dataset files in the Dataset-Files/ directory
- Ensure they follow the required JSON format (see below)

🎯 Usage

Prepare your datasets in the required JSON format
Place them in the Dataset-Files/ folder
Run the generator:
```
python concurrently_main.py
```

The tool will automatically:

Process all JSON files in the directory
Generate outputs for empty fields
Skip rows with existing outputs
Save the updated datasets

📝 Dataset Format

Your dataset files must follow this exact JSON structure:

[
    {
        "instruction": "Summarize the given text into one sentence.",
        "input": "Artificial Intelligence is transforming industries by automating tasks, enhancing decision-making, and improving user experiences across sectors like healthcare, finance, and education.",
        "output": ""
    },
    {
        "instruction": "Translate the following English sentence into French.",
        "input": "The weather is beautiful today.",
        "output": ""
    },
    {
        "instruction": "Write a Python function that reverses a string.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Generate three creative business name ideas for a coffee shop.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Classify the sentiment of the given text as Positive, Negative, or Neutral.",
        "input": "I really love the new phone update; it runs faster and looks amazing!",
        "output": ""
    },
    {
        "instruction": "Write a short poem about the sunset.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Explain the concept of machine learning in simple terms.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Convert the following temperature from Celsius to Fahrenheit.",
        "input": "25°C",
        "output": ""
    },
    {
        "instruction": "Write a SQL query to select all users who registered in the last 30 days.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Create a short ad copy for an eco-friendly water bottle brand.",
        "input": "",
        "output": ""
    }
]

Required Fields

Field	Description	Required
`instruction`	The task or prompt for the model	✅ Yes
`input`	The input data/context	✅ Optional — Yes if the task depends on context
`output`	The generated response (leave empty for generation)	✅ Yes

📁 Directory Structure

📁 LLM-Finetuning-Dataset-Generator/
├── 📁 Config/
│   └── 📄 config.py
├── 📁 Providers/
│   ├── 📄 DeepInfra.py
│   ├── 📄 Nvidia.py
│   └── 📄 __init__.py
├── 📁 dataset_files/
|   ├── 📄 dataset_1.json
│   └── 📄 dataset_2.json
├── 📄 requirements.txt
└── 📄 main.py

⚙️ How It Works

📂 File Detection - Scans the Dataset-Files/ directory for JSON files
🔍 Row Analysis - Checks each row for empty output fields
🤖 API Communication - Uses API Key based model APIs with custom system prompts
✍️ Output Generation - Generates high-quality responses for empty fields
⏭️ Smart Skipping - Automatically skips rows with existing outputs
💾 Auto-Save - Saves updated datasets with generated outputs

🔮 Future Improvements

We're constantly working to make this tool better! Here's what's on the roadmap:

🌐 More LLM Providers - Add support for additional free LLM model providers
📊 Enhanced Scalability - Improve dataset generation capacity for large datasets
🛠️ Better Error Handling - Implement advanced troubleshooting features
📈 Progress Tracking - Add real-time progress bars and statistics
🔧 Configuration File - Support for external configuration files
🎨 GUI Interface - Optional graphical user interface
📝 Logging System - Comprehensive logging for debugging

🤝 Contributing

Contributions are welcome! This is an open-source project aimed at improving dataset generation efficiency.

How to Contribute

🍴 Fork the repository
🌿 Create a feature branch (git checkout -b feature/AmazingFeature)
💻 Commit your changes (git commit -m 'Add some AmazingFeature')
📤 Push to the branch (git push origin feature/AmazingFeature)
🔃 Open a Pull Request

Contribution Guidelines

Write clean, documented code
Test your changes thoroughly
Update documentation as needed
Follow existing code style
Be respectful and constructive

⚠️ Disclaimer

This tool is intended for educational purposes only - to automate the dataset generation process and enhance productivity in machine learning workflows.

📚 Educational Use - Designed for learning and research
🚫 No Malicious Intent - Not intended to harm any API or organization
⚖️ Responsible Use - Users are responsible for compliance with API terms of service
🔒 Respect Rate Limits - Use responsibly and respect API rate limits

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Thanks to all contributors who help improve this tool
Inspired by the need for efficient dataset preparation in ML/AI workflows
Built with ❤️ for the open-source community

📧 Contact

Have questions or suggestions? Feel free to:

🐛 Open an issue
💬 Start a discussion
⭐ Star the repository if you find it useful!

Developed by: Sujal Rajpoot 🎯 Full Stack Python Developer & AI Fine-Tuning Expert 🚀 Founder of TrueSyncAI — Custom AI Solutions for Everyone

Made with ❤️ for the ML/AI Community

⭐ Star this repository if you find it helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 LLM Finetuning Dataset Generator

📋 Table of Contents

🌟 Overview

Key Highlights

🚀 Features

📦 Installation

🎯 Usage

📝 Dataset Format

Required Fields

📁 Directory Structure

⚙️ How It Works

🔮 Future Improvements

🤝 Contributing

How to Contribute

Contribution Guidelines

⚠️ Disclaimer

📄 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Config		Config
Providers		Providers
dataset_files		dataset_files
LICENSE		LICENSE
README.md		README.md
concurrently_main.py		concurrently_main.py
main.py		main.py
requirements.txt		requirements.txt

License

sujalrajpoot/LLM-Finetuning-Dataset-Generator

Folders and files

Latest commit

History

Repository files navigation

🤖 LLM Finetuning Dataset Generator

📋 Table of Contents

🌟 Overview

Key Highlights

🚀 Features

📦 Installation

🎯 Usage

📝 Dataset Format

Required Fields

📁 Directory Structure

⚙️ How It Works

🔮 Future Improvements

🤝 Contributing

How to Contribute

Contribution Guidelines

⚠️ Disclaimer

📄 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages