Skip to content

sujalrajpoot/LLM-Finetuning-Dataset-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– LLM Finetuning Dataset Generator

License Python Status

An automated tool for generating high-quality outputs for LLM finetuning datasets using API Key based model APIs.

๐Ÿ“‹ Table of Contents

๐ŸŒŸ Overview

The LLM Finetuning Dataset Generator is a powerful automation tool designed to streamline the process of generating output responses for machine learning datasets. It leverages API Key based model APIs to generate high-quality outputs based on your custom system prompts.

Key Highlights

โœจ Smart Skip Feature - Automatically skips rows that already have outputs
โšก High Performance - Generate approximately 500+ rows per run
๐ŸŽฏ Custom Prompts - Support for custom system prompts
๐Ÿ”„ Batch Processing - Process multiple dataset files simultaneously

๐Ÿš€ Features

  • ๐Ÿค– Automated Output Generation - Generate responses using advanced LLM models
  • ๐Ÿ“Š Batch Processing - Handle multiple dataset files in one go
  • ๐Ÿ” Intelligent Skipping - Skip rows with existing outputs automatically
  • โš™๏ธ Custom System Prompts - Use your own system prompts for generation
  • ๐Ÿ“ˆ Scalable - Generates ~430K rows per day with Nvidia Free API Key on a single device
  • ๐Ÿ’ป Multiple Devices Compatible - Also works in Pydroid3 Mobile Application
  • ๐Ÿ’พ JSON Support - Works with standard JSON dataset format

๐Ÿ“ฆ Installation

  1. Clone the repository
git clone https://github.com/sujalrajpoot/LLM-Finetuning-Dataset-Generator.git
cd LLM-Finetuning-Dataset-Generator
  1. Install dependencies
pip install -r requirements.txt
  1. Set up your dataset files
    • Place all dataset files in the Dataset-Files/ directory
    • Ensure they follow the required JSON format (see below)

๐ŸŽฏ Usage

  1. Prepare your datasets in the required JSON format
  2. Place them in the Dataset-Files/ folder
  3. Run the generator:
    python concurrently_main.py

The tool will automatically:

  • Process all JSON files in the directory
  • Generate outputs for empty fields
  • Skip rows with existing outputs
  • Save the updated datasets

๐Ÿ“ Dataset Format

Your dataset files must follow this exact JSON structure:

[
    {
        "instruction": "Summarize the given text into one sentence.",
        "input": "Artificial Intelligence is transforming industries by automating tasks, enhancing decision-making, and improving user experiences across sectors like healthcare, finance, and education.",
        "output": ""
    },
    {
        "instruction": "Translate the following English sentence into French.",
        "input": "The weather is beautiful today.",
        "output": ""
    },
    {
        "instruction": "Write a Python function that reverses a string.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Generate three creative business name ideas for a coffee shop.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Classify the sentiment of the given text as Positive, Negative, or Neutral.",
        "input": "I really love the new phone update; it runs faster and looks amazing!",
        "output": ""
    },
    {
        "instruction": "Write a short poem about the sunset.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Explain the concept of machine learning in simple terms.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Convert the following temperature from Celsius to Fahrenheit.",
        "input": "25ยฐC",
        "output": ""
    },
    {
        "instruction": "Write a SQL query to select all users who registered in the last 30 days.",
        "input": "",
        "output": ""
    },
    {
        "instruction": "Create a short ad copy for an eco-friendly water bottle brand.",
        "input": "",
        "output": ""
    }
]

Required Fields

Field Description Required
instruction The task or prompt for the model โœ… Yes
input The input data/context โœ… Optional โ€” Yes if the task depends on context
output The generated response (leave empty for generation) โœ… Yes

๐Ÿ“ Directory Structure

๐Ÿ“ LLM-Finetuning-Dataset-Generator/
โ”œโ”€โ”€ ๐Ÿ“ Config/
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ config.py
โ”œโ”€โ”€ ๐Ÿ“ Providers/
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ DeepInfra.py
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ Nvidia.py
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ __init__.py
โ”œโ”€โ”€ ๐Ÿ“ dataset_files/
|   โ”œโ”€โ”€ ๐Ÿ“„ dataset_1.json
โ”‚   โ””โ”€โ”€ ๐Ÿ“„ dataset_2.json
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt
โ””โ”€โ”€ ๐Ÿ“„ main.py

โš™๏ธ How It Works

  1. ๐Ÿ“‚ File Detection - Scans the Dataset-Files/ directory for JSON files
  2. ๐Ÿ” Row Analysis - Checks each row for empty output fields
  3. ๐Ÿค– API Communication - Uses API Key based model APIs with custom system prompts
  4. โœ๏ธ Output Generation - Generates high-quality responses for empty fields
  5. โญ๏ธ Smart Skipping - Automatically skips rows with existing outputs
  6. ๐Ÿ’พ Auto-Save - Saves updated datasets with generated outputs

๐Ÿ”ฎ Future Improvements

We're constantly working to make this tool better! Here's what's on the roadmap:

  • ๐ŸŒ More LLM Providers - Add support for additional free LLM model providers
  • ๐Ÿ“Š Enhanced Scalability - Improve dataset generation capacity for large datasets
  • ๐Ÿ› ๏ธ Better Error Handling - Implement advanced troubleshooting features
  • ๐Ÿ“ˆ Progress Tracking - Add real-time progress bars and statistics
  • ๐Ÿ”ง Configuration File - Support for external configuration files
  • ๐ŸŽจ GUI Interface - Optional graphical user interface
  • ๐Ÿ“ Logging System - Comprehensive logging for debugging

๐Ÿค Contributing

Contributions are welcome! This is an open-source project aimed at improving dataset generation efficiency.

How to Contribute

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch (git checkout -b feature/AmazingFeature)
  3. ๐Ÿ’ป Commit your changes (git commit -m 'Add some AmazingFeature')
  4. ๐Ÿ“ค Push to the branch (git push origin feature/AmazingFeature)
  5. ๐Ÿ”ƒ Open a Pull Request

Contribution Guidelines

  • Write clean, documented code
  • Test your changes thoroughly
  • Update documentation as needed
  • Follow existing code style
  • Be respectful and constructive

โš ๏ธ Disclaimer

This tool is intended for educational purposes only - to automate the dataset generation process and enhance productivity in machine learning workflows.

  • ๐Ÿ“š Educational Use - Designed for learning and research
  • ๐Ÿšซ No Malicious Intent - Not intended to harm any API or organization
  • โš–๏ธ Responsible Use - Users are responsible for compliance with API terms of service
  • ๐Ÿ”’ Respect Rate Limits - Use responsibly and respect API rate limits

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Thanks to all contributors who help improve this tool
  • Inspired by the need for efficient dataset preparation in ML/AI workflows
  • Built with โค๏ธ for the open-source community

๐Ÿ“ง Contact

Have questions or suggestions? Feel free to:

  • ๐Ÿ› Open an issue
  • ๐Ÿ’ฌ Start a discussion
  • โญ Star the repository if you find it useful!

Developed by: Sujal Rajpoot ๐ŸŽฏ Full Stack Python Developer & AI Fine-Tuning Expert ๐Ÿš€ Founder of TrueSyncAI โ€” Custom AI Solutions for Everyone

Made with โค๏ธ for the ML/AI Community

โญ Star this repository if you find it helpful!

About

An automated tool for generating high-quality synthetic outputs for LLM finetuning datasets using model APIs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages