Skip to content

A modern CLI that adds natural language explanations to labeled datasets using Google Gemini. Primarily designed to generate explanations for the UStanceBR corpus — a collection of stance detection datasets, composed by tweets annotated with "for" or "against" labels across multiple political targets.

Notifications You must be signed in to change notification settings

Amorim33/label-explainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪄 Label Explainer CLI

A modern CLI that performs stance classification with natural language explanations using AI models (Google Gemini and OpenAI GPT).

This tool is primarily designed for the UStanceBR corpus — a collection of stance detection datasets, composed by tweets annotated with "for" or "against" labels across multiple political targets. It performs two main tasks:

  1. Classification + Explanation: Classifies texts and provides natural language explanations
  2. Metrics Calculation: Compares LLM classifications with ground truth labels

📋 Features

  • Interactive CLI: User-friendly command-line interface with guided prompts
  • Stance Classification: Classifies texts as "for" or "against" with natural language explanations
  • Metrics Calculation: Accuracy, precision, recall, F1-score, and confusion matrix
  • Concurrent Processing: Processes texts in batches of 50 using Promise.all for speed
  • Multi-Model Support: Works with GPT-5, Gemini 2.0 Flash, Gemini 2.5 Pro, and Gemini 3 Flash Preview
  • Multi-Language: Explanations available in Portuguese, English, and Spanish
  • Excel Compatibility: Reads and writes Excel files with structured data

🚀 Getting Started

Prerequisites

  1. Install Bun (JavaScript runtime):

    curl -fsSL https://bun.sh/install | bash

    For more options, visit the Bun installation guide.

  2. Clone the Repository:

    git clone https://github.com/yourusername/label-explainer.git
    cd label-explainer
  3. Install Dependencies:

    bun install
  4. Set Up API Keys: Create a .env file in the root directory:

    # For Google Gemini models:
    GOOGLE_GENERATIVE_AI_API_KEY=your_google_api_key_here
    
    # For OpenAI GPT models:
    OPENAI_API_KEY=your_openai_api_key_here

📂 Data Setup

Input Excel File Format

Your input Excel files should have the following structure:

  • Column A: Tweet/text to classify

The CLI will prompt you for the input file path when you run it.

For Metrics Calculation

When using the metrics command, you'll need two files:

  1. Generated file: Output from the classify command (text + LLM label)
  2. Original file: Ground truth labels (text in Column A, label in Column C)

🎯 Running the Tool

Interactive CLI

Start the interactive CLI:

bun run src/cli.ts

The CLI provides two main commands:

1. Classify

Run stance classification on a dataset with the following interactive prompts:

  • Model selection: Choose from GPT-5, Gemini 2.0 Flash, Gemini 2.5 Pro, or Gemini 3 Flash Preview
  • Target selection: Select the classification target (bolsonaro, cloroquina, coronavac, globo, igreja, lula)
  • Language: Choose explanation language (Portuguese, English, Spanish)
  • Input file: Path to your Excel file with texts to classify
  • Output file: Path for the output Excel file with classifications and explanations

The classifier processes texts in batches of 50 concurrently using Promise.all for optimal performance.

2. Metrics

Calculate classification accuracy metrics by comparing generated labels with ground truth:

  • Accuracy, Precision, Recall, F1-Score per class
  • Confusion Matrix visualization
  • Macro-averaged metrics
  • Optional JSON export for detailed analysis

Available Models

  • gpt-5 - OpenAI's latest model
  • gemini-2.0-flash - Fast and efficient
  • gemini-2.5-pro - More accurate but slower
  • gemini-3-flash-preview - Google's latest flash model (preview)

Available Targets

Default targets for UStanceBR corpus:

  • bolsonaro
  • cloroquina
  • coronavac
  • globo
  • igreja
  • lula

🔄 Processing Workflow

The CLI classifier performs the following steps:

  1. Load Data: Reads texts from the input Excel file
  2. Batch Processing: Processes texts in batches of 50 concurrently
  3. Classify with LLM: Uses AI to classify each text's stance (for/against)
  4. Generate Explanations: Provides natural language explanations for each classification
  5. Save Results: Outputs Excel file with texts, labels, and explanations
  6. Show Summary: Displays classification distribution statistics

📁 Project Structure

label-explainer/
├── src/
│   ├── prompts/           # AI prompt templates
│   │   └── classification.ts # Prompt for classifying and explaining
│   ├── services/          # Core services
│   │   └── excel.ts       # Excel file operations
│   ├── utils/             # Utility functions
│   │   ├── common.ts      # Common utilities
│   │   └── models.ts      # AI model configurations
│   └── cli.ts            # Interactive CLI (classify & metrics)
└── README.md

📊 Output Format

The CLI generates Excel files with the following columns:

Column Content
A Original text
B LLM-generated label (for/against)
C LLM explanation

Output files are named based on your input (e.g., output-bolsonaro-gpt-5.xlsx)

Adding New Models

  1. Update src/utils/models.ts with your model configuration
  2. Add the model type to ModelType type definition
  3. Add the model to VALID_MODELS in src/cli.ts

Customizing Prompts

The classification prompt is stored in src/prompts/classification.ts. You can customize:

  • Classification criteria for "for" and "against" labels
  • Implicit stance detection rules
  • Response format and language

📝 License

MIT — do what you want, just give credit ✨

🙏 Acknowledgments

Built for processing the UStanceBR corpus and designed to be extensible for other NLP tasks.

About

A modern CLI that adds natural language explanations to labeled datasets using Google Gemini. Primarily designed to generate explanations for the UStanceBR corpus — a collection of stance detection datasets, composed by tweets annotated with "for" or "against" labels across multiple political targets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published