An intelligent natural language to SQL system that thinks, explores, validates, and recovers from mistakes — going beyond naive prompt-to-SQL approaches.
Takes natural language questions → Reasons about the database → Generates safe SQL → Returns human-readable answers with full transparency.
User: "Which artist has the most albums?"
System Reasoning:
├── Understanding: aggregation query, moderate complexity
├── Schema Retrieved (Vector Search): Artist, Album
├── Plan: JOIN Artist with Album, GROUP BY, ORDER BY COUNT DESC
├── Generated SQL: SELECT a.Name, COUNT(al.AlbumId)...
└── Execution: Success, 1 row
Answer: "Iron Maiden has the most albums with 21 albums."
| Feature | Description |
|---|---|
| LangGraph Agentic Workflow | State machine-based agent with conditional routing |
| RAG-based Schema Selection | ChromaDB + Google Embeddings for vector similarity search |
| LLM-based Meta Queries | Dynamically generates SQL for schema introspection questions |
| Irrelevant Query Detection | Politely rejects off-topic questions (greetings, general knowledge) |
| Self-Correction | Automatically retries with fixes when queries fail (up to 3 attempts) |
| Reasoning Trace | Shows every step of the decision-making process |
| Ambiguity Handling | Asks clarifying questions for vague queries |
| Data Exploration | Validates entities exist before querying |
| Safe Execution | Read-only queries only, with LIMIT protection |
| Auto Visualization | Generates Chart.js charts for suitable results |
| Streaming UI | Real-time step-by-step updates in the web interface |
The system uses LangGraph to orchestrate an agentic workflow:
┌─────────────────────────────────────────────────────────────┐
│ User Question │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Understanding Node │
│ (Intent, Complexity, Entities, Relevance Check) │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
[Irrelevant] [Meta Query] [Normal Query]
│ │ │
▼ ▼ ▼
Reject Politely Handle Meta Get Schema (RAG/LLM)
│ │
▼ ▼
Generate SQL Explore Data
│ │
▼ ▼
Execute Generate Plan
│
▼
Generate SQL
│
▼
Execute & Validate ◄──┐
│ │
┌─────────┴─────────┐ │
▼ ▼ │
[Success] [Error]──┘
│ (Retry 3x)
▼
Generate Visualization
│
▼
Generate Answer
- Python 3.10+
- Google Gemini API key (Get one free)
# Clone the repository
git clone <your-repo-url>
cd nlptosql
# Install dependencies
pip install -r requirements.txt
# Set up environment (use GEMINI_API_KEY or GOOGLE_API_KEY)
echo "GEMINI_API_KEY=your_api_key_here" > .envWeb Interface (Recommended):
python src/server.py
# Open http://localhost:8000CLI Mode:
python main.py "Which artist has the most albums?"nlptosql/
├── main.py # CLI entry point
├── baseline.py # Naive approach for comparison
├── requirements.txt # Dependencies
├── Chinook_Sqlite.sqlite # Sample music database
├── src/
│ ├── server.py # FastAPI web server (streaming)
│ ├── schema.py # Schema management
│ ├── vector_store.py # ChromaDB RAG for schema selection
│ ├── examples_data.py # Few-shot examples for RAG
│ ├── validator.py # SQL validation
│ ├── meta_handler.py # LLM-based meta-query handling
│ ├── static/
│ │ └── index.html # Web UI with Chart.js
│ └── graph/ # LangGraph Agentic Workflow
│ ├── workflow.py # State machine definition
│ ├── nodes.py # Node implementations
│ └── state.py # State type definitions
└── test_suite.py # Test cases
- Replace
Chinook_Sqlite.sqlitewith your database - Update
DB_FILEinsrc/graph/nodes.py:DB_FILE = "your_database.sqlite"
- Delete the
chroma_db/folder (to re-index schema) - Restart the server — the system auto-detects table structure
Edit src/graph/nodes.py in the get_llm() function:
_llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite", temperature=0)| Type | Example |
|---|---|
| Simple | "How many tracks are there?" |
| Filtering | "Find all tracks longer than 5 minutes" |
| Joins | "List all tracks in the 'Rock' genre" |
| Aggregation | "Total revenue by country, sorted highest first" |
| Complex | "Customers who purchased both Rock and Jazz" |
| Meta | "Which tables have more than 5 columns?" |
| Ambiguous | "Show me the best artists" → asks for clarification |
| Irrelevant | "How are you doing?" → politely declined |
python test_suite.py- ✅ Read-only: Only SELECT queries allowed
- ✅ LIMIT protection: Auto-adds LIMIT 1000 to prevent runaway queries
- ✅ Validation: Checks for dangerous patterns before execution
- ✅ Error recovery: Graceful handling with up to 3 retry attempts
- ✅ Irrelevant query rejection: Won't hallucinate on off-topic questions
langgraph— Agentic workflow orchestrationlangchain-google-genai— Gemini LLM integrationchromadb— Vector store for RAGfastapi/uvicorn— Web server with streamingsqlparse— SQL validationpython-dotenv— Environment management
MIT License