Skip to content

ShreeBohara/codebaseqa

CodebaseQA

Understand any codebase in minutes, not days.

AI-powered codebase understanding, onboarding, and hands-on learning for developers.

TypeScript Python Next.js FastAPI OpenAI CI GitHub stars License: MIT

Quick StartDemo VideoArchitecture DiagramScreenshotsFeaturesLive Demo ModeAPICLI

Star RepoOpen IssuesContributing


Why CodebaseQA?

CodebaseQA is built for the moment you open an unfamiliar repository and need answers fast.

It gives you:

  • Chat Q&A over real code context (RAG + source citations)
  • Learning paths tailored by persona
  • Interactive lessons with file-linked references and Mermaid diagrams
  • Quizzes and coding challenges (bug hunt, code trace, fill-in-the-blank)
  • Gamification (XP, levels, streaks, achievements, activity heatmap)
  • Full-workspace dependency graph visualization with adaptive module-first overview, scoped drill-down, and PNG export

Use it from the web UI or from the CLI, depending on your workflow.

90-Second Product Flow

  1. Add a GitHub repository and let CodebaseQA index it.
  2. Ask natural-language questions and get answers with source-backed citations.
  3. Generate a persona-based curriculum and open any lesson.
  4. Practice with quizzes/challenges and track progress with XP, streaks, and achievements.
  5. Explore system structure in the dependency graph and export lesson tours for VS Code.

Best For

  • New team members onboarding into large codebases
  • Engineering managers/leads accelerating ramp-up
  • Developers trying to understand architecture and key execution paths
  • Interview prep / self-learning on open-source repositories

Demo Video

CodebaseQA demo video thumbnail (click to watch on YouTube)

Watch 90s demo on YouTube

Click the thumbnail or button above to play the full demo video.

Architecture Diagram

CodebaseQA architecture diagram

Screenshots

Media pack guide: docs/media/README.md

The gallery below uses the 10 numbered screenshots in docs/media/screenshots.

Web App Flow

1) Landing page hero

Landing page hero

2) Landing page feature section

Landing page feature section

3) Repository import and indexing

Repository import and indexing

4) Chat home with starter prompts

Chat home with starter prompts

5) Chat answer with citations and code snippets

Chat answer with citations and code snippets

6) Dependency graph overview

Dependency graph overview

7) Dependency graph deep inspection panel

Dependency graph deep inspection panel

8) Learning role selection

Learning role selection

9) Full-stack learning track map

Full-stack learning track map

10) Lesson workspace with practice tools

Lesson workspace with practice tools


Features

Feature Description
Repository Indexing Clone/index GitHub repos with progress states (pending, cloning, parsing, embedding, completed, failed)
RAG Chat Streaming responses with query expansion, hybrid retrieval, and source snippets
Semantic Search Vector + keyword hybrid search with language/file filters
Learning Personas New Hire, Security Auditor, Full Stack Dev, Archaeologist
Lesson Generation AI-generated lesson markdown, code references, optional Mermaid diagram
Quiz Generation Lesson-based multiple-choice quizzes
Challenges Bug Hunt, Code Trace, Fill-in-the-Blank generation + validation
Gamification XP rewards, 6 levels, streak tracking, achievements, dashboard analytics
Dependency Graph Full-workspace intelligent graph with adaptive module overview, focus mode, progressive edge reveal, deterministic extraction, and PNG export
CodeTour Export Export lesson content as VS Code CodeTour (.tour)
CLI Tooling Index, ask, search, list, lessons, and CodeTour export from terminal
Demo Bootstrap Seed a demo repository via API/UI (/api/repos/demo/seed)

Quick Start

Prerequisites

  • Docker Desktop (recommended for fastest setup)
  • Or local dev: Node.js 20+ and Python 3.11+
  • At least one supported LLM provider (OpenAI, Anthropic, or Ollama)

Docker Setup

git clone https://github.com/ShreeBohara/codebaseqa.git
cd codebaseqa
cp .env.example .env
# Edit .env and add provider credentials (typically OPENAI_API_KEY)

./scripts/start-docker.sh
# Optional demo seed:
# ./scripts/start-docker.sh --with-demo

Endpoints after startup:

  • Web UI: http://localhost:3000
  • API docs: http://localhost:8000/docs
  • Health check: http://localhost:8000/health

Local Development

# Install JS workspace deps once (from repo root)
pnpm install

# Terminal 1: API
cd apps/api
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn src.main:app --reload

# Terminal 2: Web
pnpm web:dev

# Optional (direct package command)
# cd apps/web
# pnpm dev

Frontend Troubleshooting

If the UI looks unstyled (plain text/stacked layout), clear Next.js build artifacts and restart:

rm -rf apps/web/.next
pnpm web:dev

Use pnpm web:dev as the canonical frontend start command.


Live Demo Mode

CodebaseQA supports a public-safe runtime mode for hosted demos.

  • Set DEMO_MODE=true to pin the deployment to one featured repository.
  • Default featured repository is vercel/nextjs-subscription-payments (configurable with DEMO_REPO_* env vars).
  • In demo mode, repo import/delete can be disabled, and chat/learning endpoints apply soft rate limits.
  • Frontend automatically shows a demo banner and hides destructive repo actions.

Key env vars:

  • DEMO_MODE, SEED_DEMO
  • DEMO_REPO_URL, DEMO_REPO_OWNER, DEMO_REPO_NAME, DEMO_REPO_BRANCH
  • DEMO_ALLOW_PUBLIC_IMPORTS, DEMO_BANNER_TEXT, DEMO_BUSY_MODE
  • DEMO_RATE_LIMIT_* knobs for soft guardrails

For local validation:

pnpm web:dev
# in another terminal
cd apps/api && uvicorn src.main:app --reload

For production prewarm after first deploy:

pnpm demo:prewarm

Configuration

CodebaseQA reads settings from environment variables (via apps/api/src/config.py).

Core Settings

Variable Description Default
DATABASE_URL SQLAlchemy DB URL sqlite:///./data/codebaseqa.db
CHROMA_PERSIST_DIR Chroma storage path ./data/chroma
REPOS_DIR Cloned repository cache path ./data/repos
GITHUB_TOKEN Needed for private repos / higher API limits unset
MAX_FILES_PER_REPO Indexing cap per repository 5000
MAX_FILE_SIZE_KB Skip files larger than this 500
DEBUG API debug mode false

LLM & Embeddings

Variable Description Default
LLM_PROVIDER openai, anthropic, or ollama openai
EMBEDDING_PROVIDER openai or ollama openai
OPENAI_API_KEY OpenAI API key unset
OPENAI_MODEL OpenAI chat model gpt-4o
OPENAI_EMBEDDING_MODEL OpenAI embedding model text-embedding-3-small
OPENAI_BASE_URL OpenAI-compatible endpoint override unset
OPENAI_EMBEDDING_MAX_TOKENS_PER_REQUEST Max total tokens per embedding request batch 250000
OPENAI_EMBEDDING_MAX_TEXTS_PER_REQUEST Max chunk count per embedding request batch 128
OPENAI_EMBEDDING_REQUEST_CONCURRENCY Max concurrent embedding requests 1
OPENAI_EMBEDDING_MIN_SECONDS_BETWEEN_REQUESTS Minimum delay between embedding requests 0.0
OPENAI_EMBEDDING_RATE_LIMIT_MAX_RETRIES Retry attempts for embedding HTTP 429 responses 6
OPENAI_EMBEDDING_RATE_LIMIT_BASE_BACKOFF_SECONDS Base seconds for exponential backoff on HTTP 429 1.0
OPENAI_EMBEDDING_RATE_LIMIT_MAX_BACKOFF_SECONDS Maximum wait per retry on HTTP 429 30.0
ANTHROPIC_API_KEY Anthropic API key unset
ANTHROPIC_MODEL Anthropic model claude-sonnet-4-20250514
OLLAMA_BASE_URL Ollama host URL http://localhost:11434
OLLAMA_MODEL Ollama generation model llama3.1
LOCAL_EMBEDDING_MODEL Ollama embedding model name nomic-ai/nomic-embed-text-v1.5
LEARNING_V2_ENABLED Enable Learning V2 syllabus/lesson cache pipeline false

Notes:

  • Docker compose currently passes OPENAI_API_KEY by default; if you want Anthropic/Ollama in Docker, add those env vars in docker/docker-compose.yml.
  • For local development, all variables above can be set directly in your shell or .env.
  • Large repositories can trigger temporary embedding rate limits (HTTP 429). Indexing retries automatically; tune the OPENAI_EMBEDDING_* controls above (batch size, pacing, and retry backoff) if needed.
  • VOYAGE_API_KEY exists in config for future provider support, but EMBEDDING_PROVIDER currently supports only openai and ollama.

API Endpoints

Repository & Indexing

Method Endpoint Description
POST /api/repos/ Add repository and start background indexing
GET /api/repos/ List repositories
GET /api/repos/{repo_id} Get repository details
GET /api/repos/{repo_id}/progress Stream indexing progress (SSE)
DELETE /api/repos/{repo_id} Delete repository and indexed data
GET /api/repos/{repo_id}/files/content Fetch file content by path query param
POST /api/repos/demo/seed Seed demo repository

Chat & Search

Method Endpoint Description
POST /api/chat/sessions Create chat session
GET /api/chat/sessions/{session_id} Get session + messages
POST /api/chat/sessions/{session_id}/messages Stream assistant response (SSE)
POST /api/search/ Hybrid semantic code search

Learning, Graph, Gamification, Challenges

Method Endpoint Description
GET /api/learning/personas List available personas
POST /api/learning/{repo_id}/curriculum Generate syllabus
POST /api/learning/{repo_id}/lessons/{lesson_id} Generate lesson content
POST /api/learning/{repo_id}/lessons/{lesson_id}/quiz Generate quiz
GET /api/learning/{repo_id}/lessons/{lesson_id}/export/codetour Export lesson as CodeTour
GET /api/learning/{repo_id}/graph Generate dependency graph
GET /api/learning/{repo_id}/stats User XP/level/streak stats
GET /api/learning/{repo_id}/activity Activity heatmap data
GET /api/learning/{repo_id}/achievements Achievement list + unlock status
GET /api/learning/{repo_id}/progress Completed lessons
POST /api/learning/{repo_id}/lessons/{lesson_id}/complete Mark lesson complete + award XP
POST /api/learning/{repo_id}/lessons/{lesson_id}/quiz/result Submit quiz result + award XP
POST /api/learning/{repo_id}/challenges/complete Record challenge completion + award XP
POST /api/learning/{repo_id}/graph/viewed Record graph view event
POST /api/learning/{repo_id}/graph/nodes/viewed Record unique graph node exploration events
POST /api/learning/{repo_id}/lessons/{lesson_id}/challenge Generate challenge
POST /api/learning/{repo_id}/challenges/validate/bug_hunt Validate bug hunt answer
POST /api/learning/{repo_id}/challenges/validate/code_trace Validate code trace answer
POST /api/learning/{repo_id}/challenges/validate/fill_blank Validate fill-blank answer

Platform Endpoints

Method Endpoint Description
GET /api/platform/config Runtime demo flags for frontend behavior
GET /health Service health + dependency checks
GET /openapi.json OpenAPI JSON
GET /openapi.yaml OpenAPI YAML
GET /api/cache/stats LLM cache statistics

CLI Usage

Install:

cd cli
pip install -e .

Optional: set CODEBASEQA_API_URL if your API is not at http://localhost:8000.

Commands:

# Index repository
codebaseqa index https://github.com/expressjs/express

# List repositories
codebaseqa list

# Ask a question
codebaseqa ask <repo_id> "What is the main entry point?"

# Search code
codebaseqa search <repo_id> "authentication middleware"

# List generated lessons (default persona: new_hire)
codebaseqa lessons <repo_id>

# Export lesson as VS Code CodeTour
codebaseqa export-tour <repo_id> <lesson_id>

Architecture (Monorepo)

codebaseqa/
├── apps/
│   ├── api/        # FastAPI backend (RAG, indexing, learning, gamification)
│   └── web/        # Next.js frontend UI
├── cli/            # Python CLI client
├── docker/         # Dockerfiles + compose + entrypoint
├── docs/           # Architecture and design notes
└── scripts/        # Local helper scripts

Backend highlights:

  • Tree-sitter semantic parsing for Python, JavaScript, TypeScript, Java, Go, Rust, C#, C++, Ruby
  • Hybrid retrieval (vector + keyword) and query expansion
  • LLM-based reranking for improved relevance
  • SQLite metadata + Chroma vector persistence

Testing & Checks

Backend:

cd apps/api
# after activating your virtualenv
python -m pytest tests/unit tests/integration
ruff check src tests

Frontend:

cd apps/web
pnpm lint
pnpm type-check
pnpm test
pnpm build

Workspace shortcuts:

pnpm lint
pnpm test
pnpm type-check
pnpm web:build
pnpm web:verify-css

Current Limitations

  • Very large repositories can still be slow/expensive to index depending on provider/model choice, and may hit temporary embedding rate limits before retries succeed.
  • Lesson/challenge/graph generation quality depends on model capability and retrieved context.
  • Docker setup is optimized for OpenAI defaults unless extra provider vars are explicitly wired.

Author

Shree Bohara


License

MIT

About

AI-powered codebase Q&A to explore repositories, docs, and project insights

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors