Skip to content

jameswei/tiny-rag-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tiny-rag-lab

tiny-rag-lab is a learning-first RAG engine/laboratory for understanding how classic retrieval-augmented generation works end to end.

The goal is to keep the RAG lifecycle visible: document loading, text normalization, chunking, metadata, embeddings, local vector search, retrieval, prompt assembly, answer generation, citations, evaluation, and failure inspection.

Current Status

Phase 1 through Phase 2.0 are complete. No phase is currently active.

  • Phase 1 — Naive Classic RAG: full pipeline from corpus to grounded answers with citations
  • Phase 1.5 — Retrieval Mechanics: BM25 keyword retrieval, hybrid retrieval, and retriever comparison flags
  • Phase 1.6 — Evaluation Harness: retrieval quality metrics (rag eval) against a prepared QA set
  • Phase 1.7 — Observability And Debugging: retrieve/ask traces, stage latency, and optional JSON trace output
  • Phase 1.8 — RAG Failure Lab: curated failure cases and rag diagnose for baseline vs. intervention retrieval
  • Phase 1.9 — Reranking: fake and cross-encoder reranker interfaces with retrieve/eval/ask/diagnose integration
  • Phase 2.0 — Answer Quality Judging: fake and OpenAI-compatible judge paths for answer metrics and answer-side failure diagnosis

Completed phase contracts:

Phase 1 Result

Phase 1 delivers a minimal but complete CLI-first RAG baseline:

local corpus -> documents -> normalized text -> chunks -> embeddings
-> local vector index -> query embedding -> cosine retrieval
-> grounded prompt -> generated answer with citations

Key decisions:

  • Python implementation
  • argparse CLI
  • primary corpus: IBM watsonxDocsQA
  • local embeddings: sentence-transformers/all-MiniLM-L6-v2
  • OpenAI-compatible online generation for real answers
  • fake embedder and fake generator for tests
  • local index files under .tiny-rag/index/
  • no vector database in Phase 1
  • no LangChain/LlamaIndex/Haystack wrapper in Phase 1

Phase 1.5 Result

Phase 1.5 adds inspectable retrieval strategies to compare dense vector search, BM25 keyword search, and hybrid retrieval with Reciprocal Rank Fusion.

query + index -> dense retrieval | BM25 retrieval -> optional RRF fusion
-> ranked chunks and eval reports tagged with retriever=dense|bm25|hybrid

Phase 1.6 Result

Phase 1.6 adds a rag eval command that measures retrieval quality against the prepared qa.jsonl evaluation set. Four deterministic metrics are reported: hit rate @ k, MRR, context precision, and context recall.

qa.jsonl + index -> embed questions -> retrieve top-k -> compare to gold docs
-> hit rate, MRR, context precision, context recall

Phase 1.7 Result

Phase 1.7 adds trace records and human-readable trace output for retrieve and ask flows. Traces expose the retriever, top-k, ranked chunks, scores, citations, prompt/answer context, and stage latency.

query + retrieval/ask flow -> trace fields -> readable trace and optional JSON

Phase 1.8 Result

Phase 1.8 adds a failure lab for curated retrieval failure scenarios. The rag diagnose command compares each case's baseline and intervention retrieval config, labels heuristic failure modes, and reports whether failures were confirmed, fixed, moved, or unchanged.

failure cases + index -> baseline retrieval + intervention retrieval
-> failure labels, metrics, and diagnosis report

Phase 1.9 Result

Phase 1.9 adds a reranker abstraction and optional second-pass reranking for retrieve, eval, ask, and diagnose workflows. The default none path remains unchanged; fake rerankers keep tests offline, and the cross-encoder path is lazy and gated.

initial candidates -> optional reranker -> final top-k chunks
-> traces and reports with reranker metadata

Phase 2.0 Result

Phase 2.0 adds answer-quality judging behind a fakeable interface. rag eval can print retrieval metrics plus answer metrics, rag ask can include a judge verdict in the trace, and rag diagnose can cover answer-side failures such as unsupported answers and citation mismatches.

retrieved context + generated answer -> judge verdict
-> answer metrics, trace verdicts, and answer-side diagnosis

CLI

rag index --corpus PATH --index-dir .tiny-rag/index --chunk-size 800 --chunk-overlap 120
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever dense
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever bm25
rag retrieve "question text" --index-dir .tiny-rag/index --top-k 5 --retriever hybrid
rag ask "question text" --index-dir .tiny-rag/index --top-k 5
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever dense
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever bm25
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --top-k 5 --retriever hybrid
rag eval --qa-file corpus/watsonx-docsqa/qa.jsonl --index-dir .tiny-rag/index --judge fake --generator fake
rag diagnose --cases-file tests/fixtures/failure/cases.jsonl --index-dir .tiny-rag/index
rag diagnose --cases-file tests/fixtures/failure/cases.jsonl --index-dir .tiny-rag/index --judge fake --generator fake

Help is available for each command:

uv run rag --help
uv run rag index --help
uv run rag retrieve --help
uv run rag ask --help
uv run rag eval --help
uv run rag diagnose --help

Development

Install/sync dependencies:

uv sync --group dev

Run tests:

uv run pytest --tb=short -q

Prepare the primary corpus after dependencies are installed:

uv run python scripts/prepare_watsonx_docsqa.py --inspect
uv run python scripts/prepare_watsonx_docsqa.py --output-dir corpus/watsonx-docsqa

Generated corpora and indexes are intentionally ignored by git:

corpus/
.tiny-rag/

Docs

For implementation work, the phase spec and taskboard under docs/phases/ are the source of truth.

About

a learning-first RAG engine for classic RAG scenario.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages