Benchmarks — orientation

This directory is the canonical home of the VectorCode benchmark suite. The published baseline numbers live in BASELINE.md (one level up from this directory, in the repo root) and the regression-gate baseline JSON files live in baseline/. The verification path is a single command:

bash scripts/verify-baseline.sh

For a deeper walkthrough of how to read the numbers and what they mean, see ../docs/benchmarks.md.

Layout

benchmarks/
├── README.md            ← you are here
├── CONTRIBUTING.md      ← how to add / change golden queries
├── corpus.toml          ← corpus definitions (mock-mini, mini, vscode)
├── baseline/            ← committed regression-gate JSON files
│   ├── SCHEMA.md        ← contract between baselines and the comparator
│   ├── baseline-mock-mini.json
│   ├── baseline-mock-mini-structural.json
│   └── baseline-store-mock-mini.json
└── queries/             ← golden query sets, one per baseline
    ├── mock-mini.toml
    ├── mock-mini-structural.toml
    ├── mini.toml
    └── mini_structural.toml

Corpora

Corpus	Source	Purpose
`mock-mini`	`tests/fixtures/mini/` (4 small files)	Public verify path (phase 4.1). No network, no Ollama, no model download.
`mini`	3 small GitHub repos (thiserror, defu, itsdangerous)	Larger integration smoke test. Requires network on first run.
`vscode`	`microsoft/vscode` sparse checkout	Scale benchmark (~15K files). Requires network.

The vscode.toml placeholder query file that used to live under queries/ has been removed (its grade = 0 entries could not gate any regression). The real vscode corpus + query set arrives in phase 4.4.

Quick path

# Verify against the committed baselines (CI uses this).
bash scripts/verify-baseline.sh

# Run the benchmarks without comparing, capturing JSON for inspection.
bash scripts/run-benchmarks.sh
# → benchmarks/results/benchmark-mock-mini-ir-dense.json
# → benchmarks/results/benchmark-mock-mini-structural-dense.json
# → benchmarks/results/bench-store-mock-mini.json

Mock vs real

The mock-mini baselines are a smoke test, not a measure of real retrieval quality. The mock embedder produces deterministic, but semantically random, vectors. The mock-mini baselines are useful to catch:

Indexing pipeline regressions (a chunk that should appear in the index doesn't).
Store performance regressions (indexing time, RSS, disk size).
Schema drift in the comparator or the report format.

For real IR-quality numbers (thiserror, defu, itsdangerous, vscode with a pinned model), see phase 4.4. The infrastructure added in 4.1 is designed to be model-agnostic so 4.4 needs no code changes.

Adding a baseline

Update or add a query file under queries/.
Run bash scripts/run-benchmarks.sh to capture the new JSON.
Edit the corresponding file under baseline/ so it carries just the metric values (see baseline/SCHEMA.md for the shape).
Re-run bash scripts/verify-baseline.sh to confirm the new baseline passes against itself.
Open a PR. CI will gate on the new baseline from then on.

Adding a query

See CONTRIBUTING.md.

Academic Taxonomy & Roadmap

Our benchmarking efforts align with formal research terminology for LLM agents and Augmented Retrieval Systems. Currently, the suite formally implements Phase 1, with subsequent phases under development:

Fase	Nombre técnico aproximado	Qué mide	Estado
1	Information Retrieval Benchmark (Retrieval Evaluation)	Calidad del recuperador (Recall, Precision, etc.)	Implementado (Dense, Sparse, Hybrid, Graph)
2	End-to-End Agent Benchmark (Task-Oriented Agent Evaluation)	Eficiencia y capacidad del agente resolviendo problemas usando las herramientas.	Implementado (vectorcode corpus, SER=1.80, TER=1.29)
3	Context Efficiency Benchmark (Token Efficiency Evaluation)	Coste de contexto y escalabilidad del sistema RAG.	Implementado (vectorcode corpus, SER=1.80, TER=1.29)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks — orientation

Layout

Corpora

Quick path

Mock vs real

Adding a baseline

Adding a query

Academic Taxonomy & Roadmap

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Benchmarks — orientation

Layout

Corpora

Quick path

Mock vs real

Adding a baseline

Adding a query

Academic Taxonomy & Roadmap