You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VectorCode dense-only search quality baseline. Run against the curated
mini-corpus (3 repos: thiserror Rust, defu TypeScript, itsdangerous Python)
using the embeddinggemma:latest embedding model via Ollama on ARM (Apple Silicon).
Run Info
Field
Value
Date
2026-06-19
Embedder
Ollama / embeddinggemma:latest (768d)
Platform
ARM (Apple Silicon)
Corpus
mini (thiserror + defu + itsdangerous)
Files indexed
18
Chunks
83
Queries
15
Duration
~13.4s
Aggregate Metrics
Metric
Value
Recall@5
0.3000
Recall@10
0.3000
nDCG@10
0.2415
MRR
0.3500
Per-Language Breakdown
Language
Queries
R@5 > 0
Best R@5
Rust (thiserror)
5
5/5
1.0000
TypeScript (defu)
5
0/5
0.0000
Python (itsdangerous)
5
1/5
0.5000
Notes
Dense search with embeddinggemma performs well on Rust code (100% hit rate
on thiserror queries) but poorly on TypeScript (0%) and mixed on Python (20%).
The zero TypeScript results suggest the embedding model struggles with TS
semantic search or the defu corpus is too small (3 files).
This baseline will be used to measure improvement in Fase 1.3-1.6
(sparse search, RRF fusion, reranker).
Reproducibility
# Requires: Ollama running with embeddinggemma:latest
ollama pull embeddinggemma:latest
cargo run -- benchmark --corpus mini
Fase 1.3-1.4 — Hybrid Search Baseline Verification
Verification that the dense-only baseline is preserved after adding sparse
search (FTS5) and RRF fusion. Default mode remains Dense, so the benchmark
code path is identical.
Run Info
Field
Value
Date
2026-06-19
Embedder
Ollama / embeddinggemma:latest (768d)
Platform
ARM (Apple Silicon)
Corpus
mini (thiserror + defu + itsdangerous)
Files indexed
18
Chunks
83
Queries
15
Duration
18.52s
Aggregate Metrics — Dense Mode (post Fase 1.3-1.4)
Metric
Fase 1.2
Fase 1.3-1.4
Delta
Verdict
Recall@5
0.3000
0.3000
±0%
✅ Preserved
Recall@10
0.3000
0.3000
±0%
✅ Preserved
nDCG@10
0.2415
0.2947
+22%
✅ Improved (variance)
MRR
0.3500
0.3667
+4.8%
✅ Improved (variance)
New Capabilities Verified
Mode
Command
Status
Dense (default)
vectorcode search "query"
✅ Unchanged
Sparse (FTS5)
vectorcode search --mode sparse "query"
✅ bm25 lexical
Hybrid (RRF)
vectorcode search --mode hybrid "query"
✅ Dense + Sparse fusion
Implementation Summary
Schema: v2→v3 migration with chunks_fts FTS5 virtual table + triggers
Engine: SearchStrategy trait with DenseSearcher / SparseSearcher / HybridSearcher
Fusion: rrf_fuse() pure function with configurable K (default 60)
CLI: --mode dense|sparse|hybrid flag
Tests: 617 total (573 unit + 44 integration), all passing
Commits: 6 (5 features + 1 migration fix)
Conclusion
La línea base dense-only se mantiene intacta. Las fases 1.3 y 1.4 agregan
búsqueda léxica (FTS5) y fusión RRF sin degradar el pipeline existente.
El benchmark confirma que el modo por defecto (Dense) produce resultados
equivalentes a la Fase 1.2. Las capacidades nuevas (--mode sparse, --mode hybrid)
están operativas y verificadas con pruebas de humo sobre el repositorio real.
Listo para Fase 1.5 (reranker ONNX) y Fase 1.6 (re-medición completa con hybrid).
Fase 1.5-1.6: Reranker ONNX + Re-medición (2026-06-19)
Multi-mode benchmark comparing dense, sparse, hybrid (RRF), and hybrid-rerank
search strategies over the mini corpus. Reranker active — BGE-Reranker-v2-m3
cross-encoder running on CPU via ONNX Runtime.
El reranker funciona. nDCG@10 mejora un 47% sobre hybrid y MRR más del doble.
El cross-encoder re-ordena el top-K con scores de relevancia mucho más finos que
RRF, empujando los documentos correctos a las primeras posiciones.
Recall@5 no cambia porque el reranker no descubre documentos nuevos — solo
re-ordena lo que el retrieval (dense+sparse) ya encontró. El recall es
responsabilidad del retrieval; nDCG y MRR son responsabilidad del ranking.
Latencia: 32.6s vs 11.3s (~3×). El costo de correr un cross-encoder de 568M
parámetros en CPU pura. El modo es explícitamente opt-in (--mode hybrid-rerank),
dejando al usuario elegir entre velocidad y calidad. Para agentes IA que hacen
búsquedas esporádicas, el trade-off es aceptable.
Dense vs Hybrid+Rerank: El pipeline completo (dense + sparse + RRF + reranker)
supera al dense-only original en calidad de ranking. Sparse solo (FTS5/BM25) sigue
siendo débil para queries en lenguaje natural, pero su valor está en complementar
al dense en la fusión RRF.
Bugs corregidos durante Fase 1.5-1.6:
sanitize_fts_query: términos con guiones ("key-based") causaban error FTS5
from_cache_with_timeout(): no descargaba el modelo automáticamente
URL del modelo: Xenova/bge-reranker-v2-m3 no existe → onnx-community/...
model.onnx con external data → model_quantized.onnx self-contained
token_type_ids: XLM-RoBERTa no acepta este input (solo BERT)
Verificación
616 unit tests, 44 integration tests — all passing
# Requires: Ollama running with embeddinggemma:latest
ollama pull embeddinggemma:latest
# First run with hybrid-rerank triggers automatic model download (~571MB)
cargo run -- search "test" --mode hybrid-rerank
# Run multi-mode benchmark
cargo run -- benchmark --corpus mini --output table --mode all
Phase 2 Graph Benchmark
Structural query benchmark using the knowledge graph. Measures symbol-level
recall and precision for graph-based retrieval (callers, dependents, imports).
Run Info
Field
Value
Date
2026-06-20
Query Set
mini-structural (12 queries)
Query Types
5 callers, 4 imports, 3 dependents
Corpus
mini (thiserror + defu + itsdangerous)
Files Indexed
18
Chunks Created
83
Graph Nodes
Populated via tree-sitter extraction
Aggregate Metrics
Metric
Value
Symbol Recall@5
1.0000
Symbol Recall@10
1.0000
Symbol Precision@5
0.6500
Per-Tool Breakdown
Tool
Queries
R@5
P@5
R@10
callers
5
1.0000
0.8000
1.0000
imports
4
1.0000
0.5500
1.0000
dependents
3
1.0000
0.6667
1.0000
Notes
Structural queries use routing=graph or routing=auto with heuristic classifications.
Symbol-level metrics measure exact symbol matches (file::symbol keys).
External imports (e.g., std::fmt) are surfaced via LEFT JOIN in get_imports.
This benchmark complements the semantic retrieval metrics above.
Queries reference symbols from the actual indexed mini corpus (thiserror + defu + itsdangerous).
Perfect recall@5 and recall@10 indicate the graph correctly returns all expected symbols.
Precision@5 of 0.65 reflects queries with fewer than 5 results (graph returns exact sets, not ranked lists).
Reproducibility
# Run structural benchmark
cargo run -- benchmark --corpus mini --queries benchmarks/queries/mini_structural.toml --mode dense