Contributing to the Benchmark Golden Set

This document describes the process for proposing, validating, and merging new golden-set queries into the benchmark harness.

Purpose

The golden query set defines the acceptance criteria for code-search quality. Changes to this set directly affect baseline metrics and regression detection, so they require careful review.

Proposing a New Query

Identify the gap: What search behavior is not covered by existing queries? Examples:
- A common code pattern (e.g., "iterator with error handling")
- A cross-language concept (e.g., "async/await in TypeScript vs Python")
- A regression case (e.g., "search for function with specific signature")
Draft the query: Write the query text in natural language or keyword style. Keep it realistic — what would a developer actually type?
Label relevance: For each file in the corpus that might be returned, assign a grade:
- 3 (highly relevant): Direct match, core example of the concept
- 2 (relevant): Related code, partial match
- 1 (marginally relevant): Tangentially related
- 0 (irrelevant): Not related (omit these from the file — only list files with grade >= 1)
Validate locally: Run the benchmark with your proposed query:
```
cargo run -- benchmark --corpus mini
```
Verify that the metrics change in an expected direction. If your query has grade-3 files but they rank below grade-1 files, investigate why.

Review Process

Open a PR with:
- The new query added to benchmarks/queries/<corpus>.toml
- A brief rationale in the PR description (why this query matters)
- Before/after metric changes
Two reviewers must approve:
- At least one reviewer should verify the relevance judgments are correct
- At least one reviewer should verify the query text is realistic
Merge criteria:
- All CI checks pass (including the benchmark workflow)
- No metric regression beyond ±0.01 tolerance (unless intentional)
- Both reviewers have approved

Query Quality Guidelines

Be specific: "error handling" is too vague; "custom error type with Display trait" is better
Cross-language coverage: Ensure queries test semantic search across Rust, TypeScript, and Python
Avoid ambiguity: If a query could match multiple unrelated concepts, split it into separate queries
Realistic phrasing: Use natural language that developers would actually type, not artificial test strings

Updating Relevance Judgments

If you discover that a file's relevance judgment is incorrect:

Open a PR with the corrected grade
Explain why the original grade was wrong
Follow the same two-reviewer approval process

Metrics Tolerance

When adding queries, expect small metric fluctuations (±0.01). If metrics change by more than this:

Verify the new query's judgments are correct
Check if the query exposes a real search quality issue
Document the expected change in the PR description

Real-corpus verification (thiserror / defu / itsdangerous / vscode)

The mini and vscode query sets in this directory are placeholders for the phase-4.4 real-corpus verification path. The current mock-mini baselines under baseline/ gate the deterministic smoke test only — they are not a published measurement of retrieval quality.

The real-corpus path (with a pinned model + OLLAMA setup) is intentionally out of scope here. If you need a real baseline, see phase 4.4 in the SDD roadmap; the comparator added in 4.1 is model-agnostic so 4.4 will only need new JSON files and a workflow update.

Example: Adding a Query

[[queries]]
text = "iterator pattern with error propagation"
[[queries.judgments]]
file = "src/iter.rs"
grade = 3

[[queries.judgments]]
file = "src/error.rs"
grade = 2

Rationale: Tests whether the search can find iterator implementations that handle errors, a common Rust pattern. The iter.rs file is a direct match (grade 3), while error.rs is related but not the primary target (grade 2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to the Benchmark Golden Set

Purpose

Proposing a New Query

Review Process

Query Quality Guidelines

Updating Relevance Judgments

Metrics Tolerance

Real-corpus verification (thiserror / defu / itsdangerous / vscode)

Example: Adding a Query

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to the Benchmark Golden Set

Purpose

Proposing a New Query

Review Process

Query Quality Guidelines

Updating Relevance Judgments

Metrics Tolerance

Real-corpus verification (thiserror / defu / itsdangerous / vscode)

Example: Adding a Query