Skip to content

Run specific Question ID's #26

Open
sreedharsreeram wants to merge 1 commit intomainfrom
02-03_question_id
Open

Run specific Question ID's #26
sreedharsreeram wants to merge 1 commit intomainfrom
02-03_question_id

Conversation

@sreedharsreeram
Copy link
Contributor

No description provided.

Comment on lines +159 to +161
if (questionIds && questionIds.length > 0) {
targetQuestionIds = questionIds
logger.info(`Using explicit questionIds: ${questionIds.length} questions`)

This comment was marked as outdated.

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Comment on lines +358 to +364
if (!questionIdValidation || questionIdValidation.invalid.length > 0) {
setError("Please validate patterns before starting the run")
return
}

// Use the expanded question IDs from validation
questionIds = questionIdValidation.expanded
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Changing the benchmark does not clear the question ID validation state, allowing submission with stale validation data from a different benchmark.
Severity: MEDIUM

Suggested Fix

Add a useEffect hook that listens for changes to form.benchmark. When the benchmark is changed, the effect should clear the questionIdValidation state, forcing the user to re-validate their question IDs against the new benchmark before they can submit the form.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: ui/app/runs/new/page.tsx#L358-L364

Potential issue: When a user validates question IDs for a specific benchmark and then
changes the benchmark without re-validating, the validation state
(`questionIdValidation`) is not cleared. The form allows submission using the stale
validation results from the original benchmark. If any question IDs from the original
benchmark exist in the new benchmark, the backend will silently accept them and execute
the run against an incorrect set of questions, leading to invalid results without user
awareness.

Copy link
Member

Dhravya commented Feb 17, 2026

I wonder if question ID is the right heuristic to build on especially because we want to make it interoperable between all benchmarks

Copy link
Member

Dhravya commented Feb 17, 2026

Can we test this against Locomo and Convoman as well?

Copy link

@vorflux vorflux bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) ## Review Summary

This PR adds the ability to run benchmarks against specific question IDs (by direct ID, conversation pattern, or session pattern) — useful feature for debugging and targeted testing. The backend validation, pattern expansion endpoint, and UI are all solid overall.

However, I'd hold off on merging as-is due to a few issues — one bug, one design concern raised by Dhravya, and some code quality items.

🐛 Bug: questionIds not passed through in compare flow

The initializeComparison function in compare.ts adds questionIds to its type signature, but the function body just does batchManager.createManifest(options) — which does work since it spreads the whole options object. However, createManifest then creates individual run configs via executeRuns, and I don't see where questionIds gets threaded into each individual run's orchestrator call. The batch manager creates manifests but the individual runs may not receive the questionIds. Worth verifying this end-to-end for the compare flow specifically.

🎯 Design concern (from Dhravya's comment)

Dhravya raised a valid point: the expand-ids endpoint pattern matching (/^[a-zA-Z]+-\d+$/ for conversation IDs, -session substring for session IDs) is tightly coupled to the naming conventions of specific benchmarks. LongMemEval uses question_id directly from the dataset (e.g., UUIDs), not conv-N patterns. This won't generalize well across Locomo, ConvoMem, and LongMemEval as requested.

📋 Other items

See inline comments below.

const trimmed = pattern.trim()
if (!trimmed) continue

const expanded: string[] = []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) This regex ^[a-zA-Z]+-\d+$ is hardcoded to match patterns like conv-26 but won't work for LongMemEval's question IDs (which come from item.question_id and could be UUIDs like 001be529-...). This is the interoperability concern Dhravya raised — the pattern expansion logic is benchmark-specific.

Consider either:

  1. Making the expansion logic benchmark-aware (each benchmark defines its own pattern rules)
  2. Or simplifying to just support exact question IDs + a generic prefix match (any ID that's a prefix of a question ID expands to all matches)

targetQuestionIds = validIds
logger.info(
`Using explicit questionIds: ${validIds.length} valid questions` +
(invalidIds.length > 0 ? ` (${invalidIds.length} invalid skipped)` : "")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) The validation logic here (lines 215-248) is nearly identical to the one added in batch.ts (lines 158-188). This is a ~30-line block duplicated verbatim. Consider extracting a shared validateQuestionIds(allQuestions, questionIds, benchmarkName) utility function.

@@ -212,6 +223,70 @@ export default function NewRunPage() {
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) The entire validateQuestionIds function (lines 223-285) is copy-pasted identically between runs/new/page.tsx and compare/new/page.tsx. Extract this to a shared utility (e.g., ui/lib/question-id-validation.ts) to avoid the duplication.

Comment on lines +237 to +248
const inputPatterns = questionIdsInput
.split(",")
.map((id) => id.trim())
.filter((id) => id.length > 0)
const uniquePatterns = [...new Set(inputPatterns)]

// Call pattern expansion endpoint
const expansionResult = await expandQuestionIdPatterns(benchmark, uniquePatterns)
const expandedIds = expansionResult.expandedIds

// Fetch all questions to validate expanded IDs exist
const allQuestionIds = new Set<string>()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) This fetches all benchmark questions page-by-page just to validate IDs that were already expanded by the server. But the server's expand-ids endpoint already validates against the benchmark's questions internally — if a pattern doesn't match, it returns empty results. The client-side re-validation is redundant and could be slow for large benchmarks. You could trust the server's expansion result and just check for patterns with no results.

fromPhase,
sourceRunId,
} = body
console.log("[API] Extracted sampling:", sampling)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) Nit: console.log for debug logging — the rest of the codebase uses logger. Consider removing or converting to logger.debug.

@vorflux
Copy link

vorflux bot commented Mar 23, 2026

(aside) ## Correction to my review

I want to retract two points from my earlier review:

1. The "bug" about questionIds not threading through the compare flow — this was wrong. I traced it more carefully and the full chain works correctly:

  • API extracts questionIds → passes to initializeComparison(options)
  • batchManager.createManifest(options) destructures questionIds and sets manifest.targetQuestionIds
  • executeRuns(manifest) passes manifest.targetQuestionIds as questionIds to each orchestrator.run() call (line 219 of batch.ts)

No bug here — sorry for the noise.

2. The console.log nit in runs.ts — the existing code in that file already uses console.log (lines 184, 198), so the PR is just following the established convention. The inconsistency with logger predates this PR and isn't something to flag here.

The remaining points from the review still stand:

  • Duplicated validation logic (~30 lines copy-pasted between orchestrator/index.ts and orchestrator/batch.ts, and again between the two UI pages) — worth extracting to shared utilities
  • Redundant client-side re-validation — the UI fetches all questions page-by-page to re-validate IDs that the server already validated in expand-ids
  • Pattern matching not generalizable across benchmarks (Dhravya's concern) — the ^[a-zA-Z]+-\d+$ regex is specific to ConvoMem/Locomo naming conventions and won't work for LongMemEval's UUID-style question IDs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants