Conversation
src/orchestrator/batch.ts
Outdated
| if (questionIds && questionIds.length > 0) { | ||
| targetQuestionIds = questionIds | ||
| logger.info(`Using explicit questionIds: ${questionIds.length} questions`) |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
25e2b45 to
28a1861
Compare
| if (!questionIdValidation || questionIdValidation.invalid.length > 0) { | ||
| setError("Please validate patterns before starting the run") | ||
| return | ||
| } | ||
|
|
||
| // Use the expanded question IDs from validation | ||
| questionIds = questionIdValidation.expanded |
There was a problem hiding this comment.
Bug: Changing the benchmark does not clear the question ID validation state, allowing submission with stale validation data from a different benchmark.
Severity: MEDIUM
Suggested Fix
Add a useEffect hook that listens for changes to form.benchmark. When the benchmark is changed, the effect should clear the questionIdValidation state, forcing the user to re-validate their question IDs against the new benchmark before they can submit the form.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: ui/app/runs/new/page.tsx#L358-L364
Potential issue: When a user validates question IDs for a specific benchmark and then
changes the benchmark without re-validating, the validation state
(`questionIdValidation`) is not cleared. The form allows submission using the stale
validation results from the original benchmark. If any question IDs from the original
benchmark exist in the new benchmark, the backend will silently accept them and execute
the run against an incorrect set of questions, leading to invalid results without user
awareness.
|
I wonder if question ID is the right heuristic to build on especially because we want to make it interoperable between all benchmarks |
|
Can we test this against Locomo and Convoman as well? |
There was a problem hiding this comment.
(aside) ## Review Summary
This PR adds the ability to run benchmarks against specific question IDs (by direct ID, conversation pattern, or session pattern) — useful feature for debugging and targeted testing. The backend validation, pattern expansion endpoint, and UI are all solid overall.
However, I'd hold off on merging as-is due to a few issues — one bug, one design concern raised by Dhravya, and some code quality items.
🐛 Bug: questionIds not passed through in compare flow
The initializeComparison function in compare.ts adds questionIds to its type signature, but the function body just does batchManager.createManifest(options) — which does work since it spreads the whole options object. However, createManifest then creates individual run configs via executeRuns, and I don't see where questionIds gets threaded into each individual run's orchestrator call. The batch manager creates manifests but the individual runs may not receive the questionIds. Worth verifying this end-to-end for the compare flow specifically.
🎯 Design concern (from Dhravya's comment)
Dhravya raised a valid point: the expand-ids endpoint pattern matching (/^[a-zA-Z]+-\d+$/ for conversation IDs, -session substring for session IDs) is tightly coupled to the naming conventions of specific benchmarks. LongMemEval uses question_id directly from the dataset (e.g., UUIDs), not conv-N patterns. This won't generalize well across Locomo, ConvoMem, and LongMemEval as requested.
📋 Other items
See inline comments below.
| const trimmed = pattern.trim() | ||
| if (!trimmed) continue | ||
|
|
||
| const expanded: string[] = [] |
There was a problem hiding this comment.
(aside) This regex ^[a-zA-Z]+-\d+$ is hardcoded to match patterns like conv-26 but won't work for LongMemEval's question IDs (which come from item.question_id and could be UUIDs like 001be529-...). This is the interoperability concern Dhravya raised — the pattern expansion logic is benchmark-specific.
Consider either:
- Making the expansion logic benchmark-aware (each benchmark defines its own pattern rules)
- Or simplifying to just support exact question IDs + a generic prefix match (any ID that's a prefix of a question ID expands to all matches)
| targetQuestionIds = validIds | ||
| logger.info( | ||
| `Using explicit questionIds: ${validIds.length} valid questions` + | ||
| (invalidIds.length > 0 ? ` (${invalidIds.length} invalid skipped)` : "") |
There was a problem hiding this comment.
(aside) The validation logic here (lines 215-248) is nearly identical to the one added in batch.ts (lines 158-188). This is a ~30-line block duplicated verbatim. Consider extracting a shared validateQuestionIds(allQuestions, questionIds, benchmarkName) utility function.
| @@ -212,6 +223,70 @@ export default function NewRunPage() { | |||
| } | |||
There was a problem hiding this comment.
(aside) The entire validateQuestionIds function (lines 223-285) is copy-pasted identically between runs/new/page.tsx and compare/new/page.tsx. Extract this to a shared utility (e.g., ui/lib/question-id-validation.ts) to avoid the duplication.
| const inputPatterns = questionIdsInput | ||
| .split(",") | ||
| .map((id) => id.trim()) | ||
| .filter((id) => id.length > 0) | ||
| const uniquePatterns = [...new Set(inputPatterns)] | ||
|
|
||
| // Call pattern expansion endpoint | ||
| const expansionResult = await expandQuestionIdPatterns(benchmark, uniquePatterns) | ||
| const expandedIds = expansionResult.expandedIds | ||
|
|
||
| // Fetch all questions to validate expanded IDs exist | ||
| const allQuestionIds = new Set<string>() |
There was a problem hiding this comment.
(aside) This fetches all benchmark questions page-by-page just to validate IDs that were already expanded by the server. But the server's expand-ids endpoint already validates against the benchmark's questions internally — if a pattern doesn't match, it returns empty results. The client-side re-validation is redundant and could be slow for large benchmarks. You could trust the server's expansion result and just check for patterns with no results.
| fromPhase, | ||
| sourceRunId, | ||
| } = body | ||
| console.log("[API] Extracted sampling:", sampling) |
There was a problem hiding this comment.
(aside) Nit: console.log for debug logging — the rest of the codebase uses logger. Consider removing or converting to logger.debug.
|
(aside) ## Correction to my review I want to retract two points from my earlier review: 1. The "bug" about
No bug here — sorry for the noise. 2. The The remaining points from the review still stand:
|

No description provided.