Run specific Question ID's by sreedharsreeram · Pull Request #26 · supermemoryai/memorybench

sreedharsreeram · 2026-02-17T03:32:32Z

No description provided.

src/orchestrator/batch.ts

+    if (questionIds && questionIds.length > 0) {
+      targetQuestionIds = questionIds
+      logger.info(`Using explicit questionIds: ${questionIds.length} questions`)


sreedharsreeram · 2026-02-17T03:41:14Z

Run specific Question ID's #26 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

sentry · 2026-02-17T04:16:56Z

ui/app/runs/new/page.tsx

+        if (!questionIdValidation || questionIdValidation.invalid.length > 0) {
+          setError("Please validate patterns before starting the run")
+          return
+        }
+
+        // Use the expanded question IDs from validation
+        questionIds = questionIdValidation.expanded


Bug: Changing the benchmark does not clear the question ID validation state, allowing submission with stale validation data from a different benchmark.
_{Severity: MEDIUM}

Suggested Fix

Add a useEffect hook that listens for changes to form.benchmark. When the benchmark is changed, the effect should clear the questionIdValidation state, forcing the user to re-validate their question IDs against the new benchmark before they can submit the form.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: ui/app/runs/new/page.tsx#L358-L364 Potential issue: When a user validates question IDs for a specific benchmark and then changes the benchmark without re-validating, the validation state (`questionIdValidation`) is not cleared. The form allows submission using the stale validation results from the original benchmark. If any question IDs from the original benchmark exist in the new benchmark, the backend will silently accept them and execute the run against an incorrect set of questions, leading to invalid results without user awareness.

Dhravya · 2026-02-17T18:42:07Z

I wonder if question ID is the right heuristic to build on especially because we want to make it interoperable between all benchmarks

Dhravya · 2026-02-17T18:42:17Z

Can we test this against Locomo and Convoman as well?

vorflux

(aside) ## Review Summary

This PR adds the ability to run benchmarks against specific question IDs (by direct ID, conversation pattern, or session pattern) — useful feature for debugging and targeted testing. The backend validation, pattern expansion endpoint, and UI are all solid overall.

However, I'd hold off on merging as-is due to a few issues — one bug, one design concern raised by Dhravya, and some code quality items.

🐛 Bug: `questionIds` not passed through in compare flow

The initializeComparison function in compare.ts adds questionIds to its type signature, but the function body just does batchManager.createManifest(options) — which does work since it spreads the whole options object. However, createManifest then creates individual run configs via executeRuns, and I don't see where questionIds gets threaded into each individual run's orchestrator call. The batch manager creates manifests but the individual runs may not receive the questionIds. Worth verifying this end-to-end for the compare flow specifically.

🎯 Design concern (from Dhravya's comment)

Dhravya raised a valid point: the expand-ids endpoint pattern matching (/^[a-zA-Z]+-\d+$/ for conversation IDs, -session substring for session IDs) is tightly coupled to the naming conventions of specific benchmarks. LongMemEval uses question_id directly from the dataset (e.g., UUIDs), not conv-N patterns. This won't generalize well across Locomo, ConvoMem, and LongMemEval as requested.

📋 Other items

See inline comments below.

vorflux · 2026-03-23T22:44:39Z

src/server/routes/benchmarks.ts

+        const trimmed = pattern.trim()
+        if (!trimmed) continue
+
+        const expanded: string[] = []


(aside) This regex ^[a-zA-Z]+-\d+$ is hardcoded to match patterns like conv-26 but won't work for LongMemEval's question IDs (which come from item.question_id and could be UUIDs like 001be529-...). This is the interoperability concern Dhravya raised — the pattern expansion logic is benchmark-specific.

Consider either:

Making the expansion logic benchmark-aware (each benchmark defines its own pattern rules)

Or simplifying to just support exact question IDs + a generic prefix match (any ID that's a prefix of a question ID expands to all matches)

vorflux · 2026-03-23T22:44:39Z

src/orchestrator/index.ts

+        targetQuestionIds = validIds
+        logger.info(
+          `Using explicit questionIds: ${validIds.length} valid questions` +
+            (invalidIds.length > 0 ? ` (${invalidIds.length} invalid skipped)` : "")


(aside) The validation logic here (lines 215-248) is nearly identical to the one added in batch.ts (lines 158-188). This is a ~30-line block duplicated verbatim. Consider extracting a shared validateQuestionIds(allQuestions, questionIds, benchmarkName) utility function.

vorflux · 2026-03-23T22:44:39Z

ui/app/runs/new/page.tsx

@@ -212,6 +223,70 @@ export default function NewRunPage() {
    }


(aside) The entire validateQuestionIds function (lines 223-285) is copy-pasted identically between runs/new/page.tsx and compare/new/page.tsx. Extract this to a shared utility (e.g., ui/lib/question-id-validation.ts) to avoid the duplication.

vorflux · 2026-03-23T22:44:39Z

ui/app/runs/new/page.tsx

+    const inputPatterns = questionIdsInput
+      .split(",")
+      .map((id) => id.trim())
+      .filter((id) => id.length > 0)
+    const uniquePatterns = [...new Set(inputPatterns)]
+
+    // Call pattern expansion endpoint
+    const expansionResult = await expandQuestionIdPatterns(benchmark, uniquePatterns)
+    const expandedIds = expansionResult.expandedIds
+
+    // Fetch all questions to validate expanded IDs exist
+    const allQuestionIds = new Set<string>()


(aside) This fetches all benchmark questions page-by-page just to validate IDs that were already expanded by the server. But the server's expand-ids endpoint already validates against the benchmark's questions internally — if a pattern doesn't match, it returns empty results. The client-side re-validation is redundant and could be slow for large benchmarks. You could trust the server's expansion result and just check for patterns with no results.

vorflux · 2026-03-23T22:44:39Z

src/server/routes/runs.ts

        fromPhase,
        sourceRunId,
      } = body
      console.log("[API] Extracted sampling:", sampling)


(aside) Nit: console.log for debug logging — the rest of the codebase uses logger. Consider removing or converting to logger.debug.

vorflux · 2026-03-23T22:45:48Z

(aside) ## Correction to my review

I want to retract two points from my earlier review:

1. The "bug" about questionIds not threading through the compare flow — this was wrong. I traced it more carefully and the full chain works correctly:

API extracts questionIds → passes to initializeComparison(options)
→ batchManager.createManifest(options) destructures questionIds and sets manifest.targetQuestionIds
→ executeRuns(manifest) passes manifest.targetQuestionIds as questionIds to each orchestrator.run() call (line 219 of batch.ts)

No bug here — sorry for the noise.

2. The console.log nit in runs.ts — the existing code in that file already uses console.log (lines 184, 198), so the PR is just following the established convention. The inconsistency with logger predates this PR and isn't something to flag here.

The remaining points from the review still stand:

Duplicated validation logic (~30 lines copy-pasted between orchestrator/index.ts and orchestrator/batch.ts, and again between the two UI pages) — worth extracting to shared utilities
Redundant client-side re-validation — the UI fetches all questions page-by-page to re-validate IDs that the server already validated in expand-ids
Pattern matching not generalizable across benchmarks (Dhravya's concern) — the ^[a-zA-Z]+-\d+$ regex is specific to ConvoMem/Locomo naming conventions and won't work for LongMemEval's UUID-style question IDs

sentry bot reviewed Feb 17, 2026

View reviewed changes

src/orchestrator/batch.ts Outdated

Comment on lines +159 to +161

if (questionIds && questionIds.length > 0) {

targetQuestionIds = questionIds

logger.info(`Using explicit questionIds: ${questionIds.length} questions`)

This comment was marked as outdated.

Sign in to view

added question id

28a1861

sreedharsreeram force-pushed the 02-03_question_id branch from 25e2b45 to 28a1861 Compare February 17, 2026 04:13

sentry bot reviewed Feb 17, 2026

View reviewed changes

vorflux bot reviewed Mar 23, 2026

View reviewed changes

vorflux bot mentioned this pull request Mar 23, 2026

refactor: deduplicate question ID validation, fix cross-benchmark pattern matching #39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run specific Question ID's #26

Run specific Question ID's #26
sreedharsreeram wants to merge 1 commit intomainfrom
02-03_question_id

sreedharsreeram commented Feb 17, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

sreedharsreeram commented Feb 17, 2026

Uh oh!

sentry bot Feb 17, 2026

Uh oh!

Dhravya commented Feb 17, 2026

Uh oh!

Dhravya commented Feb 17, 2026

Uh oh!

vorflux bot left a comment

Uh oh!

vorflux bot Mar 23, 2026

Uh oh!

vorflux bot Mar 23, 2026

Uh oh!

vorflux bot Mar 23, 2026

Uh oh!

vorflux bot Mar 23, 2026

Uh oh!

vorflux bot Mar 23, 2026

Uh oh!

vorflux bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -212,6 +223,70 @@ export default function NewRunPage() {
		}

Conversation

sreedharsreeram commented Feb 17, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

sreedharsreeram commented Feb 17, 2026

Uh oh!

sentry bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Dhravya commented Feb 17, 2026

Uh oh!

Dhravya commented Feb 17, 2026

Uh oh!

vorflux bot left a comment

Choose a reason for hiding this comment

🐛 Bug: questionIds not passed through in compare flow

🎯 Design concern (from Dhravya's comment)

📋 Other items

Uh oh!

vorflux bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vorflux bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vorflux bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vorflux bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vorflux bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vorflux bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🐛 Bug: `questionIds` not passed through in compare flow