FE-711: Add Metrics to Experiments#8751
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
PR SummaryMedium Risk Overview The experiment API and worker protocol switch from The create-experiment drawer adds a Metrics section (built-in kinds, model metrics, custom code with LSP). Simulate view hides the standalone Metrics tab from the sidebar while keeping scenarios and experiments. Reviewed by Cursor Bugbot for commit 09d211d. Bugbot is set up for automated code reviews on this repo. Configure here. |
🤖 Augment PR SummarySummary: This PR introduces first-class Monte Carlo “experiment metrics” for Petrinaut experiments, including runtime/user-defined metric evaluation and UI controls to configure and view metric outputs. Changes:
Technical Notes: Worker-backed experiments remain distribution-only; experiments with expression/aggregated metrics run locally to avoid posting JS code across the worker boundary. 🤖 Was this summary useful? React with 👍 or 👎 |
| color: "neutral.s120", | ||
| cursor: "pointer", | ||
| }); | ||
|
|
||
| const metricCollapseIconStyle = css({ | ||
| transition: "[transform 200ms ease-in-out]", | ||
| "&[data-state=open]": { | ||
| transform: "[rotate(90deg)]", | ||
| }, |
There was a problem hiding this comment.
Metric ID validation inconsistency allows duplicate IDs with different whitespace. The code trims the ID when checking if empty (line 136) but adds the untrimmed ID to the Set (line 142), and checks for duplicates using the untrimmed ID (line 139). This means metric IDs like " test" and "test" would both pass validation as unique IDs, causing potential conflicts.
for (const metricSpec of input.metricSpecs) {
const trimmedId = metricSpec.id.trim();
if (trimmedId === "") {
throw new Error("Metric id is required");
}
if (metricIds.has(trimmedId)) {
throw new Error(`Metric id "${trimmedId}" is duplicated`);
}
metricIds.add(trimmedId);
// ...
}Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
1e2973a to
74a64da
Compare
74a64da to
823f376
Compare
| const distributionAccumulator = | ||
| createMonteCarloMetricHistogramAccumulator(runOutput.binning); | ||
| let distributionValues = runValues; | ||
| let timeSampleCount = 0; |
There was a problem hiding this comment.
Distribution metric timeSampleCount is zero without time aggregation
Low Severity
When a distribution metric has no time aggregation (aggregateTime: "none"), timeSampleCount is initialized to 0 and never updated, so every distribution frame reports timeSampleCount: 0 regardless of how many run samples were collected. This is inconsistent with the scalar path which tracks frame count in scalarFrameCountState.count. Consumers displaying or reasoning about sample counts for distribution frames will see a misleading zero.
Reviewed by Cursor Bugbot for commit 823f376. Configure here.
Support aggregating metric runs (mean/median/min/max/percentiles, heatmap and percentile-line distribution views) and aggregating over time (value, min/max-to-date traces, or collapsing to a single number or distribution). Add a per-block size toggle and render the aggregated distribution as a uPlot bar chart.
Group the metric-kind picker into Built-in / Model metrics / Custom sections with icons, and add the model's custom metrics as read-only expression options.
823f376 to
09d211d
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 09d211d. Configure here.
| runCount: latestFrame?.runCount ?? simulator?.runCount ?? 0, | ||
| frameNumber: firstRunSummary?.frameNumber ?? 0, | ||
| time: firstRunSummary?.currentTime ?? 0, | ||
| runCount: simulator?.runCount ?? 0, |
There was a problem hiding this comment.
Progress reports stale frame number from first run
Low Severity
Both progressFromResult (worker) and getProgressFromResult (local) derive frameNumber and time from simulator.getRunSummary(0), i.e. the first run's individual state. The old code used the distribution metric's latest frame, whose frameNumber came from the simulator-level context.frameNumber—the global counter incremented by every advanceAll() call. If the first run errors while others continue, the reported frameNumber/time in progress will be stale, because a completed or errored run's frame number stops advancing while the simulator-level frame count keeps going.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 09d211d. Configure here.
| frames, | ||
| latestByMetricId, | ||
| }; | ||
| } |
There was a problem hiding this comment.
Local experiment recopies all metric frames every batch
Low Severity
getMetricsState rebuilds the entire frames array via metrics.flatMap(m => [...m.frames]) on every syncStores call, which runs after each batch in the local run loop. As frames accumulate over a long experiment, each batch copies an ever-growing history, making total work quadratic. The worker path's appendMetricFrames has similar spreading costs but only appends new frames, making intent clearer and avoiding re-reading from metric objects.
Reviewed by Cursor Bugbot for commit 09d211d. Configure here.


🌟 What is the purpose of this PR?
🔗 Related links
🚫 Blocked by
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
turbo.json's have been updated to reflect this🐾 Next steps
🛡 What tests cover this?
❓ How to test this?
📹 Demo