Skip to content

FE-711: Add Metrics to Experiments#8751

Open
kube wants to merge 9 commits into
mainfrom
cf/fe-711-add-metrics-to-monte-carlo-experiments
Open

FE-711: Add Metrics to Experiments#8751
kube wants to merge 9 commits into
mainfrom
cf/fe-711-add-metrics-to-monte-carlo-experiments

Conversation

@kube
Copy link
Copy Markdown
Collaborator

@kube kube commented May 25, 2026

🌟 What is the purpose of this PR?

🔗 Related links

  • ...

🚫 Blocked by

  • ...

🔍 What does this change?

  • ...

Pre-Merge Checklist 🚀

🚢 Has this modified a publishable library?

This PR:

  • does not modify any publishable blocks or libraries, or modifications do not need publishing
  • modifies an npm-publishable library and I have added a changeset file(s)
  • modifies a Cargo-publishable library and I have amended the version
  • modifies a Cargo-publishable library, but it is not yet ready to publish
  • modifies a block that will need publishing via GitHub action once merged
  • I am unsure / need advice

📜 Does this require a change to the docs?

The changes in this PR:

  • are internal and do not require a docs change
  • are in a state where docs changes are not yet required but will be
  • require changes to docs which are made as part of this PR
  • require changes to docs which are not made in this PR
    • Provide more detail here
  • I am unsure / need advice

🕸️ Does this require a change to the Turbo Graph?

The changes in this PR:

  • do not affect the execution graph
  • affected the execution graph, and the turbo.json's have been updated to reflect this
  • I am unsure / need advice

⚠️ Known issues

🐾 Next steps

🛡 What tests cover this?

❓ How to test this?

  1. Checkout the branch / view the deployment
  2. Try X
  3. Confirm that Y

📹 Demo

@kube kube self-assigned this May 25, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented May 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hash Ready Ready Preview, Comment May 29, 2026 2:01am
petrinaut Ready Ready Preview, Comment May 29, 2026 2:01am
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
hashdotdesign-tokens Ignored Ignored Preview May 29, 2026 2:01am

@cursor
Copy link
Copy Markdown

cursor Bot commented May 25, 2026

PR Summary

Medium Risk
Large refactor of Monte Carlo experiment outputs and public exports; breaking for consumers of place-token distribution types/messages, mitigated by tests and parallel local/worker paths.

Overview
Monte Carlo experiments no longer hard-code place token count distributions. They now support configurable user-defined metrics (expression code, place-token means, transition firing counts) with scalar or per-run distribution output, mergeable accumulators, and specs compiled in the worker or run locally.

The experiment API and worker protocol switch from distributions / distributionFrames to metrics / metricFrames keyed by metric id. The React layer requires at least one metric when creating an experiment, streams frames into experiment records, and replaces the old place-only timeline with a multi-metric uPlot UI (heatmaps, percentile bands, run/time aggregation).

The create-experiment drawer adds a Metrics section (built-in kinds, model metrics, custom code with LSP). Simulate view hides the standalone Metrics tab from the sidebar while keeping scenarios and experiments.

Reviewed by Cursor Bugbot for commit 09d211d. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions github-actions Bot added area/infra Relates to version control, CI, CD or IaC (area) area/libs Relates to first-party libraries/crates/packages (area) type/eng > frontend Owned by the @frontend team labels May 25, 2026
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 25, 2026

🤖 Augment PR Summary

Summary: This PR introduces first-class Monte Carlo “experiment metrics” for Petrinaut experiments, including runtime/user-defined metric evaluation and UI controls to configure and view metric outputs.

Changes:

  • Exports new Monte Carlo metric APIs and types from @hashintel/petrinaut-core (metric specs, user-defined metrics, accumulators).
  • Adds a SimulationFrameReader-based per-run frame reader so metrics can safely inspect place/transition state.
  • Implements numeric and histogram accumulator utilities plus tests for merge/monoid behavior.
  • Adds metric spec compilation (including expression metrics via compileMetric) into user-defined metric configs.
  • Extends createMonteCarloExperiment() with a local execution path when metric callbacks/specs are provided (worker cannot receive executable code).
  • Updates experiments React state/provider to store metric specs, metric frames, and latest-by-id.
  • Adds UI for defining metrics when creating an experiment (including LSP diagnostics for expression metrics).
  • Updates experiment viewing UI to optionally show token-count timeline and a new metrics summary section.
  • Adds architecture/proposal/usage docs for Monte Carlo metrics direction.

Technical Notes: Worker-backed experiments remain distribution-only; experiments with expression/aggregated metrics run locally to avoid posting JS code across the worker boundary.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread libs/@hashintel/petrinaut/src/react/experiments/provider.tsx Outdated
Comment on lines +134 to +142
color: "neutral.s120",
cursor: "pointer",
});

const metricCollapseIconStyle = css({
transition: "[transform 200ms ease-in-out]",
"&[data-state=open]": {
transform: "[rotate(90deg)]",
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric ID validation inconsistency allows duplicate IDs with different whitespace. The code trims the ID when checking if empty (line 136) but adds the untrimmed ID to the Set (line 142), and checks for duplicates using the untrimmed ID (line 139). This means metric IDs like " test" and "test" would both pass validation as unique IDs, causing potential conflicts.

for (const metricSpec of input.metricSpecs) {
  const trimmedId = metricSpec.id.trim();
  if (trimmedId === "") {
    throw new Error("Metric id is required");
  }
  if (metricIds.has(trimmedId)) {
    throw new Error(`Metric id "${trimmedId}" is duplicated`);
  }
  metricIds.add(trimmedId);
  // ...
}

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Base automatically changed from cm/petrinaut-ai-assistant-mvp to main May 26, 2026 17:11
@github-actions github-actions Bot added area/deps Relates to third-party dependencies (area) area/infra Relates to version control, CI, CD or IaC (area) type/eng > backend Owned by the @backend team area/apps area/apps > hash.design Affects the `hash.design` design site (app) labels May 26, 2026
@kube kube force-pushed the cf/fe-711-add-metrics-to-monte-carlo-experiments branch from 1e2973a to 74a64da Compare May 26, 2026 21:03
@github-actions github-actions Bot removed area/deps Relates to third-party dependencies (area) area/infra Relates to version control, CI, CD or IaC (area) type/eng > backend Owned by the @backend team labels May 26, 2026
const distributionAccumulator =
createMonteCarloMetricHistogramAccumulator(runOutput.binning);
let distributionValues = runValues;
let timeSampleCount = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distribution metric timeSampleCount is zero without time aggregation

Low Severity

When a distribution metric has no time aggregation (aggregateTime: "none"), timeSampleCount is initialized to 0 and never updated, so every distribution frame reports timeSampleCount: 0 regardless of how many run samples were collected. This is inconsistent with the scalar path which tracks frame count in scalarFrameCountState.count. Consumers displaying or reasoning about sample counts for distribution frames will see a misleading zero.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 823f376. Configure here.

kube added 9 commits May 28, 2026 23:06
Support aggregating metric runs (mean/median/min/max/percentiles, heatmap
and percentile-line distribution views) and aggregating over time (value,
min/max-to-date traces, or collapsing to a single number or distribution).
Add a per-block size toggle and render the aggregated distribution as a
uPlot bar chart.
Group the metric-kind picker into Built-in / Model metrics / Custom
sections with icons, and add the model's custom metrics as read-only
expression options.
@kube kube force-pushed the cf/fe-711-add-metrics-to-monte-carlo-experiments branch from 823f376 to 09d211d Compare May 29, 2026 01:55
@github-actions github-actions Bot added the area/apps > hash.design Affects the `hash.design` design site (app) label May 29, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 09d211d. Configure here.

runCount: latestFrame?.runCount ?? simulator?.runCount ?? 0,
frameNumber: firstRunSummary?.frameNumber ?? 0,
time: firstRunSummary?.currentTime ?? 0,
runCount: simulator?.runCount ?? 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress reports stale frame number from first run

Low Severity

Both progressFromResult (worker) and getProgressFromResult (local) derive frameNumber and time from simulator.getRunSummary(0), i.e. the first run's individual state. The old code used the distribution metric's latest frame, whose frameNumber came from the simulator-level context.frameNumber—the global counter incremented by every advanceAll() call. If the first run errors while others continue, the reported frameNumber/time in progress will be stale, because a completed or errored run's frame number stops advancing while the simulator-level frame count keeps going.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09d211d. Configure here.

frames,
latestByMetricId,
};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local experiment recopies all metric frames every batch

Low Severity

getMetricsState rebuilds the entire frames array via metrics.flatMap(m => [...m.frames]) on every syncStores call, which runs after each batch in the local run loop. As frames accumulate over a long experiment, each batch copies an ever-growing history, making total work quadratic. The worker path's appendMetricFrames has similar spreading costs but only appends new frames, making intent clearer and avoiding re-reading from metric objects.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09d211d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/apps > hash.design Affects the `hash.design` design site (app) area/libs Relates to first-party libraries/crates/packages (area) type/eng > frontend Owned by the @frontend team

Development

Successfully merging this pull request may close these issues.

1 participant