[core] Extend OCC fence to all branch-decision writes#2132
Conversation
Temporary diagnostic instrumentation for investigating intermittent CorruptedEventLogError 'step consumer mismatch' failures. Emits console.log lines tagged 'WF_TRACE' at four points: - runWorkflow start: dumps the full event array the replay will consume (eventIds, types, correlationIds, stepNames) plus a sha256 digest - step/hook/sleep subscribe: per-replay correlationId -> name assignment - step consumer mismatch: structured record of the failure including the event index in the SDK's view of the log - runWorkflow end: completed | failed | suspended Used to diff successive replays of the same runId and confirm whether the SDK actually sees the same event array each time.
Peter's PR #2113 fences `wait_completed` writes from the elapsed-wait scan. This commit extends the fence to every other write whose outcome depends on a branch decision the workflow VM made from its loaded event log — per the table @VaguelySerious himself laid out in his PR comment: suspension-handler.ts: - step_created (the smoking gun on wrun_01KSPS7XEGHF4A6WYF4DB03D40) - hook_created - hook_disposed - wait_created runtime.ts terminal writes: - run_completed - run_failed `hook_received` is deliberately NOT fenced (Peter's reasoning preserved verbatim: fencing the user's signal would drop it on contention; stale- snapshot protection belongs on the writes that consume hooks, not the ones that deliver them). The fence value is the load-time tail of the events array passed into `runWorkflow`. `suspension-handler` receives the fence + cursor from the runtime and reloads on conflict; the runtime's terminal writes read the cursor directly. The new `__fenced-write.ts` helper encapsulates the retry loop so we don't have to copy/paste Peter's pattern six times. It's named with the leading-underscore convention to flag it as throwaway diagnostic code, matching `__debug-replay-trace.ts`.
🦋 Changeset detectedLatest commit: a7efa5a The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Two changes both needed for the extended-fence test loop to actually exercise the OCC code path on the server: 1. Hardcode WORKFLOW_SERVER_URL_OVERRIDE to https://workflow-server-83nn57dvc.vercel.sh (preview deployment of workflow-server PR 447, branch alias workflow-server-git-peter-event-write-cas.vercel.sh). The previous preview at workflow-server-7pxaxn4d4.vercel.sh was Pranay's monotonic- append PR 456 \u2014 different fix, doesn't implement the CAS the SDK side now sends. 2. Map HTTP 412 \u2192 EntityConflictError in the world-vercel error mapper. workflow-server PR 447 returns 412 with a 'fence conflict' message for EventLogFenceConflictError; the SDK's existing fence-retry loops (Peter's wait_completed scan + the new ones in suspension-handler and runtime terminal writes) match on /fence conflict/i against the message of an EntityConflictError. Without this mapping the 412 falls through to WorkflowWorldError and the regex match never fires.
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
VaguelySerious
left a comment
There was a problem hiding this comment.
Let's merge into upstream branch? and then review as one
Strip the WF_TRACE replay tracing that was used to diagnose the CORRUPTED_EVENT_LOG race \u2014 it's served its purpose now that the fix is in. Specifically: - Delete packages/core/src/__debug-replay-trace.ts and its 8 call sites in workflow.ts, step.ts, workflow/hook.ts, workflow/sleep.ts. - Drop the matching [DEBUG] inline narrative comments at each call site. - Rename packages/core/src/runtime/__fenced-write.ts \u2192 fenced-write.ts (the leading-underscore convention marked it as throwaway diagnostic code; the helper is intended to stay). - Trim the file header on fenced-write.ts and the related narrative comment in suspension-handler.ts to drop the failing-runId / PR-number references that only made sense in the debug context. No behavioral change. typecheck clean (0 errors); 1014/1014 unit tests pass (same as parent commit 77f057a).
There was a problem hiding this comment.
Pull request overview
Extends optimistic concurrency control (OCC) “fencing” to additional workflow event writes whose correctness depends on branch decisions made from a potentially stale event-log snapshot, and adjusts step dispatch behavior to avoid duplicate execution under concurrent replay.
Changes:
- Introduce a shared
fencedEventCreatehelper to retry CAS-fenced writes with refresh + idempotency checks. - Apply fencing to
step_created/wait_created/ hook create+dispose paths inhandleSuspension, and to terminalrun_completed/run_failedwrites inruntime.ts. - Improve local queue idempotency by retaining completed idempotency keys (plus tests/docs updates).
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/world-vercel/src/utils.ts | Map HTTP 412 OCC fence conflicts into EntityConflictError for downstream handling. |
| packages/world-testing/src/inline-batches-debug.mts | Tighten debug assertions for race/skip counters. |
| packages/world-local/src/queue.ts | Add a bounded cache of completed idempotency keys to prevent post-completion duplicate dispatch. |
| packages/world-local/src/queue.test.ts | Add coverage for post-completion idempotency dedupe behavior. |
| packages/core/src/workflow/hook.ts | Minor formatting-only change. |
| packages/core/src/step.ts | Minor formatting-only change. |
| packages/core/src/runtime/suspension-handler.ts | Fence branch-decision writes and serialize them to avoid self-conflicts; add refresh/idempotency logic. |
| packages/core/src/runtime/suspension-handler.test.ts | New test ensuring the fence is chained across writes within a suspension. |
| packages/core/src/runtime/fenced-write.ts | New shared helper implementing fence-conflict retries with backoff and caller-provided refresh/idempotency checks. |
| packages/core/src/runtime.ts | Fence terminal writes; owner-scope step queueing when inline execution is possible; crash-recovery exception for redelivery. |
| docs/content/docs/v5/changelog/eager-processing.mdx | Update documentation to reflect owner-scoped queueing semantics. |
| docs/content/docs/v4/changelog/eager-processing.mdx | Same as v5 doc update for v4 docs. |
| .changeset/quick-local-queues.md | Changeset for local queue idempotency behavior. |
| .changeset/chilly-fences-chain.md | Changeset for core fencing + dispatch changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address Copilot review on PR 2132 (#2132 (comment)). The fence-retry loop in runtime/fenced-write.ts detects OCC conflicts via /fence conflict/i.test(err.message). The 412 branch was relying on the server's JSON body to populate that message via errorData.message, but parseResponseBody().catch(() => ({})) swallows JSON parse failures silently — so any non-JSON 412 response (CDN HTML, gateway timeout page, intermediate proxy error) would surface as EntityConflictError("<METHOD> /endpoint -> HTTP 412: Precondition Failed"), the regex would miss it, and the retry loop would mis-classify the conflict as terminal. Prefix the message with `fence conflict:` whenever the parsed body didn't already carry the marker, so the retry detection is robust to response-body parse failures. Tests: world-vercel 69/69 pass.
Summary
Layered on top of #2113. Extends the OCC fence — which currently covers only
wait_completedwrites from the elapsed-wait scan — to every other write whose outcome depends on a branch decision the workflow VM made from its loaded event log:wait_completed(elapsed-wait scan)step_createdwait_createdhook_createdhook_disposedrun_completedrun_failedhook_receivedhook_receivedstays unfenced for the same reason it's unfenced in #2113: fencing the user's signal would drop it on contention. Stale-snapshot protection belongs on the writes that consume hooks, not the writes that deliver them.Why
Under concurrent replay (two invocations of the same run with overlapping but non-identical event-log snapshots), the SDK's deterministic ULID factory can allocate the same
correlationIdfor different step calls depending on which side of aPromise.raceeach replay sees as the winner. Both invocations then try to writestep_createdfor the same correlationId with different stepNames. The server's entity idempotency rejects the loser at the entity level — but if the winner was the stale-view invocation, its branch decision is the one that lands in the log, and future replays will see astep_createdthey don't expect.Fencing the write makes the stale-view invocation lose the CAS, retry against the freshly-loaded tail, observe the events its snapshot was missing, and arrive at the same branch decision the authoritative invocation made.
Implementation
The retry loop from #2113's elapsed-wait scan is extracted into
packages/core/src/runtime/fenced-write.tsso the six new fenced sites can share it without copy-paste. Each call site provides its ownonConflictRefreshthat runs an idempotency check against the reloaded log (e.g. "is thiswait_createdalready in the log? → abort instead of retrying").handleSuspensionreceives the load-time tail eventId + cursor from the runtime, and runs all four of its fenced writes (step_created,hook_created,hook_disposed,wait_created) against that fence. Each successful write advances the fence so chained writes in the same suspension don't conflict with each other.The terminal
run_completed/run_failedwrites inruntime.tsuse the same helper, with the idempotency check verifying no other terminal event has landed since the snapshot.Test plan
@workflow/coreunit tests pass.run.lastKnownEventIdmaterialized as expected.