Skip to content

fix(core): preserve event-log order in hook-vs-sleep replay races#2171

Open
TooTallNate wants to merge 1 commit into
codex/runtime-only-reused-sleep-reprofrom
nate/fix-hook-sleep-replay-ordering
Open

fix(core): preserve event-log order in hook-vs-sleep replay races#2171
TooTallNate wants to merge 1 commit into
codex/runtime-only-reused-sleep-reprofrom
nate/fix-hook-sleep-replay-ordering

Conversation

@TooTallNate
Copy link
Copy Markdown
Member

Summary

Fixes a replay divergence where a buffered hook payload races a concurrent sleep, and sleep can win a Promise.race that the committed event log says the hook won — surfacing as CorruptedEventLogError on replay (seen in production as a step_created for one step being consumed by a different step's consumer).

Stacked on top of #2169 (the runtime-only repro). Base is the repro branch so this PR is repro + fix; it will reduce to the fix-only diff once #2169 merges.

Root cause

  • A buffered hook payload (a hook_received consumed before the workflow awaited the hook) was delivered to its consumer only at claim time (iterator.next() / await).
  • A concurrent wait_completed resolved synchronously in its promiseQueue slot — no hydration, fewer microtask hops — while the hook payload reaches the consumer through the async hook iterator (yield await this), which adds hops.
  • In Promise.race([hook, sleep]), sleep could therefore preempt an earlier-in-log hook payload, diverging from the committed log.

This is specific to hook-vs-sleep: hook-vs-step and two-hook ordering already resolve correctly (verified by the added characterization tests), because the existing serial promiseQueue discipline is decryption-time independent for those.

Fix (three timing-independent parts)

  1. Anchor resolution at log position. A buffered hook payload now resolves through a promiseQueue slot chained at its log position (not at the later claim site), so ordering follows the event log regardless of hydration/decryption time.
  2. Cross-entity ordering barrier. Each in-flight buffered delivery registers a barrier keyed by its source hook_received eventId (ctx.pendingHookDeliveries). A later-in-log entity (sleep) defers behind any earlier-in-log in-flight hook delivery via awaitEarlierHookDeliveries.
  3. Macrotask release (hop-count independent). The barrier releases on a macrotask (setTimeout(0)) after the payload is claimed, so the consumer's branch decision — however many await hops deep — always commits before the deferring entity proceeds. This reuses the same macrotask-boundary technique scheduleWhenIdle already relies on; no microtask-hop heuristic. awaitEarlierHookDeliveries bounds its wait with a one-macrotask fallback so an unclaimed payload can't deadlock a deferring entity.

Why timing-independent

Decryption time, hydration time, and consumer await-chain depth are all irrelevant: a macrotask runs only after the entire pending microtask queue drains. Empirically validated by modeling the race (the iterator path costs 3 microtask hops; a fixed-K microtask release was fragile — the macrotask boundary is not).

Tests

  • hook-sleep-interaction.test.ts: hook-vs-sleep (the repro, both sync + async-deser modes), hook-vs-step, two-hook slow-decrypt ordering.
  • Full @workflow/core suite: 630/630, typecheck clean, stable across repeated runs, no hangs.

Scope

Pre-existing runtime bug on stable; independent of the OCC / fenced-write work, so it can ship on its own.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com

A buffered hook payload (a `hook_received` consumed before the workflow
awaited the hook) was delivered to its consumer only at claim time
(`iterator.next()`/`await`), and a concurrent `wait_completed` resolved
synchronously with fewer microtask hops. When the workflow did
`Promise.race([hook, sleep])`, sleep could win a race the committed event
log says the hook won, surfacing as `CorruptedEventLogError` on replay
(observed in production: a `step_created` for one step consumed by a
different step's consumer).

Fix, in three timing-independent parts:

1. Resolve a buffered hook payload through a `promiseQueue` slot chained
   at its log position (not at the later claim site), so resolution order
   stays anchored to the event log regardless of hydration/decryption
   time.

2. Register a per-delivery ordering barrier keyed by the source
   `hook_received` eventId (`ctx.pendingHookDeliveries`). A later-in-log
   entity (sleep's `wait_completed`) defers behind any earlier-in-log
   in-flight hook delivery via `awaitEarlierHookDeliveries`.

3. Release the barrier on a MACROTASK (`setTimeout(0)`) after the payload
   is claimed, so the consumer's branch decision — however many await
   hops deep through the async hook iterator — always commits before the
   deferring entity proceeds. This reuses the macrotask-boundary technique
   `scheduleWhenIdle` already relies on and is fully hop-count- and
   decryption-time independent (no microtask-hop heuristic).
   `awaitEarlierHookDeliveries` bounds its wait with a one-macrotask
   fallback so an unclaimed payload cannot deadlock a deferring entity.

Adds characterization tests covering hook-vs-sleep (the repro),
hook-vs-step, and a two-hook slow-decrypt ordering case. Pre-existing
runtime bug on `stable`, independent of the OCC/fence work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 30, 2026 00:43
@TooTallNate TooTallNate requested a review from a team as a code owner May 30, 2026 00:43
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 30, 2026

⚠️ No Changeset found

Latest commit: ca169cc

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview, Comment May 30, 2026 12:46am
example-nextjs-workflow-webpack Ready Ready Preview, Comment May 30, 2026 12:46am
example-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-astro-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-express-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-fastify-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-hono-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-nitro-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-nuxt-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-sveltekit-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-tanstack-start-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workbench-vite-workflow Ready Ready Preview, Comment May 30, 2026 12:46am
workflow-swc-playground Ready Ready Preview, Comment May 30, 2026 12:46am
workflow-tarballs Ready Ready Preview, Comment May 30, 2026 12:46am
workflow-web Ready Ready Preview, Comment May 30, 2026 12:46am

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deterministic-replay divergence in @workflow/core where a buffered hook_received payload could lose a Promise.race against a concurrently-resolving sleep (wait_completed) during replay, even when the committed event log indicates the hook branch won. The change anchors buffered hook payload delivery to its event-log position and introduces a cross-entity ordering barrier so later-in-log sleeps defer until earlier hook deliveries are observed.

Changes:

  • Anchor buffered hook_received payload hydration to the payload’s log position via ctx.promiseQueue (instead of scheduling at claim/iterator.next() time).
  • Add ctx.pendingHookDeliveries and awaitEarlierHookDeliveries() to defer wait_completed behind earlier-in-log buffered hook deliveries.
  • Add characterization/regression tests covering hook-vs-sleep, hook-vs-step, and two-hook ordering under slow async hydration.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/core/src/workflow/sleep.ts Defers wait_completed resolution behind earlier buffered hook deliveries.
packages/core/src/workflow/hook.ts Reworks buffered hook payload handling to resolve at log position and register ordering barriers.
packages/core/src/private.ts Adds pendingHookDeliveries to context and implements awaitEarlierHookDeliveries().
packages/core/src/workflow.ts Initializes pendingHookDeliveries in the runtime context.
packages/core/src/hook-sleep-interaction.test.ts Adds regression/characterization tests for ordering across hook/sleep/step races.
packages/core/src/workflow/sleep.test.ts Updates test harness context initialization with pendingHookDeliveries.
packages/core/src/step.test.ts Updates test harness context initialization with pendingHookDeliveries.
packages/core/src/async-deserialization-ordering.test.ts Updates test harness context initialization with pendingHookDeliveries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +225 to +230
await Promise.race([
Promise.all(earlier),
new Promise<void>((resolve) => {
setTimeout(resolve, 0);
}),
]);
* different branch or is suspending), `awaitEarlierHookDeliveries`
* bounds its wait with a one-macrotask fallback.
*/
pendingHookDeliveries: Map<string, Promise<void>>;
Comment on lines +423 to +426
// Ordered durable history where the hook branch already won the race
// against a *step* (not a sleep): the hook payload (evnt_2) precedes
// the racing step's completion (evnt_9), and the committed branch is
// the hook (drainStep created at evnt_10).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants