[core] Optimistic concurrency control for branch-decision event writes#2113
[core] Optimistic concurrency control for branch-decision event writes#2113VaguelySerious wants to merge 10 commits into
Conversation
The elapsed-wait scan now snapshots the loaded events' tail eventId and passes it as `lastKnownEventId` on each `wait_completed` write, so a concurrent `resumeHook` that has already advanced the canonical log is detected — the server's CAS rejects the write, we surface it as the existing `EntityConflictError`, and the next iteration re-replays against the fresh event list (mirroring the duplicate-wait fall-through that was already there). `resumeHook` sends `asOfTimestamp` (Date.now() at call time) so the server resolves the fence to the highest eventId strictly before resume time — no client-side event pre-read needed. Plumbed through `CreateEventParams` on `@workflow/world` so future worlds can forward as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 98c9741 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests💻 Local Development (1 failed)fastify-stable (1 failed):
Details by Category✅ ▲ Vercel Production
❌ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details.
Check the workflow run for details. |
The lazy-refs branch of createWorkflowRunEventInner forgot to thread `lastKnownEventId` and `asOfTimestamp` into the request body, so the fence was silently dropped for any event whose type went through the lazy path (i.e., not in `eventsNeedingResolve`). The resolve branch already had the forwarding. Caught by Vercel Agent Review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Vercel Agent review acknowledged + addressed in 1e69c82 — the lazy branch of |
|
Status after
I don't think this is from anything in this PR:
Reads like a long-standing abort-stream-propagation timing flake that just happened to fire on fastify Vercel Prod this run. Will rerun the job once the workflow run is no longer in-progress; flagging here in case a reviewer hits it before I get back to it. |
Co-authored-by: Peter Wielander <mittgfu@gmail.com> Signed-off-by: Peter Wielander <mittgfu@gmail.com>
|
Good catch — the answer is "yes, conceptually they can race the same way." Here's the breakdown of where we currently fence vs. don't, and what's at risk: Where the SDK currently sends a fenceOnly one site: the elapsed-wait scan in Where it doesn't, but could race in the same shapeAny write that the workflow runtime makes based on a branch decision driven by the loaded events array can race in the same way the production hook/sleep bug did. Concretely:
What doesn't need a fence (and why)
Why the wait_completed fence alone catches the production bug todayThe hook/sleep race needs: (1) tick takes sleep branch with stale events, (2) tick writes wait_completed, and (3) tick writes any sleep-branch follow-up events (more sleeps, the next iteration's allocations). My current fence trips at step (2): the wait_completed CAS fails, the retry loop reloads events, sees But that's specific to the hook-vs-sleep shape because RecommendationFor full coverage of the general "stale-snapshot branch decision" race, fence all 6 sites in the table above. Implementation cost is modest: the suspension-handler already has the events array available (the runtime invokes it with the loaded events); we'd thread the tail eventId in the same way I did for the elapsed-wait scan, and add the same retry-on-fence-conflict pattern to each catch block. Want me to do it in this PR or keep it as a follow-up? My slight lean is follow-up: the current PR cleanly resolves the production bug and it's already a non-trivial behavioral change; the broader fencing is mostly defensive against shapes we haven't seen in the wild. But the risk of leaving them unfenced is real if we're betting that no other workflow patterns trigger the same race. |
|
You're right, I was sloppy. Let me walk through it. Your example:
|
TooTallNate
left a comment
There was a problem hiding this comment.
Reviewed against a fresh failing run (wrun_01KSNVVXH82P5GM40D6F9152JF, deployed against the matching workflow-server preview). The fence catches the original hook/sleep production shape, but the run still produced CORRUPTED_EVENT_LOG because the corruption path doesn't run through wait_completed.
TL;DR
The SDK side of this fix only fences one write site — the elapsed-wait scan's wait_completed. Every other branch-decision write from suspension-handler.ts (step_created, hook_created, wait_created, hook_disposed) plus the terminal writes (run_completed, run_failed) still go through unfenced. That leaves the very gap you already articulated in the 2026-05-27T11:02:45Z comment: "A race that doesn't go through wait_completed (e.g., a branch decision around Promise.race([hook, someStep]) where the workflow allocates a different step depending on which side wins) wouldn't be caught — the stale step_created/hook_created would land without a fence check."
Our failing run is exactly that case.
Evidence from wrun_01KSNVVXH82P5GM40D6F9152JF
Reproduced with the WF_TRACE instrumentation from #2127. Critical sequence (eventIds and write timestamps from DynamoDB):
| eventId | type | corr | write time |
|---|---|---|---|
…3CFWJ |
step_completed |
step_…PMW (sync iter 3) |
23:21:17.967 |
…3ZNRF |
wait_completed |
wait_…PMP |
23:21:18.581 |
…41F5T |
step_created (drain) |
step_…PMX |
23:21:18.639 |
What happened:
- Two replays (inv 269, 327) loaded events at
ec=205— last event wasstep_completed PMW. Nowait_completedyet. Same digest94e1262add726a3ein both. - Both replays' iter-3 race resolved with wake (hook payloads were buffered from the resume hammer), so the workflow called
drainand the suspension handler wrotestep_created PMX (drain)unfenced. - Concurrently, some other invocation's elapsed-wait scan wrote
wait_completed PMPat 23:21:18.581 — 58 ms before the drain write at 23:21:18.639. - 52 subsequent replays loaded the same
ec=208 / digest=e8d1f86e7a43de3devent log and every one of them failed withstep_mismatchon PMX (their iter-3 race resolved with sleep becausewait_completedis now in the log, so iter 4'suseStep('sync')got allocated the correlationId that the log records asdrain).
The critical point: the missed event (wait_completed PMP) has a higher eventId than the tick's tail (step_completed PMW), so this is precisely the symmetric-case race the current fence design covers — if the write being fenced is the step_created PMX (drain) write. But it isn't, because this PR doesn't fence step writes.
With the current PR applied to this run:
- The hypothetical concurrent
wait_completedwrite would be fenced (and either land first or fail CAS and retry). - But
step_created PMX (drain)from inv 269/327 is still unfenced, so it still lands, still corrupts the log, and future replays still fail.
Requesting changes
Per your own follow-up table in the 2026-05-27T11:02:45Z comment — the writes that need fences are:
suspension-handler.ts—step_created,hook_created,wait_created,hook_disposedruntime.tsterminal-state writes —run_completed,run_failed
Fencing each to the tick's loaded events tail (with a parallel retry-on-conflict loop modeled after the one this PR adds for wait_completed) should close the failure mode we're seeing.
The deeper edge case you describe at the end of that comment (eventually-consistent reads + missed event with a lower eventId than the tick's tail) is a separate concern that the current fence design genuinely cannot catch — that's fine to defer. Our run is unambiguously the symmetric case the fence already handles, just at a write site that isn't fenced yet.
What I'd like to see before merge
Either:
- Extend the fence to the writes in your table (preferred — completes the design), or
- Land this PR as-is but with a
KNOWN_ISSUESnote documenting which races it does and doesn't catch, plus a tracking issue for the follow-up.
The production hook/sleep shape this PR validated against does demonstrably stop reproducing — that's solid. But shipping it as the fix for CORRUPTED_EVENT_LOG without the rest of the table is going to be misleading once a Promise.race([hook, step])-shaped run hits prod.
Happy to provide the full WF_TRACE export and DynamoDB dump for wrun_01KSNVVXH82P5GM40D6F9152JF if helpful for designing the extension.
* [DEBUG] Trace replay event log and step/hook/sleep assignments Temporary diagnostic instrumentation for investigating intermittent CorruptedEventLogError 'step consumer mismatch' failures. Emits console.log lines tagged 'WF_TRACE' at four points: - runWorkflow start: dumps the full event array the replay will consume (eventIds, types, correlationIds, stepNames) plus a sha256 digest - step/hook/sleep subscribe: per-replay correlationId -> name assignment - step consumer mismatch: structured record of the failure including the event index in the SDK's view of the log - runWorkflow end: completed | failed | suspended Used to diff successive replays of the same runId and confirm whether the SDK actually sees the same event array each time. * [DEBUG] Extend OCC fence to all branch-decision writes Peter's PR #2113 fences `wait_completed` writes from the elapsed-wait scan. This commit extends the fence to every other write whose outcome depends on a branch decision the workflow VM made from its loaded event log — per the table @VaguelySerious himself laid out in his PR comment: suspension-handler.ts: - step_created (the smoking gun on wrun_01KSPS7XEGHF4A6WYF4DB03D40) - hook_created - hook_disposed - wait_created runtime.ts terminal writes: - run_completed - run_failed `hook_received` is deliberately NOT fenced (Peter's reasoning preserved verbatim: fencing the user's signal would drop it on contention; stale- snapshot protection belongs on the writes that consume hooks, not the ones that deliver them). The fence value is the load-time tail of the events array passed into `runWorkflow`. `suspension-handler` receives the fence + cursor from the runtime and reloads on conflict; the runtime's terminal writes read the cursor directly. The new `__fenced-write.ts` helper encapsulates the retry loop so we don't have to copy/paste Peter's pattern six times. It's named with the leading-underscore convention to flag it as throwaway diagnostic code, matching `__debug-replay-trace.ts`. * [DEBUG] Point at Peter's workflow-server PR 447 preview + map HTTP 412 Two changes both needed for the extended-fence test loop to actually exercise the OCC code path on the server: 1. Hardcode WORKFLOW_SERVER_URL_OVERRIDE to https://workflow-server-83nn57dvc.vercel.sh (preview deployment of workflow-server PR 447, branch alias workflow-server-git-peter-event-write-cas.vercel.sh). The previous preview at workflow-server-7pxaxn4d4.vercel.sh was Pranay's monotonic- append PR 456 \u2014 different fix, doesn't implement the CAS the SDK side now sends. 2. Map HTTP 412 \u2192 EntityConflictError in the world-vercel error mapper. workflow-server PR 447 returns 412 with a 'fence conflict' message for EventLogFenceConflictError; the SDK's existing fence-retry loops (Peter's wait_completed scan + the new ones in suspension-handler and runtime terminal writes) match on /fence conflict/i against the message of an EntityConflictError. Without this mapping the 412 falls through to WorkflowWorldError and the regex match never fires. * fix(core): chain fences for replay-created events * test: clear workflow server debug override * fix(core): scope step dispatch to owners * fix(core): recover wait-raced step dispatch * Remove diagnostic instrumentation, rename fenced-write helper Strip the WF_TRACE replay tracing that was used to diagnose the CORRUPTED_EVENT_LOG race \u2014 it's served its purpose now that the fix is in. Specifically: - Delete packages/core/src/__debug-replay-trace.ts and its 8 call sites in workflow.ts, step.ts, workflow/hook.ts, workflow/sleep.ts. - Drop the matching [DEBUG] inline narrative comments at each call site. - Rename packages/core/src/runtime/__fenced-write.ts \u2192 fenced-write.ts (the leading-underscore convention marked it as throwaway diagnostic code; the helper is intended to stay). - Trim the file header on fenced-write.ts and the related narrative comment in suspension-handler.ts to drop the failing-runId / PR-number references that only made sense in the debug context. No behavioral change. typecheck clean (0 errors); 1014/1014 unit tests pass (same as parent commit 77f057a). * world-vercel: always preserve fence-conflict marker on HTTP 412 Address Copilot review on PR 2132 (#2132 (comment)). The fence-retry loop in runtime/fenced-write.ts detects OCC conflicts via /fence conflict/i.test(err.message). The 412 branch was relying on the server's JSON body to populate that message via errorData.message, but parseResponseBody().catch(() => ({})) swallows JSON parse failures silently — so any non-JSON 412 response (CDN HTML, gateway timeout page, intermediate proxy error) would surface as EntityConflictError("<METHOD> /endpoint -> HTTP 412: Precondition Failed"), the regex would miss it, and the retry loop would mis-classify the conflict as terminal. Prefix the message with `fence conflict:` whenever the parsed body didn't already carry the marker, so the retry detection is robust to response-body parse failures. Tests: world-vercel 69/69 pass. --------- Co-authored-by: Peter Wielander <peter.wielander@vercel.com>
TooTallNate
left a comment
There was a problem hiding this comment.
Re-reviewed after b5c567c (extended OCC fence to all branch-decision writes, #2132) and a7efa5a (always preserve fence-conflict marker on HTTP 412).
All three of my earlier inline comments are addressed:
-
Fence coverage (runtime.ts:800) — now applied to
step_created,hook_created,hook_disposed,wait_created,run_completed,run_failedvia the sharedfencedEventCreatehelper, in addition towait_completed. All branch-decision writes are fenced. -
Brittle string match (runtime.ts:842) —
packages/world-vercel/src/utils.tsnow anchors detection on HTTP 412 and prefixes the message withfence conflict:client-side, so the/fence conflict/iregex infenced-write.tscannot regress against server wording changes. -
hook_receivedunfenced (resume-hook.ts:159) — confirmed deliberate, preserved.
Stress reproduction with this branch + the paired backend preview: 0/40 cycles surface CORRUPTED_EVENT_LOG (baseline on stable: ~2/40).
LGTM — clearing my CHANGES_REQUESTED block.
- Log each fence-conflict retry at info level (datadog visibility for the retry path between the first conflict and the give-up warning). - Keep prior fence when a create response is missing `event`; emit a warn instead of silently advancing to a value we didn't observe on the wire. The schema marks `event` as optional for legacy compat; in practice creates always return it, but the type permits drift. - Replace `events.some(\u2026)` inside the reload-merge loop with a Set-based dedup so the retry path is O(n + m) instead of O(n \u00d7 m). - Drop `asOfTimestamp` from CreateEventParams. The original motivation was `resumeHook`-style writes; the runtime keeps `hook_received` unfenced (fencing the user signal would drop it on contention) so nothing in this PR exercises the param. Reintroduce when a real caller appears. Addresses inline review comments on #2113.
The four `WorkflowWorldError` subclasses surfaced by the runtime's own calls into the world layer represent infrastructure-level conditions, not user-code failures: - `EntityConflictError`: CAS rejection on event writes (409 / 412), including OCC fence conflicts that exhausted the in-place retry budget in `fenced-write.ts`. - `RunExpiredError`: 410 — run was cleaned up or already terminal. - `TooEarlyError`: 425 — retry-after timestamp not yet reached. - `ThrottleError`: 429 — rate limited by the workflow backend. When any of these reach `classifyRunError` the runtime's own retry logic has already exhausted (otherwise the error would have been swallowed upstream). The truthful classification is `RUNTIME_ERROR`, not `USER_ERROR`. Same shape as the existing entries (`WorkflowRuntimeError`, `WorkflowNotRegisteredError`, `StepNotRegisteredError`). The bare `WorkflowWorldError` parent stays out of the runtime list: it can also surface from user-code `fetch` calls into the workflow API, where `USER_ERROR` is the correct attribution (see the existing "WorkflowWorldError with status 500" test). Caught during the stress validation of #2113 + workflow-server#447 end-to-end: fence-conflict retries exhausting under the 180-way hook race were surfacing as `USER_ERROR`, masking an obvious infra condition as user code. Tests added for each of the four subclasses.
karthikscale3
left a comment
There was a problem hiding this comment.
Reviewed the OCC fencing approach end-to-end — the design is sound and unusually well-documented, and the repro evidence (0/40 vs ~2/40 baseline) is convincing. Three things I'd want addressed before merge, called out inline:
- Duplication — the
wait_completedloop inruntime.tsreimplementsfencedEventCreate's entire fence/retry/reload/backoff pattern (incl. a secondMAX_FENCE_RETRIESand a duplicate/fence conflict/iregex). These will drift. - Owner-scoped dispatch is the riskiest behavioral change and the least directly tested.
fenced-write.tshas no direct unit tests despite being the most subtle new code.
Minor (non-blocking): deterministic 25 * attempt backoff has no jitter (can resync conflicting writers under the exact storm this targets); attempts > MAX_FENCE_RETRIES yields 6 attempts, not the stated 5; please confirm the world-local completedMessages cache can't suppress a legitimate step re-dispatch that reuses a completed idempotency key; and there's stray whitespace in the new resume-hook.ts comment plus two unrelated blank-line additions in step.ts / workflow/hook.ts.
| events.length > 0 | ||
| ? events[events.length - 1].eventId | ||
| : undefined; | ||
| const MAX_FENCE_RETRIES = 5; |
There was a problem hiding this comment.
[Emphasis #1 — duplication] This inline wait_completed loop reimplements the exact fence → 412 → reload → idempotency-check → retry-with-backoff pattern that fenced-write.ts was created to centralize — including a second copy of MAX_FENCE_RETRIES = 5 and a duplicated /fence conflict/i regex.
This is the biggest maintainability concern in the PR: two copies of the same protocol will drift, and a future fix to one won't reach the other. The loop's extra requirements (chaining the fence across multiple waits, merging reloaded events into the local events array) look expressible by calling fencedEventCreate per wait with an onConflictRefresh closure that does the merge. If full unification isn't feasible now, please at least share the MAX_FENCE_RETRIES constant and the isFenceConflict helper from fenced-write.ts rather than re-declaring them here.
| } | ||
| } | ||
|
|
||
| const queueablePendingSteps = |
There was a problem hiding this comment.
[Emphasis #2 — owner-scoped dispatch is the riskiest change] This flips a core invariant: from "queue every pending step (minus inline)" to "queue only owned + recoverable steps, unless a wait is pending." The safety argument holds only if the winning owner always finishes its tick, or a redelivery (metadata.attempt > 1) triggers the recovery set.
The gap I'd want covered by a test: a step whose step_created was written by a concurrent handler that then crashed, observed here on a fresh attempt === 1 delivery with no pending wait — it's not in createdStepCorrelationIds, not in recoverablePendingStepCorrelationIds (empty when attempt === 1), and not the inline step, so neither this handler nor the dead owner dispatches it until a later redelivery bumps attempt. That's the intended crash window, but it's a stall risk that the hook/sleep repro doesn't exercise. Can we add a targeted stress/integration case for owner-crash-after-step_created-before-dispatch (both with and without a pending wait)?
| * the abort-vs-rethrow decision (preserves the existing | ||
| * "EntityConflictError → log and skip" behavior for callers that want it). | ||
| */ | ||
| export async function fencedEventCreate( |
There was a problem hiding this comment.
[Emphasis #3 — needs direct unit tests] This helper carries the most subtle logic in the PR (retry budget + off-by-one boundary, abort-vs-rethrow on non-fence EntityConflictError, the missing-result.event path that keeps the prior fence, and fence advancement on success) yet has no dedicated unit test — only suspension-handler.test.ts exercises the happy-path chaining, and the rest is covered solely by the e2e stress run.
A table-driven test here would lock in the contract cheaply: (a) success advances the fence and returns the event; (b) N fence conflicts then success; (c) exceeding MAX_FENCE_RETRIES rethrows; (d) onConflictRefresh returning abort yields written:false; (e) non-fence EntityConflictError honors onEntityConflict abort vs rethrow; (f) missing result.event keeps the prior fence and logs. This is the seam most likely to regress silently.
When a fenced event write rejects with EventLogFenceConflictError, the
SDK previously retried up to MAX_FENCE_RETRIES = 5 times against a
freshly-loaded tail, with linear backoff. Under stress this behaved
poorly in two ways:
1. The retry loop spins against an ever-changing tail. Under high
contention (e.g. a hook flood triggering many concurrent ticks),
exhausting the budget throws an EntityConflictError, which surfaces
as run_failed — a transient infra condition mis-classified as a
terminal failure.
2. The retries amplified the server-side stuck-fence pattern (run.lastKnownEventId
advancing past a non-existent eventId due to the patch-then-PUT
non-atomicity documented in c06d6ce of workflow-server). Every
retry that hit the same stale fence wasted compute and prolonged
the affected window.
Switch to bail-on-conflict: on fence conflict, fencedEventCreate
returns {written: false} immediately. No retry, no throw, no
re-enqueue. A fence conflict means another invocation has the
canonical view of the event log — the canonical invocation is
responsible for whatever progress the workflow needs, and the
losing invocation just exits cleanly.
This matches the existing workflow-server comment ('the @workflow/core
suspension handler swallows it') and the original design intent.
Net change: ~250 lines removed from runtime.ts + fenced-write.ts +
suspension-handler.ts. The custom retry loop in the wait_completed
elapsed-wait scan is also folded into fencedEventCreate.
Behavioral effects:
- USER_ERROR / RUNTIME_ERROR run failures from exhausted fence retries
are eliminated. Fence conflicts no longer mark runs as failed.
- Higher hook-payload throughput under stress: the 180-way race is no
longer amplified by 5x retries per losing lambda.
- The server-side stuck-fence window (patch-then-PUT non-atomicity)
is unchanged — that needs to be addressed in workflow-server, not
here. But the SDK no longer makes it worse by spinning.
Tests: 1018 core tests pass. The previously-tested 'fence-conflict
retries and reloads' behavior is removed; the replacement behavior
('fence-conflict returns {written:false} once') is exercised
implicitly via the suspension-handler integration tests.
Summary
Adds optimistic-concurrency fencing to the event writes that determine workflow branching, closing the hook/sleep race that produces
CORRUPTED_EVENT_LOGon production runs.Every write that emits a branch-decision side effect from a stale snapshot is now fenced against the canonical log:
wait_completed— elapsed-wait scan snapshots the loaded events' taileventIdand passes it aslastKnownEventIdon each write. If a concurrentresumeHookhas advanced the log, the server's CAS rejects.step_created,hook_created,hook_disposed,wait_created— suspension-time writes fence against the same snapshot, via a new sharedfencedEventCreatehelper (packages/core/src/runtime/fenced-write.ts).run_completed,run_failed— terminal writes fence against the snapshot, with idempotency-via-reload so a concurrent terminal write doesn't re-fail the tick.On a fence-conflict
EntityConflictError, the runtime retries in-place rather than throwing the whole tick away: it reloads events from the cursor, refreshes the fence, and tries again (up to 5x with backoff). Falling back to queue redelivery turned out to thunder-herd — every redelivery spawns another concurrent tick, which fences-conflicts again, and workflows stall inrunning. If the work was committed by a concurrent writer between attempts, we observe it in the reloaded log and skip the write entirely (idempotency).resumeHookappendshook_receivedunfenced. ULID ordering already places this write after anything committed before us, and applying CAS would only ever reject the hook in favor of an unrelated concurrent write (losing the user's signal). Stale-snapshot protection lives on the tick writes that consume hooks, not on the write that delivers them.Fence-conflict detection is anchored on HTTP status, not error wording: in
world-vercel, HTTP 412 responses are always surfaced asEntityConflictErrorwith afence conflict:prefix added client-side based on HTTP status (so the marker is present even when the response body fails to parse). The runtime'sisFenceConflict()check (EntityConflictError+/fence conflict/i) therefore cannot silently regress against server wording changes.CreateEventParamson@workflow/worldgrowslastKnownEventIdandasOfTimestamp(both optional). Worlds that don't implement OCC can pass them through or ignore them.Pairs with backend PR vercel/workflow-server#447 which materializes
run.lastKnownEventIdand gates event writes on it. The server's CAS is explicit opt-in — unfenced writers (most paths) still atomically advance the materialized value so fenced writers can chain off it, but they don't reject on contention.Test plan
Stress reproduction
The original
CORRUPTED_EVENT_LOGbug reproduces on stable at ~0.1–0.4% of runs under the following shape:Promise.race([hook, sleep])withsleepBranchWaitCountparallel sleeps when sleep wins, 10 hook payloads per token atfireAfterMs=3000.Stress runs (
REPRO_COUNT=180 REPRO_LOOPS=80 REPRO_CONCURRENCY=50× 8 parallel cycles, 40 cycles total against this branch + the paired backend preview) show 0/40 cycles surfacingCORRUPTED_EVENT_LOG. Baseline on stable surfaces the failure in ~2/40 cycles under the same load.The earlier residual pattern (sleep-branch waits with a single un-completed
wait_created) is now closed by extending the fence towait_createditself.🤖 Generated with Claude Code