feat(core): bounded queue-redelivery retry on decryption failure during replay#2166
feat(core): bounded queue-redelivery retry on decryption failure during replay#2166TooTallNate wants to merge 1 commit into
Conversation
…ng replay Follow-up to #2145. An AES-GCM auth failure (RuntimeDecryptionError) is terminal for the current attempt's bytes, but often transient at the run level: when the ciphertext came from a truncated/corrupted read of remotely-persisted data (e.g. a partial /refs response), a fresh queue delivery re-fetches the event log + ref payloads and can succeed. Mirror the replay-timeout bounded-redelivery precedent: on managed worlds (processExitTriggersQueueRedelivery), exit the process to trigger queue redelivery for up to DECRYPTION_FAILURE_MAX_RETRIES attempts, then commit run_failed as RUNTIME_ERROR. In-process worlds fail immediately (no queue to re-fetch from, and exiting would kill the host).
🦋 Changeset detectedLatest commit: fe5d9ca The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (4 failed)astro (1 failed):
fastify (1 failed):
nextjs-turbopack (1 failed):
nextjs-webpack (1 failed):
📦 Local Production (1 failed)express-stable (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
❌ 📦 Local Production
✅ 🐘 Local Postgres
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
This PR adds bounded queue-redelivery behavior for replay-time RuntimeDecryptionErrors in @workflow/core, allowing managed worlds to retry transient persisted-data read corruption before marking the run failed.
Changes:
- Adds a decryption-failure retry decision helper and retry budget constant.
- Integrates the helper into the workflow runtime error path before writing
run_failed. - Adds unit and runtime coverage plus a patch changeset.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/runtime/decryption-failure.ts |
Adds redelivery decision logic and logging for replay decryption failures. |
packages/core/src/runtime/decryption-failure.test.ts |
Covers managed/in-process world behavior and retry-budget boundaries. |
packages/core/src/runtime/constants.ts |
Defines the decryption-failure retry budget. |
packages/core/src/runtime.ts |
Redrives managed-world runs on eligible RuntimeDecryptionErrors before failing the run. |
packages/core/src/runtime.test.ts |
Adds end-to-end key-mismatch tests for redelivery and failure behavior. |
.changeset/requeue-on-decryption-failure.md |
Adds a patch changeset for @workflow/core. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Follow-up to #2145 (the
RuntimeDecryptionErrorattribution fix), implementing the bounded-redelivery behavior @pranaygp described in this comment.An AES-GCM authentication failure is terminal for the bytes/key of the current attempt — we must never continue executing the workflow on top of data we couldn't decrypt. But the failure is often not terminal for the run: when the ciphertext came from a transiently truncated or corrupted read of remotely-persisted data (a partial
/refsresponse, an edge-cache miss returning a partial 200, a proxy drop during streaming), a fresh queue delivery re-fetches the event log and ref payloads from scratch and can succeed.Previously we committed
run_failedimmediately, turning a potentially recoverable read failure into a terminal workflow failure.What changed
Mirrors the existing replay-timeout bounded-redelivery precedent (
handleReplayBudgetExhausted):processExitTriggersQueueRedelivery === true, e.g.world-vercel): on attempts ≤DECRYPTION_FAILURE_MAX_RETRIES(3), the run handler exits the process — which the platform turns into a queue redelivery — so replay restarts from freshly-fetched persisted data. Once the retry budget is exhausted, it commitsrun_failedwithRUNTIME_ERROR.world-local, dev servers, custom in-process worlds): no queue to re-fetch from, andprocess.exit()would kill the user's host, so the run fails immediately withRUNTIME_ERROR.New
shouldRedriveOnDecryptionFailure()helper (runtime/decryption-failure.ts) is pure (logging only) and returns whether the caller should redrive, keeping theprocess.exit/run_faileddecision at the single existing call site in the run handler.Test coverage
runtime/decryption-failure.test.ts— unit tests for the helper across managed/in-process worlds and the retry-budget boundary.runtime.test.ts— end-to-end tests driving a real key mismatch (input encrypted with key A, run-key resolves to key B → auth-tag failure during input hydration):process.exit(1), norun_failedrun_failedwithRUNTIME_ERRORrun_failedimmediately, no exitAll
@workflow/coretests pass (1073), full repo typecheck passes (40/40).Notes
The longer-term improvement @pranaygp mentioned — detecting response truncation / integrity failure at the
/refstransport boundary to classify the retryable case more directly — is out of scope here and would be a separate change on the world / workflow-server side.