fix(core,errors): classify SDK encryption failures as RUNTIME_ERROR#2145
Conversation
SDK-level AES-GCM encrypt/decrypt failures are never the user's fault, but the run-failure classifier was tagging them as USER_ERROR because the native Web Crypto OperationError (most commonly raised by AESCipherJob.onDone on GCM auth-tag mismatch) does not match any RUNTIME_ERROR_CHECKS entry. Introduce a new RuntimeDecryptionError (subclass of WorkflowRuntimeError) that the encryption module throws when subtle.encrypt/subtle.decrypt fails, with the original DOMException as cause plus diagnostic context (operation, byteLength, printable/hex format prefix of the input header). classifyRunError now picks it up via RUNTIME_ERROR_CHECKS, so these failures surface as RUNTIME_ERROR with a proper named class for dashboards and triage.
🦋 Changeset detectedLatest commit: 17b9b57 The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
|
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
There was a problem hiding this comment.
Pull request overview
Workflow run failures originating in the SDK's AES-GCM encryption layer (most notably Node's native OperationError from AESCipherJob.onDone on GCM auth-tag mismatch) were falling through to USER_ERROR because classifyRunError's name-based duck checks didn't match a raw DOMException. This PR introduces a RuntimeDecryptionError (subclass of WorkflowRuntimeError) that the encryption module always wraps Web Crypto failures in, plus diagnostic context (operation, byte length, header prefix), so failures classify as RUNTIME_ERROR and carry enough telemetry to triangulate root cause on the next occurrence. No root-cause fix is attempted.
Changes:
- New
RuntimeDecryptionErrorclass +runtime-decryption-failedslug in@workflow/errorswith optional structuredcontext. - Wrap encrypt/decrypt Web Crypto calls in
packages/core/src/encryption.tsand rewrap the two "encrypted-but-no-key" throws in serialization paths. - Add
RuntimeDecryptionError.istoRUNTIME_ERROR_CHECKSand cover the new behavior with errors + core tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/errors/src/index.ts | Adds RUNTIME_DECRYPTION_FAILED slug and RuntimeDecryptionError class with name-based .is(). |
| packages/errors/src/runtime-decryption-error.test.ts | New tests covering name, docs link, cause, context shape, and .is() duck check. |
| packages/core/src/encryption.ts | Wraps subtle.encrypt/decrypt failures and length precheck in RuntimeDecryptionError; adds printable/hex diagnostic prefix helper. |
| packages/core/src/encryption.test.ts | New 8-test module: round-trip, length-check, tamper, wrong key, encrypt-only-usage, prefix capture. |
| packages/core/src/serialization/encryption.ts | Switches "encrypted-but-no-key" throw to RuntimeDecryptionError with context. |
| packages/core/src/serialization.ts | Same switch on the deserialize-stream path. |
| packages/core/src/classify-error.ts | Adds RuntimeDecryptionError.is to RUNTIME_ERROR_CHECKS. |
| packages/core/src/classify-error.test.ts | Adds tests for the new mapping and a bare-OperationError sanity check. |
| .changeset/runtime-decryption-error.md | Patch changeset for @workflow/errors and @workflow/core. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pranaygp
left a comment
There was a problem hiding this comment.
Left three inline findings from local verification.
|
Follow-up thought after tracing the runtime path: I think this PR is appropriately scoped to improving attribution ( That said, we should follow up by applying the same bounded-redelivery precedent used for replay timeouts to Concretely, for managed worlds we should let the queue redrive a small bounded number of times (re-fetching the events/ref payload each delivery), then commit terminal |
…x, propagate through serialization wrappers Addresses review feedback on #2145: - Add a RuntimeDecryptionError reducer/reviver (+ SerializableSpecial entry + globalThis registration) so its `context` (operation, byteLength, formatPrefix) survives the dehydrate/hydrate run-error round trip instead of being dropped by the generic Error reducer. - Stop capturing `formatPrefix` in the low-level encryption layer, which only sees the stripped AES payload (nonce bytes), not the outer `encr` marker. The serialization layer now attaches the real envelope prefix. - Rethrow RuntimeDecryptionError unchanged from the serialize/dehydrate catch blocks instead of reframing it as a SerializationError, so an encryption failure during dehydration stays a RUNTIME_ERROR rather than being misclassified as USER_ERROR.
|
Agreed on both points — keeping this PR scoped to attribution, and treating bounded redelivery as a focused follow-up. The reasoning is sound: an AES-GCM auth failure is terminal for the current bytes/key, but if those bytes came from a transiently truncated/corrupted I'll open a follow-up to apply the bounded-redelivery precedent (the same one used for replay timeouts) to The review feedback on this PR has been addressed in the latest commits:
|
VaguelySerious
left a comment
There was a problem hiding this comment.
AI review: no blocking issues
- Mirror the catch/enrich/rethrow block from serialization/encryption.ts around the stream-path aesGcmDecrypt() call so auth-tag failures on encrypted stream frames also carry context.formatPrefix = 'encr' (addresses review feedback). Add a tampered-frame test. - Fix all auto-fixable Biome lint findings in the touched files (template literals, useless try/catch wrappers, optional chaining, non-null assertions).
|
Backport PR opened against |
Summary
Workflow runs that fail inside the SDK's AES-GCM encryption layer were being misclassified as
USER_ERROR. SDK-level decryption is never user code — the user never directly invokessubtle.decrypt— so failures here should beRUNTIME_ERROR.This PR adds a new error class and wires it into the run-failure classifier, without addressing the root cause of the decryption failure itself (that investigation is ongoing). The change is intentionally narrow: better classification + diagnostic context, so the next mystery report is properly categorized and immediately actionable.
What I observed
Production reports surface as:
OperationErrorfromAESCipherJob.onDoneis what Node's Web Crypto API throws when an AES-GCM auth-tag verification fails. The bare native DOMException doesn't match any ofclassifyRunError'sRUNTIME_ERROR_CHECKS(which are name-based duck checks), so it falls through toUSER_ERROR.Changes
@workflow/errors(packages/errors/src/index.ts):RUNTIME_DECRYPTION_FAILEDslug.RuntimeDecryptionError(extendsWorkflowRuntimeError) with optional structuredcontext(operation, byteLength, formatPrefix).@workflow/core:packages/core/src/encryption.ts: wrap bothencrypt()anddecrypt()Web Crypto calls; rewrap any failure asRuntimeDecryptionErrorwith diagnostic context (printable or hex prefix of the input header, byte length, operation). The existing length-precheck now also throwsRuntimeDecryptionError.packages/core/src/serialization/encryption.ts&packages/core/src/serialization.ts: the two "encrypted-but-no-key" throw paths now useRuntimeDecryptionError.packages/core/src/classify-error.ts:RuntimeDecryptionError.isadded toRUNTIME_ERROR_CHECKSsoclassifyRunErrorroutes these failures toRUNTIME_ERROR.Test coverage
packages/errors/src/runtime-decryption-error.test.ts(new, 6 tests): name, inheritance, docs URL, cause preservation, context shape, name-basedis()duck check.packages/core/src/encryption.test.ts(new, 8 tests): happy-path round-trip, length-check failure, GCM auth-tag tamper →RuntimeDecryptionError(cause = OperationError), wrong-key decryption → same, encrypt-only key used for encrypt →RuntimeDecryptionError, printable + hex format-prefix capture.packages/core/src/classify-error.test.ts(extended):RuntimeDecryptionError → RUNTIME_ERROR, plus a documentation test that a bare nativeOperationErrorstill classifies asUSER_ERROR(proves the encryption module's wrap is what does the work).All existing tests still pass:
@workflow/errors: 36/36 ✅@workflow/core: 1024/1024 ✅pnpm typecheck(full repo): 40/40 packages ✅What this does NOT fix
The actual decryption failure. Root cause is still under investigation — see the prior analysis. Strongest current hypothesis remains transport-level corruption/truncation of ciphertext between storage and read (in particular the workflow-server
/refsendpoint, where a guard against truncated bodies was prototyped on a feature branch but never landed onmain).The diagnostic context added here is specifically what we need to triangulate the source on the next occurrence: byte length distinguishes "truncated" from "tampered", and format prefix distinguishes "valid
encrenvelope with bad ciphertext" from "garbage bytes that happened to land in a decrypt path".