Conversation
Add a third WorkloadManager implementation that creates sandboxes via the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway accepted requests but never returned responses. Adds AbortSignal.timeout (30s) and consolidates all logging into a single structured event per create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered log calls at each early return. Adds business context (envId, envType, orgId, projectId, deploymentVersion, machine) and instanceName to the event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.) as metadata on CreateSandboxRequest instead of relying on env vars. This enables wide event logging in the compute stack without parsing env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the CreateSandboxRequest so the compute stack can use them for admission control and resource allocation.
Thread timing context from queue consumer through to the compute workload manager's wide event: - dequeueResponseMs: platform dequeue HTTP round-trip - pollingIntervalMs: which polling interval was active (idle vs active) - warmStartCheckMs: warm start check duration All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances - Pass name: runnerId when creating compute instances - Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager - Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer - Handle suspend requests in compute mode via fire-and-forget snapshot with callback - Handle restore in compute mode by calling gateway restore API directly - Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request
Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.
runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode. When disabled, VMs won't receive the metadata URL and suspend/restore are no-ops. Defaults to off so compute mode can be used without snapshots.
🦋 Changeset detectedLatest commit: 80b62d4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 28 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughAdds a ComputeWorkloadManager for remote instance lifecycle (create, snapshot, restore, delete) and integrates compute mode into the supervisor and workload server, including a compute snapshot callback endpoint and delayed snapshot scheduling. Extends environment schema with compute-related variables and conditional validation when snapshots are enabled. Introduces a TimerWheel implementation and comprehensive tests for delayed dispatch behavior. Propagates timing context fields (dequeueResponseMs, pollingIntervalMs, warmStartCheckMs) through consumer/queue/session/events/types into workload creation and events. Adjusts CLI build image output options for local builds and adds a changeset marking a patch. Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (6)
packages/cli-v3/src/deploy/buildImage.ts (1)
1128-1153: Consider tightening theloadcheck toload === truefor contract clarity.
loadis typed asboolean | undefined, so the truthy checkisLocalBuild && load && !pushsilently falls through totype=imagewhenloadisundefined. In current call sites this is fine (callers always pass a resolvedboolean), but the implicit contract is easy to violate if a future caller passesisLocalBuild: truewithout resolvingloadfirst, unexpectedly gettingtype=imageoutput.♻️ Proposed clarification
- if (isLocalBuild && load && !push) { + if (isLocalBuild && load === true && !push) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/cli-v3/src/deploy/buildImage.ts` around lines 1128 - 1153, The conditional that chooses "type=docker" relies on a truthy check (isLocalBuild && load && !push) which allows load to be undefined; change the guard to explicitly test load === true so the branch only triggers when load is explicitly true. Locate the build function parameters (isLocalBuild, load, push, imageTag) and update the if condition (and any related comment if helpful) so it reads isLocalBuild && load === true && !push, keeping the imageTag handling and returned outputOptions logic unchanged.apps/supervisor/src/env.ts (1)
81-85: Consider validatingCOMPUTE_GATEWAY_AUTH_TOKENwhenCOMPUTE_GATEWAY_URLis set.If the compute gateway always requires authentication, add a superRefine check similar to the snapshots/metadata validation. If unauthenticated gateways are a valid use case (e.g., mTLS or network-level auth), this is fine as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/env.ts` around lines 81 - 85, Add a zod superRefine to the environment schema to enforce that when COMPUTE_GATEWAY_URL is defined, COMPUTE_GATEWAY_AUTH_TOKEN must also be provided: update the env schema around the COMPUTE_GATEWAY_URL and COMPUTE_GATEWAY_AUTH_TOKEN entries to call superRefine (similar to the snapshots/metadata validation) and add an error on the token field when URL is set but token is missing; keep behavior unchanged if URL is undefined to support unauthenticated gateways.packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (1)
18-18: Consider extracting the timing type to avoid inline duplication.The
{ dequeueResponseMs: number; pollingIntervalMs: number }shape is repeated at Line 18 and Line 26 (and similarly inconsumerPool.tsandsession.ts). A shared named type would reduce repetition across the dequeue chain.Example: shared type in types.ts
// In types.ts export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs: number; };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts` at line 18, Extract the inline timing shape into a shared exported type (e.g., export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs: number; }) in a central types file, then replace the inline occurrences with that type in the function signature for onDequeue (currently onDequeue: (messages: WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number; pollingIntervalMs: number }) => Promise<void>), and update the other files referencing the same shape (consumerPool.ts and session.ts) to import and use DequeueTiming to avoid duplication across the dequeue chain.apps/supervisor/src/index.ts (1)
229-231: Runner ID derivation looks reasonable but is fragile on format assumptions.
message.run.friendlyId.replace("run_", "")uses a plain string replace (only first occurrence), andcheckpoint.id.slice(-8)assumes IDs are always ≥8 characters. Both assumptions are likely safe given the platform's ID format, but consider adding a brief comment documenting the expected format, or using a more defensive approach.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/index.ts` around lines 229 - 231, The runner ID construction is brittle: replace("run_", "") and slice(-8) assume formats; update code around runIdShort, checkpointSuffix, and runnerId to be defensive by (1) deriving runIdShort using a check like message.run.friendlyId.startsWith("run_") ? message.run.friendlyId.substring(4) : message.run.friendlyId (or use a regex to extract the suffix) and (2) computing checkpointSuffix by using checkpoint.id.length >= 8 ? checkpoint.id.slice(-8) : checkpoint.id (or another safe fallback) so you never slice past bounds; also add a short comment above the logic documenting the expected friendlyId and checkpoint.id formats.apps/supervisor/src/workloadManager/compute.ts (1)
72-78:create()duplicates theauthHeadersgetter logic — usethis.authHeadersinstead.The inline header construction at Lines 72–78 is identical to the
private get authHeadersgetter defined at Lines 168–176. All other methods (snapshot,deleteInstance,restore) correctly usethis.authHeaders.♻️ Proposed fix
- const headers: Record<string, string> = { - "Content-Type": "application/json", - }; - - if (this.opts.gatewayAuthToken) { - headers["Authorization"] = `Bearer ${this.opts.gatewayAuthToken}`; - } - // Strip image digest — resolve by tag, not digest const imageRef = opts.image.split("@")[0]!; const url = `${this.opts.gatewayUrl}/api/instances`; ... const [fetchError, response] = await tryCatch( fetch(url, { method: "POST", - headers, + headers: this.authHeaders,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/workloadManager/compute.ts` around lines 72 - 78, The create() method duplicates header construction that already exists in the private getter authHeaders; replace the inline headers block in create() with a single reference to this.authHeaders so it reuses the centralized logic (similar to snapshot, deleteInstance, restore). Locate create() in compute.ts, remove the manual headers Record<string,string> and the conditional Authorization assignment, and use this.authHeaders wherever headers are needed to avoid duplication.apps/supervisor/src/workloadServer/index.ts (1)
454-499: 200 response is sent only after all network calls complete — gateway retries on slow ops could cause duplicate submissions.
submitSuspendCompletionanddeleteInstanceare both awaited beforereply.empty(200)at Line 499. If either call is slow, the gateway's outbound callback will stall waiting for the acknowledgement; if it has its own timeout/retry policy it will re-deliver the callback, potentially triggering duplicatesubmitSuspendCompletioncalls and a seconddeleteInstanceattempt on the same (possibly already-deleted) instance.Consider responding
200immediately after validating the body and processing the callback asynchronously, relying on idempotency guarantees insubmitSuspendCompletion, or ensuringsubmitSuspendCompletionis idempotent for the same(runId, snapshotId)pair.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/workloadServer/index.ts` around lines 454 - 499, The handler currently awaits submitSuspendCompletion and deleteInstance before calling reply.empty(200), which can block the gateway and cause duplicate callbacks; change flow to validate the incoming body, call reply.empty(200) immediately, then process the suspend completion asynchronously (e.g. spawn a background task/Promise) that calls this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and this.computeManager.deleteInstance(body.instance_id) without blocking the response; ensure the background task wraps calls in try/catch, logs errors, makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair, and makes deleteInstance tolerant of already-deleted instances (or check existence before deleting) so retries are safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 226-251: The handler currently logs failures from
computeManager.restore (didRestore === false or caught exceptions in the
try/catch around computeManager.restore) and returns without remediation; update
it to notify the platform before returning by calling the worker session HTTP
API (use this.workerSession.httpClient) to either mark the run failed or request
rescheduling, sending run id (message.run.id/friendlyId), runnerId,
checkpoint.id/location and the error details (or a failure reason when
didRestore is false); invoke this notification in both the didRestore === false
branch and the catch block and swallow/handle any HTTP errors locally so the
handler never silently abandons the run.
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 433-501: The endpoint registered at
"/api/v1/compute/snapshot-complete" (handler for ComputeSnapshotCallbackBody)
lacks authentication allowing arbitrary suspend submissions and instance
deletions; fix by validating an immutable callback secret/signature before
processing: when generating callback URLs embed a per-run or global callback
token, then in this handler verify a signed header or Bearer token (e.g.,
X-Callback-Signature or Authorization) and reject requests with reply.empty(401)
if verification fails; additionally, once authenticated, cross-check the
provided body.instance_id against the expected instance for the runId (look up
the canonical instance for runId before calling
this.computeManager?.deleteInstance) and only call computeManager.deleteInstance
after both signature and instance match succeed; keep logging but avoid acting
on unauthenticated or mismatched requests and return appropriate HTTP codes.
- Around line 270-272: The callback URL construction in workloadServer/index.ts
uses a fallback to "localhost" which breaks external gateways; update validation
and usage so TRIGGER_WORKLOAD_API_DOMAIN is required when
COMPUTE_SNAPSHOTS_ENABLED is true (similar to TRIGGER_METADATA_URL), and remove
the silent default to "localhost" in the callbackUrl assembly. Specifically,
add/adjust env.ts validation to mark TRIGGER_WORKLOAD_API_DOMAIN as mandatory
when COMPUTE_SNAPSHOTS_ENABLED is enabled (or throw early), and change the code
that builds callbackUrl (the const callbackUrl) to fail loudly if
TRIGGER_WORKLOAD_API_DOMAIN is missing rather than using "localhost".
- Around line 266-288: Return 202 is sent before awaiting
computeManager.snapshot, and if snapshotResult is false there is no recovery
path; after the existing snapshotResult check in the block that calls
computeManager.snapshot({ runnerId, callbackUrl, metadata: { runId:
params.runFriendlyId, snapshotFriendlyId: params.snapshotFriendlyId }}), call
workerClient.submitSuspendCompletion (or equivalent API) to notify the platform
of the suspend failure including runnerId,
params.runFriendlyId/snapshotFriendlyId and error context, log the failure via
this.logger.error, and ensure the submitSuspendCompletion call handles/awaits
errors so the run can be re-queued or failed cleanly when snapshotResult is
false.
In `@packages/core/src/v3/runEngineWorker/supervisor/session.ts`:
- Around line 83-93: This change touched the public package `@trigger.dev/core`
(method onDequeue in supervisor/session.ts), so add a changeset entry for
`@trigger.dev/core`: run pnpm run changeset:add, select the `@trigger.dev/core`
package, describe the change and bump the appropriate version, then commit the
generated changeset file alongside your PR so the release tooling will include
this package.
---
Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 81-85: Add a zod superRefine to the environment schema to enforce
that when COMPUTE_GATEWAY_URL is defined, COMPUTE_GATEWAY_AUTH_TOKEN must also
be provided: update the env schema around the COMPUTE_GATEWAY_URL and
COMPUTE_GATEWAY_AUTH_TOKEN entries to call superRefine (similar to the
snapshots/metadata validation) and add an error on the token field when URL is
set but token is missing; keep behavior unchanged if URL is undefined to support
unauthenticated gateways.
In `@apps/supervisor/src/index.ts`:
- Around line 229-231: The runner ID construction is brittle: replace("run_",
"") and slice(-8) assume formats; update code around runIdShort,
checkpointSuffix, and runnerId to be defensive by (1) deriving runIdShort using
a check like message.run.friendlyId.startsWith("run_") ?
message.run.friendlyId.substring(4) : message.run.friendlyId (or use a regex to
extract the suffix) and (2) computing checkpointSuffix by using
checkpoint.id.length >= 8 ? checkpoint.id.slice(-8) : checkpoint.id (or another
safe fallback) so you never slice past bounds; also add a short comment above
the logic documenting the expected friendlyId and checkpoint.id formats.
In `@apps/supervisor/src/workloadManager/compute.ts`:
- Around line 72-78: The create() method duplicates header construction that
already exists in the private getter authHeaders; replace the inline headers
block in create() with a single reference to this.authHeaders so it reuses the
centralized logic (similar to snapshot, deleteInstance, restore). Locate
create() in compute.ts, remove the manual headers Record<string,string> and the
conditional Authorization assignment, and use this.authHeaders wherever headers
are needed to avoid duplication.
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 454-499: The handler currently awaits submitSuspendCompletion and
deleteInstance before calling reply.empty(200), which can block the gateway and
cause duplicate callbacks; change flow to validate the incoming body, call
reply.empty(200) immediately, then process the suspend completion asynchronously
(e.g. spawn a background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.
In `@packages/cli-v3/src/deploy/buildImage.ts`:
- Around line 1128-1153: The conditional that chooses "type=docker" relies on a
truthy check (isLocalBuild && load && !push) which allows load to be undefined;
change the guard to explicitly test load === true so the branch only triggers
when load is explicitly true. Locate the build function parameters
(isLocalBuild, load, push, imageTag) and update the if condition (and any
related comment if helpful) so it reads isLocalBuild && load === true && !push,
keeping the imageTag handling and returned outputOptions logic unchanged.
In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts`:
- Line 18: Extract the inline timing shape into a shared exported type (e.g.,
export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs:
number; }) in a central types file, then replace the inline occurrences with
that type in the function signature for onDequeue (currently onDequeue:
(messages: WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.changeset/fix-local-build-load.mdapps/supervisor/src/env.tsapps/supervisor/src/index.tsapps/supervisor/src/workloadManager/compute.tsapps/supervisor/src/workloadManager/types.tsapps/supervisor/src/workloadServer/index.tspackages/cli-v3/src/deploy/buildImage.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/core/src/v3/runEngineWorker/supervisor/session.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: sdk-compat / Cloudflare Workers
- GitHub Check: typecheck / typecheck
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
**/*.{ts,tsx}: Always import tasks from@trigger.dev/sdk, never use@trigger.dev/sdk/v3or deprecatedclient.defineJobpattern
Every Trigger.dev task must be exported and have a uniqueidproperty with no timeouts in the run function
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tsapps/supervisor/src/workloadManager/types.tsapps/supervisor/src/workloadServer/index.tsapps/supervisor/src/index.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/cli-v3/src/deploy/buildImage.tsapps/supervisor/src/workloadManager/compute.tsapps/supervisor/src/env.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use zod for validation in packages/core and apps/webapp
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Import from
@trigger.dev/coreusing subpaths only, never import from root
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tsapps/supervisor/src/workloadManager/types.tsapps/supervisor/src/workloadServer/index.tsapps/supervisor/src/index.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/cli-v3/src/deploy/buildImage.tsapps/supervisor/src/workloadManager/compute.tsapps/supervisor/src/env.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tsapps/supervisor/src/workloadManager/types.tsapps/supervisor/src/workloadServer/index.tsapps/supervisor/src/index.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/cli-v3/src/deploy/buildImage.tsapps/supervisor/src/workloadManager/compute.tsapps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tsapps/supervisor/src/workloadManager/types.tsapps/supervisor/src/workloadServer/index.tsapps/supervisor/src/index.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/cli-v3/src/deploy/buildImage.tsapps/supervisor/src/workloadManager/compute.tsapps/supervisor/src/env.ts
{packages,integrations}/**/*
📄 CodeRabbit inference engine (CLAUDE.md)
Add a changeset when modifying any public package in
packages/*orintegrations/*usingpnpm run changeset:add
Files:
packages/core/src/v3/runEngineWorker/supervisor/events.tspackages/core/src/v3/runEngineWorker/supervisor/consumerPool.tspackages/core/src/v3/runEngineWorker/supervisor/session.tspackages/core/src/v3/runEngineWorker/supervisor/queueConsumer.tspackages/cli-v3/src/deploy/buildImage.ts
🧠 Learnings (9)
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to {packages,integrations}/**/* : Add a changeset when modifying any public package in `packages/*` or `integrations/*` using `pnpm run changeset:add`
Applied to files:
.changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Use build extensions in trigger.config.ts (additionalFiles, additionalPackages, aptGet, prismaExtension, etc.) to customize the build
Applied to files:
.changeset/fix-local-build-load.mdpackages/cli-v3/src/deploy/buildImage.tsapps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure build process in trigger.config.ts using `build` object with external packages, extensions, and JSX settings
Applied to files:
.changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Run `npx trigger.devlatest dev` to start the Trigger.dev development server
Applied to files:
.changeset/fix-local-build-load.md
📚 Learning: 2025-11-14T19:24:39.536Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 2685
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.settings/route.tsx:1234-1257
Timestamp: 2025-11-14T19:24:39.536Z
Learning: In the trigger.dev project, version validation for the `useNativeBuildServer` setting cannot be performed at the settings form level because the SDK version is only known at build/deployment time, not when saving project settings.
Applied to files:
.changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2025-06-04T16:02:22.957Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025-06-04T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.
Applied to files:
apps/supervisor/src/env.ts
🧬 Code graph analysis (4)
packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
WorkerApiDequeueResponseBody(73-73)WorkerApiDequeueResponseBody(74-74)
apps/supervisor/src/index.ts (3)
apps/supervisor/src/workloadManager/compute.ts (1)
ComputeWorkloadManager(17-311)packages/core/src/v3/serverOnly/checkpointClient.ts (1)
CheckpointClient(16-122)apps/supervisor/src/env.ts (1)
env(164-164)
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
WorkerApiDequeueResponseBody(73-73)WorkerApiDequeueResponseBody(74-74)packages/core/src/v3/runEngineWorker/supervisor/types.ts (1)
PreDequeueFn(12-15)
apps/supervisor/src/workloadManager/compute.ts (2)
apps/supervisor/src/workloadManager/types.ts (3)
WorkloadManagerOptions(3-14)WorkloadManager(16-18)WorkloadManagerCreateOptions(20-42)apps/supervisor/src/util.ts (1)
getRunnerId(17-25)
🔇 Additional comments (13)
packages/cli-v3/src/deploy/buildImage.ts (2)
201-209: LGTM.
isLocalBuild: falsecorrectly gates the newtype=dockerpath out of remote builds. Sincepush: trueis hardcoded and depot uses--save, theloadoption is intentionally a no-op here.
535-543: LGTM.
loadandpushare already resolved tobooleanviashouldLoad/shouldPush, so the truthy check ingetOutputOptionsis reliable here..changeset/fix-local-build-load.md (1)
1-5: LGTM!Changeset correctly documents a patch-level fix for the
--loadflag issue.apps/supervisor/src/env.ts (1)
154-162: Good conditional validation.The superRefine correctly enforces
TRIGGER_METADATA_URLas a dependency whenCOMPUTE_SNAPSHOTS_ENABLEDis true. The error is scoped to the right path for clear diagnostics.packages/core/src/v3/runEngineWorker/supervisor/events.ts (1)
9-10: LGTM!Optional timing fields are cleanly added to the event type, maintaining backward compatibility.
apps/supervisor/src/workloadManager/types.ts (1)
27-30: LGTM! Clean addition of timing context fields for observability.Nit: The file uses
interfacethroughout, but coding guidelines prefertype. Since this is pre-existing, not worth changing in this PR.packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts (1)
354-359: LGTM!Timing parameter is correctly threaded through the consumer pool's interceptor to the original handler.
apps/supervisor/src/index.ts (2)
282-284: Good use ofperformance.now()for high-resolution timing.Clean pattern for measuring warm-start duration.
83-98: Compute mode initialization is clean.The triple assignment pattern (create → store as
computeManager→ assign asworkloadManager) is clear. TheisComputeModeflag derived from the URL's presence is a good single source of truth.packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)
116-127: LGTM! Timing instrumentation is correctly scoped.
dequeueResponseMsmeasures only the HTTP call, andpollingIntervalMscaptures the interval that scheduled this dequeue cycle. ThelastScheduledIntervalMstracking acrossscheduleNextDequeue→ nextdequeue()call is sound.
33-33:lastScheduledIntervalMstracking is correct.Initialized to
idleIntervalMs(first poll starts from idle state), then updated inscheduleNextDequeuebefore the callback fires. The value read in the nextdequeue()cycle accurately reflects the delay used.Also applies to: 42-42, 148-148
apps/supervisor/src/workloadManager/compute.ts (2)
120-120: Memory field name and unit differ betweencreateandrestore— verify gateway API contracts.
create(Line 120):memory_gb: opts.machine.memory— sends machine memory as-is, using field namememory_gbrestore(Line 273):memory_mb: opts.machine.memory * 1024— multiplies by 1024, using field namememory_mbIf
opts.machine.memoryis in GiB (e.g.,2= 2 GiB), the create endpoint receives2while the restore endpoint receives2048. Confirm this asymmetry is intentional per the gateway's API schemas for/api/instancesvs/api/snapshots/{id}/restore.Also applies to: 273-273
9-9: No action required. The import is correct.
import { tryCatch } from "@trigger.dev/core"is a valid import. The root index.ts explicitly re-exportstryCatchfromutils.ts, making it an intentional root-level export. This does not violate the subpath-only guideline.
…nabled Remove the silent `localhost` fallback for the snapshot callback URL, which would be unreachable from external compute gateways. Add env validation and a runtime guard matching the existing metadata URL pattern.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
apps/supervisor/src/env.ts (1)
154-169: Good cross-field validation — consider also requiringCOMPUTE_GATEWAY_URL.The
superRefinecorrectly gatesTRIGGER_METADATA_URLandTRIGGER_WORKLOAD_API_DOMAINbehindCOMPUTE_SNAPSHOTS_ENABLED. However,COMPUTE_GATEWAY_URLis not validated here. If someone setsCOMPUTE_SNAPSHOTS_ENABLED=truewithoutCOMPUTE_GATEWAY_URL, the env validation passes (requiring the metadata/domain vars) but noComputeWorkloadManageris instantiated — so snapshots never actually work, with no warning at startup.Adding a third check would make the fail-fast behavior consistent:
Suggested addition
.superRefine((data, ctx) => { + if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.COMPUTE_GATEWAY_URL) { + ctx.addIssue({ + code: z.ZodIssueCode.custom, + message: "COMPUTE_GATEWAY_URL is required when COMPUTE_SNAPSHOTS_ENABLED is true", + path: ["COMPUTE_GATEWAY_URL"], + }); + } if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.TRIGGER_METADATA_URL) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/env.ts` around lines 154 - 169, The superRefine validator currently requires TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED && !data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code: z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so validation fails fast and surfaces the missing gateway variable.apps/supervisor/src/workloadServer/index.ts (1)
439-507: Callback handler logic is sound; one defensive improvement to consider.The completed/failed branching, metadata validation, and suspend-completion submission all look correct. One thing to note: if
submitSuspendCompletionsucceeds butdeleteInstancefails (Line 478), the failure is silently swallowed —deleteInstancereturns abooleanand logs internally, but the callback handler doesn't log the deletion outcome. This could make debugging orphaned instances harder.Optional: log deletion failure
- await this.computeManager?.deleteInstance(body.instance_id); + const deleted = await this.computeManager?.deleteInstance(body.instance_id); + if (deleted === false) { + this.logger.warn("Failed to delete compute instance after suspend completion", { + runId, + instanceId: body.instance_id, + }); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/supervisor/src/workloadServer/index.ts` around lines 439 - 507, The handler currently calls await this.computeManager?.deleteInstance(body.instance_id) without checking the boolean result; update the completed branch after submitSuspendCompletion to capture the deleteOutcome = await this.computeManager?.deleteInstance(body.instance_id) and then log a warning/error via this.logger.warn or this.logger.error if deleteOutcome is falsy (include runId, instanceId, snapshotFriendlyId and any contextual info), otherwise log successful deletion—this ensures failures in computeManager.deleteInstance are visible; reference submitSuspendCompletion, deleteInstance, computeManager, and this.logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 266-273: The error response in the branch where
this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED and
env.TRIGGER_WORKLOAD_API_DOMAIN is missing returns { error: "..." } which
doesn't match WorkloadSuspendRunResponseBody; change the reply.json call in that
branch (the block referencing this.computeManager,
env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return the same shape as the
other handlers: { ok: false, error: "TRIGGER_WORKLOAD_API_DOMAIN is not set,
cannot create snapshot callback URL" } (or equivalent message) so the response
conforms to WorkloadSuspendRunResponseBody and the runner can parse it.
---
Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 154-169: The superRefine validator currently requires
TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 439-507: The handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.
ℹ️ Review info
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
apps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
**/*.{ts,tsx}: Always import tasks from@trigger.dev/sdk, never use@trigger.dev/sdk/v3or deprecatedclient.defineJobpattern
Every Trigger.dev task must be exported and have a uniqueidproperty with no timeouts in the run function
Files:
apps/supervisor/src/workloadServer/index.tsapps/supervisor/src/env.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Import from
@trigger.dev/coreusing subpaths only, never import from root
Files:
apps/supervisor/src/workloadServer/index.tsapps/supervisor/src/env.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/supervisor/src/workloadServer/index.tsapps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
apps/supervisor/src/workloadServer/index.tsapps/supervisor/src/env.ts
🧠 Learnings (4)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-02-23T12:56:44.997Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Applied to files:
apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp
Applied to files:
apps/supervisor/src/env.ts
🧬 Code graph analysis (2)
apps/supervisor/src/workloadServer/index.ts (2)
apps/supervisor/src/workloadManager/compute.ts (1)
ComputeWorkloadManager(17-311)packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)
WorkloadSuspendRunResponseBody(22-30)WorkloadSuspendRunResponseBody(31-31)
apps/supervisor/src/env.ts (2)
packages/core/src/v3/utils/crypto.ts (1)
randomUUID(1-5)apps/supervisor/src/envUtil.ts (2)
BoolEnv(15-17)AdditionalEnvVars(19-47)
🔇 Additional comments (4)
apps/supervisor/src/env.ts (1)
81-85: LGTM — new compute settings are well-structured.Clean schema additions with sensible defaults (
30_000 mstimeout, snapshots off by default).apps/supervisor/src/workloadServer/index.ts (3)
27-27: LGTM — type-only import forComputeWorkloadManager.Correctly uses
import typeto avoid runtime dependency on the compute module.
57-63: LGTM —ComputeSnapshotCallbackBodyschema is well-defined.Uses
z.objectwith appropriate optional fields and a bounded string enum forstatus.
275-294: Fire-and-forget pattern looks correct for the compute snapshot path.The 202 is returned before the async
snapshot()call, and the failure is logged. Recovery for failed snapshots is acknowledged as a future enhancement per prior discussion.
| if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) { | ||
| if (!env.TRIGGER_WORKLOAD_API_DOMAIN) { | ||
| this.logger.error( | ||
| "TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" | ||
| ); | ||
| reply.json({ error: "Snapshot callbacks not configured" }, false, 500); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Response at Line 271 doesn't conform to WorkloadSuspendRunResponseBody.
The other error responses in this handler (Lines 255–262, 297–304) correctly return { ok: false, error: "..." } satisfies WorkloadSuspendRunResponseBody. This branch returns { error: "..." } without the ok discriminator, so the runner may fail to parse the response.
Proposed fix
- reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
+ reply.json(
+ {
+ ok: false,
+ error: "Snapshot callbacks not configured",
+ } satisfies WorkloadSuspendRunResponseBody,
+ false,
+ 500
+ );📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) { | |
| if (!env.TRIGGER_WORKLOAD_API_DOMAIN) { | |
| this.logger.error( | |
| "TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" | |
| ); | |
| reply.json({ error: "Snapshot callbacks not configured" }, false, 500); | |
| return; | |
| } | |
| if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) { | |
| if (!env.TRIGGER_WORKLOAD_API_DOMAIN) { | |
| this.logger.error( | |
| "TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" | |
| ); | |
| reply.json( | |
| { | |
| ok: false, | |
| error: "Snapshot callbacks not configured", | |
| } satisfies WorkloadSuspendRunResponseBody, | |
| false, | |
| 500 | |
| ); | |
| return; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/workloadServer/index.ts` around lines 266 - 273, The
error response in the branch where this.computeManager &&
env.COMPUTE_SNAPSHOTS_ENABLED and env.TRIGGER_WORKLOAD_API_DOMAIN is missing
returns { error: "..." } which doesn't match WorkloadSuspendRunResponseBody;
change the reply.json call in that branch (the block referencing
this.computeManager, env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return
the same shape as the other handlers: { ok: false, error:
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" }
(or equivalent message) so the response conforms to
WorkloadSuspendRunResponseBody and the runner can parse it.
delay compute snapshot requests to avoid wasted work on short-lived waitpoints (e.g. triggerAndWait resolving in <5s). configurable via COMPUTE_SNAPSHOT_DELAY_MS (default 5s).
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 430-433: Change shutdown ordering to quiesce and drain the worker
session before closing the workload API: in the async stop() method, first call
and await a worker-session quiesce/drain operation (e.g.,
workerSession.quiesce() or workerSession.drainPendingWork()) to stop intake and
let in-flight runners/callbacks complete, then call await
this.workloadServer.stop(), and finally call await this.workerSession.stop() if
you need to fully teardown session state; if workerSession lacks a quiesce/drain
API, add one (e.g., WorkerSession.quiesce/ drainPendingWork) that stops new
intake, waits for active tasks/callbacks to finish, and resolves before
workloadServer.stop() is invoked.
In `@apps/supervisor/src/services/timerWheel.ts`:
- Around line 43-50: The delay-to-slot conversion in the TimerWheel constructor
(setting this.delaySlots) currently uses Math.round(opts.delayMs / TICK_MS)
which can undershoot the requested delay; change it to Math.ceil(opts.delayMs /
TICK_MS) so delays never shorten, and search the file for any other uses of
Math.round(... / TICK_MS) (e.g., later scheduling/adjustment logic around
delaySlots) and replace those with Math.ceil as well to keep consistent, correct
rounding behavior.
- Around line 22-24: The onExpire callback should allow async handlers and be
awaited with error handling: change TimerWheelOptions.onExpire signature to
return Promise<void> (or void | Promise<void>), then where you invoke onExpire
for a TimerWheelItem<T> (the current expiration path that deletes the entry
before calling onExpire), await the call inside a try/catch; if it throws, catch
the error, log it, and re-insert/re-schedule the same TimerWheelItem back into
the wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 478-523: The handler currently always calls reply.empty(200) even
when this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.
- Around line 752-760: When stopping the snapshot delay wheel on shutdown
(this.snapshotDelayWheel?.stop()) we currently log and discard the returned
pending items, which loses pending suspend/snapshot work after we've already
responded 202; instead iterate the returned remaining items and either
re-enqueue them or invoke the same expiration handler used by the wheel so each
pending waitpoint still triggers the suspend/snapshot workflow (i.e. call the
TimerWheel onExpire handler or the snapshot processing function for each
item.key/item), or persist them for resume — do not drop them. Ensure you
reference snapshotDelayWheel, TimerWheel.stop(), onExpire() and the remaining
array when making the change so pending entries are handled the same way as when
they naturally expire.
- Around line 117-123: The TimerWheel's onExpire handler is calling the async
dispatchComputeSnapshot() synchronously so its returned promise can reject
unhandled; update the onExpire callback passed to TimerWheel
(snapshotDelayWheel) to call dispatchComputeSnapshot(item.data) and attach
explicit error handling (e.g., .catch(...) to log/report failures) so any
rejection is caught and does not become an unhandled promise rejection;
reference TimerWheel, TimerWheel.tick, snapshotDelayWheel, and
dispatchComputeSnapshot when locating the handler to wrap the call with a
rejection handler.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1d829bf2-d50e-47e1-9d3f-417818c8b908
📒 Files selected for processing (5)
apps/supervisor/src/env.tsapps/supervisor/src/index.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/services/timerWheel.tsapps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
- GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
- GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
- GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
- GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
- GitHub Check: sdk-compat / Bun Runtime
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: sdk-compat / Deno Runtime
- GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead
**/*.{ts,tsx}: In TypeScript SDK usage, always import from@trigger.dev/sdk, never from@trigger.dev/sdk/v3or use deprecated client.defineJob
Import from@trigger.dev/coresubpaths only, never from the root
Use the Run Engine 2.0 (@internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/index.tsapps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use function declarations instead of default exports
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/index.tsapps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
**/*.ts
📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)
**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/index.tsapps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Format code using Prettier before committing
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/index.tsapps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
apps/**/*.{ts,tsx,js}
📄 CodeRabbit inference engine (CLAUDE.md)
When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.tsapps/supervisor/src/index.tsapps/supervisor/src/env.tsapps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/**/*.{js,ts}
📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
Core service logic should be organized in the
src/services/directory
Files:
apps/supervisor/src/services/timerWheel.tsapps/supervisor/src/services/timerWheel.test.ts
**/*.{test,spec}.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use vitest for all tests in the Trigger.dev repository
Files:
apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptivedescribeanditblocks
Tests should avoid mocks or stubs and use the helpers from@internal/testcontainerswhen Redis or Postgres are needed
Use vitest for running unit tests
Files:
apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js}
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.test.{ts,tsx,js}: Use vitest exclusively for testing - never mock anything, use testcontainers instead
Place test files next to source files with .test.ts naming convention (e.g., MyService.ts -> MyService.test.ts)
Test files using Redis or PostgreSQL should use testcontainers helpers (redisTest, postgresTest, containerTest) instead of mocks
Files:
apps/supervisor/src/services/timerWheel.test.ts
apps/supervisor/src/env.ts
📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
Environment configuration should be defined in
src/env.ts
Files:
apps/supervisor/src/env.ts
apps/supervisor/src/workloadServer/**/*.{js,ts}
📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)
HTTP server for workload communication (heartbeats, snapshots) should be implemented in
src/workloadServer/
Files:
apps/supervisor/src/workloadServer/index.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
📚 Learning: 2026-03-03T13:07:33.177Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3166
File: internal-packages/run-engine/src/batch-queue/tests/index.test.ts:711-713
Timestamp: 2026-03-03T13:07:33.177Z
Learning: In `internal-packages/run-engine/src/batch-queue/tests/index.test.ts`, test assertions for rate limiter stubs can use `toBeGreaterThanOrEqual` rather than exact equality (`toBe`) because the consumer loop may call the rate limiter during empty pops in addition to actual item processing, and this over-calling is acceptable in integration tests.
Applied to files:
apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-03-02T12:43:25.254Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Applies to internal-packages/run-engine/src/engine/tests/**/*.test.ts : Implement tests for RunEngine in `src/engine/tests/` using testcontainers for Redis and PostgreSQL containerization
Applied to files:
apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Applied to files:
apps/supervisor/src/index.tsapps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`
Applied to files:
apps/supervisor/src/index.tsapps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
Applied to files:
apps/supervisor/src/index.tsapps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:41.110Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:41.110Z
Learning: Applies to **/*.{ts,tsx} : Use the Run Engine 2.0 (internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions
Applied to files:
apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:56.114Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:56.114Z
Learning: All run lifecycle operations (triggering, completing, cancelling, etc.) should be performed through the singleton run engine instance in `app/v3/runEngine.server.ts` via service calls
Applied to files:
apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: WorkloadManager should abstract Docker vs Kubernetes execution to allow switching between container orchestration platforms
Applied to files:
apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/env.ts : Environment configuration should be defined in `src/env.ts`
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp
Applied to files:
apps/supervisor/src/env.ts
📚 Learning: 2026-03-02T12:43:37.906Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/core/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:37.906Z
Learning: Exercise caution with changes to trigger.dev/core as they affect both the customer-facing SDK and server-side webapp - breaking changes can impact deployed user tasks and the platform simultaneously
Applied to files:
apps/supervisor/src/workloadServer/index.ts
| async stop() { | ||
| this.logger.log("Shutting down"); | ||
| await this.workloadServer.stop(); | ||
| await this.workerSession.stop(); |
There was a problem hiding this comment.
Quiesce the worker session before tearing down the workload API.
workloadServer fronts heartbeats, attempt completion, suspend/continue, and compute snapshot callbacks. Closing it first can strand active runners and in-flight callbacks against a dead server while workerSession is still live. This shutdown path needs a “stop intake, drain, then close the API” order instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/index.ts` around lines 430 - 433, Change shutdown
ordering to quiesce and drain the worker session before closing the workload
API: in the async stop() method, first call and await a worker-session
quiesce/drain operation (e.g., workerSession.quiesce() or
workerSession.drainPendingWork()) to stop intake and let in-flight
runners/callbacks complete, then call await this.workloadServer.stop(), and
finally call await this.workerSession.stop() if you need to fully teardown
session state; if workerSession lacks a quiesce/drain API, add one (e.g.,
WorkerSession.quiesce/ drainPendingWork) that stops new intake, waits for active
tasks/callbacks to finish, and resolves before workloadServer.stop() is invoked.
| export type TimerWheelOptions<T> = { | ||
| /** Called when an item's delay expires. */ | ||
| onExpire: (item: TimerWheelItem<T>) => void; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "timerWheel.ts" -type fRepository: triggerdotdev/trigger.dev
Length of output: 113
🏁 Script executed:
find . -name "workloadServer" -type fRepository: triggerdotdev/trigger.dev
Length of output: 51
🏁 Script executed:
fd -t f timerWheelRepository: triggerdotdev/trigger.dev
Length of output: 159
🏁 Script executed:
cat -n apps/supervisor/src/services/timerWheel.tsRepository: triggerdotdev/trigger.dev
Length of output: 5895
🏁 Script executed:
cat -n apps/supervisor/src/workloadServer/index.ts | sed -n '115,125p;725,745p'Repository: triggerdotdev/trigger.dev
Length of output: 1425
🏁 Script executed:
cat -n apps/supervisor/src/workloadServer/index.ts | head -200 | tail -100Repository: triggerdotdev/trigger.dev
Length of output: 3881
🏁 Script executed:
cat -n apps/supervisor/src/services/timerWheel.test.ts | head -100Repository: triggerdotdev/trigger.dev
Length of output: 3290
🏁 Script executed:
rg "dispatchComputeSnapshot" apps/supervisor/src/workloadServer/index.ts -A 5 -B 5Repository: triggerdotdev/trigger.dev
Length of output: 993
🏁 Script executed:
rg "unhandledRejection|uncaughtException" apps/supervisor/src --type ts -B 3 -A 3Repository: triggerdotdev/trigger.dev
Length of output: 51
🏁 Script executed:
rg "catch|\.on\(" apps/supervisor/src/workloadServer/index.ts | head -20Repository: triggerdotdev/trigger.dev
Length of output: 522
🏁 Script executed:
git ls-files apps/supervisor/src | head -20Repository: triggerdotdev/trigger.dev
Length of output: 886
🏁 Script executed:
rg "process\.on\(" apps/supervisor --type tsRepository: triggerdotdev/trigger.dev
Length of output: 51
🏁 Script executed:
cat -n apps/supervisor/src/index.tsRepository: triggerdotdev/trigger.dev
Length of output: 18909
Handle async onExpire callbacks explicitly.
The onExpire callback type declares () => void, but TypeScript permits assigning async functions here since Promise<void> is assignable to void. Line 157 invokes the callback without awaiting or catching, so if dispatchComputeSnapshot() (line 725, apps/supervisor/src/workloadServer/index.ts) rejects, the rejection goes unhandled. The entry is already deleted at line 156, making the delayed snapshot permanently lost.
Suggested fix
export type TimerWheelOptions<T> = {
/** Called when an item's delay expires. */
- onExpire: (item: TimerWheelItem<T>) => void;
+ onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+ onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
/** Delay in milliseconds before items fire. Clamped to [100, 60000]. */
delayMs: number;
};
@@
- private onExpire: (item: TimerWheelItem<T>) => void;
+ private onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+ private onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
@@
this.intervalId = null;
this.onExpire = opts.onExpire;
+ this.onExpireError = opts.onExpireError;
this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
}
@@
- this.entries.delete(key);
- this.onExpire({ key, data: entry.data });
+ this.entries.delete(key);
+ const item = { key, data: entry.data };
+
+ try {
+ void Promise.resolve(this.onExpire(item)).catch((error) => {
+ this.onExpireError?.(error, item);
+ });
+ } catch (error) {
+ this.onExpireError?.(error, item);
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/services/timerWheel.ts` around lines 22 - 24, The
onExpire callback should allow async handlers and be awaited with error
handling: change TimerWheelOptions.onExpire signature to return Promise<void>
(or void | Promise<void>), then where you invoke onExpire for a
TimerWheelItem<T> (the current expiration path that deletes the entry before
calling onExpire), await the call inside a try/catch; if it throws, catch the
error, log it, and re-insert/re-schedule the same TimerWheelItem back into the
wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.
| constructor(opts: TimerWheelOptions<T>) { | ||
| this.slots = Array.from({ length: NUM_SLOTS }, () => new Set()); | ||
| this.entries = new Map(); | ||
| this.cursor = 0; | ||
| this.intervalId = null; | ||
| this.onExpire = opts.onExpire; | ||
| this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS))); | ||
| } |
There was a problem hiding this comment.
Use Math.ceil() for slot conversion.
Math.round(delayMs / TICK_MS) can shave almost half a tick off the requested delay. For example, a 149ms delay submitted on a tick boundary becomes a single 100ms slot, so the grace window can expire earlier than configured.
🛠️ Minimal fix
- this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
+ this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(opts.delayMs / TICK_MS)));
@@
- this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(delayMs / TICK_MS)));
+ this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(delayMs / TICK_MS)));Also applies to: 90-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/services/timerWheel.ts` around lines 43 - 50, The
delay-to-slot conversion in the TimerWheel constructor (setting this.delaySlots)
currently uses Math.round(opts.delayMs / TICK_MS) which can undershoot the
requested delay; change it to Math.ceil(opts.delayMs / TICK_MS) so delays never
shorten, and search the file for any other uses of Math.round(... / TICK_MS)
(e.g., later scheduling/adjustment logic around delaySlots) and replace those
with Math.ceil as well to keep consistent, correct rounding behavior.
| if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) { | ||
| this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({ | ||
| delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS, | ||
| onExpire: (item) => this.dispatchComputeSnapshot(item.data), | ||
| }); | ||
| this.snapshotDelayWheel.start(); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Expected result:
# - timerWheel.tick() calls onExpire(...) with no await/.catch
# - workloadServer wires onExpire to async dispatchComputeSnapshot()
sed -n '141,159p' apps/supervisor/src/services/timerWheel.ts
echo
sed -n '117,123p' apps/supervisor/src/workloadServer/index.ts
echo
sed -n '721,745p' apps/supervisor/src/workloadServer/index.tsRepository: triggerdotdev/trigger.dev
Length of output: 1785
Catch timer-wheel dispatch failures before they become unhandled rejections.
TimerWheel.tick() invokes onExpire() synchronously without awaiting it, so the promise returned by dispatchComputeSnapshot() is currently dropped. Any rejection inside that async function becomes an unhandled promise rejection, which can crash the supervisor process.
🐛 Proposed fix
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
- onExpire: (item) => this.dispatchComputeSnapshot(item.data),
+ onExpire: (item) => {
+ void this.dispatchComputeSnapshot(item.data).catch((error) => {
+ this.logger.error("Unhandled compute snapshot dispatch error", {
+ runId: item.data.runFriendlyId,
+ runnerId: item.data.runnerId,
+ error,
+ });
+ });
+ },
});
this.snapshotDelayWheel.start();
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/workloadServer/index.ts` around lines 117 - 123, The
TimerWheel's onExpire handler is calling the async dispatchComputeSnapshot()
synchronously so its returned promise can reject unhandled; update the onExpire
callback passed to TimerWheel (snapshotDelayWheel) to call
dispatchComputeSnapshot(item.data) and attach explicit error handling (e.g.,
.catch(...) to log/report failures) so any rejection is caught and does not
become an unhandled promise rejection; reference TimerWheel, TimerWheel.tick,
snapshotDelayWheel, and dispatchComputeSnapshot when locating the handler to
wrap the call with a rejection handler.
| if (body.status === "completed") { | ||
| const result = await this.workerClient.submitSuspendCompletion({ | ||
| runId, | ||
| snapshotId: snapshotFriendlyId, | ||
| body: { | ||
| success: true, | ||
| checkpoint: { | ||
| type: "KUBERNETES", | ||
| location: body.snapshot_id, | ||
| }, | ||
| }, | ||
| }); | ||
|
|
||
| if (result.success) { | ||
| this.logger.info("Suspend completion submitted", { | ||
| runId, | ||
| instanceId: body.instance_id, | ||
| snapshotId: body.snapshot_id, | ||
| }); | ||
| } else { | ||
| this.logger.error("Failed to submit suspend completion", { | ||
| runId, | ||
| snapshotFriendlyId, | ||
| error: result.error, | ||
| }); | ||
| } | ||
| } else { | ||
| const result = await this.workerClient.submitSuspendCompletion({ | ||
| runId, | ||
| snapshotId: snapshotFriendlyId, | ||
| body: { | ||
| success: false, | ||
| error: body.error ?? "Snapshot failed", | ||
| }, | ||
| }); | ||
|
|
||
| if (!result.success) { | ||
| this.logger.error("Failed to submit suspend failure", { | ||
| runId, | ||
| snapshotFriendlyId, | ||
| error: result.error, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| reply.empty(200); |
There was a problem hiding this comment.
Return a non-2xx when suspend completion was not recorded.
Both branches log !result.success and still reply 200. That acknowledges the compute callback even though the platform never received the suspend result, which drops the only retry signal on transient worker API failures.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/workloadServer/index.ts` around lines 478 - 523, The
handler currently always calls reply.empty(200) even when
this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.
| const remaining = this.snapshotDelayWheel?.stop() ?? []; | ||
| if (remaining.length > 0) { | ||
| this.logger.info("Snapshot delay wheel stopped, dropped pending snapshots", { | ||
| count: remaining.length, | ||
| }); | ||
| this.logger.debug("Dropped snapshot details", { | ||
| runs: remaining.map((item) => item.key), | ||
| }); | ||
| } |
There was a problem hiding this comment.
Don't drop delayed suspends on shutdown after already returning 202.
TimerWheel.stop() returns pending items without firing onExpire(). At this point the runner has already been told to suspend, so logging and discarding these entries means no snapshot request or failure is ever reported for that waitpoint across a supervisor restart.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/supervisor/src/workloadServer/index.ts` around lines 752 - 760, When
stopping the snapshot delay wheel on shutdown (this.snapshotDelayWheel?.stop())
we currently log and discard the returned pending items, which loses pending
suspend/snapshot work after we've already responded 202; instead iterate the
returned remaining items and either re-enqueue them or invoke the same
expiration handler used by the wheel so each pending waitpoint still triggers
the suspend/snapshot workflow (i.e. call the TimerWheel onExpire handler or the
snapshot processing function for each item.key/item), or persist them for resume
— do not drop them. Ensure you reference snapshotDelayWheel, TimerWheel.stop(),
onExpire() and the remaining array when making the change so pending entries are
handled the same way as when they naturally expire.
Adds the
ComputeWorkloadManagerfor routing task execution through the compute gateway, including full checkpoint/restore support.Changes
Compute workload manager (
apps/supervisor/src/workloadManager/compute.ts)Supervisor wiring (
apps/supervisor/src/index.ts)Workload server (
apps/supervisor/src/workloadServer/index.ts)submitSuspendCompletionEnv validation (
apps/supervisor/src/env.ts)