Skip to content

feat(supervisor): compute workload manager#3114

Draft
nicktrn wants to merge 26 commits intomainfrom
feat/compute-workload-manager
Draft

feat(supervisor): compute workload manager#3114
nicktrn wants to merge 26 commits intomainfrom
feat/compute-workload-manager

Conversation

@nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Feb 23, 2026

Adds the ComputeWorkloadManager for routing task execution through the compute gateway, including full checkpoint/restore support.

Changes

Compute workload manager (apps/supervisor/src/workloadManager/compute.ts)

  • Routes VM create, snapshot, delete, and restore through the compute gateway API
  • Wide event logging on create with full timing and context
  • Configurable gateway timeout, auth token, image digest stripping
  • Restore sends name, env override metadata, CPU and memory so the agent can inject them before the VM resumes

Supervisor wiring (apps/supervisor/src/index.ts)

  • Compute mode activated when gateway URL is configured
  • Restore branch derives a unique runnerId per restore cycle, matching iceman's convention
  • Suspend/restore gated behind snapshots enabled flag

Workload server (apps/supervisor/src/workloadServer/index.ts)

  • Suspend handler triggers a compute snapshot (fire-and-forget) when in compute mode with snapshots enabled
  • Snapshot-complete callback endpoint receives the snapshot ID and calls submitSuspendCompletion

Env validation (apps/supervisor/src/env.ts)

  • Compute gateway URL, auth token, and timeout settings
  • Snapshots enabled flag (defaults off — compute mode can run without checkpoints)
  • Metadata URL required when snapshots enabled (validated at startup)

Add a third WorkloadManager implementation that creates sandboxes via
the compute gateway HTTP API (POST /api/sandboxes). Uses native fetch
with no new dependencies. Enabled by setting COMPUTE_GATEWAY_URL, which
takes priority over Kubernetes and Docker providers.
The fetch() call had no timeout, causing infinite hangs when the gateway
accepted requests but never returned responses. Adds AbortSignal.timeout
(30s) and consolidates all logging into a single structured event per
create() call with timing, status, and error context.
Emit a single canonical log line in a finally block instead of scattered
log calls at each early return. Adds business context (envId, envType,
orgId, projectId, deploymentVersion, machine) and instanceName to the
event. Always emits at info level with ok=true/false for queryability.
Pass business context (runId, envId, orgId, projectId, machine, etc.)
as metadata on CreateSandboxRequest instead of relying on env vars.
This enables wide event logging in the compute stack without parsing
env or leaking secrets.
Passes machine preset cpu and memory as top-level fields on the
CreateSandboxRequest so the compute stack can use them for admission
control and resource allocation.
Thread timing context from queue consumer through to the compute
workload manager's wide event:

- dequeueResponseMs: platform dequeue HTTP round-trip
- pollingIntervalMs: which polling interval was active (idle vs active)
- warmStartCheckMs: warm start check duration

All fields are optional to avoid breaking existing consumers.
- Fix instance creation URL from /api/sandboxes to /api/instances
- Pass name: runnerId when creating compute instances
- Add snapshot(), deleteInstance(), and restore() methods to ComputeWorkloadManager
- Add /api/v1/compute/snapshot-complete callback endpoint to WorkloadServer
- Handle suspend requests in compute mode via fire-and-forget snapshot with callback
- Handle restore in compute mode by calling gateway restore API directly
- Wire computeManager into WorkloadServer for compute mode suspend/restore
…re request

Restore calls now send a request body with the runner name, env override metadata,
cpu, and memory so the agent can inject them before the VM resumes. The runner
fetches these overrides from TRIGGER_METADATA_URL at restore time.

runnerId is derived per restore cycle as runner-{runIdShort}-{checkpointSuffix},
matching iceman's pattern.
Gates snapshot/restore behaviour independently of compute mode.
When disabled, VMs won't receive the metadata URL and suspend/restore
are no-ops. Defaults to off so compute mode can be used without snapshots.
@changeset-bot
Copy link

changeset-bot bot commented Feb 23, 2026

🦋 Changeset detected

Latest commit: 80b62d4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages
Name Type
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
references-telemetry Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/redis-worker Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/redis Patch
@internal/replication Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
@internal/sdk-compat-tests Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

Adds a ComputeWorkloadManager for remote instance lifecycle (create, snapshot, restore, delete) and integrates compute mode into the supervisor and workload server, including a compute snapshot callback endpoint and delayed snapshot scheduling. Extends environment schema with compute-related variables and conditional validation when snapshots are enabled. Introduces a TimerWheel implementation and comprehensive tests for delayed dispatch behavior. Propagates timing context fields (dequeueResponseMs, pollingIntervalMs, warmStartCheckMs) through consumer/queue/session/events/types into workload creation and events. Adjusts CLI build image output options for local builds and adds a changeset marking a patch.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(supervisor): compute workload manager' directly matches the main objective of the PR, which introduces the ComputeWorkloadManager for routing task execution through a compute gateway.
Description check ✅ Passed The PR description is well-structured with clear sections explaining the changes, though it does not follow the exact template structure (missing explicit checklist items and sections like Testing/Screenshots).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/compute-workload-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
packages/cli-v3/src/deploy/buildImage.ts (1)

1128-1153: Consider tightening the load check to load === true for contract clarity.

load is typed as boolean | undefined, so the truthy check isLocalBuild && load && !push silently falls through to type=image when load is undefined. In current call sites this is fine (callers always pass a resolved boolean), but the implicit contract is easy to violate if a future caller passes isLocalBuild: true without resolving load first, unexpectedly getting type=image output.

♻️ Proposed clarification
-  if (isLocalBuild && load && !push) {
+  if (isLocalBuild && load === true && !push) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/cli-v3/src/deploy/buildImage.ts` around lines 1128 - 1153, The
conditional that chooses "type=docker" relies on a truthy check (isLocalBuild &&
load && !push) which allows load to be undefined; change the guard to explicitly
test load === true so the branch only triggers when load is explicitly true.
Locate the build function parameters (isLocalBuild, load, push, imageTag) and
update the if condition (and any related comment if helpful) so it reads
isLocalBuild && load === true && !push, keeping the imageTag handling and
returned outputOptions logic unchanged.
apps/supervisor/src/env.ts (1)

81-85: Consider validating COMPUTE_GATEWAY_AUTH_TOKEN when COMPUTE_GATEWAY_URL is set.

If the compute gateway always requires authentication, add a superRefine check similar to the snapshots/metadata validation. If unauthenticated gateways are a valid use case (e.g., mTLS or network-level auth), this is fine as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 81 - 85, Add a zod superRefine to
the environment schema to enforce that when COMPUTE_GATEWAY_URL is defined,
COMPUTE_GATEWAY_AUTH_TOKEN must also be provided: update the env schema around
the COMPUTE_GATEWAY_URL and COMPUTE_GATEWAY_AUTH_TOKEN entries to call
superRefine (similar to the snapshots/metadata validation) and add an error on
the token field when URL is set but token is missing; keep behavior unchanged if
URL is undefined to support unauthenticated gateways.
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (1)

18-18: Consider extracting the timing type to avoid inline duplication.

The { dequeueResponseMs: number; pollingIntervalMs: number } shape is repeated at Line 18 and Line 26 (and similarly in consumerPool.ts and session.ts). A shared named type would reduce repetition across the dequeue chain.

Example: shared type in types.ts
// In types.ts
export type DequeueTiming = {
  dequeueResponseMs: number;
  pollingIntervalMs: number;
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts` at line 18,
Extract the inline timing shape into a shared exported type (e.g., export type
DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs: number; }) in a
central types file, then replace the inline occurrences with that type in the
function signature for onDequeue (currently onDequeue: (messages:
WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.
apps/supervisor/src/index.ts (1)

229-231: Runner ID derivation looks reasonable but is fragile on format assumptions.

message.run.friendlyId.replace("run_", "") uses a plain string replace (only first occurrence), and checkpoint.id.slice(-8) assumes IDs are always ≥8 characters. Both assumptions are likely safe given the platform's ID format, but consider adding a brief comment documenting the expected format, or using a more defensive approach.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/index.ts` around lines 229 - 231, The runner ID
construction is brittle: replace("run_", "") and slice(-8) assume formats;
update code around runIdShort, checkpointSuffix, and runnerId to be defensive by
(1) deriving runIdShort using a check like
message.run.friendlyId.startsWith("run_") ? message.run.friendlyId.substring(4)
: message.run.friendlyId (or use a regex to extract the suffix) and (2)
computing checkpointSuffix by using checkpoint.id.length >= 8 ?
checkpoint.id.slice(-8) : checkpoint.id (or another safe fallback) so you never
slice past bounds; also add a short comment above the logic documenting the
expected friendlyId and checkpoint.id formats.
apps/supervisor/src/workloadManager/compute.ts (1)

72-78: create() duplicates the authHeaders getter logic — use this.authHeaders instead.

The inline header construction at Lines 72–78 is identical to the private get authHeaders getter defined at Lines 168–176. All other methods (snapshot, deleteInstance, restore) correctly use this.authHeaders.

♻️ Proposed fix
-    const headers: Record<string, string> = {
-      "Content-Type": "application/json",
-    };
-
-    if (this.opts.gatewayAuthToken) {
-      headers["Authorization"] = `Bearer ${this.opts.gatewayAuthToken}`;
-    }
-
     // Strip image digest — resolve by tag, not digest
     const imageRef = opts.image.split("@")[0]!;
 
     const url = `${this.opts.gatewayUrl}/api/instances`;
 
     ...
 
       const [fetchError, response] = await tryCatch(
         fetch(url, {
           method: "POST",
-          headers,
+          headers: this.authHeaders,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadManager/compute.ts` around lines 72 - 78, The
create() method duplicates header construction that already exists in the
private getter authHeaders; replace the inline headers block in create() with a
single reference to this.authHeaders so it reuses the centralized logic (similar
to snapshot, deleteInstance, restore). Locate create() in compute.ts, remove the
manual headers Record<string,string> and the conditional Authorization
assignment, and use this.authHeaders wherever headers are needed to avoid
duplication.
apps/supervisor/src/workloadServer/index.ts (1)

454-499: 200 response is sent only after all network calls complete — gateway retries on slow ops could cause duplicate submissions.

submitSuspendCompletion and deleteInstance are both awaited before reply.empty(200) at Line 499. If either call is slow, the gateway's outbound callback will stall waiting for the acknowledgement; if it has its own timeout/retry policy it will re-deliver the callback, potentially triggering duplicate submitSuspendCompletion calls and a second deleteInstance attempt on the same (possibly already-deleted) instance.

Consider responding 200 immediately after validating the body and processing the callback asynchronously, relying on idempotency guarantees in submitSuspendCompletion, or ensuring submitSuspendCompletion is idempotent for the same (runId, snapshotId) pair.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 454 - 499, The
handler currently awaits submitSuspendCompletion and deleteInstance before
calling reply.empty(200), which can block the gateway and cause duplicate
callbacks; change flow to validate the incoming body, call reply.empty(200)
immediately, then process the suspend completion asynchronously (e.g. spawn a
background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 226-251: The handler currently logs failures from
computeManager.restore (didRestore === false or caught exceptions in the
try/catch around computeManager.restore) and returns without remediation; update
it to notify the platform before returning by calling the worker session HTTP
API (use this.workerSession.httpClient) to either mark the run failed or request
rescheduling, sending run id (message.run.id/friendlyId), runnerId,
checkpoint.id/location and the error details (or a failure reason when
didRestore is false); invoke this notification in both the didRestore === false
branch and the catch block and swallow/handle any HTTP errors locally so the
handler never silently abandons the run.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 433-501: The endpoint registered at
"/api/v1/compute/snapshot-complete" (handler for ComputeSnapshotCallbackBody)
lacks authentication allowing arbitrary suspend submissions and instance
deletions; fix by validating an immutable callback secret/signature before
processing: when generating callback URLs embed a per-run or global callback
token, then in this handler verify a signed header or Bearer token (e.g.,
X-Callback-Signature or Authorization) and reject requests with reply.empty(401)
if verification fails; additionally, once authenticated, cross-check the
provided body.instance_id against the expected instance for the runId (look up
the canonical instance for runId before calling
this.computeManager?.deleteInstance) and only call computeManager.deleteInstance
after both signature and instance match succeed; keep logging but avoid acting
on unauthenticated or mismatched requests and return appropriate HTTP codes.
- Around line 270-272: The callback URL construction in workloadServer/index.ts
uses a fallback to "localhost" which breaks external gateways; update validation
and usage so TRIGGER_WORKLOAD_API_DOMAIN is required when
COMPUTE_SNAPSHOTS_ENABLED is true (similar to TRIGGER_METADATA_URL), and remove
the silent default to "localhost" in the callbackUrl assembly. Specifically,
add/adjust env.ts validation to mark TRIGGER_WORKLOAD_API_DOMAIN as mandatory
when COMPUTE_SNAPSHOTS_ENABLED is enabled (or throw early), and change the code
that builds callbackUrl (the const callbackUrl) to fail loudly if
TRIGGER_WORKLOAD_API_DOMAIN is missing rather than using "localhost".
- Around line 266-288: Return 202 is sent before awaiting
computeManager.snapshot, and if snapshotResult is false there is no recovery
path; after the existing snapshotResult check in the block that calls
computeManager.snapshot({ runnerId, callbackUrl, metadata: { runId:
params.runFriendlyId, snapshotFriendlyId: params.snapshotFriendlyId }}), call
workerClient.submitSuspendCompletion (or equivalent API) to notify the platform
of the suspend failure including runnerId,
params.runFriendlyId/snapshotFriendlyId and error context, log the failure via
this.logger.error, and ensure the submitSuspendCompletion call handles/awaits
errors so the run can be re-queued or failed cleanly when snapshotResult is
false.

In `@packages/core/src/v3/runEngineWorker/supervisor/session.ts`:
- Around line 83-93: This change touched the public package `@trigger.dev/core`
(method onDequeue in supervisor/session.ts), so add a changeset entry for
`@trigger.dev/core`: run pnpm run changeset:add, select the `@trigger.dev/core`
package, describe the change and bump the appropriate version, then commit the
generated changeset file alongside your PR so the release tooling will include
this package.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 81-85: Add a zod superRefine to the environment schema to enforce
that when COMPUTE_GATEWAY_URL is defined, COMPUTE_GATEWAY_AUTH_TOKEN must also
be provided: update the env schema around the COMPUTE_GATEWAY_URL and
COMPUTE_GATEWAY_AUTH_TOKEN entries to call superRefine (similar to the
snapshots/metadata validation) and add an error on the token field when URL is
set but token is missing; keep behavior unchanged if URL is undefined to support
unauthenticated gateways.

In `@apps/supervisor/src/index.ts`:
- Around line 229-231: The runner ID construction is brittle: replace("run_",
"") and slice(-8) assume formats; update code around runIdShort,
checkpointSuffix, and runnerId to be defensive by (1) deriving runIdShort using
a check like message.run.friendlyId.startsWith("run_") ?
message.run.friendlyId.substring(4) : message.run.friendlyId (or use a regex to
extract the suffix) and (2) computing checkpointSuffix by using
checkpoint.id.length >= 8 ? checkpoint.id.slice(-8) : checkpoint.id (or another
safe fallback) so you never slice past bounds; also add a short comment above
the logic documenting the expected friendlyId and checkpoint.id formats.

In `@apps/supervisor/src/workloadManager/compute.ts`:
- Around line 72-78: The create() method duplicates header construction that
already exists in the private getter authHeaders; replace the inline headers
block in create() with a single reference to this.authHeaders so it reuses the
centralized logic (similar to snapshot, deleteInstance, restore). Locate
create() in compute.ts, remove the manual headers Record<string,string> and the
conditional Authorization assignment, and use this.authHeaders wherever headers
are needed to avoid duplication.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 454-499: The handler currently awaits submitSuspendCompletion and
deleteInstance before calling reply.empty(200), which can block the gateway and
cause duplicate callbacks; change flow to validate the incoming body, call
reply.empty(200) immediately, then process the suspend completion asynchronously
(e.g. spawn a background task/Promise) that calls
this.workerClient.submitSuspendCompletion(runId, snapshotFriendlyId, ...) and
this.computeManager.deleteInstance(body.instance_id) without blocking the
response; ensure the background task wraps calls in try/catch, logs errors,
makes submitSuspendCompletion idempotent for the same (runId, snapshotId) pair,
and makes deleteInstance tolerant of already-deleted instances (or check
existence before deleting) so retries are safe.

In `@packages/cli-v3/src/deploy/buildImage.ts`:
- Around line 1128-1153: The conditional that chooses "type=docker" relies on a
truthy check (isLocalBuild && load && !push) which allows load to be undefined;
change the guard to explicitly test load === true so the branch only triggers
when load is explicitly true. Locate the build function parameters
(isLocalBuild, load, push, imageTag) and update the if condition (and any
related comment if helpful) so it reads isLocalBuild && load === true && !push,
keeping the imageTag handling and returned outputOptions logic unchanged.

In `@packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts`:
- Line 18: Extract the inline timing shape into a shared exported type (e.g.,
export type DequeueTiming = { dequeueResponseMs: number; pollingIntervalMs:
number; }) in a central types file, then replace the inline occurrences with
that type in the function signature for onDequeue (currently onDequeue:
(messages: WorkerApiDequeueResponseBody, timing?: { dequeueResponseMs: number;
pollingIntervalMs: number }) => Promise<void>), and update the other files
referencing the same shape (consumerPool.ts and session.ts) to import and use
DequeueTiming to avoid duplication across the dequeue chain.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a70546 and 5089bba.

📒 Files selected for processing (11)
  • .changeset/fix-local-build-load.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • apps/supervisor/src/workloadManager/types.ts
  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/index.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/workloadManager/compute.ts
  • apps/supervisor/src/env.ts
{packages,integrations}/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Add a changeset when modifying any public package in packages/* or integrations/* using pnpm run changeset:add

Files:

  • packages/core/src/v3/runEngineWorker/supervisor/events.ts
  • packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts
  • packages/core/src/v3/runEngineWorker/supervisor/session.ts
  • packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts
  • packages/cli-v3/src/deploy/buildImage.ts
🧠 Learnings (9)
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to {packages,integrations}/**/* : Add a changeset when modifying any public package in `packages/*` or `integrations/*` using `pnpm run changeset:add`

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Use build extensions in trigger.config.ts (additionalFiles, additionalPackages, aptGet, prismaExtension, etc.) to customize the build

Applied to files:

  • .changeset/fix-local-build-load.md
  • packages/cli-v3/src/deploy/buildImage.ts
  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure build process in trigger.config.ts using `build` object with external packages, extensions, and JSX settings

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Run `npx trigger.devlatest dev` to start the Trigger.dev development server

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-14T19:24:39.536Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 2685
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.settings/route.tsx:1234-1257
Timestamp: 2025-11-14T19:24:39.536Z
Learning: In the trigger.dev project, version validation for the `useNativeBuildServer` setting cannot be performed at the settings form level because the SDK version is only known at build/deployment time, not when saving project settings.

Applied to files:

  • .changeset/fix-local-build-load.md
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Configure OpenTelemetry instrumentations and exporters in trigger.config.ts for enhanced logging

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-06-04T16:02:22.957Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2150
File: apps/supervisor/src/workloadManager/docker.ts:115-115
Timestamp: 2025-06-04T16:02:22.957Z
Learning: In the Trigger.dev codebase, the supervisor component uses DOCKER_ENFORCE_MACHINE_PRESETS while the docker provider component uses ENFORCE_MACHINE_PRESETS. These are separate components with separate environment variable configurations for the same logical concept of enforcing machine presets.

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (4)
packages/core/src/v3/runEngineWorker/supervisor/session.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
apps/supervisor/src/index.ts (3)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/serverOnly/checkpointClient.ts (1)
  • CheckpointClient (16-122)
apps/supervisor/src/env.ts (1)
  • env (164-164)
packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • WorkerApiDequeueResponseBody (73-73)
  • WorkerApiDequeueResponseBody (74-74)
packages/core/src/v3/runEngineWorker/supervisor/types.ts (1)
  • PreDequeueFn (12-15)
apps/supervisor/src/workloadManager/compute.ts (2)
apps/supervisor/src/workloadManager/types.ts (3)
  • WorkloadManagerOptions (3-14)
  • WorkloadManager (16-18)
  • WorkloadManagerCreateOptions (20-42)
apps/supervisor/src/util.ts (1)
  • getRunnerId (17-25)
🔇 Additional comments (13)
packages/cli-v3/src/deploy/buildImage.ts (2)

201-209: LGTM.

isLocalBuild: false correctly gates the new type=docker path out of remote builds. Since push: true is hardcoded and depot uses --save, the load option is intentionally a no-op here.


535-543: LGTM.

load and push are already resolved to boolean via shouldLoad/shouldPush, so the truthy check in getOutputOptions is reliable here.

.changeset/fix-local-build-load.md (1)

1-5: LGTM!

Changeset correctly documents a patch-level fix for the --load flag issue.

apps/supervisor/src/env.ts (1)

154-162: Good conditional validation.

The superRefine correctly enforces TRIGGER_METADATA_URL as a dependency when COMPUTE_SNAPSHOTS_ENABLED is true. The error is scoped to the right path for clear diagnostics.

packages/core/src/v3/runEngineWorker/supervisor/events.ts (1)

9-10: LGTM!

Optional timing fields are cleanly added to the event type, maintaining backward compatibility.

apps/supervisor/src/workloadManager/types.ts (1)

27-30: LGTM! Clean addition of timing context fields for observability.

Nit: The file uses interface throughout, but coding guidelines prefer type. Since this is pre-existing, not worth changing in this PR.

packages/core/src/v3/runEngineWorker/supervisor/consumerPool.ts (1)

354-359: LGTM!

Timing parameter is correctly threaded through the consumer pool's interceptor to the original handler.

apps/supervisor/src/index.ts (2)

282-284: Good use of performance.now() for high-resolution timing.

Clean pattern for measuring warm-start duration.


83-98: Compute mode initialization is clean.

The triple assignment pattern (create → store as computeManager → assign as workloadManager) is clear. The isComputeMode flag derived from the URL's presence is a good single source of truth.

packages/core/src/v3/runEngineWorker/supervisor/queueConsumer.ts (2)

116-127: LGTM! Timing instrumentation is correctly scoped.

dequeueResponseMs measures only the HTTP call, and pollingIntervalMs captures the interval that scheduled this dequeue cycle. The lastScheduledIntervalMs tracking across scheduleNextDequeue → next dequeue() call is sound.


33-33: lastScheduledIntervalMs tracking is correct.

Initialized to idleIntervalMs (first poll starts from idle state), then updated in scheduleNextDequeue before the callback fires. The value read in the next dequeue() cycle accurately reflects the delay used.

Also applies to: 42-42, 148-148

apps/supervisor/src/workloadManager/compute.ts (2)

120-120: Memory field name and unit differ between create and restore — verify gateway API contracts.

  • create (Line 120): memory_gb: opts.machine.memory — sends machine memory as-is, using field name memory_gb
  • restore (Line 273): memory_mb: opts.machine.memory * 1024 — multiplies by 1024, using field name memory_mb

If opts.machine.memory is in GiB (e.g., 2 = 2 GiB), the create endpoint receives 2 while the restore endpoint receives 2048. Confirm this asymmetry is intentional per the gateway's API schemas for /api/instances vs /api/snapshots/{id}/restore.

Also applies to: 273-273


9-9: No action required. The import is correct.

import { tryCatch } from "@trigger.dev/core" is a valid import. The root index.ts explicitly re-exports tryCatch from utils.ts, making it an intentional root-level export. This does not violate the subpath-only guideline.

…nabled

Remove the silent `localhost` fallback for the snapshot callback URL,
which would be unreachable from external compute gateways. Add env
validation and a runtime guard matching the existing metadata URL pattern.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
apps/supervisor/src/env.ts (1)

154-169: Good cross-field validation — consider also requiring COMPUTE_GATEWAY_URL.

The superRefine correctly gates TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN behind COMPUTE_SNAPSHOTS_ENABLED. However, COMPUTE_GATEWAY_URL is not validated here. If someone sets COMPUTE_SNAPSHOTS_ENABLED=true without COMPUTE_GATEWAY_URL, the env validation passes (requiring the metadata/domain vars) but no ComputeWorkloadManager is instantiated — so snapshots never actually work, with no warning at startup.

Adding a third check would make the fail-fast behavior consistent:

Suggested addition
   .superRefine((data, ctx) => {
+    if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.COMPUTE_GATEWAY_URL) {
+      ctx.addIssue({
+        code: z.ZodIssueCode.custom,
+        message: "COMPUTE_GATEWAY_URL is required when COMPUTE_SNAPSHOTS_ENABLED is true",
+        path: ["COMPUTE_GATEWAY_URL"],
+      });
+    }
     if (data.COMPUTE_SNAPSHOTS_ENABLED && !data.TRIGGER_METADATA_URL) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/env.ts` around lines 154 - 169, The superRefine validator
currently requires TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.
apps/supervisor/src/workloadServer/index.ts (1)

439-507: Callback handler logic is sound; one defensive improvement to consider.

The completed/failed branching, metadata validation, and suspend-completion submission all look correct. One thing to note: if submitSuspendCompletion succeeds but deleteInstance fails (Line 478), the failure is silently swallowed — deleteInstance returns a boolean and logs internally, but the callback handler doesn't log the deletion outcome. This could make debugging orphaned instances harder.

Optional: log deletion failure
-            await this.computeManager?.deleteInstance(body.instance_id);
+            const deleted = await this.computeManager?.deleteInstance(body.instance_id);
+            if (deleted === false) {
+              this.logger.warn("Failed to delete compute instance after suspend completion", {
+                runId,
+                instanceId: body.instance_id,
+              });
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 439 - 507, The
handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 266-273: The error response in the branch where
this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED and
env.TRIGGER_WORKLOAD_API_DOMAIN is missing returns { error: "..." } which
doesn't match WorkloadSuspendRunResponseBody; change the reply.json call in that
branch (the block referencing this.computeManager,
env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return the same shape as the
other handlers: { ok: false, error: "TRIGGER_WORKLOAD_API_DOMAIN is not set,
cannot create snapshot callback URL" } (or equivalent message) so the response
conforms to WorkloadSuspendRunResponseBody and the runner can parse it.

---

Nitpick comments:
In `@apps/supervisor/src/env.ts`:
- Around line 154-169: The superRefine validator currently requires
TRIGGER_METADATA_URL and TRIGGER_WORKLOAD_API_DOMAIN when
COMPUTE_SNAPSHOTS_ENABLED is true but misses COMPUTE_GATEWAY_URL; add a third
conditional inside superRefine checking if data.COMPUTE_SNAPSHOTS_ENABLED &&
!data.COMPUTE_GATEWAY_URL and call ctx.addIssue with code:
z.ZodIssueCode.custom, a message like "COMPUTE_GATEWAY_URL is required when
COMPUTE_SNAPSHOTS_ENABLED is true", and path: ["COMPUTE_GATEWAY_URL"] so
validation fails fast and surfaces the missing gateway variable.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 439-507: The handler currently calls await
this.computeManager?.deleteInstance(body.instance_id) without checking the
boolean result; update the completed branch after submitSuspendCompletion to
capture the deleteOutcome = await
this.computeManager?.deleteInstance(body.instance_id) and then log a
warning/error via this.logger.warn or this.logger.error if deleteOutcome is
falsy (include runId, instanceId, snapshotFriendlyId and any contextual info),
otherwise log successful deletion—this ensures failures in
computeManager.deleteInstance are visible; reference submitSuspendCompletion,
deleteInstance, computeManager, and this.logger when making the change.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5089bba and 7ed9221.

📒 Files selected for processing (2)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/workloadServer/index.ts
  • apps/supervisor/src/env.ts
🧠 Learnings (4)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
📚 Learning: 2026-02-23T12:56:44.997Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:44.997Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
🧬 Code graph analysis (2)
apps/supervisor/src/workloadServer/index.ts (2)
apps/supervisor/src/workloadManager/compute.ts (1)
  • ComputeWorkloadManager (17-311)
packages/core/src/v3/runEngineWorker/workload/schemas.ts (2)
  • WorkloadSuspendRunResponseBody (22-30)
  • WorkloadSuspendRunResponseBody (31-31)
apps/supervisor/src/env.ts (2)
packages/core/src/v3/utils/crypto.ts (1)
  • randomUUID (1-5)
apps/supervisor/src/envUtil.ts (2)
  • BoolEnv (15-17)
  • AdditionalEnvVars (19-47)
🔇 Additional comments (4)
apps/supervisor/src/env.ts (1)

81-85: LGTM — new compute settings are well-structured.

Clean schema additions with sensible defaults (30_000 ms timeout, snapshots off by default).

apps/supervisor/src/workloadServer/index.ts (3)

27-27: LGTM — type-only import for ComputeWorkloadManager.

Correctly uses import type to avoid runtime dependency on the compute module.


57-63: LGTM — ComputeSnapshotCallbackBody schema is well-defined.

Uses z.object with appropriate optional fields and a bounded string enum for status.


275-294: Fire-and-forget pattern looks correct for the compute snapshot path.

The 202 is returned before the async snapshot() call, and the failure is logged. Recovery for failed snapshots is acknowledged as a future enhancement per prior discussion.

Comment on lines +266 to +273
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Response at Line 271 doesn't conform to WorkloadSuspendRunResponseBody.

The other error responses in this handler (Lines 255–262, 297–304) correctly return { ok: false, error: "..." } satisfies WorkloadSuspendRunResponseBody. This branch returns { error: "..." } without the ok discriminator, so the runner may fail to parse the response.

Proposed fix
-              reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
+              reply.json(
+                {
+                  ok: false,
+                  error: "Snapshot callbacks not configured",
+                } satisfies WorkloadSuspendRunResponseBody,
+                false,
+                500
+              );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json({ error: "Snapshot callbacks not configured" }, false, 500);
return;
}
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
if (!env.TRIGGER_WORKLOAD_API_DOMAIN) {
this.logger.error(
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL"
);
reply.json(
{
ok: false,
error: "Snapshot callbacks not configured",
} satisfies WorkloadSuspendRunResponseBody,
false,
500
);
return;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 266 - 273, The
error response in the branch where this.computeManager &&
env.COMPUTE_SNAPSHOTS_ENABLED and env.TRIGGER_WORKLOAD_API_DOMAIN is missing
returns { error: "..." } which doesn't match WorkloadSuspendRunResponseBody;
change the reply.json call in that branch (the block referencing
this.computeManager, env.TRIGGER_WORKLOAD_API_DOMAIN and reply.json) to return
the same shape as the other handlers: { ok: false, error:
"TRIGGER_WORKLOAD_API_DOMAIN is not set, cannot create snapshot callback URL" }
(or equivalent message) so the response conforms to
WorkloadSuspendRunResponseBody and the runner can parse it.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/supervisor/src/index.ts`:
- Around line 430-433: Change shutdown ordering to quiesce and drain the worker
session before closing the workload API: in the async stop() method, first call
and await a worker-session quiesce/drain operation (e.g.,
workerSession.quiesce() or workerSession.drainPendingWork()) to stop intake and
let in-flight runners/callbacks complete, then call await
this.workloadServer.stop(), and finally call await this.workerSession.stop() if
you need to fully teardown session state; if workerSession lacks a quiesce/drain
API, add one (e.g., WorkerSession.quiesce/ drainPendingWork) that stops new
intake, waits for active tasks/callbacks to finish, and resolves before
workloadServer.stop() is invoked.

In `@apps/supervisor/src/services/timerWheel.ts`:
- Around line 43-50: The delay-to-slot conversion in the TimerWheel constructor
(setting this.delaySlots) currently uses Math.round(opts.delayMs / TICK_MS)
which can undershoot the requested delay; change it to Math.ceil(opts.delayMs /
TICK_MS) so delays never shorten, and search the file for any other uses of
Math.round(... / TICK_MS) (e.g., later scheduling/adjustment logic around
delaySlots) and replace those with Math.ceil as well to keep consistent, correct
rounding behavior.
- Around line 22-24: The onExpire callback should allow async handlers and be
awaited with error handling: change TimerWheelOptions.onExpire signature to
return Promise<void> (or void | Promise<void>), then where you invoke onExpire
for a TimerWheelItem<T> (the current expiration path that deletes the entry
before calling onExpire), await the call inside a try/catch; if it throws, catch
the error, log it, and re-insert/re-schedule the same TimerWheelItem back into
the wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.

In `@apps/supervisor/src/workloadServer/index.ts`:
- Around line 478-523: The handler currently always calls reply.empty(200) even
when this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.
- Around line 752-760: When stopping the snapshot delay wheel on shutdown
(this.snapshotDelayWheel?.stop()) we currently log and discard the returned
pending items, which loses pending suspend/snapshot work after we've already
responded 202; instead iterate the returned remaining items and either
re-enqueue them or invoke the same expiration handler used by the wheel so each
pending waitpoint still triggers the suspend/snapshot workflow (i.e. call the
TimerWheel onExpire handler or the snapshot processing function for each
item.key/item), or persist them for resume — do not drop them. Ensure you
reference snapshotDelayWheel, TimerWheel.stop(), onExpire() and the remaining
array when making the change so pending entries are handled the same way as when
they naturally expire.
- Around line 117-123: The TimerWheel's onExpire handler is calling the async
dispatchComputeSnapshot() synchronously so its returned promise can reject
unhandled; update the onExpire callback passed to TimerWheel
(snapshotDelayWheel) to call dispatchComputeSnapshot(item.data) and attach
explicit error handling (e.g., .catch(...) to log/report failures) so any
rejection is caught and does not become an unhandled promise rejection;
reference TimerWheel, TimerWheel.tick, snapshotDelayWheel, and
dispatchComputeSnapshot when locating the handler to wrap the call with a
rejection handler.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1d829bf2-d50e-47e1-9d3f-417818c8b908

📥 Commits

Reviewing files that changed from the base of the PR and between 0531a23 and 80b62d4.

📒 Files selected for processing (5)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/workloadServer/index.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Cloudflare Workers
🧰 Additional context used
📓 Path-based instructions (11)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: In TypeScript SDK usage, always import from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or use deprecated client.defineJob
Import from @trigger.dev/core subpaths only, never from the root
Use the Run Engine 2.0 (@internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
apps/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

When modifying only server components (apps/webapp/, apps/supervisor/, etc.) with no package changes, add a .server-changes/ file instead of a changeset

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/workloadServer/index.ts
apps/supervisor/src/services/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Core service logic should be organized in the src/services/ directory

Files:

  • apps/supervisor/src/services/timerWheel.ts
  • apps/supervisor/src/services/timerWheel.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Tests should avoid mocks or stubs and use the helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for running unit tests

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx,js}: Use vitest exclusively for testing - never mock anything, use testcontainers instead
Place test files next to source files with .test.ts naming convention (e.g., MyService.ts -> MyService.test.ts)
Test files using Redis or PostgreSQL should use testcontainers helpers (redisTest, postgresTest, containerTest) instead of mocks

Files:

  • apps/supervisor/src/services/timerWheel.test.ts
apps/supervisor/src/env.ts

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

Environment configuration should be defined in src/env.ts

Files:

  • apps/supervisor/src/env.ts
apps/supervisor/src/workloadServer/**/*.{js,ts}

📄 CodeRabbit inference engine (apps/supervisor/CLAUDE.md)

HTTP server for workload communication (heartbeats, snapshots) should be implemented in src/workloadServer/

Files:

  • apps/supervisor/src/workloadServer/index.ts
🧠 Learnings (13)
📓 Common learnings
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`
📚 Learning: 2026-03-03T13:07:33.177Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3166
File: internal-packages/run-engine/src/batch-queue/tests/index.test.ts:711-713
Timestamp: 2026-03-03T13:07:33.177Z
Learning: In `internal-packages/run-engine/src/batch-queue/tests/index.test.ts`, test assertions for rate limiter stubs can use `toBeGreaterThanOrEqual` rather than exact equality (`toBe`) because the consumer loop may call the rate limiter during empty pops in addition to actual item processing, and this over-calling is acceptable in integration tests.

Applied to files:

  • apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-03-02T12:43:25.254Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: internal-packages/run-engine/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:25.254Z
Learning: Applies to internal-packages/run-engine/src/engine/tests/**/*.test.ts : Implement tests for RunEngine in `src/engine/tests/` using testcontainers for Redis and PostgreSQL containerization

Applied to files:

  • apps/supervisor/src/services/timerWheel.test.ts
📚 Learning: 2026-02-23T12:56:51.147Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3114
File: apps/supervisor/src/index.ts:226-251
Timestamp: 2026-02-23T12:56:51.147Z
Learning: In the supervisor compute restore flow (apps/supervisor/src/index.ts), the run engine handles retries for failed restore operations, so runs won't get permanently stuck even if computeManager.restore returns false or throws an exception.

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadManager/**/*.{js,ts} : Container orchestration abstraction (Docker or Kubernetes) should be implemented in `src/workloadManager/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/workloadServer/**/*.{js,ts} : HTTP server for workload communication (heartbeats, snapshots) should be implemented in `src/workloadServer/`

Applied to files:

  • apps/supervisor/src/index.ts
  • apps/supervisor/src/workloadServer/index.ts
📚 Learning: 2026-03-02T12:42:41.110Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:41.110Z
Learning: Applies to **/*.{ts,tsx} : Use the Run Engine 2.0 (internal/run-engine) and redis-worker for all new work, not legacy V1 MarQS queue or deprecated V1 functions

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:56.114Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/webapp/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:56.114Z
Learning: All run lifecycle operations (triggering, completing, cancelling, etc.) should be performed through the singleton run engine instance in `app/v3/runEngine.server.ts` via service calls

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: WorkloadManager should abstract Docker vs Kubernetes execution to allow switching between container orchestration platforms

Applied to files:

  • apps/supervisor/src/index.ts
📚 Learning: 2026-03-02T12:42:47.652Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: apps/supervisor/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:42:47.652Z
Learning: Applies to apps/supervisor/src/env.ts : Environment configuration should be defined in `src/env.ts`

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-02-04T16:34:48.876Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/vercel.connect.tsx:13-27
Timestamp: 2026-02-04T16:34:48.876Z
Learning: In apps/webapp/app/routes/vercel.connect.tsx, configurationId may be absent for "dashboard" flows but must be present for "marketplace" flows. Enforce this with a Zod superRefine and pass installationId to repository methods only when configurationId is defined (omit the field otherwise).

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/supervisor/src/env.ts
📚 Learning: 2026-03-02T12:43:37.906Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: packages/core/CLAUDE.md:0-0
Timestamp: 2026-03-02T12:43:37.906Z
Learning: Exercise caution with changes to trigger.dev/core as they affect both the customer-facing SDK and server-side webapp - breaking changes can impact deployed user tasks and the platform simultaneously

Applied to files:

  • apps/supervisor/src/workloadServer/index.ts

Comment on lines 430 to 433
async stop() {
this.logger.log("Shutting down");
await this.workloadServer.stop();
await this.workerSession.stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Quiesce the worker session before tearing down the workload API.

workloadServer fronts heartbeats, attempt completion, suspend/continue, and compute snapshot callbacks. Closing it first can strand active runners and in-flight callbacks against a dead server while workerSession is still live. This shutdown path needs a “stop intake, drain, then close the API” order instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/index.ts` around lines 430 - 433, Change shutdown
ordering to quiesce and drain the worker session before closing the workload
API: in the async stop() method, first call and await a worker-session
quiesce/drain operation (e.g., workerSession.quiesce() or
workerSession.drainPendingWork()) to stop intake and let in-flight
runners/callbacks complete, then call await this.workloadServer.stop(), and
finally call await this.workerSession.stop() if you need to fully teardown
session state; if workerSession lacks a quiesce/drain API, add one (e.g.,
WorkerSession.quiesce/ drainPendingWork) that stops new intake, waits for active
tasks/callbacks to finish, and resolves before workloadServer.stop() is invoked.

Comment on lines +22 to +24
export type TimerWheelOptions<T> = {
/** Called when an item's delay expires. */
onExpire: (item: TimerWheelItem<T>) => void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "timerWheel.ts" -type f

Repository: triggerdotdev/trigger.dev

Length of output: 113


🏁 Script executed:

find . -name "workloadServer" -type f

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

fd -t f timerWheel

Repository: triggerdotdev/trigger.dev

Length of output: 159


🏁 Script executed:

cat -n apps/supervisor/src/services/timerWheel.ts

Repository: triggerdotdev/trigger.dev

Length of output: 5895


🏁 Script executed:

cat -n apps/supervisor/src/workloadServer/index.ts | sed -n '115,125p;725,745p'

Repository: triggerdotdev/trigger.dev

Length of output: 1425


🏁 Script executed:

cat -n apps/supervisor/src/workloadServer/index.ts | head -200 | tail -100

Repository: triggerdotdev/trigger.dev

Length of output: 3881


🏁 Script executed:

cat -n apps/supervisor/src/services/timerWheel.test.ts | head -100

Repository: triggerdotdev/trigger.dev

Length of output: 3290


🏁 Script executed:

rg "dispatchComputeSnapshot" apps/supervisor/src/workloadServer/index.ts -A 5 -B 5

Repository: triggerdotdev/trigger.dev

Length of output: 993


🏁 Script executed:

rg "unhandledRejection|uncaughtException" apps/supervisor/src --type ts -B 3 -A 3

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

rg "catch|\.on\(" apps/supervisor/src/workloadServer/index.ts | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 522


🏁 Script executed:

git ls-files apps/supervisor/src | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 886


🏁 Script executed:

rg "process\.on\(" apps/supervisor --type ts

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

cat -n apps/supervisor/src/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 18909


Handle async onExpire callbacks explicitly.

The onExpire callback type declares () => void, but TypeScript permits assigning async functions here since Promise<void> is assignable to void. Line 157 invokes the callback without awaiting or catching, so if dispatchComputeSnapshot() (line 725, apps/supervisor/src/workloadServer/index.ts) rejects, the rejection goes unhandled. The entry is already deleted at line 156, making the delayed snapshot permanently lost.

Suggested fix
 export type TimerWheelOptions<T> = {
   /** Called when an item's delay expires. */
-  onExpire: (item: TimerWheelItem<T>) => void;
+  onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+  onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
   /** Delay in milliseconds before items fire. Clamped to [100, 60000]. */
   delayMs: number;
 };
@@
-  private onExpire: (item: TimerWheelItem<T>) => void;
+  private onExpire: (item: TimerWheelItem<T>) => void | Promise<void>;
+  private onExpireError?: (error: unknown, item: TimerWheelItem<T>) => void;
@@
     this.intervalId = null;
     this.onExpire = opts.onExpire;
+    this.onExpireError = opts.onExpireError;
     this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
   }
@@
-      this.entries.delete(key);
-      this.onExpire({ key, data: entry.data });
+      this.entries.delete(key);
+      const item = { key, data: entry.data };
+
+      try {
+        void Promise.resolve(this.onExpire(item)).catch((error) => {
+          this.onExpireError?.(error, item);
+        });
+      } catch (error) {
+        this.onExpireError?.(error, item);
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/services/timerWheel.ts` around lines 22 - 24, The
onExpire callback should allow async handlers and be awaited with error
handling: change TimerWheelOptions.onExpire signature to return Promise<void>
(or void | Promise<void>), then where you invoke onExpire for a
TimerWheelItem<T> (the current expiration path that deletes the entry before
calling onExpire), await the call inside a try/catch; if it throws, catch the
error, log it, and re-insert/re-schedule the same TimerWheelItem back into the
wheel (so delayed work like dispatchComputeSnapshot is not permanently lost)
instead of letting the rejection go unhandled after deletion.

Comment on lines +43 to +50
constructor(opts: TimerWheelOptions<T>) {
this.slots = Array.from({ length: NUM_SLOTS }, () => new Set());
this.entries = new Map();
this.cursor = 0;
this.intervalId = null;
this.onExpire = opts.onExpire;
this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use Math.ceil() for slot conversion.

Math.round(delayMs / TICK_MS) can shave almost half a tick off the requested delay. For example, a 149ms delay submitted on a tick boundary becomes a single 100ms slot, so the grace window can expire earlier than configured.

🛠️ Minimal fix
-    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(opts.delayMs / TICK_MS)));
+    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(opts.delayMs / TICK_MS)));
@@
-    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.round(delayMs / TICK_MS)));
+    this.delaySlots = Math.max(1, Math.min(NUM_SLOTS, Math.ceil(delayMs / TICK_MS)));

Also applies to: 90-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/services/timerWheel.ts` around lines 43 - 50, The
delay-to-slot conversion in the TimerWheel constructor (setting this.delaySlots)
currently uses Math.round(opts.delayMs / TICK_MS) which can undershoot the
requested delay; change it to Math.ceil(opts.delayMs / TICK_MS) so delays never
shorten, and search the file for any other uses of Math.round(... / TICK_MS)
(e.g., later scheduling/adjustment logic around delaySlots) and replace those
with Math.ceil as well to keep consistent, correct rounding behavior.

Comment on lines +117 to +123
if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
onExpire: (item) => this.dispatchComputeSnapshot(item.data),
});
this.snapshotDelayWheel.start();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Expected result:
# - timerWheel.tick() calls onExpire(...) with no await/.catch
# - workloadServer wires onExpire to async dispatchComputeSnapshot()
sed -n '141,159p' apps/supervisor/src/services/timerWheel.ts
echo
sed -n '117,123p' apps/supervisor/src/workloadServer/index.ts
echo
sed -n '721,745p' apps/supervisor/src/workloadServer/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1785


Catch timer-wheel dispatch failures before they become unhandled rejections.

TimerWheel.tick() invokes onExpire() synchronously without awaiting it, so the promise returned by dispatchComputeSnapshot() is currently dropped. Any rejection inside that async function becomes an unhandled promise rejection, which can crash the supervisor process.

🐛 Proposed fix
     if (this.computeManager && env.COMPUTE_SNAPSHOTS_ENABLED) {
       this.snapshotDelayWheel = new TimerWheel<DelayedSnapshot>({
         delayMs: env.COMPUTE_SNAPSHOT_DELAY_MS,
-        onExpire: (item) => this.dispatchComputeSnapshot(item.data),
+        onExpire: (item) => {
+          void this.dispatchComputeSnapshot(item.data).catch((error) => {
+            this.logger.error("Unhandled compute snapshot dispatch error", {
+              runId: item.data.runFriendlyId,
+              runnerId: item.data.runnerId,
+              error,
+            });
+          });
+        },
       });
       this.snapshotDelayWheel.start();
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 117 - 123, The
TimerWheel's onExpire handler is calling the async dispatchComputeSnapshot()
synchronously so its returned promise can reject unhandled; update the onExpire
callback passed to TimerWheel (snapshotDelayWheel) to call
dispatchComputeSnapshot(item.data) and attach explicit error handling (e.g.,
.catch(...) to log/report failures) so any rejection is caught and does not
become an unhandled promise rejection; reference TimerWheel, TimerWheel.tick,
snapshotDelayWheel, and dispatchComputeSnapshot when locating the handler to
wrap the call with a rejection handler.

Comment on lines +478 to +523
if (body.status === "completed") {
const result = await this.workerClient.submitSuspendCompletion({
runId,
snapshotId: snapshotFriendlyId,
body: {
success: true,
checkpoint: {
type: "KUBERNETES",
location: body.snapshot_id,
},
},
});

if (result.success) {
this.logger.info("Suspend completion submitted", {
runId,
instanceId: body.instance_id,
snapshotId: body.snapshot_id,
});
} else {
this.logger.error("Failed to submit suspend completion", {
runId,
snapshotFriendlyId,
error: result.error,
});
}
} else {
const result = await this.workerClient.submitSuspendCompletion({
runId,
snapshotId: snapshotFriendlyId,
body: {
success: false,
error: body.error ?? "Snapshot failed",
},
});

if (!result.success) {
this.logger.error("Failed to submit suspend failure", {
runId,
snapshotFriendlyId,
error: result.error,
});
}
}

reply.empty(200);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return a non-2xx when suspend completion was not recorded.

Both branches log !result.success and still reply 200. That acknowledges the compute callback even though the platform never received the suspend result, which drops the only retry signal on transient worker API failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 478 - 523, The
handler currently always calls reply.empty(200) even when
this.workerClient.submitSuspendCompletion returns result.success === false,
which hides failures and prevents retries; update the logic around the two calls
to submitSuspendCompletion (the completed branch and the failure branch) so that
if result.success is false you return a non-2xx response (e.g.,
reply.code(500).send({ error: result.error })) instead of reply.empty(200),
preserving the existing logger.error calls; ensure both places reference the
same result variable from submitSuspendCompletion and that success paths still
reply 200.

Comment on lines +752 to +760
const remaining = this.snapshotDelayWheel?.stop() ?? [];
if (remaining.length > 0) {
this.logger.info("Snapshot delay wheel stopped, dropped pending snapshots", {
count: remaining.length,
});
this.logger.debug("Dropped snapshot details", {
runs: remaining.map((item) => item.key),
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop delayed suspends on shutdown after already returning 202.

TimerWheel.stop() returns pending items without firing onExpire(). At this point the runner has already been told to suspend, so logging and discarding these entries means no snapshot request or failure is ever reported for that waitpoint across a supervisor restart.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/supervisor/src/workloadServer/index.ts` around lines 752 - 760, When
stopping the snapshot delay wheel on shutdown (this.snapshotDelayWheel?.stop())
we currently log and discard the returned pending items, which loses pending
suspend/snapshot work after we've already responded 202; instead iterate the
returned remaining items and either re-enqueue them or invoke the same
expiration handler used by the wheel so each pending waitpoint still triggers
the suspend/snapshot workflow (i.e. call the TimerWheel onExpire handler or the
snapshot processing function for each item.key/item), or persist them for resume
— do not drop them. Ensure you reference snapshotDelayWheel, TimerWheel.stop(),
onExpire() and the remaining array when making the change so pending entries are
handled the same way as when they naturally expire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant