fix(lambda): fail-open isJobQueued — assume queued on API errors#5130
fix(lambda): fail-open isJobQueued — assume queued on API errors#5130vegardx wants to merge 2 commits into
Conversation
Wrap the isJobQueued check in a try/catch that assumes the job is still queued when the GitHub API returns an error (404, rate limit, 502). This prevents silent job drops when the API is transiently unavailable. Previously, any error from getJobForWorkflowRun would propagate up and (combined with the non-ScaleError catch behavior) cause the entire SQS batch to be dropped. Fixes github-aws-runners#5026
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR makes the job-queued verification in scaleUp resilient to GitHub API failures by switching the isJobQueued check to a fail-open approach, and adds a test to validate the new behavior.
Changes:
- Wrap
isJobQueuedin a try/catch and proceed with scaling when the check fails (fail-open). - Add a GHES unit test ensuring runners are still created when
isJobQueuedthrows.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| lambdas/functions/control-plane/src/scale-runners/scale-up.ts | Makes the isJobQueued gate fail-open on errors to avoid dropping scale events during GitHub API issues. |
| lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts | Adds coverage for the new fail-open behavior when GitHub API calls fail. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (enableJobQueuedCheck) { | ||
| let jobQueued = true; | ||
| try { | ||
| jobQueued = await isJobQueued(githubInstallationClient, message); | ||
| } catch (e) { | ||
| messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e }); | ||
| } | ||
| if (!jobQueued) { | ||
| messageLogger.info('No runner will be created, job is not queued.'); | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
Intentionally broad. The fail-open philosophy here is that dropping a job is always worse than creating an ephemeral runner that self-terminates in ~30s when no work is available.
Specific cases:
- 404 (job not found): GitHub's API is eventually consistent — a 404 immediately after webhook delivery is a race condition, not a definitive "job doesn't exist". Failing closed here drops the job permanently.
- Auth/permission errors: Transient in practice (token refresh, installation permission propagation delays).
- Unsupported event type:
isJobQueuedalready returnsfalsefor unsupported events (the throw path is only reached on actual API call failures, not event-type mismatches).
Narrowing to specific status codes adds a maintenance surface that breaks when GitHub changes error responses. The max downside of fail-open is one extra idle runner for 30s; the max downside of fail-closed is a permanently dropped job.
| try { | ||
| jobQueued = await isJobQueued(githubInstallationClient, message); | ||
| } catch (e) { | ||
| messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e }); |
There was a problem hiding this comment.
Good point. Fixed in 0999e80 — now logs only { error: err.message, status: err.status } instead of the full error object.
Log only message and status instead of the full error object to avoid leaking request/response metadata from Octokit errors.
Problem
When
enable_job_queued_check = true(default), theisJobQueuedcall has no error handling. IfgetJobForWorkflowRunthrows (404, rate limit, 502), the error propagates up throughscaleUpand hits the generic catch inscaleUpHandler— causing the batch to fail.This is especially problematic in combination with the race condition described in #5026: when SQS delivers messages in multiple batches due to concurrency limits, GitHub API errors during the second batch cause jobs to be silently dropped.
Fix
Wrap the
isJobQueuedcheck in a try/catch. On any error, assume the job is still queued (fail-open) and log a warning. The worst case is creating a runner for a job that has already been handled — the runner will self-terminate when no job is available.Changes
lambdas/functions/control-plane/src/scale-runners/scale-up.ts— try/catch around isJobQueuedlambdas/functions/control-plane/src/scale-runners/scale-up.test.ts— test: API error → runner still createdRisk
Low — fail-open is strictly more resilient than fail-closed for job dispatch. The extra runner (if the job was already handled) self-terminates with no work.
Fixes #5026