Skip to content

fix(lambda): fail-open isJobQueued — assume queued on API errors#5130

Open
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/fail-open-is-job-queued
Open

fix(lambda): fail-open isJobQueued — assume queued on API errors#5130
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/fail-open-is-job-queued

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 26, 2026

Problem

When enable_job_queued_check = true (default), the isJobQueued call has no error handling. If getJobForWorkflowRun throws (404, rate limit, 502), the error propagates up through scaleUp and hits the generic catch in scaleUpHandler — causing the batch to fail.

This is especially problematic in combination with the race condition described in #5026: when SQS delivers messages in multiple batches due to concurrency limits, GitHub API errors during the second batch cause jobs to be silently dropped.

Fix

Wrap the isJobQueued check in a try/catch. On any error, assume the job is still queued (fail-open) and log a warning. The worst case is creating a runner for a job that has already been handled — the runner will self-terminate when no job is available.

if (enableJobQueuedCheck) {
  let jobQueued = true;
  try {
    jobQueued = await isJobQueued(githubInstallationClient, message);
  } catch (e) {
    messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });
  }
  if (!jobQueued) {
    messageLogger.info('No runner will be created, job is not queued.');
    continue;
  }
}

Changes

  • lambdas/functions/control-plane/src/scale-runners/scale-up.ts — try/catch around isJobQueued
  • lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts — test: API error → runner still created

Risk

Low — fail-open is strictly more resilient than fail-closed for job dispatch. The extra runner (if the job was already handled) self-terminates with no work.

Fixes #5026

Wrap the isJobQueued check in a try/catch that assumes the job is still
queued when the GitHub API returns an error (404, rate limit, 502). This
prevents silent job drops when the API is transiently unavailable.

Previously, any error from getJobForWorkflowRun would propagate up and
(combined with the non-ScaleError catch behavior) cause the entire SQS
batch to be dropped.

Fixes github-aws-runners#5026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes the job-queued verification in scaleUp resilient to GitHub API failures by switching the isJobQueued check to a fail-open approach, and adds a test to validate the new behavior.

Changes:

  • Wrap isJobQueued in a try/catch and proceed with scaling when the check fails (fail-open).
  • Add a GHES unit test ensuring runners are still created when isJobQueued throws.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lambdas/functions/control-plane/src/scale-runners/scale-up.ts Makes the isJobQueued gate fail-open on errors to avoid dropping scale events during GitHub API issues.
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts Adds coverage for the new fail-open behavior when GitHub API calls fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +419 to 430
if (enableJobQueuedCheck) {
let jobQueued = true;
try {
jobQueued = await isJobQueued(githubInstallationClient, message);
} catch (e) {
messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });
}
if (!jobQueued) {
messageLogger.info('No runner will be created, job is not queued.');
continue;
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally broad. The fail-open philosophy here is that dropping a job is always worse than creating an ephemeral runner that self-terminates in ~30s when no work is available.

Specific cases:

  • 404 (job not found): GitHub's API is eventually consistent — a 404 immediately after webhook delivery is a race condition, not a definitive "job doesn't exist". Failing closed here drops the job permanently.
  • Auth/permission errors: Transient in practice (token refresh, installation permission propagation delays).
  • Unsupported event type: isJobQueued already returns false for unsupported events (the throw path is only reached on actual API call failures, not event-type mismatches).

Narrowing to specific status codes adds a maintenance surface that breaks when GitHub changes error responses. The max downside of fail-open is one extra idle runner for 30s; the max downside of fail-closed is a permanently dropped job.

try {
jobQueued = await isJobQueued(githubInstallationClient, message);
} catch (e) {
messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in 0999e80 — now logs only { error: err.message, status: err.status } instead of the full error object.

Log only message and status instead of the full error object to avoid
leaking request/response metadata from Octokit errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

isJobQueued check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs

2 participants