fix(lambda): fail-open isJobQueued — assume queued on API errors by vegardx · Pull Request #5130 · github-aws-runners/terraform-aws-github-runner

vegardx · 2026-05-26T20:23:52Z

Problem

When enable_job_queued_check = true (default), the isJobQueued call has no error handling. If getJobForWorkflowRun throws (404, rate limit, 502), the error propagates up through scaleUp and hits the generic catch in scaleUpHandler — causing the batch to fail.

This is especially problematic in combination with the race condition described in #5026: when SQS delivers messages in multiple batches due to concurrency limits, GitHub API errors during the second batch cause jobs to be silently dropped.

Fix

Wrap the isJobQueued check in a try/catch. On any error, assume the job is still queued (fail-open) and log a warning. The worst case is creating a runner for a job that has already been handled — the runner will self-terminate when no job is available.

if (enableJobQueuedCheck) {
  let jobQueued = true;
  try {
    jobQueued = await isJobQueued(githubInstallationClient, message);
  } catch (e) {
    messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });
  }
  if (!jobQueued) {
    messageLogger.info('No runner will be created, job is not queued.');
    continue;
  }
}

Changes

lambdas/functions/control-plane/src/scale-runners/scale-up.ts — try/catch around isJobQueued
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts — test: API error → runner still created

Risk

Low — fail-open is strictly more resilient than fail-closed for job dispatch. The extra runner (if the job was already handled) self-terminates with no work.

Fixes #5026

Wrap the isJobQueued check in a try/catch that assumes the job is still queued when the GitHub API returns an error (404, rate limit, 502). This prevents silent job drops when the API is transiently unavailable. Previously, any error from getJobForWorkflowRun would propagate up and (combined with the non-ScaleError catch behavior) cause the entire SQS batch to be dropped. Fixes github-aws-runners#5026

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes the job-queued verification in scaleUp resilient to GitHub API failures by switching the isJobQueued check to a fail-open approach, and adds a test to validate the new behavior.

Changes:

Wrap isJobQueued in a try/catch and proceed with scaling when the check fails (fail-open).
Add a GHES unit test ensuring runners are still created when isJobQueued throws.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Makes the `isJobQueued` gate fail-open on errors to avoid dropping scale events during GitHub API issues.
lambdas/functions/control-plane/src/scale-runners/scale-up.test.ts	Adds coverage for the new fail-open behavior when GitHub API calls fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vegardx · 2026-05-26T20:58:53Z

+      if (enableJobQueuedCheck) {
+        let jobQueued = true;
+        try {
+          jobQueued = await isJobQueued(githubInstallationClient, message);
+        } catch (e) {
+          messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });
+        }
+        if (!jobQueued) {
+          messageLogger.info('No runner will be created, job is not queued.');
+          continue;
+        }
      }


Intentionally broad. The fail-open philosophy here is that dropping a job is always worse than creating an ephemeral runner that self-terminates in ~30s when no work is available.

Specific cases:

404 (job not found): GitHub's API is eventually consistent — a 404 immediately after webhook delivery is a race condition, not a definitive "job doesn't exist". Failing closed here drops the job permanently.

Auth/permission errors: Transient in practice (token refresh, installation permission propagation delays).

Unsupported event type: isJobQueued already returns false for unsupported events (the throw path is only reached on actual API call failures, not event-type mismatches).

Narrowing to specific status codes adds a maintenance surface that breaks when GitHub changes error responses. The max downside of fail-open is one extra idle runner for 30s; the max downside of fail-closed is a permanently dropped job.

vegardx · 2026-05-26T20:58:41Z

+        try {
+          jobQueued = await isJobQueued(githubInstallationClient, message);
+        } catch (e) {
+          messageLogger.warn('isJobQueued check failed, assuming job is still queued (fail-open)', { error: e });


Good point. Fixed in 0999e80 — now logs only { error: err.message, status: err.status } instead of the full error object.

Log only message and status instead of the full error object to avoid leaking request/response metadata from Octokit errors.

Copilot AI review requested due to automatic review settings May 26, 2026 20:23

vegardx requested a review from a team as a code owner May 26, 2026 20:23

vegardx mentioned this pull request May 26, 2026

isJobQueued check races with SQS visibility timeout and label-based runner assignment, silently dropping jobs #5026

Open

Copilot AI reviewed May 26, 2026

View reviewed changes

refactor: sanitize error object in isJobQueued warning log

0999e80

Log only message and status instead of the full error object to avoid leaking request/response metadata from Octokit errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lambda): fail-open isJobQueued — assume queued on API errors#5130

fix(lambda): fail-open isJobQueued — assume queued on API errors#5130
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/fail-open-is-job-queued

vegardx commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vegardx commented May 26, 2026

Problem

Fix

Changes

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants