fix(lambda): return partial CreateFleet instances instead of discarding them by vegardx · Pull Request #5131 · github-aws-runners/terraform-aws-github-runner

vegardx · 2026-05-26T20:32:04Z

Problem

When CreateFleet returns partial success (e.g. "6 of 8 instances created" plus errors), processFleetResult throws ScaleError and discards the successfully-created instance IDs. Those instances boot with no JIT config written to SSM — they are orphaned until the scale-down Lambda reaps them.

The caller (scaleUp) already handles count mismatch by marking unfulfilled messages as batch failures for SQS retry. But it never gets the chance because the throw short-circuits the entire flow.

Impact observed at batch_size ≥ 5: ~9% of jobs stuck permanently in burst tests — instances are launched, waste resources, but never pick up work.

Fix

Return partial instances when at least one was created. Only throw ScaleError when zero instances were created:

if (failedCount > 0 && instances.length > 0) {
  logger.warn(`Partial fleet success: ${instances.length}/${numberOfRunners} created. Returning partial results.`);
  return instances;
}
if (failedCount > 0) {
  throw new ScaleError(runnerParameters.numberOfRunners);
}

Same logic for unrecognized errors — if instances exist, return them.

Changes

lambdas/functions/control-plane/src/aws/runners.ts — return partial instances, only throw on zero
lambdas/functions/control-plane/src/aws/runners.test.ts — new test for partial success with recognized error, updated existing tests for new ScaleError semantics

Behavior change summary

Scenario	Before	After
Partial success + recognized error	Throws ScaleError (orphans instances)	Returns instances, caller retries shortfall
Partial success + unrecognized error	Throws Error (orphans instances)	Returns instances, caller retries shortfall
Zero instances + recognized error	Throws ScaleError(errorCount)	Throws ScaleError(numberOfRunners)
Zero instances + unrecognized error	Throws Error	Throws Error (unchanged)
On-demand failover path	Returns combined	Returns combined (unchanged)

Related to #5024

…ng them When CreateFleet returns partial success (some instances created, some errors), processFleetResult previously threw ScaleError and discarded the successfully-created instance IDs. Those instances would boot with no JIT config in SSM — orphaned until scale-down reaps them. Now returns partial instances when at least one was created. The caller (scaleUp) already handles count mismatch by marking unfulfilled messages as batch failures for SQS retry. ScaleError is only thrown when zero instances were created. Also changes ScaleError to carry numberOfRunners (the full requested count) rather than the count of matched error codes, ensuring SQS retries the correct number of messages.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates fleet creation error handling to avoid throwing when some EC2 instances are successfully created, allowing partial results to be returned (to reduce “orphan” instances) while still throwing ScaleError when zero instances are created.

Changes:

Return created instance IDs when CreateFleet yields partial success (both recognized scale errors and unrecognized errors).
Throw ScaleError(numberOfRunners) when zero instances are created under recognized scaling errors.
Adjust/add tests to validate new “return partial instead of throw” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
lambdas/functions/control-plane/src/aws/runners.ts	Changes `processFleetResult` to return instances on partial success and only throw on zero-instance outcomes.
lambdas/functions/control-plane/src/aws/runners.test.ts	Updates expectations for `ScaleError` count and adds tests for partial-return behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vegardx · 2026-05-26T21:00:07Z

+      logger.warn(
+        `Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
+          `Returning partial results; caller will retry the shortfall via SQS.`,
+        { data: fleet.Errors },


Good point. Fixed in e8cdc4e — removed SQS reference, reworded to mechanism-agnostic, and added created/requested to log metadata.

vegardx · 2026-05-26T21:00:15Z

+    if (instances.length > 0) {
+      logger.warn(
+        `Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
+          `Returning partial results; caller will retry the shortfall via SQS.`,
+        { data: fleet.Errors },
+      );
+      return instances;
+    }


The caller (scaleUp in scale-up.ts) already handles instances.length < numberOfRunners — it tracks how many instances were successfully created vs. how many SQS messages it received, and marks the shortfall as batchItemFailures. This is the existing contract; we're not changing it.

Introducing a structured return type would be a larger refactor that touches every callsite of createRunner (scale-up, pool, on-demand fallback recursion) for no behavioral benefit — the length comparison already provides the signal.

vegardx · 2026-05-26T21:00:21Z

    await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).rejects.toMatchObject({
      name: 'ScaleError',
-      failedInstanceCount: 2,
+      failedInstanceCount: 1, // numberOfRunners when zero instances created
    });


Agreed. Fixed in e8cdc4e — test now uses numberOfRunners: 3 so the assertion actually validates the value.

vegardx · 2026-05-26T21:00:21Z

+    await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).resolves.toEqual(['i-partial']);
+    expect(mockEC2Client).toHaveReceivedCommandWith(
+      CreateFleetCommand,
+      expectedCreateFleetRequest(defaultExpectedFleetRequestValues),
+    );


Agreed. Fixed in e8cdc4e — test now requests 3 runners and gets 1 instance back, making it a genuine partial-success scenario.

…tests - Reword partial-success log messages to remove SQS reference; add created/requested counts to log metadata - ScaleError test now uses numberOfRunners=3 to validate failedInstanceCount - Partial-success test now requests 3 runners and gets 1 back (truly partial)

vegardx requested a review from a team as a code owner May 26, 2026 20:32

Copilot AI review requested due to automatic review settings May 26, 2026 20:32

Copilot AI reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lambda): return partial CreateFleet instances instead of discarding them#5131

fix(lambda): return partial CreateFleet instances instead of discarding them#5131
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/partial-createfleet-return-instances

vegardx commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

vegardx May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vegardx commented May 26, 2026

Problem

Fix

Changes

Behavior change summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

vegardx May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants