Skip to content

fix(lambda): return partial CreateFleet instances instead of discarding them#5131

Open
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/partial-createfleet-return-instances
Open

fix(lambda): return partial CreateFleet instances instead of discarding them#5131
vegardx wants to merge 2 commits into
github-aws-runners:mainfrom
vegardx:fix/partial-createfleet-return-instances

Conversation

@vegardx
Copy link
Copy Markdown

@vegardx vegardx commented May 26, 2026

Problem

When CreateFleet returns partial success (e.g. "6 of 8 instances created" plus errors), processFleetResult throws ScaleError and discards the successfully-created instance IDs. Those instances boot with no JIT config written to SSM — they are orphaned until the scale-down Lambda reaps them.

The caller (scaleUp) already handles count mismatch by marking unfulfilled messages as batch failures for SQS retry. But it never gets the chance because the throw short-circuits the entire flow.

Impact observed at batch_size ≥ 5: ~9% of jobs stuck permanently in burst tests — instances are launched, waste resources, but never pick up work.

Fix

Return partial instances when at least one was created. Only throw ScaleError when zero instances were created:

if (failedCount > 0 && instances.length > 0) {
  logger.warn(`Partial fleet success: ${instances.length}/${numberOfRunners} created. Returning partial results.`);
  return instances;
}
if (failedCount > 0) {
  throw new ScaleError(runnerParameters.numberOfRunners);
}

Same logic for unrecognized errors — if instances exist, return them.

Changes

  • lambdas/functions/control-plane/src/aws/runners.ts — return partial instances, only throw on zero
  • lambdas/functions/control-plane/src/aws/runners.test.ts — new test for partial success with recognized error, updated existing tests for new ScaleError semantics

Behavior change summary

Scenario Before After
Partial success + recognized error Throws ScaleError (orphans instances) Returns instances, caller retries shortfall
Partial success + unrecognized error Throws Error (orphans instances) Returns instances, caller retries shortfall
Zero instances + recognized error Throws ScaleError(errorCount) Throws ScaleError(numberOfRunners)
Zero instances + unrecognized error Throws Error Throws Error (unchanged)
On-demand failover path Returns combined Returns combined (unchanged)

Related to #5024

…ng them

When CreateFleet returns partial success (some instances created, some
errors), processFleetResult previously threw ScaleError and discarded
the successfully-created instance IDs. Those instances would boot with
no JIT config in SSM — orphaned until scale-down reaps them.

Now returns partial instances when at least one was created. The caller
(scaleUp) already handles count mismatch by marking unfulfilled messages
as batch failures for SQS retry. ScaleError is only thrown when zero
instances were created.

Also changes ScaleError to carry numberOfRunners (the full requested
count) rather than the count of matched error codes, ensuring SQS
retries the correct number of messages.
@vegardx vegardx requested a review from a team as a code owner May 26, 2026 20:32
Copilot AI review requested due to automatic review settings May 26, 2026 20:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates fleet creation error handling to avoid throwing when some EC2 instances are successfully created, allowing partial results to be returned (to reduce “orphan” instances) while still throwing ScaleError when zero instances are created.

Changes:

  • Return created instance IDs when CreateFleet yields partial success (both recognized scale errors and unrecognized errors).
  • Throw ScaleError(numberOfRunners) when zero instances are created under recognized scaling errors.
  • Adjust/add tests to validate new “return partial instead of throw” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
lambdas/functions/control-plane/src/aws/runners.ts Changes processFleetResult to return instances on partial success and only throw on zero-instance outcomes.
lambdas/functions/control-plane/src/aws/runners.test.ts Updates expectations for ScaleError count and adds tests for partial-return behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +205 to +208
logger.warn(
`Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
`Returning partial results; caller will retry the shortfall via SQS.`,
{ data: fleet.Errors },
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in e8cdc4e — removed SQS reference, reworded to mechanism-agnostic, and added created/requested to log metadata.

Comment on lines +204 to +211
if (instances.length > 0) {
logger.warn(
`Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
`Returning partial results; caller will retry the shortfall via SQS.`,
{ data: fleet.Errors },
);
return instances;
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller (scaleUp in scale-up.ts) already handles instances.length < numberOfRunners — it tracks how many instances were successfully created vs. how many SQS messages it received, and marks the shortfall as batchItemFailures. This is the existing contract; we're not changing it.

Introducing a structured return type would be a larger refactor that touches every callsite of createRunner (scale-up, pool, on-demand fallback recursion) for no behavioral benefit — the length comparison already provides the signal.

Comment on lines 514 to 517
await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).rejects.toMatchObject({
name: 'ScaleError',
failedInstanceCount: 2,
failedInstanceCount: 1, // numberOfRunners when zero instances created
});
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in e8cdc4e — test now uses numberOfRunners: 3 so the assertion actually validates the value.

Comment on lines +549 to +553
await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).resolves.toEqual(['i-partial']);
expect(mockEC2Client).toHaveReceivedCommandWith(
CreateFleetCommand,
expectedCreateFleetRequest(defaultExpectedFleetRequestValues),
);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in e8cdc4e — test now requests 3 runners and gets 1 instance back, making it a genuine partial-success scenario.

…tests

- Reword partial-success log messages to remove SQS reference; add
  created/requested counts to log metadata
- ScaleError test now uses numberOfRunners=3 to validate failedInstanceCount
- Partial-success test now requests 3 runners and gets 1 back (truly partial)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants