fix(lambda): return partial CreateFleet instances instead of discarding them#5131
fix(lambda): return partial CreateFleet instances instead of discarding them#5131vegardx wants to merge 2 commits into
Conversation
…ng them When CreateFleet returns partial success (some instances created, some errors), processFleetResult previously threw ScaleError and discarded the successfully-created instance IDs. Those instances would boot with no JIT config in SSM — orphaned until scale-down reaps them. Now returns partial instances when at least one was created. The caller (scaleUp) already handles count mismatch by marking unfulfilled messages as batch failures for SQS retry. ScaleError is only thrown when zero instances were created. Also changes ScaleError to carry numberOfRunners (the full requested count) rather than the count of matched error codes, ensuring SQS retries the correct number of messages.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates fleet creation error handling to avoid throwing when some EC2 instances are successfully created, allowing partial results to be returned (to reduce “orphan” instances) while still throwing ScaleError when zero instances are created.
Changes:
- Return created instance IDs when
CreateFleetyields partial success (both recognized scale errors and unrecognized errors). - Throw
ScaleError(numberOfRunners)when zero instances are created under recognized scaling errors. - Adjust/add tests to validate new “return partial instead of throw” behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lambdas/functions/control-plane/src/aws/runners.ts | Changes processFleetResult to return instances on partial success and only throw on zero-instance outcomes. |
| lambdas/functions/control-plane/src/aws/runners.test.ts | Updates expectations for ScaleError count and adds tests for partial-return behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.warn( | ||
| `Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` + | ||
| `Returning partial results; caller will retry the shortfall via SQS.`, | ||
| { data: fleet.Errors }, |
There was a problem hiding this comment.
Good point. Fixed in e8cdc4e — removed SQS reference, reworded to mechanism-agnostic, and added created/requested to log metadata.
| if (instances.length > 0) { | ||
| logger.warn( | ||
| `Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` + | ||
| `Returning partial results; caller will retry the shortfall via SQS.`, | ||
| { data: fleet.Errors }, | ||
| ); | ||
| return instances; | ||
| } |
There was a problem hiding this comment.
The caller (scaleUp in scale-up.ts) already handles instances.length < numberOfRunners — it tracks how many instances were successfully created vs. how many SQS messages it received, and marks the shortfall as batchItemFailures. This is the existing contract; we're not changing it.
Introducing a structured return type would be a larger refactor that touches every callsite of createRunner (scale-up, pool, on-demand fallback recursion) for no behavioral benefit — the length comparison already provides the signal.
| await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).rejects.toMatchObject({ | ||
| name: 'ScaleError', | ||
| failedInstanceCount: 2, | ||
| failedInstanceCount: 1, // numberOfRunners when zero instances created | ||
| }); |
There was a problem hiding this comment.
Agreed. Fixed in e8cdc4e — test now uses numberOfRunners: 3 so the assertion actually validates the value.
| await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).resolves.toEqual(['i-partial']); | ||
| expect(mockEC2Client).toHaveReceivedCommandWith( | ||
| CreateFleetCommand, | ||
| expectedCreateFleetRequest(defaultExpectedFleetRequestValues), | ||
| ); |
There was a problem hiding this comment.
Agreed. Fixed in e8cdc4e — test now requests 3 runners and gets 1 instance back, making it a genuine partial-success scenario.
…tests - Reword partial-success log messages to remove SQS reference; add created/requested counts to log metadata - ScaleError test now uses numberOfRunners=3 to validate failedInstanceCount - Partial-success test now requests 3 runners and gets 1 back (truly partial)
Problem
When
CreateFleetreturns partial success (e.g. "6 of 8 instances created" plus errors),processFleetResultthrowsScaleErrorand discards the successfully-created instance IDs. Those instances boot with no JIT config written to SSM — they are orphaned until the scale-down Lambda reaps them.The caller (
scaleUp) already handles count mismatch by marking unfulfilled messages as batch failures for SQS retry. But it never gets the chance because the throw short-circuits the entire flow.Impact observed at batch_size ≥ 5: ~9% of jobs stuck permanently in burst tests — instances are launched, waste resources, but never pick up work.
Fix
Return partial instances when at least one was created. Only throw
ScaleErrorwhen zero instances were created:Same logic for unrecognized errors — if instances exist, return them.
Changes
lambdas/functions/control-plane/src/aws/runners.ts— return partial instances, only throw on zerolambdas/functions/control-plane/src/aws/runners.test.ts— new test for partial success with recognized error, updated existing tests for new ScaleError semanticsBehavior change summary
Related to #5024