acceptance: route bundle test clusters through the shared instance pool#5461
Open
renaudhartert-db wants to merge 1 commit into
Open
acceptance: route bundle test clusters through the shared instance pool#5461renaudhartert-db wants to merge 1 commit into
renaudhartert-db wants to merge 1 commit into
Conversation
The cli-isolated integration tests launch ~30 ephemeral clusters per run, each cold-pulling the multi-GB DBR runtime image over the NAT gateway in the deco AWS test account. That NAT egress is the bulk of an opex.eng.deco budget overspend (ES-1912931); the traffic is ~99.6% inbound download, ~3 GB per node. These bundle acceptance templates set node_type_id directly, bypassing the existing warm instance pool that is already exported to CI as TEST_INSTANCE_POOL_ID and already used by spark-jar-task and integration_whl/base. Routing them through the pool lets nodes reuse a cached runtime image instead of re-pulling it through NAT on every launch. Adds instance_pool_id: $TEST_INSTANCE_POOL_ID to the cluster-launching templates, matching the existing pattern, and regenerates the affected acceptance goldens. Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
|
Commit: 354063e
24 interesting tests: 15 SKIP, 7 KNOWN, 2 BUG
Top 25 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cli-isolated integration tests launch a large number of ephemeral clusters per run, and each one cold-pulls the multi-GB Databricks runtime image from the internet through the NAT gateway in the deco AWS test account. That NAT egress is the main driver of an opex.eng.deco budget overspend (ES-1912931): the traffic is about 99.6% inbound download, roughly 3 GB per node, which lines up with one runtime image per cold node.
Most cluster-launching bundle acceptance templates set node_type_id directly, which bypasses the shared warm instance pool. The pool already exists and is already exported to CI as TEST_INSTANCE_POOL_ID, and two templates (spark-jar-task and integration_whl/base) already use it. This change applies the same one-line pattern to the remaining cluster-launching templates, so their nodes come from the pool and reuse a cached runtime image instead of re-pulling it through NAT on every launch.
When instance_pool_id is set the cluster takes its node type from the pool, so the bundle drops node_type_id and driver_node_type_id; that accounts for the golden updates here. The affected acceptance tests were regenerated and pass locally.
One autoscale template (resources/clusters/deploy/update-and-resize-autoscale) is intentionally left out for now because its golden has environment-dependent fields that did not regenerate cleanly outside CI; it can be added in a follow-up. A separate optional follow-up in eng-dev-ecosystem can preload more Spark versions on the pool to warm first-use, but it is not required for the reuse win.