Skip to content

acceptance: route bundle test clusters through the shared instance pool#5461

Open
renaudhartert-db wants to merge 1 commit into
mainfrom
nat-pool-routing
Open

acceptance: route bundle test clusters through the shared instance pool#5461
renaudhartert-db wants to merge 1 commit into
mainfrom
nat-pool-routing

Conversation

@renaudhartert-db
Copy link
Copy Markdown
Contributor

The cli-isolated integration tests launch a large number of ephemeral clusters per run, and each one cold-pulls the multi-GB Databricks runtime image from the internet through the NAT gateway in the deco AWS test account. That NAT egress is the main driver of an opex.eng.deco budget overspend (ES-1912931): the traffic is about 99.6% inbound download, roughly 3 GB per node, which lines up with one runtime image per cold node.

Most cluster-launching bundle acceptance templates set node_type_id directly, which bypasses the shared warm instance pool. The pool already exists and is already exported to CI as TEST_INSTANCE_POOL_ID, and two templates (spark-jar-task and integration_whl/base) already use it. This change applies the same one-line pattern to the remaining cluster-launching templates, so their nodes come from the pool and reuse a cached runtime image instead of re-pulling it through NAT on every launch.

When instance_pool_id is set the cluster takes its node type from the pool, so the bundle drops node_type_id and driver_node_type_id; that accounts for the golden updates here. The affected acceptance tests were regenerated and pass locally.

One autoscale template (resources/clusters/deploy/update-and-resize-autoscale) is intentionally left out for now because its golden has environment-dependent fields that did not regenerate cleanly outside CI; it can be added in a follow-up. A separate optional follow-up in eng-dev-ecosystem can preload more Spark versions on the pool to warm first-use, but it is not required for the reuse win.

The cli-isolated integration tests launch ~30 ephemeral clusters per run, each
cold-pulling the multi-GB DBR runtime image over the NAT gateway in the deco AWS
test account. That NAT egress is the bulk of an opex.eng.deco budget overspend
(ES-1912931); the traffic is ~99.6% inbound download, ~3 GB per node.

These bundle acceptance templates set node_type_id directly, bypassing the
existing warm instance pool that is already exported to CI as
TEST_INSTANCE_POOL_ID and already used by spark-jar-task and integration_whl/base.
Routing them through the pool lets nodes reuse a cached runtime image instead of
re-pulling it through NAT on every launch.

Adds instance_pool_id: $TEST_INSTANCE_POOL_ID to the cluster-launching templates,
matching the existing pattern, and regenerates the affected acceptance goldens.

Co-authored-by: Isaac
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 7, 2026

Waiting for approval

Based on git history, these people are best suited to review:

  • @andrewnester -- recent work in acceptance/bundle/resources/clusters/lifecycle-started-terraform-error/, acceptance/bundle/resources/clusters/lifecycle-started-toggle/, acceptance/bundle/resources/clusters/lifecycle-started/

Eligible reviewers: @anton-107, @denik, @janniklasrose, @lennartkats-db, @pietern, @shreyas-goenka

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot
Copy link
Copy Markdown
Collaborator

Commit: 354063e

Run: 27093685326

Env 🪲​BUG 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🪲​ aws linux 2 7 15 259 923 9:57
🪲​ aws windows 2 7 15 261 921 12:20
🪲​ aws-ucws linux 2 1 6 15 355 837 6:55
🪲​ aws-ucws windows 2 1 6 15 357 835 11:01
🪲​ azure linux 2 1 17 262 921 9:22
🪲​ azure windows 2 1 17 264 919 10:28
🪲​ azure-ucws linux 2 1 17 360 833 18:17
🪲​ azure-ucws windows 2 1 17 362 831 11:20
🪲​ gcp linux 2 1 17 258 924 9:49
🪲​ gcp windows 2 1 17 260 922 11:30
24 interesting tests: 15 SKIP, 7 KNOWN, 2 BUG
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K 🟨​K
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/grants/select 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🪲​ TestAccept/bundle/run_as/job_default 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B
🪲​ TestAccept/bundle/run_as/job_default/DATABRICKS_BUNDLE_ENGINE=direct 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B 🪲​B
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
Top 25 slowest tests (at least 2 minutes):
duration env testname
7:52 azure-ucws linux TestSQLExecScalar
6:23 gcp linux TestSecretsPutSecretStringValue
5:39 aws linux TestSecretsPutSecretStringValue
5:38 azure linux TestSecretsPutSecretStringValue
5:01 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:47 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:42 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:40 azure-ucws linux TestSecretsPutSecretStringValue
4:05 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:36 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:31 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:21 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:14 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:08 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:05 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:01 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:58 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:53 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:49 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:46 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:42 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:37 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:35 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:24 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:21 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants