Skip to content

google_kubernetes_engine: add management plane operations#6747

Open
ashishsuneja wants to merge 4 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_gke
Open

google_kubernetes_engine: add management plane operations#6747
ashishsuneja wants to merge 4 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_gke

Conversation

@ashishsuneja

Copy link
Copy Markdown

Summary
Adds GKE-specific implementations of management plane async methods.

Main changes

  • _IssueAsync, CreateNodePoolAsync, DeleteNodePoolAsync,
    UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
    ResolveNodePoolVersions, _GetLatestOperationName, GetNodePoolNames

How tested

  • Existing GKE test suite passing
  • pyink + lint-diffs clean

Ashish Suneja added 3 commits June 9, 2026 12:47
- _IssueAsync: issues gcloud --async commands, returns op name
- CreateNodePoolAsync: creates nodepool, returns op handle
- DeleteNodePoolAsync: deletes nodepool, returns op handle
- UpgradeNodePoolAsync: upgrades nodepool, returns op handle
- UpdateClusterAsync: toggles label for non-destructive update
- WaitForOperation: polls until DONE/ABORTING
- ResolveNodePoolVersions: auto-detects initial/target versions
- _GetLatestOperationName: fallback op lookup by type+target
- GetNodePoolNames: lists current node pools

Tested:
- Existing GKE test suite passing
- pyink + lint-diffs clean
Operation name string, or empty string if none found.
"""
link_target = target_name or self.name
if op_start_time:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good logic; can we not just do it every time / require op_start_time? Additional conditions when not needed are just additional complexity.

@ashishsuneja ashishsuneja Jun 15, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — collapsed it. _GetLatestOperationName now always uses the broadened filter (RUNNING/PENDING/DONE with a startTime>= guard) and always takes op_start_time. The guard already prevents matching stale completed ops, so the two-branch conditional was redundant.


def _IssueAsync(self, cmd: util.GcloudCommand) -> str:
"""Issues a gcloud command with --async, returns the operation name."""
cmd.args.append('--async')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wondering what made these operations async. got it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged

f'targetLink ~ {link_target}'
f'{time_filter}'
)
for attempt in range(1, max_attempts + 1):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are some examples where you try to call the operation but it doesn't get seen immediately? does it just take a minute for results to show up?

A unit test with semi-real output examples would be helpful.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests (GoogleKubernetesEngineAsyncOpsTestCase) with captured-style gcloud output: create returns the op name directly (no fallback), upgrade and update fall back and recover from the operations list, and the no-fallback-configured case raises. On the timing: the operation can take a moment to leave PENDING, and fast ops (the label-update) may already be DONE by the time we query — both handled by the broadened filter. In practice on the 100-pool run it resolved on the first query every time; the 5×3s retry is a safety margin for the PENDING-transition window.

raise errors.Resource.CreationError(
f'GKE async command returned no operation name; stderr={stderr}'
)
return op_name

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below you have _GetLatestOperationName command.. does this command here not output the operation name? Shouldn't it just be:

  • start operation, name is returned
  • wait for operation to finish. now done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For create and delete, that's exactly the flow — _IssueAsync issues --async, gcloud prints the operation name, we wait on it. The exception is clusters upgrade --node-pool and clusters update: those reliably return success with empty stdout (gcloud just doesn't print the name for those two subcommands). Verified on a 100-pool GKE run — 99/99 upgrades and the cluster-update all came back with no operation name, so _GetLatestOperationName recovers it from the operations list. I didn't find a gcloud flag that makes upgrade/update print it; if one exists I'd happily drop the fallback.

# Fallback: gcloud succeeded but printed nothing. Query the operations
# list scoped to this specific nodepool to find the operation name.
logging.info(
'UpgradeNodePoolAsync: ops list fallback for %s: %s',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why/when/how frequently does this happen?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time, not intermittently — clusters upgrade --node-pool --async returned no operation name on 99/99 upgrades in the 100-pool run. So the fallback runs on every upgrade/update.

try:
return self._IssueAsync(cmd)
except errors.Resource.CreationError as e:
if 'returned no operation name' not in str(e):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is shared with UpgradeNodePoolAsync above. Refactor to only use in one location. Perhaps an optional parameter in _IssueAsync which can handle this "no operation name" case. If ofc this is necessary at all.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, exactly as suggested — moved the fallback into _IssueAsync behind optional fallback_op_type / fallback_target params, so the "no op name → query operations list" path lives in one place. UpgradeNodePoolAsync and UpdateClusterAsync are now one-liners

@ashishsuneja ashishsuneja force-pushed the mgmt_plane_gke branch 2 times, most recently from 2cd4d5b to d3eec15 Compare June 17, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants