google_kubernetes_engine: add management plane operations#6747
google_kubernetes_engine: add management plane operations#6747ashishsuneja wants to merge 4 commits into
Conversation
- _IssueAsync: issues gcloud --async commands, returns op name - CreateNodePoolAsync: creates nodepool, returns op handle - DeleteNodePoolAsync: deletes nodepool, returns op handle - UpgradeNodePoolAsync: upgrades nodepool, returns op handle - UpdateClusterAsync: toggles label for non-destructive update - WaitForOperation: polls until DONE/ABORTING - ResolveNodePoolVersions: auto-detects initial/target versions - _GetLatestOperationName: fallback op lookup by type+target - GetNodePoolNames: lists current node pools Tested: - Existing GKE test suite passing - pyink + lint-diffs clean
9acc83e to
b78f357
Compare
| Operation name string, or empty string if none found. | ||
| """ | ||
| link_target = target_name or self.name | ||
| if op_start_time: |
There was a problem hiding this comment.
This is good logic; can we not just do it every time / require op_start_time? Additional conditions when not needed are just additional complexity.
There was a problem hiding this comment.
Done — collapsed it. _GetLatestOperationName now always uses the broadened filter (RUNNING/PENDING/DONE with a startTime>= guard) and always takes op_start_time. The guard already prevents matching stale completed ops, so the two-branch conditional was redundant.
|
|
||
| def _IssueAsync(self, cmd: util.GcloudCommand) -> str: | ||
| """Issues a gcloud command with --async, returns the operation name.""" | ||
| cmd.args.append('--async') |
There was a problem hiding this comment.
Ah, I was wondering what made these operations async. got it.
| f'targetLink ~ {link_target}' | ||
| f'{time_filter}' | ||
| ) | ||
| for attempt in range(1, max_attempts + 1): |
There was a problem hiding this comment.
what are some examples where you try to call the operation but it doesn't get seen immediately? does it just take a minute for results to show up?
A unit test with semi-real output examples would be helpful.
There was a problem hiding this comment.
Added unit tests (GoogleKubernetesEngineAsyncOpsTestCase) with captured-style gcloud output: create returns the op name directly (no fallback), upgrade and update fall back and recover from the operations list, and the no-fallback-configured case raises. On the timing: the operation can take a moment to leave PENDING, and fast ops (the label-update) may already be DONE by the time we query — both handled by the broadened filter. In practice on the 100-pool run it resolved on the first query every time; the 5×3s retry is a safety margin for the PENDING-transition window.
| raise errors.Resource.CreationError( | ||
| f'GKE async command returned no operation name; stderr={stderr}' | ||
| ) | ||
| return op_name |
There was a problem hiding this comment.
below you have _GetLatestOperationName command.. does this command here not output the operation name? Shouldn't it just be:
- start operation, name is returned
- wait for operation to finish. now done.
There was a problem hiding this comment.
For create and delete, that's exactly the flow — _IssueAsync issues --async, gcloud prints the operation name, we wait on it. The exception is clusters upgrade --node-pool and clusters update: those reliably return success with empty stdout (gcloud just doesn't print the name for those two subcommands). Verified on a 100-pool GKE run — 99/99 upgrades and the cluster-update all came back with no operation name, so _GetLatestOperationName recovers it from the operations list. I didn't find a gcloud flag that makes upgrade/update print it; if one exists I'd happily drop the fallback.
| # Fallback: gcloud succeeded but printed nothing. Query the operations | ||
| # list scoped to this specific nodepool to find the operation name. | ||
| logging.info( | ||
| 'UpgradeNodePoolAsync: ops list fallback for %s: %s', |
There was a problem hiding this comment.
why/when/how frequently does this happen?
There was a problem hiding this comment.
Every time, not intermittently — clusters upgrade --node-pool --async returned no operation name on 99/99 upgrades in the 100-pool run. So the fallback runs on every upgrade/update.
| try: | ||
| return self._IssueAsync(cmd) | ||
| except errors.Resource.CreationError as e: | ||
| if 'returned no operation name' not in str(e): |
There was a problem hiding this comment.
this logic is shared with UpgradeNodePoolAsync above. Refactor to only use in one location. Perhaps an optional parameter in _IssueAsync which can handle this "no operation name" case. If ofc this is necessary at all.
There was a problem hiding this comment.
Done, exactly as suggested — moved the fallback into _IssueAsync behind optional fallback_op_type / fallback_target params, so the "no op name → query operations list" path lives in one place. UpgradeNodePoolAsync and UpdateClusterAsync are now one-liners
…sueAsync; add tests
2cd4d5b to
d3eec15
Compare
Summary
Adds GKE-specific implementations of management plane async methods.
Main changes
UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
ResolveNodePoolVersions, _GetLatestOperationName, GetNodePoolNames
How tested