kubernetes_management: add benchmark + base cluster management plane methods#6746
kubernetes_management: add benchmark + base cluster management plane methods#6746ashishsuneja wants to merge 10 commits into
Conversation
| vm_count: 1 | ||
| vm_spec: | ||
| GCP: | ||
| # us-central1-a: spec primary region for GCP |
There was a problem hiding this comment.
This kind of detail (zone & machine type) should go in like a design doc or similar but doesn't need to live in the code comments itself. We will often override these values when calling the benchmark anyway.
There was a problem hiding this comment.
Done. Dropped the hardcoded per-cloud machine types/zones and the rationale comments — it now uses the shared *default_dual_core anchor (same pattern as provision_node_pools), so the spec detail lives in default_config_constants.yaml and gets overridden via flags at call time as you noted.
| ) | ||
| _SCENARIOS = flags.DEFINE_list( | ||
| "k8s_mgmt_scenarios", | ||
| ["A", "B", "C"], |
There was a problem hiding this comment.
A, B, C are not great scenario names. Make these more descriptive. This might warrant a big comment in addition to more descriptive names.
There was a problem hiding this comment.
one. Renamed to concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning, with a block comment above _VALID_SCENARIOS documenting what each measures. I kept the internal RunScenario* helper names and the ScenarioA* metric tags as-is for now to avoid churning the test suite ahead of the split — happy to harmonize those too if you'd prefer.
There was a problem hiding this comment.
Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).
Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.
There was a problem hiding this comment.
Kept the three names — they genuinely differ, and the adjective names each scenario's distinguishing axis (straight from the design doc): A is concurrent node-pool ops; B is a node-pool create overlapping a cluster update (the defining trait is overlap of two different op types, not concurrency of many of the same); C is large-scale provisioning (defined by volume — up to ~1000 pools — independent of execution mode). And done on the cleanup: purged all A/B/C references from function names, metric tags, and tests.
|
|
||
| def Prepare(benchmark_spec: bm_spec.BenchmarkSpec) -> None: | ||
| del benchmark_spec | ||
| """Asserts the cluster is reachable; deploys spec-defined sleep workload.""" |
There was a problem hiding this comment.
This should always be the case.. Or it should fail a) obviously during setup or b) obviously early in the run. I guess this is y'all trying to do b) fail early in the run but I wonder how much it actually adds.
There was a problem hiding this comment.
The assert isinstance(...) is there for type-narrowing (so pytype resolves cluster.k8s_version), not as a reachability check — the sleep-pod deploy below it is what actually confirms reachability. I've fixed the docstring that implied otherwise and added a comment noting the assert's real purpose. Happy to drop it entirely if you'd rather.
| samples: list[sample.Sample] = [] | ||
|
|
||
| if "A" in scenarios: | ||
| samples += _RunScenarioA(cluster, initial, target) |
There was a problem hiding this comment.
seems like target version is only used in ScenarioA. Back up in CheckPrerequisites consider checking eg "if target version set but scenario A not being run, fail". & then similar checks for the other scenarios - if scenario specific variable set but scenario is not being run, fail.
There was a problem hiding this comment.
Done. CheckPrerequisites now fails if --k8s_mgmt_initial_version/--k8s_mgmt_target_version are set without concurrent_node_pool_ops, or --k8s_mgmt_scale_sweep without large_scale_provisioning, with tests for both. I deliberately left --k8s_mgmt_pipeline_scenario_a out of the cross-check: it defaults to True, so there's no reliable "explicitly set" signal to validate against, unlike the version/scale-sweep flags which default to None/empty. (Note: with the upgrade split, the version-flag cross-check moves to the follow-up PR, since --k8s_mgmt_target_version lives there now.)
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def ResolveNodePoolVersions(self) -> tuple[str, str]: |
There was a problem hiding this comment.
Are you overloading specifically this function in any of the children? If not, please remove. The helper as is doesn't add anything.. just call cluster.BareMinor & cluster.AdjacentMinorBelow directly from kubernetes_management_benchmark.
There was a problem hiding this comment.
It's overridden in all three providers, and GKE's override is load-bearing: GKE rejects bare-minor versions and needs a fully-qualified node version (e.g. 1.34.4-gke.1234), which it resolves via gcloud container get-server-config. Calling BareMinor/AdjacentMinorBelow directly from the benchmark would break GKE — this method is the provider seam that keeps Run() cloud-agnostic. EKS/AKS take the bare-minor path and I've refactored those two to reuse the BareMinor/AdjacentMinorBelow helpers instead of duplicating the parse inline.
There was a problem hiding this comment.
Ok, took a look at https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/6747/changes
IMO you could handle this in GKE with a shared helper + caching. ie have both call into _ResolveNodePoolVersions for GKE & save the results, then just return those saved result. Or potentially call get-server-config twice But does this version number ever change or is it always the same? I know we're doing upgrades/downgrades in this, but idk if that's for the cluster overall or just like particular nodepools or what. If yes, it does change / this approach would add too much complexity, then it is good justification for keeping the ResolveNodePoolVersions function. Thanks for the extra context.
There was a problem hiding this comment.
Yes, it changes — and per node pool, not cluster-wide. The benchmark's whole point in concurrent_node_pool_ops is to create pools at an initial version (N-1) and then upgrade them to a distinct target (N), so initial and target are deliberately different values, resolved independently. They're node-pool versions: each pool is created at initial and upgraded to target individually; the control plane version isn't what we're varying.
On the caching idea — ResolveNodePoolVersions already resolves both initial and target in a single get-server-config call (returns the tuple), so we're not double-calling. The reason it stays a per-provider method rather than collapsing to bare BareMinor/AdjacentMinorBelow calls from the benchmark is that GKE needs the fully-qualified node version (e.g. 1.34.4-gke.1234) from get-server-config, while EKS/AKS take bare-minor and resolve differently — so the method is the provider seam that keeps Run() cloud-agnostic. Given the versions do change, I've kept it. Thanks for digging into #6747.
…methods kubernetes_management_benchmark.py: - Full implementation of Scenarios A, B, C on top of Zach's skeleton - _CleanStartSweep: guard clause, let it fail (no broad except) - Run(): guard clause for version resolution - All helpers: _RunScenarioA/B/C, _TimedAsync, _RunAsync, _OpSamples - _Results accumulator, _AggregateSamples, _OutlierSamples kubernetes_cluster.py: - AddNodepool: delegates to CreateNodePool for standard clusters - CreateNodePool/DeleteNodePool/UpgradeNodePool/UpdateCluster: sync wrappers calling *Async + WaitForOperation - Abstract methods: CreateNodePoolAsync, DeleteNodePoolAsync, UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation, ResolveNodePoolVersions, GetNodePoolNames - BareMinor, AdjacentMinorBelow: version helper functions Tested: - 89 unit tests passing - EKS end-to-end: 99 pools, 100% success all 7 scenarios - pyink + lint-diffs clean
…ng, remove dead code)
…nario-flag cross-validation
6d7c0c6 to
c959389
Compare
| @@ -1,3 +1,10 @@ | |||
| ### New features: | |||
There was a problem hiding this comment.
there's an existing new features section below. Add to there.
Other comments:
- Just one line about adding the benchmark is likely sufficient for all the PRs.
- No need for author credit (PR history shows this)
- No need to specify "add abstract methods" - again probably the whole PR stack can be summarized into a couple bullet points.
There was a problem hiding this comment.
Done — moved it into the existing new-features section, collapsed to a single line, and dropped both the author credit and the abstract-methods bullet.
| """Holds timing and outcome for a single async management-plane operation.""" | ||
|
|
||
| name: str | ||
| init_dur: float |
There was a problem hiding this comment.
Use full length names + provide g3doc descriptions for all the attributes. In particular I'm not sure what the difference is between "initial duration?" & "end to end duration?".
There was a problem hiding this comment.
Done — renamed init_dur→initiation_latency and e2e_dur→end_to_end_latency with a full Attributes: block. The distinction: initiation latency is the time from issuing the async call until it's accepted and an operation handle returns (time to start); end-to-end latency is from issuing the call until the operation fully completes (initiation plus server-side execution).
| ) | ||
| _SCENARIOS = flags.DEFINE_list( | ||
| "k8s_mgmt_scenarios", | ||
| ["A", "B", "C"], |
There was a problem hiding this comment.
Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).
Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.
| success = 0 | ||
|
|
||
| for r in results: | ||
| if isinstance(r, tuple): |
There was a problem hiding this comment.
this theoretically has type list[_OpResult], from above.. is that not actually the case? should we do this conversion beforehand, or is it truly a mixed list, or is it an unnecessary conversion? again brittle is fine / preferred over handling cases which should fail or we generally expect to pass.
There was a problem hiding this comment.
You're right — the isinstance(r, tuple) conversion was dead for production: _RunAsync always yields _OpResult objects, and only the tests were passing tuples. Removed the conversion and updated the tests to construct _OpResult directly, keeping the path strict.
| dict(meta), | ||
| ) | ||
| ) | ||
| samples.append( |
There was a problem hiding this comment.
Below you're doing a percentile / outlier based summation right? But you're also including results for every operation? How many operations do you expect?
There was a problem hiding this comment.
Both, by design: per-op _InitiationLatency/_EndToEndLatency samples preserve the raw distribution, while _AggregateSamples (Mean/StdDev/Min/Median/P90/P99/Max) and _OutlierSamples (IQR-fence) summarize over the successful ops. Op counts are scenario-driven — concurrent_node_pool_ops does --k8s_mgmt_concurrent_nodepools (default 5); large_scale_provisioning runs up to --k8s_mgmt_large_scale_nodepools or the --k8s_mgmt_scale_sweep values (hundreds to ~1000 per the design doc). Aggregates only emit at ≥2 successes and outliers at ≥4, so small scenarios skip them rather than emit noise
| kubernetes_management_benchmark._ScenarioCName(7), | ||
| ) | ||
|
|
||
| def testScenarioCNameFourDigitIndex(self): |
There was a problem hiding this comment.
condense these using a parameterized test
There was a problem hiding this comment.
Done — condensed the repetitive name tests into two @parameterized.named_parameters tests (three cases each).
| cluster = _make_mock_cluster( | ||
| pool_names=['pkbma000', 'pkbmc0001', 'user-pool'] | ||
| ) | ||
| kubernetes_management_benchmark._CleanStartSweep(cluster) |
There was a problem hiding this comment.
probably test as part of cleanup when moving clean start sweep call sites.
There was a problem hiding this comment.
Done — with the sweep relocated to per-scenario teardown and Cleanup(), the test moved to those new call sites (testRunSweepsAfterEachScenario, plus the renamed SweepNodePoolsTest).
…count metrics, fail-fast on zero ops
…f at start; Cleanup reuses sweep
… parameterize name tests
| + "If empty, uses --k8s_mgmt_large_scale_nodepools.", | ||
| ) | ||
|
|
||
| # AKS caps node-pool names at 12 chars — keep all names within that limit. |
There was a problem hiding this comment.
nit: move this to _ConcurrentPoolName's g3doc
There was a problem hiding this comment.
Done — _ConcurrentPoolName now has a g3doc, and the AKS-12-char note lives in it.
| ) | ||
|
|
||
|
|
||
| def _SweepNodePools(cluster: kubernetes_cluster.KubernetesCluster) -> None: |
There was a problem hiding this comment.
nit: rename to Clean or ClearNodePools, as you also use "Sweep" to mean "go through all of these different scale numbers" & its a bit confusing to have 2 meanings in the same file.
There was a problem hiding this comment.
Done — renamed _SweepNodePools → _ClearNodePools. Kept _SweepScales for the scale-loop, so "sweep" now has a single meaning in the file.
| for scale in scales: | ||
| scale_samples = _ScaleToPoolCount(cluster, initial, scale) | ||
| for s in scale_samples: | ||
| s.metadata["large_scale_scale"] = str(scale) |
There was a problem hiding this comment.
Done — renamed the scale/scales concept to goal_nodepools/goal_counts throughout _SweepScales and _ScaleToPoolCount, including the metadata key (large_scale_scale → goal_nodepools).
| get_name=lambda cfg: cfg.name, | ||
| ) | ||
| samples += _OpSamples( | ||
| "ConcurrentOps_Create", create_results, attempted_ops=len(pool_names) |
There was a problem hiding this comment.
is the concurrent_nodepools value in metadata already? If not, add it here or there
There was a problem hiding this comment.
It's already there — concurrent_nodepools is added to run_meta in Run() and tagged onto every sample, so it's in the metadata for all scenarios.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class _Results: |
There was a problem hiding this comment.
I don't really like _OpResults nor _Results as names. I think these don't need the _ - any class can just be a Result. Maybe change this one to something like ThreadSafeResults?
There was a problem hiding this comment.
Done — _Results → ThreadSafeResults (no leading underscore, as you suggested). _OpResult → OpTiming
| name: str | ||
| initiation_latency: float | ||
| end_to_end_latency: float | ||
| error: Exception | None = None |
There was a problem hiding this comment.
Again on "when to fail" - should we not just fail the whole benchmark run if an op fails? If you only run one scenario & one of its two ops fails.. then surely the benchmark overall should fail. But extrapolate that & you're only running like 4 or 5 ops in most benchmarks. Maybe the "Sweep through nodepool counts" shouldn't fail in case you eg put in a really high nodepool count & couldn't make it, but I think you could add such exception handling to that case in particular if indeed that's a concern.
There was a problem hiding this comment.
Reworked per both comments. OpTiming is now pure timing — (initiation_latency, end_to_end_latency), no name and no error field. The metric name is supplied by the sample builder (_OpSamples takes (name, OpTiming) pairs), so the result object no longer carries presentation data. This also removes the error field, which ties into the fail-hard change below.
| wait_fn: Callable[[str], None], | ||
| items: list, | ||
| get_name: Callable[[object], str], | ||
| ) -> list[tuple[str, float, float, Exception | None]]: |
There was a problem hiding this comment.
Why isn't this just a list[_OpResult]s ? Seems like if you're making dataclasses you should use them. Also, should _OpResult just be a sample & a convenience structure for making it? & then you add more metadata to it when needed. Or Make OpResult just the actual result without the name & only add that in a convenience structure that makes the sample? Esp if you remove exception handling.
There was a problem hiding this comment.
Agreed — switched to fail-hard by default. _TimedAsync/_RunAsync now let exceptions propagate, so any failed op in concurrent_node_pool_ops or overlapping_cluster_update aborts the benchmark (via RunThreaded), with the real traceback. I kept the one exception you flagged: large_scale_provisioning tolerates partial failure — it reports how many ops failed and lists the failed pool names in metadata, via a separate _RunAsyncTolerant + _LargeScaleSamples path. This also let me drop the broad-except blocks.
| # ── Counts + success rate ────────────────────────────────────────────── | ||
| total = attempted_ops if attempted_ops is not None else len(results) | ||
| executed = len(results) | ||
| if total == 0: |
There was a problem hiding this comment.
added comment above about "should we just fail if any op failed"?
There was a problem hiding this comment.
Done — fail-hard now: any op failure aborts the run (no success-rate to compute), so this counts block is gone for the fail-hard scenarios. Kept tolerance only in large_scale_provisioning per your 166 comment, which reports FailedOps + failed pool names.
|
|
||
|
|
||
| def _OpSamples( | ||
| metric_prefix: str, |
There was a problem hiding this comment.
Wait, so why do OpResults have names if OpSamples overwrites them / relegates their original name to metadata? I think I'm missing stuff but will still leave comment.
There was a problem hiding this comment.
Right — that was redundant. _OpResult→OpTiming now holds only (initiation_latency, end_to_end_latency); the name is supplied by _OpSamples via (name, OpTiming) pairs, so the result no longer carries a name the builder overwrites.
f431c3f to
cec3a9c
Compare
cec3a9c to
c0f8a7b
Compare
fef72ad to
f431c3f
Compare
Summary
Adds full implementation of the kubernetes_management benchmark on top
of Zach's skeleton, and adds management plane abstract methods to
KubernetesCluster base class.
Main changes
in _CleanStartSweep and Run(); no broad excepts
UpgradeNodePool, UpdateCluster (sync wrappers); async abstract
methods; BareMinor, AdjacentMinorBelow helpers
How tested