kubernetes_management: add benchmark + base cluster management plane methods by ashishsuneja · Pull Request #6746 · GoogleCloudPlatform/PerfKitBenchmarker

ashishsuneja · 2026-06-05T17:10:43Z

Summary
Adds full implementation of the kubernetes_management benchmark on top
of Zach's skeleton, and adds management plane abstract methods to
KubernetesCluster base class.

Main changes

kubernetes_management_benchmark.py: Scenarios A, B, C; guard clauses
in _CleanStartSweep and Run(); no broad excepts
kubernetes_cluster.py: AddNodepool, CreateNodePool, DeleteNodePool,
UpgradeNodePool, UpdateCluster (sync wrappers); async abstract
methods; BareMinor, AdjacentMinorBelow helpers

How tested

89 unit tests passing
EKS end-to-end: 99 pools, 100% success all 7 scenarios
pyink + lint-diffs clean

hubatish · 2026-06-08T20:01:19Z

    vm_count: 1
+    vm_spec:
+      GCP:
+        # us-central1-a: spec primary region for GCP


This kind of detail (zone & machine type) should go in like a design doc or similar but doesn't need to live in the code comments itself. We will often override these values when calling the benchmark anyway.

Done. Dropped the hardcoded per-cloud machine types/zones and the rationale comments — it now uses the shared *default_dual_core anchor (same pattern as provision_node_pools), so the spec detail lives in default_config_constants.yaml and gets overridden via flags at call time as you noted.

hubatish · 2026-06-08T20:03:44Z

+)
+_SCENARIOS = flags.DEFINE_list(
+    "k8s_mgmt_scenarios",
+    ["A", "B", "C"],


A, B, C are not great scenario names. Make these more descriptive. This might warrant a big comment in addition to more descriptive names.

one. Renamed to concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning, with a block comment above _VALID_SCENARIOS documenting what each measures. I kept the internal RunScenario* helper names and the ScenarioA* metric tags as-is for now to avoid churning the test suite ahead of the split — happy to harmonize those too if you'd prefer.

Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).

Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.

Kept the three names — they genuinely differ, and the adjective names each scenario's distinguishing axis (straight from the design doc): A is concurrent node-pool ops; B is a node-pool create overlapping a cluster update (the defining trait is overlap of two different op types, not concurrency of many of the same); C is large-scale provisioning (defined by volume — up to ~1000 pools — independent of execution mode). And done on the cleanup: purged all A/B/C references from function names, metric tags, and tests.

hubatish · 2026-06-08T20:12:26Z


 def Prepare(benchmark_spec: bm_spec.BenchmarkSpec) -> None:
-  del benchmark_spec
+  """Asserts the cluster is reachable; deploys spec-defined sleep workload."""


This should always be the case.. Or it should fail a) obviously during setup or b) obviously early in the run. I guess this is y'all trying to do b) fail early in the run but I wonder how much it actually adds.

The assert isinstance(...) is there for type-narrowing (so pytype resolves cluster.k8s_version), not as a reachability check — the sleep-pod deploy below it is what actually confirms reachability. I've fixed the docstring that implied otherwise and added a comment noting the assert's real purpose. Happy to drop it entirely if you'd rather.

hubatish · 2026-06-08T20:16:38Z

+  samples: list[sample.Sample] = []
+
+  if "A" in scenarios:
+    samples += _RunScenarioA(cluster, initial, target)


seems like target version is only used in ScenarioA. Back up in CheckPrerequisites consider checking eg "if target version set but scenario A not being run, fail". & then similar checks for the other scenarios - if scenario specific variable set but scenario is not being run, fail.

Done. CheckPrerequisites now fails if --k8s_mgmt_initial_version/--k8s_mgmt_target_version are set without concurrent_node_pool_ops, or --k8s_mgmt_scale_sweep without large_scale_provisioning, with tests for both. I deliberately left --k8s_mgmt_pipeline_scenario_a out of the cross-check: it defaults to True, so there's no reliable "explicitly set" signal to validate against, unlike the version/scale-sweep flags which default to None/empty. (Note: with the upgrade split, the version-flag cross-check moves to the follow-up PR, since --k8s_mgmt_target_version lives there now.)

hubatish · 2026-06-08T20:21:35Z

+    """
+    raise NotImplementedError
+
+  def ResolveNodePoolVersions(self) -> tuple[str, str]:


Are you overloading specifically this function in any of the children? If not, please remove. The helper as is doesn't add anything.. just call cluster.BareMinor & cluster.AdjacentMinorBelow directly from kubernetes_management_benchmark.

It's overridden in all three providers, and GKE's override is load-bearing: GKE rejects bare-minor versions and needs a fully-qualified node version (e.g. 1.34.4-gke.1234), which it resolves via gcloud container get-server-config. Calling BareMinor/AdjacentMinorBelow directly from the benchmark would break GKE — this method is the provider seam that keeps Run() cloud-agnostic. EKS/AKS take the bare-minor path and I've refactored those two to reuse the BareMinor/AdjacentMinorBelow helpers instead of duplicating the parse inline.

Ok, took a look at https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/6747/changes

IMO you could handle this in GKE with a shared helper + caching. ie have both call into _ResolveNodePoolVersions for GKE & save the results, then just return those saved result. Or potentially call get-server-config twice But does this version number ever change or is it always the same? I know we're doing upgrades/downgrades in this, but idk if that's for the cluster overall or just like particular nodepools or what. If yes, it does change / this approach would add too much complexity, then it is good justification for keeping the ResolveNodePoolVersions function. Thanks for the extra context.

Yes, it changes — and per node pool, not cluster-wide. The benchmark's whole point in concurrent_node_pool_ops is to create pools at an initial version (N-1) and then upgrade them to a distinct target (N), so initial and target are deliberately different values, resolved independently. They're node-pool versions: each pool is created at initial and upgraded to target individually; the control plane version isn't what we're varying.
On the caching idea — ResolveNodePoolVersions already resolves both initial and target in a single get-server-config call (returns the tuple), so we're not double-calling. The reason it stays a per-provider method rather than collapsing to bare BareMinor/AdjacentMinorBelow calls from the benchmark is that GKE needs the fully-qualified node version (e.g. 1.34.4-gke.1234) from get-server-config, while EKS/AKS take bare-minor and resolve differently — so the method is the provider seam that keeps Run() cloud-agnostic. Given the versions do change, I've kept it. Thanks for digging into #6747.

…methods kubernetes_management_benchmark.py: - Full implementation of Scenarios A, B, C on top of Zach's skeleton - _CleanStartSweep: guard clause, let it fail (no broad except) - Run(): guard clause for version resolution - All helpers: _RunScenarioA/B/C, _TimedAsync, _RunAsync, _OpSamples - _Results accumulator, _AggregateSamples, _OutlierSamples kubernetes_cluster.py: - AddNodepool: delegates to CreateNodePool for standard clusters - CreateNodePool/DeleteNodePool/UpgradeNodePool/UpdateCluster: sync wrappers calling *Async + WaitForOperation - Abstract methods: CreateNodePoolAsync, DeleteNodePoolAsync, UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation, ResolveNodePoolVersions, GetNodePoolNames - BareMinor, AdjacentMinorBelow: version helper functions Tested: - 89 unit tests passing - EKS end-to-end: 99 pools, 100% success all 7 scenarios - pyink + lint-diffs clean

…ng, remove dead code)

…nario-flag cross-validation

hubatish · 2026-06-09T18:06:44Z

@@ -1,3 +1,10 @@
+### New features:


there's an existing new features section below. Add to there.
Other comments:

Just one line about adding the benchmark is likely sufficient for all the PRs.

No need for author credit (PR history shows this)

No need to specify "add abstract methods" - again probably the whole PR stack can be summarized into a couple bullet points.

Done — moved it into the existing new-features section, collapsed to a single line, and dropped both the author credit and the abstract-methods bullet.

hubatish · 2026-06-09T18:09:39Z

+  """Holds timing and outcome for a single async management-plane operation."""
+
+  name: str
+  init_dur: float


Use full length names + provide g3doc descriptions for all the attributes. In particular I'm not sure what the difference is between "initial duration?" & "end to end duration?".

Done — renamed init_dur→initiation_latency and e2e_dur→end_to_end_latency with a full Attributes: block. The distinction: initiation latency is the time from issuing the async call until it's accepted and an operation handle returns (time to start); end-to-end latency is from issuing the call until the operation fully completes (initiation plus server-side execution).

hubatish · 2026-06-09T18:13:36Z

+)
+_SCENARIOS = flags.DEFINE_list(
+    "k8s_mgmt_scenarios",
+    ["A", "B", "C"],


Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).

Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.

hubatish · 2026-06-09T18:30:39Z

+  success = 0
+
+  for r in results:
+    if isinstance(r, tuple):


this theoretically has type list[_OpResult], from above.. is that not actually the case? should we do this conversion beforehand, or is it truly a mixed list, or is it an unnecessary conversion? again brittle is fine / preferred over handling cases which should fail or we generally expect to pass.

You're right — the isinstance(r, tuple) conversion was dead for production: _RunAsync always yields _OpResult objects, and only the tests were passing tuples. Removed the conversion and updated the tests to construct _OpResult directly, keeping the path strict.

hubatish · 2026-06-09T18:33:28Z

+            dict(meta),
+        )
+    )
+    samples.append(


Below you're doing a percentile / outlier based summation right? But you're also including results for every operation? How many operations do you expect?

Both, by design: per-op _InitiationLatency/_EndToEndLatency samples preserve the raw distribution, while _AggregateSamples (Mean/StdDev/Min/Median/P90/P99/Max) and _OutlierSamples (IQR-fence) summarize over the successful ops. Op counts are scenario-driven — concurrent_node_pool_ops does --k8s_mgmt_concurrent_nodepools (default 5); large_scale_provisioning runs up to --k8s_mgmt_large_scale_nodepools or the --k8s_mgmt_scale_sweep values (hundreds to ~1000 per the design doc). Aggregates only emit at ≥2 successes and outliers at ≥4, so small scenarios skip them rather than emit noise

hubatish · 2026-06-09T18:34:57Z

+        kubernetes_management_benchmark._ScenarioCName(7),
+    )
+
+  def testScenarioCNameFourDigitIndex(self):


condense these using a parameterized test

Done — condensed the repetitive name tests into two @parameterized.named_parameters tests (three cases each).

hubatish · 2026-06-09T18:38:16Z

+    cluster = _make_mock_cluster(
+        pool_names=['pkbma000', 'pkbmc0001', 'user-pool']
+    )
+    kubernetes_management_benchmark._CleanStartSweep(cluster)


probably test as part of cleanup when moving clean start sweep call sites.

Done — with the sweep relocated to per-scenario teardown and Cleanup(), the test moved to those new call sites (testRunSweepsAfterEachScenario, plus the renamed SweepNodePoolsTest).

…count metrics, fail-fast on zero ops

…f at start; Cleanup reuses sweep

… parameterize name tests

hubatish · 2026-06-11T18:44:14Z

+    + "If empty, uses --k8s_mgmt_large_scale_nodepools.",
+)
+
+# AKS caps node-pool names at 12 chars — keep all names within that limit.


nit: move this to _ConcurrentPoolName's g3doc

Done — _ConcurrentPoolName now has a g3doc, and the AKS-12-char note lives in it.

hubatish · 2026-06-11T18:47:05Z

+  )
+
+
+def _SweepNodePools(cluster: kubernetes_cluster.KubernetesCluster) -> None:


nit: rename to Clean or ClearNodePools, as you also use "Sweep" to mean "go through all of these different scale numbers" & its a bit confusing to have 2 meanings in the same file.

Done — renamed _SweepNodePools → _ClearNodePools. Kept _SweepScales for the scale-loop, so "sweep" now has a single meaning in the file.

hubatish · 2026-06-11T18:54:28Z

+  for scale in scales:
+    scale_samples = _ScaleToPoolCount(cluster, initial, scale)
+    for s in scale_samples:
+      s.metadata["large_scale_scale"] = str(scale)


name goal_nodepools ?

Done — renamed the scale/scales concept to goal_nodepools/goal_counts throughout _SweepScales and _ScaleToPoolCount, including the metadata key (large_scale_scale → goal_nodepools).

hubatish · 2026-06-11T18:54:56Z

+      get_name=lambda cfg: cfg.name,
+  )
+  samples += _OpSamples(
+      "ConcurrentOps_Create", create_results, attempted_ops=len(pool_names)


is the concurrent_nodepools value in metadata already? If not, add it here or there

It's already there — concurrent_nodepools is added to run_meta in Run() and tagged onto every sample, so it's in the metadata for all scenarios.

hubatish · 2026-06-11T20:21:54Z

+# ---------------------------------------------------------------------------
+
+
+class _Results:


I don't really like _OpResults nor _Results as names. I think these don't need the _ - any class can just be a Result. Maybe change this one to something like ThreadSafeResults?

Done — _Results → ThreadSafeResults (no leading underscore, as you suggested). _OpResult → OpTiming

hubatish · 2026-06-11T20:26:23Z

+  name: str
+  initiation_latency: float
+  end_to_end_latency: float
+  error: Exception | None = None


Again on "when to fail" - should we not just fail the whole benchmark run if an op fails? If you only run one scenario & one of its two ops fails.. then surely the benchmark overall should fail. But extrapolate that & you're only running like 4 or 5 ops in most benchmarks. Maybe the "Sweep through nodepool counts" shouldn't fail in case you eg put in a really high nodepool count & couldn't make it, but I think you could add such exception handling to that case in particular if indeed that's a concern.

Reworked per both comments. OpTiming is now pure timing — (initiation_latency, end_to_end_latency), no name and no error field. The metric name is supplied by the sample builder (_OpSamples takes (name, OpTiming) pairs), so the result object no longer carries presentation data. This also removes the error field, which ties into the fail-hard change below.

hubatish · 2026-06-11T20:27:50Z

+    wait_fn: Callable[[str], None],
+    items: list,
+    get_name: Callable[[object], str],
+) -> list[tuple[str, float, float, Exception | None]]:


Why isn't this just a list[_OpResult]s ? Seems like if you're making dataclasses you should use them. Also, should _OpResult just be a sample & a convenience structure for making it? & then you add more metadata to it when needed. Or Make OpResult just the actual result without the name & only add that in a convenience structure that makes the sample? Esp if you remove exception handling.

Agreed — switched to fail-hard by default. _TimedAsync/_RunAsync now let exceptions propagate, so any failed op in concurrent_node_pool_ops or overlapping_cluster_update aborts the benchmark (via RunThreaded), with the real traceback. I kept the one exception you flagged: large_scale_provisioning tolerates partial failure — it reports how many ops failed and lists the failed pool names in metadata, via a separate _RunAsyncTolerant + _LargeScaleSamples path. This also let me drop the broad-except blocks.

hubatish · 2026-06-11T20:28:51Z

+  # ── Counts + success rate ──────────────────────────────────────────────
+  total = attempted_ops if attempted_ops is not None else len(results)
+  executed = len(results)
+  if total == 0:


added comment above about "should we just fail if any op failed"?

Done — fail-hard now: any op failure aborts the run (no success-rate to compute), so this counts block is gone for the fail-hard scenarios. Kept tolerance only in large_scale_provisioning per your 166 comment, which reports FailedOps + failed pool names.

hubatish · 2026-06-11T20:31:10Z

+
+
+def _OpSamples(
+    metric_prefix: str,


Wait, so why do OpResults have names if OpSamples overwrites them / relegates their original name to metadata? I think I'm missing stuff but will still leave comment.

Right — that was redundant. _OpResult→OpTiming now holds only (initiation_latency, end_to_end_latency); the name is supplied by _OpSamples via (name, OpTiming) pairs, so the result no longer carries a name the builder overwrites.

…folds GoogleCloudPlatform#6750)

…Timing/ThreadSafeResults redesign

hubatish reviewed Jun 8, 2026

View reviewed changes

Ashish Suneja added 3 commits June 9, 2026 12:45

kubernetes_management: address review (config anchor, Prepare docstri…

6bdc03e

…ng, remove dead code)

kubernetes_management: descriptive scenario names, grouped flags, sce…

c959389

…nario-flag cross-validation

ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from 6d7c0c6 to c959389 Compare June 9, 2026 12:45

kubernetes_management: split out upgrade path — create/delete only

bf8c1d1

ashishsuneja mentioned this pull request Jun 9, 2026

Mgmt plane upgrade ashishsuneja/PerfKitBenchmarker#10

Open

hubatish reviewed Jun 9, 2026

View reviewed changes

Ashish Suneja added 3 commits June 10, 2026 12:44

kubernetes_management: purge A/B/C naming, extract helpers, separate …

b0c04ec

…count metrics, fail-fast on zero ops

kubernetes_management: sweep node pools after each scenario instead o…

0e1dba3

…f at start; Cleanup reuses sweep

kubernetes_management: condense changelog, document _OpResult fields,…

620fd56

… parameterize name tests

hubatish reviewed Jun 11, 2026

View reviewed changes

Ashish Suneja added 3 commits June 12, 2026 09:38

kubernetes_management: remove BaseEksCluster AddNodepool no-op stub (…

e2a106f

…folds GoogleCloudPlatform#6750)

kubernetes_management: fail-hard by default, tolerant large_scale; Op…

17df7f8

…Timing/ThreadSafeResults redesign

kubernetes_management: rename scale->goal_nodepools (nit 334)

f431c3f

ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from f431c3f to cec3a9c Compare June 17, 2026 10:38

ashishsuneja closed this Jun 17, 2026

ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from cec3a9c to c0f8a7b Compare June 17, 2026 14:55

ashishsuneja reopened this Jun 17, 2026

ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from fef72ad to f431c3f Compare June 17, 2026 19:11

		)


		def _SweepNodePools(cluster: kubernetes_cluster.KubernetesCluster) -> None:

		# ---------------------------------------------------------------------------


		class _Results:

Conversation

ashishsuneja commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishsuneja Jun 12, 2026 •

edited

Loading