Skip to content

kubernetes_management: add benchmark + base cluster management plane methods#6746

Open
ashishsuneja wants to merge 10 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_benchmark_base
Open

kubernetes_management: add benchmark + base cluster management plane methods#6746
ashishsuneja wants to merge 10 commits into
GoogleCloudPlatform:masterfrom
ashishsuneja:mgmt_plane_benchmark_base

Conversation

@ashishsuneja

Copy link
Copy Markdown

Summary
Adds full implementation of the kubernetes_management benchmark on top
of Zach's skeleton, and adds management plane abstract methods to
KubernetesCluster base class.

Main changes

  • kubernetes_management_benchmark.py: Scenarios A, B, C; guard clauses
    in _CleanStartSweep and Run(); no broad excepts
  • kubernetes_cluster.py: AddNodepool, CreateNodePool, DeleteNodePool,
    UpgradeNodePool, UpdateCluster (sync wrappers); async abstract
    methods; BareMinor, AdjacentMinorBelow helpers

How tested

  • 89 unit tests passing
  • EKS end-to-end: 99 pools, 100% success all 7 scenarios
  • pyink + lint-diffs clean

vm_count: 1
vm_spec:
GCP:
# us-central1-a: spec primary region for GCP

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of detail (zone & machine type) should go in like a design doc or similar but doesn't need to live in the code comments itself. We will often override these values when calling the benchmark anyway.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Dropped the hardcoded per-cloud machine types/zones and the rationale comments — it now uses the shared *default_dual_core anchor (same pattern as provision_node_pools), so the spec detail lives in default_config_constants.yaml and gets overridden via flags at call time as you noted.

)
_SCENARIOS = flags.DEFINE_list(
"k8s_mgmt_scenarios",
["A", "B", "C"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A, B, C are not great scenario names. Make these more descriptive. This might warrant a big comment in addition to more descriptive names.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one. Renamed to concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning, with a block comment above _VALID_SCENARIOS documenting what each measures. I kept the internal RunScenario* helper names and the ScenarioA* metric tags as-is for now to avoid churning the test suite ahead of the split — happy to harmonize those too if you'd prefer.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).

Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the three names — they genuinely differ, and the adjective names each scenario's distinguishing axis (straight from the design doc): A is concurrent node-pool ops; B is a node-pool create overlapping a cluster update (the defining trait is overlap of two different op types, not concurrency of many of the same); C is large-scale provisioning (defined by volume — up to ~1000 pools — independent of execution mode). And done on the cleanup: purged all A/B/C references from function names, metric tags, and tests.

Comment thread perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py Outdated

def Prepare(benchmark_spec: bm_spec.BenchmarkSpec) -> None:
del benchmark_spec
"""Asserts the cluster is reachable; deploys spec-defined sleep workload."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always be the case.. Or it should fail a) obviously during setup or b) obviously early in the run. I guess this is y'all trying to do b) fail early in the run but I wonder how much it actually adds.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert isinstance(...) is there for type-narrowing (so pytype resolves cluster.k8s_version), not as a reachability check — the sleep-pod deploy below it is what actually confirms reachability. I've fixed the docstring that implied otherwise and added a comment noting the assert's real purpose. Happy to drop it entirely if you'd rather.

samples: list[sample.Sample] = []

if "A" in scenarios:
samples += _RunScenarioA(cluster, initial, target)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like target version is only used in ScenarioA. Back up in CheckPrerequisites consider checking eg "if target version set but scenario A not being run, fail". & then similar checks for the other scenarios - if scenario specific variable set but scenario is not being run, fail.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. CheckPrerequisites now fails if --k8s_mgmt_initial_version/--k8s_mgmt_target_version are set without concurrent_node_pool_ops, or --k8s_mgmt_scale_sweep without large_scale_provisioning, with tests for both. I deliberately left --k8s_mgmt_pipeline_scenario_a out of the cross-check: it defaults to True, so there's no reliable "explicitly set" signal to validate against, unlike the version/scale-sweep flags which default to None/empty. (Note: with the upgrade split, the version-flag cross-check moves to the follow-up PR, since --k8s_mgmt_target_version lives there now.)

Comment thread perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py Outdated
"""
raise NotImplementedError

def ResolveNodePoolVersions(self) -> tuple[str, str]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you overloading specifically this function in any of the children? If not, please remove. The helper as is doesn't add anything.. just call cluster.BareMinor & cluster.AdjacentMinorBelow directly from kubernetes_management_benchmark.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's overridden in all three providers, and GKE's override is load-bearing: GKE rejects bare-minor versions and needs a fully-qualified node version (e.g. 1.34.4-gke.1234), which it resolves via gcloud container get-server-config. Calling BareMinor/AdjacentMinorBelow directly from the benchmark would break GKE — this method is the provider seam that keeps Run() cloud-agnostic. EKS/AKS take the bare-minor path and I've refactored those two to reuse the BareMinor/AdjacentMinorBelow helpers instead of duplicating the parse inline.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, took a look at https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/6747/changes

IMO you could handle this in GKE with a shared helper + caching. ie have both call into _ResolveNodePoolVersions for GKE & save the results, then just return those saved result. Or potentially call get-server-config twice But does this version number ever change or is it always the same? I know we're doing upgrades/downgrades in this, but idk if that's for the cluster overall or just like particular nodepools or what. If yes, it does change / this approach would add too much complexity, then it is good justification for keeping the ResolveNodePoolVersions function. Thanks for the extra context.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it changes — and per node pool, not cluster-wide. The benchmark's whole point in concurrent_node_pool_ops is to create pools at an initial version (N-1) and then upgrade them to a distinct target (N), so initial and target are deliberately different values, resolved independently. They're node-pool versions: each pool is created at initial and upgraded to target individually; the control plane version isn't what we're varying.
On the caching idea — ResolveNodePoolVersions already resolves both initial and target in a single get-server-config call (returns the tuple), so we're not double-calling. The reason it stays a per-provider method rather than collapsing to bare BareMinor/AdjacentMinorBelow calls from the benchmark is that GKE needs the fully-qualified node version (e.g. 1.34.4-gke.1234) from get-server-config, while EKS/AKS take bare-minor and resolve differently — so the method is the provider seam that keeps Run() cloud-agnostic. Given the versions do change, I've kept it. Thanks for digging into #6747.

Ashish Suneja added 3 commits June 9, 2026 12:45
…methods

kubernetes_management_benchmark.py:
- Full implementation of Scenarios A, B, C on top of Zach's skeleton
- _CleanStartSweep: guard clause, let it fail (no broad except)
- Run(): guard clause for version resolution
- All helpers: _RunScenarioA/B/C, _TimedAsync, _RunAsync, _OpSamples
- _Results accumulator, _AggregateSamples, _OutlierSamples

kubernetes_cluster.py:
- AddNodepool: delegates to CreateNodePool for standard clusters
- CreateNodePool/DeleteNodePool/UpgradeNodePool/UpdateCluster: sync
  wrappers calling *Async + WaitForOperation
- Abstract methods: CreateNodePoolAsync, DeleteNodePoolAsync,
  UpgradeNodePoolAsync, UpdateClusterAsync, WaitForOperation,
  ResolveNodePoolVersions, GetNodePoolNames
- BareMinor, AdjacentMinorBelow: version helper functions

Tested:
- 89 unit tests passing
- EKS end-to-end: 99 pools, 100% success all 7 scenarios
- pyink + lint-diffs clean
@ashishsuneja ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from 6d7c0c6 to c959389 Compare June 9, 2026 12:45
Comment thread CHANGES.next.md
@@ -1,3 +1,10 @@
### New features:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an existing new features section below. Add to there.
Other comments:

  • Just one line about adding the benchmark is likely sufficient for all the PRs.
  • No need for author credit (PR history shows this)
  • No need to specify "add abstract methods" - again probably the whole PR stack can be summarized into a couple bullet points.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved it into the existing new-features section, collapsed to a single line, and dropped both the author credit and the abstract-methods bullet.

"""Holds timing and outcome for a single async management-plane operation."""

name: str
init_dur: float

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use full length names + provide g3doc descriptions for all the attributes. In particular I'm not sure what the difference is between "initial duration?" & "end to end duration?".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed init_dur→initiation_latency and e2e_dur→end_to_end_latency with a full Attributes: block. The distinction: initiation latency is the time from issuing the async call until it's accepted and an operation handle returns (time to start); end-to-end latency is from issuing the call until the operation fully completes (initiation plus server-side execution).

)
_SCENARIOS = flags.DEFINE_list(
"k8s_mgmt_scenarios",
["A", "B", "C"],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing caution/waiting for me to take a look is a good idea in case I want to change them again - but I think concurrent_node_pool_ops, overlapping_cluster_update, and large_scale_provisioning mostly work. Some quibbles - Are any of them too long? Why is one concurrent vs the second is overlapping? & then the last large_scale one is sequential rather than concurrent? Or all they all concurrent? (If so perhaps drop or move to comments).

Once you lock down good names, please do rename everything to the new / have no references to scenario A/B/C.

Comment thread perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py Outdated
Comment thread perfkitbenchmarker/linux_benchmarks/kubernetes_management_benchmark.py Outdated
success = 0

for r in results:
if isinstance(r, tuple):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this theoretically has type list[_OpResult], from above.. is that not actually the case? should we do this conversion beforehand, or is it truly a mixed list, or is it an unnecessary conversion? again brittle is fine / preferred over handling cases which should fail or we generally expect to pass.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the isinstance(r, tuple) conversion was dead for production: _RunAsync always yields _OpResult objects, and only the tests were passing tuples. Removed the conversion and updated the tests to construct _OpResult directly, keeping the path strict.

dict(meta),
)
)
samples.append(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below you're doing a percentile / outlier based summation right? But you're also including results for every operation? How many operations do you expect?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, by design: per-op _InitiationLatency/_EndToEndLatency samples preserve the raw distribution, while _AggregateSamples (Mean/StdDev/Min/Median/P90/P99/Max) and _OutlierSamples (IQR-fence) summarize over the successful ops. Op counts are scenario-driven — concurrent_node_pool_ops does --k8s_mgmt_concurrent_nodepools (default 5); large_scale_provisioning runs up to --k8s_mgmt_large_scale_nodepools or the --k8s_mgmt_scale_sweep values (hundreds to ~1000 per the design doc). Aggregates only emit at ≥2 successes and outliers at ≥4, so small scenarios skip them rather than emit noise

kubernetes_management_benchmark._ScenarioCName(7),
)

def testScenarioCNameFourDigitIndex(self):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condense these using a parameterized test

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — condensed the repetitive name tests into two @parameterized.named_parameters tests (three cases each).

cluster = _make_mock_cluster(
pool_names=['pkbma000', 'pkbmc0001', 'user-pool']
)
kubernetes_management_benchmark._CleanStartSweep(cluster)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably test as part of cleanup when moving clean start sweep call sites.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — with the sweep relocated to per-scenario teardown and Cleanup(), the test moved to those new call sites (testRunSweepsAfterEachScenario, plus the renamed SweepNodePoolsTest).

+ "If empty, uses --k8s_mgmt_large_scale_nodepools.",
)

# AKS caps node-pool names at 12 chars — keep all names within that limit.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this to _ConcurrentPoolName's g3doc

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — _ConcurrentPoolName now has a g3doc, and the AKS-12-char note lives in it.

)


def _SweepNodePools(cluster: kubernetes_cluster.KubernetesCluster) -> None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to Clean or ClearNodePools, as you also use "Sweep" to mean "go through all of these different scale numbers" & its a bit confusing to have 2 meanings in the same file.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed _SweepNodePools → _ClearNodePools. Kept _SweepScales for the scale-loop, so "sweep" now has a single meaning in the file.

for scale in scales:
scale_samples = _ScaleToPoolCount(cluster, initial, scale)
for s in scale_samples:
s.metadata["large_scale_scale"] = str(scale)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name goal_nodepools ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed the scale/scales concept to goal_nodepools/goal_counts throughout _SweepScales and _ScaleToPoolCount, including the metadata key (large_scale_scale → goal_nodepools).

get_name=lambda cfg: cfg.name,
)
samples += _OpSamples(
"ConcurrentOps_Create", create_results, attempted_ops=len(pool_names)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the concurrent_nodepools value in metadata already? If not, add it here or there

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already there — concurrent_nodepools is added to run_meta in Run() and tagged onto every sample, so it's in the metadata for all scenarios.

# ---------------------------------------------------------------------------


class _Results:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like _OpResults nor _Results as names. I think these don't need the _ - any class can just be a Result. Maybe change this one to something like ThreadSafeResults?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — _Results → ThreadSafeResults (no leading underscore, as you suggested). _OpResult → OpTiming

name: str
initiation_latency: float
end_to_end_latency: float
error: Exception | None = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again on "when to fail" - should we not just fail the whole benchmark run if an op fails? If you only run one scenario & one of its two ops fails.. then surely the benchmark overall should fail. But extrapolate that & you're only running like 4 or 5 ops in most benchmarks. Maybe the "Sweep through nodepool counts" shouldn't fail in case you eg put in a really high nodepool count & couldn't make it, but I think you could add such exception handling to that case in particular if indeed that's a concern.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked per both comments. OpTiming is now pure timing — (initiation_latency, end_to_end_latency), no name and no error field. The metric name is supplied by the sample builder (_OpSamples takes (name, OpTiming) pairs), so the result object no longer carries presentation data. This also removes the error field, which ties into the fail-hard change below.

wait_fn: Callable[[str], None],
items: list,
get_name: Callable[[object], str],
) -> list[tuple[str, float, float, Exception | None]]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this just a list[_OpResult]s ? Seems like if you're making dataclasses you should use them. Also, should _OpResult just be a sample & a convenience structure for making it? & then you add more metadata to it when needed. Or Make OpResult just the actual result without the name & only add that in a convenience structure that makes the sample? Esp if you remove exception handling.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — switched to fail-hard by default. _TimedAsync/_RunAsync now let exceptions propagate, so any failed op in concurrent_node_pool_ops or overlapping_cluster_update aborts the benchmark (via RunThreaded), with the real traceback. I kept the one exception you flagged: large_scale_provisioning tolerates partial failure — it reports how many ops failed and lists the failed pool names in metadata, via a separate _RunAsyncTolerant + _LargeScaleSamples path. This also let me drop the broad-except blocks.

# ── Counts + success rate ──────────────────────────────────────────────
total = attempted_ops if attempted_ops is not None else len(results)
executed = len(results)
if total == 0:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment above about "should we just fail if any op failed"?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — fail-hard now: any op failure aborts the run (no success-rate to compute), so this counts block is gone for the fail-hard scenarios. Kept tolerance only in large_scale_provisioning per your 166 comment, which reports FailedOps + failed pool names.



def _OpSamples(
metric_prefix: str,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, so why do OpResults have names if OpSamples overwrites them / relegates their original name to metadata? I think I'm missing stuff but will still leave comment.

@ashishsuneja ashishsuneja Jun 12, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — that was redundant. _OpResult→OpTiming now holds only (initiation_latency, end_to_end_latency); the name is supplied by _OpSamples via (name, OpTiming) pairs, so the result no longer carries a name the builder overwrites.

@ashishsuneja ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from f431c3f to cec3a9c Compare June 17, 2026 10:38
@ashishsuneja ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from cec3a9c to c0f8a7b Compare June 17, 2026 14:55
@ashishsuneja ashishsuneja reopened this Jun 17, 2026
@ashishsuneja ashishsuneja force-pushed the mgmt_plane_benchmark_base branch from fef72ad to f431c3f Compare June 17, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants