add agentic benchmarking on gke#6772
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
| # Used with --gke_provision_mode=native | ||
| # | ||
| # Prerequisites (run once before PKB): | ||
| # python tools/agentic-benchmark/scripts/prerequisite_setup.py \ |
There was a problem hiding this comment.
This tool isn't being included & therefore this comment doesn't need to be here.
| # For sweeps (cluster pre-exists, PKB skips provision/teardown): | ||
| # The sweep bridge injects --run_stage=run,cleanup automatically. | ||
|
|
||
| gke_python_density: |
There was a problem hiding this comment.
Internally we put a lot of this info but externally it is useful.. it's probably a good addition.
| @@ -0,0 +1,240 @@ | |||
| from google.adk.agents import LlmAgent | |||
| from google.adk.code_executors import GkeCodeExecutor | |||
There was a problem hiding this comment.
Where is this file run? From the same machine running PKB or a different one?
| six>=1.13.0 | ||
| timeout-decorator | ||
| scipy | ||
| matplotlib |
There was a problem hiding this comment.
I don't see a reference to this elsewhere with a ctrl-f; is it leftover from an earlier version?
In general we prefer not making many changes to requirements.txt.
| ' beyond the default node pool (e.g. kubernetes_node_scale with 5k nodes).', | ||
| ) | ||
|
|
||
| GKE_USE_BETA = flags.DEFINE_boolean( |
There was a problem hiding this comment.
If we add this flag, IMO just make it "gcloud_use_beta" (or actually an enum use alpha, beta, None "gcloud_beta_version") & being referenced from gcp/util.py directly seems best.
Alternatively we often will say in the provider "if preview feature used, cmd.use_beta_gcloud = True". In general what feature are you using that needs beta?
| by all seven UC benchmark scripts. Each benchmark's Provision() and | ||
| Teardown() functions delegate to the public functions in this module. | ||
|
|
||
| Infrastructure created (in order): |
There was a problem hiding this comment.
The very premise of this file is incorrect. PKB (and esp eg google_kubernetes_engine.py _Create) should be handling all of the provisioning logic.
I'm not sure how much of this is a) completely unnecessary because it's handled elsewhere in PKB (like we do setup subnets & networks automatically if you don't specify a network" or b) is indeed necessary but should be located in some other Resource.py class.
There was a problem hiding this comment.
+1. Let's set up the cloud infra using PKB-native way.
| chromium_replicas = FLAGS.gke_chromium_replicas | ||
|
|
||
| manifest = """--- | ||
| apiVersion: extensions.agents.x-k8s.io/v1alpha1 |
There was a problem hiding this comment.
should go in some .yaml.j2 file
| return _RunCmd(cmd, check=check, timeout=timeout) | ||
|
|
||
|
|
||
| def _KubectlApply(manifest_str): |
There was a problem hiding this comment.
why have you rewritten kubectl apply & _RunKubectl when implementations exist container_service/kubectl.py ?
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
For easier review and faster iteration, I'd recommend keeping one benchmark in this PR and leave the other benchmarks for followup PRs. My recommendation is to keep the Python density benchmark.
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
Let's drop "(Use Case B)" from the description. For the published PKB benchmarks, the documentation should clearly state what the benchmarks are about. The ordering of A,B,C... will become stale and confusing to readers.
| @@ -0,0 +1,362 @@ | |||
| """PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B). | |||
There was a problem hiding this comment.
Can we drop "GKE" from the file name and the description? Based on the path this is a Kubernetes benchmark, and presumably this benchmark can be reused for other cloud provider without significant change, right?
| # --------------------------------------------------------------------------- | ||
|
|
||
| flags.DEFINE_integer( | ||
| "gke_python_density", |
There was a problem hiding this comment.
gke_python_density
nit: Shall we name the flag something like "concurrent_sandbox_count"?gkeandpythoncan already be implied based on the file name and description of the benchmark.
| flags.DEFINE_integer( | ||
| "gke_python_density_sample_warmup", | ||
| 0, | ||
| "Number of warmup iterations per session (excluded from stats).", |
There was a problem hiding this comment.
It's unclear what "warmup iterations" means as it's not mentioned before. Shall we document the workflow in the benchmark description?
| by all seven UC benchmark scripts. Each benchmark's Provision() and | ||
| Teardown() functions delegate to the public functions in this module. | ||
|
|
||
| Infrastructure created (in order): |
There was a problem hiding this comment.
+1. Let's set up the cloud infra using PKB-native way.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra): |
There was a problem hiding this comment.
Can you document how the metrics emit works and what the parameters are?
Agentic Workload Benchmarking for GKE (PKB Extension)
Summary
Adds a complete benchmarking framework for Agentic Workloads on Google Kubernetes Engine (GKE) — specifically measuring per-operation performance of untrusted Python code execution and headless Chromium browser tasks running under gVisor (GKE Agent Sandbox) isolation.
Motivation
AI agent systems require ephemeral, isolated execution environments (sandboxes) for running untrusted code. Understanding the performance characteristics of these sandboxes under gVisor — including cold-start latency, execution overhead, memory density limits, and scheduling throughput — is critical for production capacity planning.
This framework enables systematic, repeatable measurement of these characteristics across multiple GCP machine families.
Architecture
Benchmark Definitions (7 Use Cases)
gke_snapshotgke_python_densitygke_chromium_densitygke_payloadgke_warmpoolgke_qpsgke_deletionShared Utilities
gke_benchmark_utils.pygke_deploy_utils.pygke_provision_utils.pygke_image_build_utils.pygke_prerequisite_setup.pyDual Provisioning Modes
custommode: Directgcloudcalls for full infrastructure controlnativemode: Uses PKB's built-incontainer_clusterprovisioner with prerequisite script for resources PKB cannot managePKB Provider Extensions
Small additions to support GKE preview features:
--gke_use_betaflag (forcesgcloud beta container clusters create)--gke_additional_flagslist (appended to cluster create)--gke_additional_nodepool_flagslist (appended to node pool create)In-Cluster Components
ADK Agent (
workloads/adk_agent/)A FastAPI service deployed inside GKE that:
/benchmark/python/density,/benchmark/python/payload,/benchmark/python/qps,/benchmark/chromium/density)DirectConnection(in-cluster) orkubectl port-forward(dev mode)Sandbox Scripts (
sandboxed_apps/)benchmark_density.py— CPU-bound, syscall-heavy, and import-heavy tasks with RSS trackingbenchmark_payload.py— Payload generation, serialization, and stdout transfer measurementbenchmark_qps.py— Minimal script proving sandbox livenessbenchmark_density.js— Playwright-driven Chromium interaction benchmarkVibe Coding Workloads (
workloads/vibe_coding/)Startup scripts simulating real-world agentic cold-starts:
startup_pip_fastapi.sh— pip install + FastAPI server bootstartup_npm_vite.sh— npm install + Vite dev server bootUsage
Prerequisites (once per environment)
python -m perfkitbenchmarker.linux_benchmarks.kubernetes.agentic.gke_prerequisite_setup \ --project_id=sada-gke-benchmarking2 \ --region=us-central1 \ --zone=us-central1-a \ --machine_type=c4-standard-8Provision Cluster
python pkb.py --benchmarks=gke_python_density \ --run_stage=provision \ --gke_provision_mode=native \ --project=sada-gke-benchmarking2 \ --owner=george-kalisse \ --benchmark_config_file=k8s_agents/config/native_provision_config.yaml \ --gce_network_name=george-agentic-vpc \ --gce_subnet_region=us-central1 \ --zone=us-central1-a \ --container_cluster_version=1.35.3-gke.1389000 \ --gke_use_beta=true \ --gke_additional_flags="--enable-pod-snapshots,--enable-dataplane-v2,--enable-private-nodes,--enable-ip-alias,--master-ipv4-cidr=172.16.0.0/28,--workload-pool=sada-gke-benchmarking2.svc.id.goog,--subnetwork=george-agentic-subnet,--enable-master-authorized-networks,--master-authorized-networks=$(curl -s ifconfig.me)/32" \ --gke_additional_nodepool_flags="--max-pods-per-node=250" \ --gke_enable_shielded_nodes=false \ --run_uri=test \ --temp_dir=./testing/pkb/c4-standard-8/ucb``` ### Run Benchmark ```bash python pkb.py --benchmarks=gke_python_density \ --run_stage=prepare,run,cleanup \ --gke_provision_mode=native \ --gke_project_id=sada-gke-benchmarking2 \ --gke_region=us-central1 \ --gke_zone=us-central1-a \ --gke_sandbox_machine_type=c4-standard-8 \ --gke_namespace=agentic \ --gke_sandbox_version=v0.4.6 \ --gke_python_density=4 \ --gke_python_density_sample_count=20 \ --gke_python_density_sample_warmup=0 \ --gke_python_density_patch_warmpool=true \ --gke_python_density_exec_timeout=600 \ --gke_machine_type=c4-standard-8 \ --gke_gvisor=true \ --gke_api_url=http://localhost:8080 \ --run_uri=test \ --temp_dir=./testing/pkb/c4-standard-8/ucb