add agentic benchmarking on gke by george-kalisse-sada · Pull Request #6772 · GoogleCloudPlatform/PerfKitBenchmarker

george-kalisse-sada · 2026-06-16T08:13:52Z

Agentic Workload Benchmarking for GKE (PKB Extension)

Summary

Adds a complete benchmarking framework for Agentic Workloads on Google Kubernetes Engine (GKE) — specifically measuring per-operation performance of untrusted Python code execution and headless Chromium browser tasks running under gVisor (GKE Agent Sandbox) isolation.

Motivation

AI agent systems require ephemeral, isolated execution environments (sandboxes) for running untrusted code. Understanding the performance characteristics of these sandboxes under gVisor — including cold-start latency, execution overhead, memory density limits, and scheduling throughput — is critical for production capacity planning.

This framework enables systematic, repeatable measurement of these characteristics across multiple GCP machine families.

Architecture

Benchmark Definitions (7 Use Cases)

Benchmark	Use Case	Measures
`gke_snapshot`	UC-A: Cold Start & Snapshot	Pod snapshot create/restore latency under CRIU
`gke_python_density`	UC-B: Python Density	CEL, TTFE, RSS growth at varying concurrency
`gke_chromium_density`	UC-C: Chromium Density	Interaction latency, screenshot time at scale
`gke_payload`	UC-D: Payload Transfer	Sandbox→orchestrator data transfer saturation
`gke_warmpool`	UC-E: Warmpool Scale-Up	Bulk provisioning speed (0→N pods)
`gke_qps`	UC-F: QPS Saturation	Scheduling throughput until pool drain
`gke_deletion`	UC-G: Deletion & Cleanup	Bulk deletion latency and IP reclamation

Shared Utilities

Module	Purpose
`gke_benchmark_utils.py`	Agent API interaction, kubectl helpers, warm pool management, port-forward manager, sample construction
`gke_deploy_utils.py`	Idempotent workload deployment (CRDs, templates, warm pools, router, ADK agent, PSI reader)
`gke_provision_utils.py`	Full GKE infrastructure lifecycle (VPC, NAT, cluster, node pools, AR, IAM)
`gke_image_build_utils.py`	Container image builds via Cloud Build (ADK agent, Chrome sandbox, Sandbox Router)
`gke_prerequisite_setup.py`	Standalone script for pre-PKB infrastructure (VPC, NAT, AR, SA, images)

Dual Provisioning Modes

custom mode: Direct gcloud calls for full infrastructure control
native mode: Uses PKB's built-in container_cluster provisioner with prerequisite script for resources PKB cannot manage

PKB Provider Extensions

Small additions to support GKE preview features:

--gke_use_beta flag (forces gcloud beta container clusters create)
--gke_additional_flags list (appended to cluster create)
--gke_additional_nodepool_flags list (appended to node pool create)

In-Cluster Components

ADK Agent (`workloads/adk_agent/`)

A FastAPI service deployed inside GKE that:

Exposes REST endpoints for each benchmark type (/benchmark/python/density, /benchmark/python/payload, /benchmark/python/qps, /benchmark/chromium/density)
Uses a Mock LLM (no real model calls) to drive the ADK Runner through sandbox claim→execute→release cycles
Connects to sandboxes via DirectConnection (in-cluster) or kubectl port-forward (dev mode)
Measures both orchestrator-side and sandbox-side metrics

Sandbox Scripts (`sandboxed_apps/`)

benchmark_density.py — CPU-bound, syscall-heavy, and import-heavy tasks with RSS tracking
benchmark_payload.py — Payload generation, serialization, and stdout transfer measurement
benchmark_qps.py — Minimal script proving sandbox liveness
benchmark_density.js — Playwright-driven Chromium interaction benchmark

Vibe Coding Workloads (`workloads/vibe_coding/`)

Startup scripts simulating real-world agentic cold-starts:

startup_pip_fastapi.sh — pip install + FastAPI server boot
startup_npm_vite.sh — npm install + Vite dev server boot

Usage

Prerequisites (once per environment)

python -m perfkitbenchmarker.linux_benchmarks.kubernetes.agentic.gke_prerequisite_setup \
    --project_id=sada-gke-benchmarking2 \
    --region=us-central1 \
    --zone=us-central1-a \
    --machine_type=c4-standard-8

Provision Cluster

python pkb.py --benchmarks=gke_python_density \
    --run_stage=provision \
    --gke_provision_mode=native \
    --project=sada-gke-benchmarking2 \
    --owner=george-kalisse \
    --benchmark_config_file=k8s_agents/config/native_provision_config.yaml \
    --gce_network_name=george-agentic-vpc \
    --gce_subnet_region=us-central1 \
    --zone=us-central1-a \
    --container_cluster_version=1.35.3-gke.1389000 \
    --gke_use_beta=true \
    --gke_additional_flags="--enable-pod-snapshots,--enable-dataplane-v2,--enable-private-nodes,--enable-ip-alias,--master-ipv4-cidr=172.16.0.0/28,--workload-pool=sada-gke-benchmarking2.svc.id.goog,--subnetwork=george-agentic-subnet,--enable-master-authorized-networks,--master-authorized-networks=$(curl -s ifconfig.me)/32" \
    --gke_additional_nodepool_flags="--max-pods-per-node=250" \
    --gke_enable_shielded_nodes=false \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb```

### Run Benchmark
```bash
python pkb.py --benchmarks=gke_python_density \
    --run_stage=prepare,run,cleanup \
    --gke_provision_mode=native \
    --gke_project_id=sada-gke-benchmarking2 \
    --gke_region=us-central1 \
    --gke_zone=us-central1-a \
    --gke_sandbox_machine_type=c4-standard-8 \
    --gke_namespace=agentic \
    --gke_sandbox_version=v0.4.6 \
    --gke_python_density=4 \
    --gke_python_density_sample_count=20 \
    --gke_python_density_sample_warmup=0 \
    --gke_python_density_patch_warmpool=true \
    --gke_python_density_exec_timeout=600 \
    --gke_machine_type=c4-standard-8 \
    --gke_gvisor=true \
    --gke_api_url=http://localhost:8080 \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb

google-cla · 2026-06-16T08:13:57Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

hubatish · 2026-06-16T16:55:30Z

+# Used with --gke_provision_mode=native
+#
+# Prerequisites (run once before PKB):
+#   python tools/agentic-benchmark/scripts/prerequisite_setup.py \


This tool isn't being included & therefore this comment doesn't need to be here.

hubatish · 2026-06-16T16:59:36Z

+# For sweeps (cluster pre-exists, PKB skips provision/teardown):
+#   The sweep bridge injects --run_stage=run,cleanup automatically.
+
+gke_python_density:


Internally we put a lot of this info but externally it is useful.. it's probably a good addition.

hubatish · 2026-06-16T17:00:43Z

@@ -0,0 +1,240 @@
+from google.adk.agents import LlmAgent
+from google.adk.code_executors import GkeCodeExecutor


Where is this file run? From the same machine running PKB or a different one?

hubatish · 2026-06-16T17:04:24Z

 six>=1.13.0
 timeout-decorator
 scipy
+matplotlib


I don't see a reference to this elsewhere with a ctrl-f; is it leftover from an earlier version?
In general we prefer not making many changes to requirements.txt.

hubatish · 2026-06-16T17:11:11Z

    ' beyond the default node pool (e.g. kubernetes_node_scale with 5k nodes).',
 )
+
+GKE_USE_BETA = flags.DEFINE_boolean(


If we add this flag, IMO just make it "gcloud_use_beta" (or actually an enum use alpha, beta, None "gcloud_beta_version") & being referenced from gcp/util.py directly seems best.

Alternatively we often will say in the provider "if preview feature used, cmd.use_beta_gcloud = True". In general what feature are you using that needs beta?

hubatish · 2026-06-16T17:22:22Z

+by all seven UC benchmark scripts.  Each benchmark's Provision() and
+Teardown() functions delegate to the public functions in this module.
+
+Infrastructure created (in order):


The very premise of this file is incorrect. PKB (and esp eg google_kubernetes_engine.py _Create) should be handling all of the provisioning logic.

I'm not sure how much of this is a) completely unnecessary because it's handled elsewhere in PKB (like we do setup subnets & networks automatically if you don't specify a network" or b) is indeed necessary but should be located in some other Resource.py class.

+1. Let's set up the cloud infra using PKB-native way.

hubatish · 2026-06-16T17:28:38Z

+    chromium_replicas = FLAGS.gke_chromium_replicas
+
+    manifest = """---
+apiVersion: extensions.agents.x-k8s.io/v1alpha1


should go in some .yaml.j2 file

hubatish · 2026-06-16T17:29:48Z

+    return _RunCmd(cmd, check=check, timeout=timeout)
+
+
+def _KubectlApply(manifest_str):


why have you rewritten kubectl apply & _RunKubectl when implementations exist container_service/kubectl.py ?

roycaihw · 2026-06-16T23:44:48Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


For easier review and faster iteration, I'd recommend keeping one benchmark in this PR and leave the other benchmarks for followup PRs. My recommendation is to keep the Python density benchmark.

roycaihw · 2026-06-16T23:47:27Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


Let's drop "(Use Case B)" from the description. For the published PKB benchmarks, the documentation should clearly state what the benchmarks are about. The ordering of A,B,C... will become stale and confusing to readers.

roycaihw · 2026-06-16T23:49:19Z

@@ -0,0 +1,362 @@
+"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).


Can we drop "GKE" from the file name and the description? Based on the path this is a Kubernetes benchmark, and presumably this benchmark can be reused for other cloud provider without significant change, right?

roycaihw · 2026-06-16T23:54:01Z

+# ---------------------------------------------------------------------------
+
+flags.DEFINE_integer(
+    "gke_python_density",


gke_python_density
nit: Shall we name the flag something like "concurrent_sandbox_count"? gke and python can already be implied based on the file name and description of the benchmark.

roycaihw · 2026-06-16T23:56:16Z

+flags.DEFINE_integer(
+    "gke_python_density_sample_warmup",
+    0,
+    "Number of warmup iterations per session (excluded from stats).",


It's unclear what "warmup iterations" means as it's not mentioned before. Shall we document the workflow in the benchmark description?

roycaihw · 2026-06-17T00:00:34Z

+by all seven UC benchmark scripts.  Each benchmark's Provision() and
+Teardown() functions delegate to the public functions in this module.
+
+Infrastructure created (in order):


+1. Let's set up the cloud infra using PKB-native way.

roycaihw · 2026-06-17T00:05:19Z

+# ---------------------------------------------------------------------------
+
+
+def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra):


Can you document how the metrics emit works and what the parameters are?

roycaihw · 2026-06-17T16:01:21Z

cc @yuanwang04 @oceanxie1

add agentic benchmarking on gke

f614265

hubatish reviewed Jun 16, 2026

View reviewed changes

roycaihw reviewed Jun 17, 2026

View reviewed changes

		@@ -0,0 +1,240 @@
		from google.adk.agents import LlmAgent
		from google.adk.code_executors import GkeCodeExecutor

		return _RunCmd(cmd, check=check, timeout=timeout)


		def _KubectlApply(manifest_str):

		@@ -0,0 +1,362 @@
		"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).

		# ---------------------------------------------------------------------------


		def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra):

Conversation

george-kalisse-sada commented Jun 16, 2026

Agentic Workload Benchmarking for GKE (PKB Extension)

Summary

Motivation

Architecture

Benchmark Definitions (7 Use Cases)

Shared Utilities

Dual Provisioning Modes

PKB Provider Extensions

In-Cluster Components

ADK Agent (workloads/adk_agent/)

Sandbox Scripts (sandboxed_apps/)

Vibe Coding Workloads (workloads/vibe_coding/)

Usage

Prerequisites (once per environment)

Provision Cluster

Uh oh!

google-cla Bot commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roycaihw commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ADK Agent (`workloads/adk_agent/`)

Sandbox Scripts (`sandboxed_apps/`)

Vibe Coding Workloads (`workloads/vibe_coding/`)