Skip to content

add agentic benchmarking on gke#6772

Open
george-kalisse-sada wants to merge 1 commit into
GoogleCloudPlatform:masterfrom
george-kalisse-sada:sada-gke-agentic-benchmarking
Open

add agentic benchmarking on gke#6772
george-kalisse-sada wants to merge 1 commit into
GoogleCloudPlatform:masterfrom
george-kalisse-sada:sada-gke-agentic-benchmarking

Conversation

@george-kalisse-sada

Copy link
Copy Markdown

Agentic Workload Benchmarking for GKE (PKB Extension)

Summary

Adds a complete benchmarking framework for Agentic Workloads on Google Kubernetes Engine (GKE) — specifically measuring per-operation performance of untrusted Python code execution and headless Chromium browser tasks running under gVisor (GKE Agent Sandbox) isolation.


Motivation

AI agent systems require ephemeral, isolated execution environments (sandboxes) for running untrusted code. Understanding the performance characteristics of these sandboxes under gVisor — including cold-start latency, execution overhead, memory density limits, and scheduling throughput — is critical for production capacity planning.

This framework enables systematic, repeatable measurement of these characteristics across multiple GCP machine families.


Architecture

Benchmark Definitions (7 Use Cases)

Benchmark Use Case Measures
gke_snapshot UC-A: Cold Start & Snapshot Pod snapshot create/restore latency under CRIU
gke_python_density UC-B: Python Density CEL, TTFE, RSS growth at varying concurrency
gke_chromium_density UC-C: Chromium Density Interaction latency, screenshot time at scale
gke_payload UC-D: Payload Transfer Sandbox→orchestrator data transfer saturation
gke_warmpool UC-E: Warmpool Scale-Up Bulk provisioning speed (0→N pods)
gke_qps UC-F: QPS Saturation Scheduling throughput until pool drain
gke_deletion UC-G: Deletion & Cleanup Bulk deletion latency and IP reclamation

Shared Utilities

Module Purpose
gke_benchmark_utils.py Agent API interaction, kubectl helpers, warm pool management, port-forward manager, sample construction
gke_deploy_utils.py Idempotent workload deployment (CRDs, templates, warm pools, router, ADK agent, PSI reader)
gke_provision_utils.py Full GKE infrastructure lifecycle (VPC, NAT, cluster, node pools, AR, IAM)
gke_image_build_utils.py Container image builds via Cloud Build (ADK agent, Chrome sandbox, Sandbox Router)
gke_prerequisite_setup.py Standalone script for pre-PKB infrastructure (VPC, NAT, AR, SA, images)

Dual Provisioning Modes

  • custom mode: Direct gcloud calls for full infrastructure control
  • native mode: Uses PKB's built-in container_cluster provisioner with prerequisite script for resources PKB cannot manage

PKB Provider Extensions

Small additions to support GKE preview features:

  • --gke_use_beta flag (forces gcloud beta container clusters create)
  • --gke_additional_flags list (appended to cluster create)
  • --gke_additional_nodepool_flags list (appended to node pool create)

In-Cluster Components

ADK Agent (workloads/adk_agent/)

A FastAPI service deployed inside GKE that:

  • Exposes REST endpoints for each benchmark type (/benchmark/python/density, /benchmark/python/payload, /benchmark/python/qps, /benchmark/chromium/density)
  • Uses a Mock LLM (no real model calls) to drive the ADK Runner through sandbox claim→execute→release cycles
  • Connects to sandboxes via DirectConnection (in-cluster) or kubectl port-forward (dev mode)
  • Measures both orchestrator-side and sandbox-side metrics

Sandbox Scripts (sandboxed_apps/)

  • benchmark_density.py — CPU-bound, syscall-heavy, and import-heavy tasks with RSS tracking
  • benchmark_payload.py — Payload generation, serialization, and stdout transfer measurement
  • benchmark_qps.py — Minimal script proving sandbox liveness
  • benchmark_density.js — Playwright-driven Chromium interaction benchmark

Vibe Coding Workloads (workloads/vibe_coding/)

Startup scripts simulating real-world agentic cold-starts:

  • startup_pip_fastapi.sh — pip install + FastAPI server boot
  • startup_npm_vite.sh — npm install + Vite dev server boot

Usage

Prerequisites (once per environment)

python -m perfkitbenchmarker.linux_benchmarks.kubernetes.agentic.gke_prerequisite_setup \
    --project_id=sada-gke-benchmarking2 \
    --region=us-central1 \
    --zone=us-central1-a \
    --machine_type=c4-standard-8

Provision Cluster

python pkb.py --benchmarks=gke_python_density \
    --run_stage=provision \
    --gke_provision_mode=native \
    --project=sada-gke-benchmarking2 \
    --owner=george-kalisse \
    --benchmark_config_file=k8s_agents/config/native_provision_config.yaml \
    --gce_network_name=george-agentic-vpc \
    --gce_subnet_region=us-central1 \
    --zone=us-central1-a \
    --container_cluster_version=1.35.3-gke.1389000 \
    --gke_use_beta=true \
    --gke_additional_flags="--enable-pod-snapshots,--enable-dataplane-v2,--enable-private-nodes,--enable-ip-alias,--master-ipv4-cidr=172.16.0.0/28,--workload-pool=sada-gke-benchmarking2.svc.id.goog,--subnetwork=george-agentic-subnet,--enable-master-authorized-networks,--master-authorized-networks=$(curl -s ifconfig.me)/32" \
    --gke_additional_nodepool_flags="--max-pods-per-node=250" \
    --gke_enable_shielded_nodes=false \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb```

### Run Benchmark
```bash
python pkb.py --benchmarks=gke_python_density \
    --run_stage=prepare,run,cleanup \
    --gke_provision_mode=native \
    --gke_project_id=sada-gke-benchmarking2 \
    --gke_region=us-central1 \
    --gke_zone=us-central1-a \
    --gke_sandbox_machine_type=c4-standard-8 \
    --gke_namespace=agentic \
    --gke_sandbox_version=v0.4.6 \
    --gke_python_density=4 \
    --gke_python_density_sample_count=20 \
    --gke_python_density_sample_warmup=0 \
    --gke_python_density_patch_warmpool=true \
    --gke_python_density_exec_timeout=600 \
    --gke_machine_type=c4-standard-8 \
    --gke_gvisor=true \
    --gke_api_url=http://localhost:8080 \
    --run_uri=test \
    --temp_dir=./testing/pkb/c4-standard-8/ucb

@google-cla

google-cla Bot commented Jun 16, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

# Used with --gke_provision_mode=native
#
# Prerequisites (run once before PKB):
# python tools/agentic-benchmark/scripts/prerequisite_setup.py \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool isn't being included & therefore this comment doesn't need to be here.

# For sweeps (cluster pre-exists, PKB skips provision/teardown):
# The sweep bridge injects --run_stage=run,cleanup automatically.

gke_python_density:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally we put a lot of this info but externally it is useful.. it's probably a good addition.

@@ -0,0 +1,240 @@
from google.adk.agents import LlmAgent
from google.adk.code_executors import GkeCodeExecutor

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this file run? From the same machine running PKB or a different one?

Comment thread requirements.txt
six>=1.13.0
timeout-decorator
scipy
matplotlib

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reference to this elsewhere with a ctrl-f; is it leftover from an earlier version?
In general we prefer not making many changes to requirements.txt.

' beyond the default node pool (e.g. kubernetes_node_scale with 5k nodes).',
)

GKE_USE_BETA = flags.DEFINE_boolean(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add this flag, IMO just make it "gcloud_use_beta" (or actually an enum use alpha, beta, None "gcloud_beta_version") & being referenced from gcp/util.py directly seems best.

Alternatively we often will say in the provider "if preview feature used, cmd.use_beta_gcloud = True". In general what feature are you using that needs beta?

by all seven UC benchmark scripts. Each benchmark's Provision() and
Teardown() functions delegate to the public functions in this module.

Infrastructure created (in order):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very premise of this file is incorrect. PKB (and esp eg google_kubernetes_engine.py _Create) should be handling all of the provisioning logic.

I'm not sure how much of this is a) completely unnecessary because it's handled elsewhere in PKB (like we do setup subnets & networks automatically if you don't specify a network" or b) is indeed necessary but should be located in some other Resource.py class.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's set up the cloud infra using PKB-native way.

chromium_replicas = FLAGS.gke_chromium_replicas

manifest = """---
apiVersion: extensions.agents.x-k8s.io/v1alpha1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go in some .yaml.j2 file

return _RunCmd(cmd, check=check, timeout=timeout)


def _KubectlApply(manifest_str):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you rewritten kubectl apply & _RunKubectl when implementations exist container_service/kubectl.py ?

@@ -0,0 +1,362 @@
"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier review and faster iteration, I'd recommend keeping one benchmark in this PR and leave the other benchmarks for followup PRs. My recommendation is to keep the Python density benchmark.

@@ -0,0 +1,362 @@
"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop "(Use Case B)" from the description. For the published PKB benchmarks, the documentation should clearly state what the benchmarks are about. The ordering of A,B,C... will become stale and confusing to readers.

@@ -0,0 +1,362 @@
"""PKB Benchmark: GKE Agent Python Sandbox Density (Use Case B).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop "GKE" from the file name and the description? Based on the path this is a Kubernetes benchmark, and presumably this benchmark can be reused for other cloud provider without significant change, right?

# ---------------------------------------------------------------------------

flags.DEFINE_integer(
"gke_python_density",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gke_python_density
nit: Shall we name the flag something like "concurrent_sandbox_count"? gke and python can already be implied based on the file name and description of the benchmark.

flags.DEFINE_integer(
"gke_python_density_sample_warmup",
0,
"Number of warmup iterations per session (excluded from stats).",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what "warmup iterations" means as it's not mentioned before. Shall we document the workflow in the benchmark description?

by all seven UC benchmark scripts. Each benchmark's Provision() and
Teardown() functions delegate to the public functions in this module.

Infrastructure created (in order):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let's set up the cloud infra using PKB-native way.

# ---------------------------------------------------------------------------


def _emit(samples, agg, agg_key, metric_suffix, unit, namespace, extra):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document how the metrics emit works and what the parameters are?

@roycaihw

Copy link
Copy Markdown

cc @yuanwang04 @oceanxie1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants