agent_sandbox: load generator, metrics, and runnable benchmark by geojaz · Pull Request #6740 · GoogleCloudPlatform/PerfKitBenchmarker

geojaz · 2026-06-04T23:57:58Z

Third in the stacked agent_sandbox series. This is the PR that makes the
benchmark actually run.

Stack / merge order (each branch is cumulative on the one below):

agent_sandbox: skeleton resource, spec, and cluster wiring #6730 - skeleton resource, spec, cluster wiring
agent_sandbox: Kubernetes agent sandbox resource implementation #6732 - Kubernetes agent sandbox resource implementation
this PR - load generator, metrics, runnable Run

Related (but not technically a dependency): #6741 adds the GKE provider options
(per-nodepool node_labels/taints, etc.) this benchmark uses to run on GKE.

Because this is a cross-fork PR the base has to be master, so until #6730 and
#6732 merge the diff here shows the whole stack. The only new commit in this PR
is the top one (9 files, +1600/-33); that commit is the real review scope, and
GitHub will narrow the diff as the lower PRs land.

What this adds

Load generator (agent_sandbox_loadgen.py): submits SandboxClaim custom
resources at a target QPS through a single shared Kubernetes Watch stream (no
per-claim polling). ClaimDriver handles create/watch with 429 retry and
separate connection pools, LoadGenerator paces submission, and readiness is
tracked with bounded concurrency. Claims reference the warm pool directly via
spec.warmPoolRef (api: Replace spec.templateRef in SandboxClaim with spec.warmpoolRef. kubernetes-sigs/agent-sandbox#899 replaced
sandboxTemplateRef/warmpool with a single warmPoolRef; the controller
resolves the template through the warm pool), and the default manifest ref is
bumped to the post-Unable to use run stages with OpenStack #899 main HEAD so the installed CRDs match.
Metrics (agent_sandbox_metrics.py): startup-time percentiles,
submit/completion QPS, peak concurrency, warm_served_fraction, error counts,
and lifecycle/exec-duration percentiles from the recorded events.
Run wiring: builds the load generator from the load-shape flags, runs it,
and converts the recorded events into PKB samples (the stub Run from the
resource PR returned nothing).
Provision/prepare install split: provision installs only the cluster
scaffolding (gVisor, CRDs, RBAC); the controller Deployment, sandbox template,
and warm pool move to the prepare stage via K8sAgentSandbox.InstallWorkload.
This lets the controller be reinstalled against an existing cluster with
--run_stage=prepare to iterate on controller settings without recreating it.
Prepare calls RefreshSpecFromFlags on resume so the unpickled spec reflects
the current flags. Note: --run_stage=provision alone no longer installs the
controller; use provision,prepare for a full setup.
Adds the kubernetes Python client to requirements.txt, plus unit tests for
the load generator, the metrics, and the provision/prepare split.

Scheduling single-source

The gVisor selector and taint were duplicated as literal strings across the
benchmark config, the installer DaemonSet, and the sandbox template, with
nothing keeping them in sync. This untangles scheduling from runtime identity:

Scheduling: the installer DaemonSet and the sandbox pods select the sandbox
nodepool via the pkb_nodepool label PKB already injects on every pool, and
the pod toleration is derived from a single taint constant. nodeSelector
and tolerations are now injected in Python (same pattern as
_configure_controller_manifest) instead of being hardcoded in the manifests.
Runtime identity: runtimeClassName stays runsc, used only for the
RuntimeClass, the containerd registration, and the pod's runtimeClassName.
It is no longer reused as a node selector value.

Known gap until #6741 lands: PKB does not yet apply nodepool taints to the
actual nodes (that wiring is in #6741). So on this branch the canonical taint
lives in a _SANDBOX_TAINT constant with a TODO(#6741), and the taint is
effectively a no-op on the nodes. Concretely: selection onto the sandbox pool
via pkb_nodepool works today, but the fence (keeping other pods off the gVisor
nodes) is not live yet, because nothing taints those nodes until #6741. That gap
closes when #6741 wires node_taints and the constant is swapped for
nodepool.node_taints. The SandboxWarmPool is unchanged; it inherits
scheduling from the SandboxTemplate podTemplate.

Introduce the agent sandbox as a PKB resource modeled on the kubernetes inference server pattern, replacing the prior linux_package shape. This change adds only the class/spec/registration skeleton plus the cloud-agnostic container_cluster wiring. The install logic and the benchmark are added in follow-up changes. - BaseAgentSandbox resource and GetAgentSandbox factory, keyed on SANDBOX_TYPE so additional sandbox implementations can coexist. - BaseAgentSandboxConfigSpec and AgentSandboxConfigDecoder, embeddable under container_cluster in a benchmark config. - K8sAgentSandbox / K8sAgentSandboxConfigSpec: the Kubernetes (kubernetes-sigs/agent-sandbox) implementation stubs. - KubernetesCluster constructs and lifecycles cluster.agent_sandbox alongside cluster.inference_server.

…b-specs and flags Add ControllerSpec / SandboxTemplateSpec / SandboxWarmPoolSpec nested sub-specs, the agent_sandbox_* stack and controller-tuning flags bridged via _ApplyFlags, and rename the old controller_ref flag to agent_sandbox_manifest_ref.

…-op _Delete

…en test

…oxWarmPool

…spec register The concrete resource module must import its concrete spec module (as wg_serving_inference_server imports wg_serving_inference_server_spec) so the agent_sandbox_* flags and K8sAgentSandboxConfigSpec register at runtime. Without it, a real pkb.py run fails at flag parsing / config decode even though unit tests (which import the spec module directly) pass.

Make the agent_sandbox benchmark run: a SandboxClaim load generator, the metrics it produces, the Run wiring that drives them, and a provision/prepare install split for fast iteration. The load generator (agent_sandbox_loadgen.py) submits SandboxClaim custom resources at a target QPS through a single shared Kubernetes Watch stream (no per-claim polling). ClaimDriver handles create/watch with 429 retry and separate connection pools, LoadGenerator paces submission, and readiness is tracked with bounded concurrency. Claims reference the warm pool directly via spec.warmPoolRef (kubernetes-sigs/agent-sandbox#899 replaced sandboxTemplateRef/warmpool with a single warmPoolRef; the controller resolves the template through the warm pool), and the default manifest ref is bumped to the post-GoogleCloudPlatform#899 main HEAD so the installed CRDs match. The metrics module (agent_sandbox_metrics.py) computes startup-time percentiles, submit/completion QPS, peak concurrency, warm_served_fraction, error counts, and lifecycle/exec-duration percentiles from the recorded events. The benchmark Run constructs the load generator from the load-shape flags, runs it, and converts the recorded events into PKB samples (the stub Run from the resource PR returned nothing). Install is split across provision and prepare: provision installs only the cluster scaffolding (gVisor, CRDs, RBAC); the controller Deployment, sandbox template, and warm pool move to the prepare stage via a new K8sAgentSandbox.InstallWorkload. This lets the controller be reinstalled against an existing cluster with --run_stage=prepare to iterate on controller settings without recreating it. Because the benchmark spec is pickled at provision and unpickled without re-applying flags, Prepare calls RefreshSpecFromFlags on a resume so the controller, template, and warm pool config reflect the current command-line flags. Note: --run_stage=provision alone no longer installs the controller; run provision,prepare for a full setup. Adds the kubernetes Python client to requirements.txt, plus unit tests for the load generator, the metrics, and the provision/prepare split.

The gVisor scheduling selector and taint were duplicated as literal strings across the benchmark config, the installer DaemonSet, and the sandbox template, with nothing keeping them in sync. Untangle scheduling from runtime identity: - Scheduling: select the sandbox nodepool via the pkb_nodepool label PKB already injects on every pool, and derive the pod toleration from a single taint constant. nodeSelector/tolerations are now injected in Python (like _configure_controller_manifest) instead of being hardcoded in the manifests. - Runtime identity: runtimeClassName stays runsc, used only for the RuntimeClass, containerd registration, and the pod runtimeClassName. PKB does not yet apply nodepool taints to nodes (that lands in a follow-up), so the canonical taint lives in a _SANDBOX_TAINT constant with a TODO to read it from the nodepool config once that wiring exists. The SandboxWarmPool is unchanged: it inherits scheduling from the SandboxTemplate podTemplate.

hubatish · 2026-06-17T20:28:29Z

+  """
+  sandbox = benchmark_spec.container_cluster.agent_sandbox
+  if sandbox is None:
+    return


can probably raise this as error (in general we like failing benchmarks rather than silently continuing)

hubatish · 2026-06-17T20:33:24Z

+      total=_TOTAL.value)
+  driver = agent_sandbox_loadgen.ClaimDriver(
+      namespace=spec.namespace,
+      template_name=k8s_agent_sandbox._SANDBOX_NAME,


great to see this is a hardcoded value (like, yes it actually should be. I mean maybe if it needs to be different every run it can have a uri component, but it shouldn't be flag passed).

But requesting actual change: Make this a public variable (no _ in front).

hubatish · 2026-06-17T20:44:41Z

+      a small urllib3 pool.
+    - the exec-plugin bearer-token remap (see _register_bearer_token_auth).
+  """
+  from kubernetes import client  # pylint: disable=import-error,no-name-in-module


I don't like this as a new PKB requirement. Each new requirement does add some load time to our internal runs & memory to everyone's machines.

My high level suggestion is mostly to run this from VMs:

We often run load generation from VMs & that could be an option here. ie entirely running from VMs rather than from a cluster. This provides more isolation & in-same region but not same cluster latencies which can be more indicative of customer usecases (not sure if sandbox load comes from outside or inside a cluster for real customers)

Similarly a VM can be used simply to handle additional dependencies. ie put this in like a data script, copy it to a runner VM, run it on said VM, copy out the results.

Otherwise:

Justify it as new requirement

Why is it imported inline rather than up top?

hubatish · 2026-06-17T20:45:44Z

+
+
+
+def percentile(values, pct):


add pytyping lots of places https://google.github.io/pytype/

geojaz added 10 commits June 3, 2026 17:35

agent_sandbox: add gVisor installer assets and sandbox manifests

be2c8ba

agent_sandbox: port controller-manifest configuration helper with tests

1a41377

agent_sandbox: implement K8sAgentSandbox._Create orchestration and no…

5567851

…-op _Delete

agent_sandbox: add stub benchmark that provisions the resource

bf852e2

agent_sandbox: pass bare manifest names to ApplyManifest and strength…

7a385ee

…en test

agent_sandbox: use a single shared name for SandboxTemplate and Sandb…

e03ceaa

…oxWarmPool

This was referenced Jun 4, 2026

gke: private nodes, DNS endpoint, Dataplane V2, cost allocation, monitoring, max_pods_per_node, nodepool labels/taints #6741

Open

eks: support nodepool labels and taints #6744

Open

hubatish reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent_sandbox: load generator, metrics, and runnable benchmark#6740

agent_sandbox: load generator, metrics, and runnable benchmark#6740
geojaz wants to merge 11 commits into
GoogleCloudPlatform:masterfrom
onix-net:geojaz/agent-sandbox-benchmark

geojaz commented Jun 4, 2026 •

edited

Loading

Uh oh!

hubatish Jun 17, 2026

Uh oh!

hubatish Jun 17, 2026

Uh oh!

hubatish Jun 17, 2026

Uh oh!

hubatish Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

geojaz commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hubatish Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

hubatish Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

geojaz commented Jun 4, 2026 •

edited

Loading