Skip to content

agent_sandbox: load generator, metrics, and runnable benchmark#6740

Open
geojaz wants to merge 11 commits into
GoogleCloudPlatform:masterfrom
onix-net:geojaz/agent-sandbox-benchmark
Open

agent_sandbox: load generator, metrics, and runnable benchmark#6740
geojaz wants to merge 11 commits into
GoogleCloudPlatform:masterfrom
onix-net:geojaz/agent-sandbox-benchmark

Conversation

@geojaz

@geojaz geojaz commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Third in the stacked agent_sandbox series. This is the PR that makes the
benchmark actually run.

Stack / merge order (each branch is cumulative on the one below):

  1. agent_sandbox: skeleton resource, spec, and cluster wiring #6730 - skeleton resource, spec, cluster wiring
  2. agent_sandbox: Kubernetes agent sandbox resource implementation #6732 - Kubernetes agent sandbox resource implementation
  3. this PR - load generator, metrics, runnable Run

Related (but not technically a dependency): #6741 adds the GKE provider options
(per-nodepool node_labels/taints, etc.) this benchmark uses to run on GKE.

Because this is a cross-fork PR the base has to be master, so until #6730 and
#6732 merge the diff here shows the whole stack. The only new commit in this PR
is the top one (9 files, +1600/-33); that commit is the real review scope, and
GitHub will narrow the diff as the lower PRs land.

What this adds

  • Load generator (agent_sandbox_loadgen.py): submits SandboxClaim custom
    resources at a target QPS through a single shared Kubernetes Watch stream (no
    per-claim polling). ClaimDriver handles create/watch with 429 retry and
    separate connection pools, LoadGenerator paces submission, and readiness is
    tracked with bounded concurrency. Claims reference the warm pool directly via
    spec.warmPoolRef (api: Replace spec.templateRef in SandboxClaim with spec.warmpoolRef. kubernetes-sigs/agent-sandbox#899 replaced
    sandboxTemplateRef/warmpool with a single warmPoolRef; the controller
    resolves the template through the warm pool), and the default manifest ref is
    bumped to the post-Unable to use run stages with OpenStack #899 main HEAD so the installed CRDs match.
  • Metrics (agent_sandbox_metrics.py): startup-time percentiles,
    submit/completion QPS, peak concurrency, warm_served_fraction, error counts,
    and lifecycle/exec-duration percentiles from the recorded events.
  • Run wiring: builds the load generator from the load-shape flags, runs it,
    and converts the recorded events into PKB samples (the stub Run from the
    resource PR returned nothing).
  • Provision/prepare install split: provision installs only the cluster
    scaffolding (gVisor, CRDs, RBAC); the controller Deployment, sandbox template,
    and warm pool move to the prepare stage via K8sAgentSandbox.InstallWorkload.
    This lets the controller be reinstalled against an existing cluster with
    --run_stage=prepare to iterate on controller settings without recreating it.
    Prepare calls RefreshSpecFromFlags on resume so the unpickled spec reflects
    the current flags. Note: --run_stage=provision alone no longer installs the
    controller; use provision,prepare for a full setup.
  • Adds the kubernetes Python client to requirements.txt, plus unit tests for
    the load generator, the metrics, and the provision/prepare split.

Scheduling single-source

The gVisor selector and taint were duplicated as literal strings across the
benchmark config, the installer DaemonSet, and the sandbox template, with
nothing keeping them in sync. This untangles scheduling from runtime identity:

  • Scheduling: the installer DaemonSet and the sandbox pods select the sandbox
    nodepool via the pkb_nodepool label PKB already injects on every pool, and
    the pod toleration is derived from a single taint constant. nodeSelector
    and tolerations are now injected in Python (same pattern as
    _configure_controller_manifest) instead of being hardcoded in the manifests.
  • Runtime identity: runtimeClassName stays runsc, used only for the
    RuntimeClass, the containerd registration, and the pod's runtimeClassName.
    It is no longer reused as a node selector value.

Known gap until #6741 lands: PKB does not yet apply nodepool taints to the
actual nodes (that wiring is in #6741). So on this branch the canonical taint
lives in a _SANDBOX_TAINT constant with a TODO(#6741), and the taint is
effectively a no-op on the nodes. Concretely: selection onto the sandbox pool
via pkb_nodepool works today, but the fence (keeping other pods off the gVisor
nodes) is not live yet, because nothing taints those nodes until #6741. That gap
closes when #6741 wires node_taints and the constant is swapped for
nodepool.node_taints. The SandboxWarmPool is unchanged; it inherits
scheduling from the SandboxTemplate podTemplate.

geojaz added 10 commits June 3, 2026 17:35
Introduce the agent sandbox as a PKB resource modeled on the kubernetes
inference server pattern, replacing the prior linux_package shape. This
change adds only the class/spec/registration skeleton plus the
cloud-agnostic container_cluster wiring. The install logic and the
benchmark are added in follow-up changes.

- BaseAgentSandbox resource and GetAgentSandbox factory, keyed on
  SANDBOX_TYPE so additional sandbox implementations can coexist.
- BaseAgentSandboxConfigSpec and AgentSandboxConfigDecoder, embeddable
  under container_cluster in a benchmark config.
- K8sAgentSandbox / K8sAgentSandboxConfigSpec: the Kubernetes
  (kubernetes-sigs/agent-sandbox) implementation stubs.
- KubernetesCluster constructs and lifecycles cluster.agent_sandbox
  alongside cluster.inference_server.
…b-specs and flags

Add ControllerSpec / SandboxTemplateSpec / SandboxWarmPoolSpec nested
sub-specs, the agent_sandbox_* stack and controller-tuning flags bridged
via _ApplyFlags, and rename the old controller_ref flag to
agent_sandbox_manifest_ref.
…spec register

The concrete resource module must import its concrete spec module (as
wg_serving_inference_server imports wg_serving_inference_server_spec) so
the agent_sandbox_* flags and K8sAgentSandboxConfigSpec register at
runtime. Without it, a real pkb.py run fails at flag parsing /
config decode even though unit tests (which import the spec module
directly) pass.
Make the agent_sandbox benchmark run: a SandboxClaim load generator, the
metrics it produces, the Run wiring that drives them, and a provision/prepare
install split for fast iteration.

The load generator (agent_sandbox_loadgen.py) submits SandboxClaim custom
resources at a target QPS through a single shared Kubernetes Watch stream (no
per-claim polling). ClaimDriver handles create/watch with 429 retry and
separate connection pools, LoadGenerator paces submission, and readiness is
tracked with bounded concurrency. Claims reference the warm pool directly via
spec.warmPoolRef (kubernetes-sigs/agent-sandbox#899 replaced
sandboxTemplateRef/warmpool with a single warmPoolRef; the controller resolves
the template through the warm pool), and the default manifest ref is bumped to
the post-GoogleCloudPlatform#899 main HEAD so the installed CRDs match.

The metrics module (agent_sandbox_metrics.py) computes startup-time
percentiles, submit/completion QPS, peak concurrency, warm_served_fraction,
error counts, and lifecycle/exec-duration percentiles from the recorded
events. The benchmark Run constructs the load generator from the load-shape
flags, runs it, and converts the recorded events into PKB samples (the stub
Run from the resource PR returned nothing).

Install is split across provision and prepare: provision installs only the
cluster scaffolding (gVisor, CRDs, RBAC); the controller Deployment, sandbox
template, and warm pool move to the prepare stage via a new
K8sAgentSandbox.InstallWorkload. This lets the controller be reinstalled
against an existing cluster with --run_stage=prepare to iterate on controller
settings without recreating it. Because the benchmark spec is pickled at
provision and unpickled without re-applying flags, Prepare calls
RefreshSpecFromFlags on a resume so the controller, template, and warm pool
config reflect the current command-line flags. Note: --run_stage=provision
alone no longer installs the controller; run provision,prepare for a full
setup.

Adds the kubernetes Python client to requirements.txt, plus unit tests for the
load generator, the metrics, and the provision/prepare split.
The gVisor scheduling selector and taint were duplicated as literal strings
across the benchmark config, the installer DaemonSet, and the sandbox template,
with nothing keeping them in sync. Untangle scheduling from runtime identity:

- Scheduling: select the sandbox nodepool via the pkb_nodepool label PKB
  already injects on every pool, and derive the pod toleration from a single
  taint constant. nodeSelector/tolerations are now injected in Python (like
  _configure_controller_manifest) instead of being hardcoded in the manifests.
- Runtime identity: runtimeClassName stays runsc, used only for the
  RuntimeClass, containerd registration, and the pod runtimeClassName.

PKB does not yet apply nodepool taints to nodes (that lands in a follow-up), so
the canonical taint lives in a _SANDBOX_TAINT constant with a TODO to read it
from the nodepool config once that wiring exists. The SandboxWarmPool is
unchanged: it inherits scheduling from the SandboxTemplate podTemplate.
"""
sandbox = benchmark_spec.container_cluster.agent_sandbox
if sandbox is None:
return

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably raise this as error (in general we like failing benchmarks rather than silently continuing)

total=_TOTAL.value)
driver = agent_sandbox_loadgen.ClaimDriver(
namespace=spec.namespace,
template_name=k8s_agent_sandbox._SANDBOX_NAME,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great to see this is a hardcoded value (like, yes it actually should be. I mean maybe if it needs to be different every run it can have a uri component, but it shouldn't be flag passed).

But requesting actual change: Make this a public variable (no _ in front).

a small urllib3 pool.
- the exec-plugin bearer-token remap (see _register_bearer_token_auth).
"""
from kubernetes import client # pylint: disable=import-error,no-name-in-module

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this as a new PKB requirement. Each new requirement does add some load time to our internal runs & memory to everyone's machines.

My high level suggestion is mostly to run this from VMs:

  • We often run load generation from VMs & that could be an option here. ie entirely running from VMs rather than from a cluster. This provides more isolation & in-same region but not same cluster latencies which can be more indicative of customer usecases (not sure if sandbox load comes from outside or inside a cluster for real customers)
  • Similarly a VM can be used simply to handle additional dependencies. ie put this in like a data script, copy it to a runner VM, run it on said VM, copy out the results.

Otherwise:

  • Justify it as new requirement
  • Why is it imported inline rather than up top?




def percentile(values, pct):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pytyping lots of places https://google.github.io/pytype/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants