agent_sandbox: Kubernetes agent sandbox resource implementation#6732
agent_sandbox: Kubernetes agent sandbox resource implementation#6732geojaz wants to merge 9 commits into
Conversation
Introduce the agent sandbox as a PKB resource modeled on the kubernetes inference server pattern, replacing the prior linux_package shape. This change adds only the class/spec/registration skeleton plus the cloud-agnostic container_cluster wiring. The install logic and the benchmark are added in follow-up changes. - BaseAgentSandbox resource and GetAgentSandbox factory, keyed on SANDBOX_TYPE so additional sandbox implementations can coexist. - BaseAgentSandboxConfigSpec and AgentSandboxConfigDecoder, embeddable under container_cluster in a benchmark config. - K8sAgentSandbox / K8sAgentSandboxConfigSpec: the Kubernetes (kubernetes-sigs/agent-sandbox) implementation stubs. - KubernetesCluster constructs and lifecycles cluster.agent_sandbox alongside cluster.inference_server.
…b-specs and flags Add ControllerSpec / SandboxTemplateSpec / SandboxWarmPoolSpec nested sub-specs, the agent_sandbox_* stack and controller-tuning flags bridged via _ApplyFlags, and rename the old controller_ref flag to agent_sandbox_manifest_ref.
…spec register The concrete resource module must import its concrete spec module (as wg_serving_inference_server imports wg_serving_inference_server_spec) so the agent_sandbox_* flags and K8sAgentSandboxConfigSpec register at runtime. Without it, a real pkb.py run fails at flag parsing / config decode even though unit tests (which import the spec module directly) pass.
| inference_server: ( | ||
| kubernetes_inference_server_spec.BaseInferenceServerConfigSpec | None | ||
| ) | ||
| agent_sandbox: agent_sandbox_spec.BaseAgentSandboxConfigSpec | None |
There was a problem hiding this comment.
Inference server set the example here, but I'm not sure it's actually the correct location as opposed to having this in root benchmark_spec.py. Namely this approach creates some circular dependency issues, where agent_sandbox wants to reference a cluster (to call methods on it) & the cluster references it to create it.
I believe wg_serving_inference_server.py & kubernetes_inference_server.py get around this by having a parent / abstract service & a child - and/or by not using pytype at that top level. But yeah putting this in benchmark_spec.py is probably the right place.
See 4daab75 for how example_resource.py was added to benchmark_spec.py. The ConstructAgentSandbox call can also take a container_cluster in its init there.
| 'agent_sandbox_controller_otel_endpoint', None, | ||
| 'OTLP exporter endpoint when tracing is enabled.') | ||
| flags.DEFINE_boolean( | ||
| 'agent_sandbox_controller_leader_elect', False, |
There was a problem hiding this comment.
we should be able to set all these variables just through config_overrides like --config_override=kubernetes_redis_memtier.container_cluster.agent_sandbox.image=yada & the flags are just convenience. That's not bad to have the convenience flags, but it means we probably only need flags for the ones which we're most likely to actually manually change.
| config_values['runtime_class'] = flag_values.agent_sandbox_runtime_class | ||
|
|
||
|
|
||
| class SandboxWarmPoolSpec(spec.BaseSpec): |
There was a problem hiding this comment.
Can you split some of these to their child PRs? IDK, this seems like it is implementing the base sandbox + a bunch of features. Each feature as an individual PR would be nice.
In general it seems like a ton of customization.. we can likely hardcode many of these values.
There was a problem hiding this comment.
Maybe I'm misinterpreting this (I note below they are all referenced by the parent spec)
| # | ||
| # Targets nodes labelled sandbox.gke.io/runtime=runsc (the label the | ||
| # benchmark applies to the sandbox node pool). | ||
| apiVersion: apps/v1 |
There was a problem hiding this comment.
this turned into mostly a "setup spec" PR so I don't think the yamls are used right? Push to the PR where they are used.
What
Second step of reshaping the agent sandbox into a PKB resource (follows the skeleton in #6730). This fills in the Kubernetes implementation: a config-driven spec, the install orchestration, the data manifests, and a stub benchmark that provisions the resource.
Changes
K8sAgentSandboxConfigSpec(k8s_agent_sandbox_spec.py): config-driven, with nestedcontroller,sandbox_template, andsandbox_warmpoolsub-specs. Thesandbox_templateblock models the upstreamSandboxTemplateSpec(pod shape rendered; template-level toggles likenetwork_policy_management/env_vars_injection_policy/serviceare accepted and validated as stubs). The existingagent_sandbox_*flags are bridged via_ApplyFlags; the oldcontroller_refflag is renamed toagent_sandbox_manifest_ref.K8sAgentSandbox(k8s_agent_sandbox.py):_Createorchestrates gVisor install, controller install, SandboxTemplate apply, and SandboxWarmPool install (private methods readingself.spec)._Deleteis a no-op (the ephemeral cluster teardown reclaims the stack). The controller-manifest configuration and install helpers are ported from the priorlinux_packagesimplementation.data/agent_sandbox/): gVisor installer assets, andsandbox-template.yaml.j2parameterized for runtime class, image, resources, and labels. The SandboxTemplate and SandboxWarmPool share a single fixed name.agent_sandbox_benchmark.py): acontainer_cluster.agent_sandboxconfig that constructs aK8sAgentSandbox.Runreturns no samples yet.k8s_agent_sandbox_test.py): spec decode + flag overrides, controller-manifest injection,_Createorchestration,_Deleteno-op, and benchmark-config construction.Scope
Reviewable but not yet runnable end to end. The benchmark's
Run(load generator + metrics) lands in the next PR, and the GKE/EKS/AKS nodepool changes that make it actually provision come after that. Cloud-agnostic: no provider changes here.Follow-ups
Runload generation + metrics; then GKE, EKS, AKS provider support.