Kubeflow Security Self Assessment by andreyvelich · Pull Request #2201 · cncf/toc

andreyvelich · 2026-06-17T23:40:51Z

Ref: kubeflow/community#996 (comment)
Adding initial Kubeflow security self-assessment

/cc @kfaseela @franciscojavierarceo @juliusvonkohout @chasecadet @thesuperzapper

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

JustinCappos

Overall, I feel like the self assessment document needs more of a security focus. Right now, it's too focused on what Kubeflow is meant to do, instead of how Kubeflow responds when things go wrong. There are clearly some good steps the project has taken from a security standpoint, so I think this can likely be addressed by updating the writing.

JustinCappos · 2026-06-18T13:23:54Z

+entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is
+composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native
+projects that cover every stage of the [AI lifecycle](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle).
+
+Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow
+offers modular, scalable, and extensible tools to support your AI use cases.


This is more like marketing speak and isn't precise about what the security properties of the system are meant to be. What security properties is it meant to provide? What trust assumptions are there? What happens when these are violated, etc.?

Good point, I refactored the overview to make it more security focused. WDYT @JustinCappos?

JustinCappos · 2026-06-18T13:25:10Z

+
+![spark-operator](images/spark-operator.png)
+
+- Spark Operator controller: A controller that watches for events of SparkApplication CRDs and acts on the watch events. It includes a submission runner that runs Spark submit for submissions received from the controller, and a Spark pod monitor that watches for Spark pods and sends pod status updates to the controller.


Is the submission runner validated / trusted?

Yes, the runner should be trusted. @RobuRishabh @vara-bonthu @yuchaoran2011 @ChenYi015 @nabuskey to confirm here.

Yes, the submitter is the operator itself.

okay, can you try to split this section out so that the security compartments are clearer? In other words, if someone manages to compromise component X and that lets them get into component Y, they are effectively not isolated. Be clear about this please as well as what parts are meant to be isolated.

Note also that sometimes if a component is compromised it can do things like cause a DoS (by failing to schedule jobs, etc.) but cannot read or write sensitive data. That's good to note too

JustinCappos · 2026-06-18T13:26:46Z

+- Experiment controller: controller that watches events of Experiment CRDs which manage single
+  hyperparameter tuning job. User can specify several parameters in Experiment such as objective
+  to define metric that user wants to achieve, search space to define set of all hyperparameter
+  values, and search algorithm to use for optimization job (e.g. bayesian optimization )
+
+- Suggestion controller: controller that watches events of Suggestion CRDs which manage set of
+  hyperparameter values that the hyperparameter tuning process has proposed. Suggestion is
+  responsible to manage algorithm service.
+
+- Trail controller: controller that watches events of Trial CRDs which manage one iteration of
+  hyperparameter tuning process. A Trial corresponds to one worker job instance with a list of
+  parameter assignments. The list of parameter assignments corresponds to a Suggestion.
+
+- Katib webhooks: Validates and mutates CRD resources to ensure they conform to Katib standards
+  and best practices. Katib also manages admission webhook to mutate metrics collector sidecar
+  container into Trial workers.


Who runs these? Are they trusted? What if a malicious party gets into one of these controllers, what could they impact?

This is managed by the Katib controllers in the system namespace. These are trusted components, and it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.

Please check out https://github.com/kubeflow/community-distribution#architecture and PSS restricted/baseline + networkpolicies

it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.

This part (and things like this) need to be clear up front. What the operator is responsible for doing is really important to surface. This should be clear in the self assessment docs and also in reasonable places on the kubeflow site / docs.

JustinCappos · 2026-06-18T13:28:59Z

+- Security and Access Control: Spark Operator leverages Kubernetes RBAC for Spark drivers and
+  executors. This allows administrators to define who can create, modify, or delete SparkApplications
+  and associated pods within the specific namespaces, enabling proper multi-tenant isolation.


So is the assumption that multiple, untrusted parties will be using the same KubeFlow experiment controllers, suggestion controllers, etc.?

Not really. We isolate users by namespaces, so with the appropriate ACLs users should not be able to manage other users resources.

Please take a look at https://github.com/kubeflow/community-distribution#architecture

okay, kindly make things like this get updated in the assessment doc. Someone should be able to treat it as self contained and get an idea of the security of Kubeflow...

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Kubeflow Security Self Assessment

451d147

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich requested a review from a team as a code owner June 17, 2026 23:40

github-actions Bot added needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) labels Jun 17, 2026

github-project-automation Bot added this to TAG Workloads Foundation, TAG Operational Resilience, CNCF TOC Board, TAG Developer Experience, TAG Security and Compliance and TAG Infrastructure Jun 17, 2026

github-actions Bot added the needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) label Jun 17, 2026

github-project-automation Bot moved this to New in CNCF TOC Board Jun 17, 2026

kfaseela self-assigned this Jun 18, 2026

JustinCappos requested changes Jun 18, 2026

View reviewed changes

andreyvelich added 3 commits June 18, 2026 17:25

Add authors

e452780

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Refactor the overview

26d806a

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Move Security Policies to the Appendix

c71afd6

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>


		![spark-operator](images/spark-operator.png)

		- Spark Operator controller: A controller that watches for events of SparkApplication CRDs and acts on the watch events. It includes a submission runner that runs Spark submit for submissions received from the controller, and a Spark pod monitor that watches for Spark pods and sends pod status updates to the controller.

Conversation

andreyvelich commented Jun 17, 2026

Uh oh!

JustinCappos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andreyvelich Jun 18, 2026 •

edited

Loading