Skip to content

Kubeflow Security Self Assessment#2201

Open
andreyvelich wants to merge 4 commits into
cncf:mainfrom
andreyvelich:kf-security-review
Open

Kubeflow Security Self Assessment#2201
andreyvelich wants to merge 4 commits into
cncf:mainfrom
andreyvelich:kf-security-review

Conversation

@andreyvelich

Copy link
Copy Markdown

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich requested a review from a team as a code owner June 17, 2026 23:40
@github-actions github-actions Bot added needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) labels Jun 17, 2026
@github-actions github-actions Bot added the needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) label Jun 17, 2026
@kfaseela kfaseela self-assigned this Jun 18, 2026

@JustinCappos JustinCappos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I feel like the self assessment document needs more of a security focus. Right now, it's too focused on what Kubeflow is meant to do, instead of how Kubeflow responds when things go wrong. There are clearly some good steps the project has taken from a security standpoint, so I think this can likely be addressed by updating the writing.

Comment thread projects/kubeflow/security-assessment/self-assessment.md
Comment on lines +117 to +122
entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is
composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native
projects that cover every stage of the [AI lifecycle](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle).

Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow
offers modular, scalable, and extensible tools to support your AI use cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like marketing speak and isn't precise about what the security properties of the system are meant to be. What security properties is it meant to provide? What trust assumptions are there? What happens when these are violated, etc.?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I refactored the overview to make it more security focused. WDYT @JustinCappos?


![spark-operator](images/spark-operator.png)

- Spark Operator controller: A controller that watches for events of SparkApplication CRDs and acts on the watch events. It includes a submission runner that runs Spark submit for submissions received from the controller, and a Spark pod monitor that watches for Spark pods and sends pod status updates to the controller.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the submission runner validated / trusted?

@andreyvelich andreyvelich Jun 18, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the runner should be trusted. @RobuRishabh @vara-bonthu @yuchaoran2011 @ChenYi015 @nabuskey to confirm here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the submitter is the operator itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, can you try to split this section out so that the security compartments are clearer? In other words, if someone manages to compromise component X and that lets them get into component Y, they are effectively not isolated. Be clear about this please as well as what parts are meant to be isolated.

Note also that sometimes if a component is compromised it can do things like cause a DoS (by failing to schedule jobs, etc.) but cannot read or write sensitive data. That's good to note too

Comment on lines +184 to +199
- Experiment controller: controller that watches events of Experiment CRDs which manage single
hyperparameter tuning job. User can specify several parameters in Experiment such as objective
to define metric that user wants to achieve, search space to define set of all hyperparameter
values, and search algorithm to use for optimization job (e.g. bayesian optimization )

- Suggestion controller: controller that watches events of Suggestion CRDs which manage set of
hyperparameter values that the hyperparameter tuning process has proposed. Suggestion is
responsible to manage algorithm service.

- Trail controller: controller that watches events of Trial CRDs which manage one iteration of
hyperparameter tuning process. A Trial corresponds to one worker job instance with a list of
parameter assignments. The list of parameter assignments corresponds to a Suggestion.

- Katib webhooks: Validates and mutates CRD resources to ensure they conform to Katib standards
and best practices. Katib also manages admission webhook to mutate metrics collector sidecar
container into Trial workers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who runs these? Are they trusted? What if a malicious party gets into one of these controllers, what could they impact?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is managed by the Katib controllers in the system namespace. These are trusted components, and it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check out https://github.com/kubeflow/community-distribution#architecture and PSS restricted/baseline + networkpolicies

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.

This part (and things like this) need to be clear up front. What the operator is responsible for doing is really important to surface. This should be clear in the self assessment docs and also in reasonable places on the kubeflow site / docs.

Comment on lines +284 to +286
- Security and Access Control: Spark Operator leverages Kubernetes RBAC for Spark drivers and
executors. This allows administrators to define who can create, modify, or delete SparkApplications
and associated pods within the specific namespaces, enabling proper multi-tenant isolation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the assumption that multiple, untrusted parties will be using the same KubeFlow experiment controllers, suggestion controllers, etc.?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. We isolate users by namespaces, so with the appropriate ACLs users should not be able to manage other users resources.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, kindly make things like this get updated in the assessment doc. Someone should be able to treat it as self contained and get an idea of the security of Kubeflow...

Comment thread projects/kubeflow/security-assessment/self-assessment.md Outdated
Comment thread projects/kubeflow/security-assessment/self-assessment.md
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied)

Projects

Status: New
Status: No status
Status: No status
Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants