Kubeflow Security Self Assessment#2201
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
JustinCappos
left a comment
There was a problem hiding this comment.
Overall, I feel like the self assessment document needs more of a security focus. Right now, it's too focused on what Kubeflow is meant to do, instead of how Kubeflow responds when things go wrong. There are clearly some good steps the project has taken from a security standpoint, so I think this can likely be addressed by updating the writing.
| entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is | ||
| composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native | ||
| projects that cover every stage of the [AI lifecycle](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle). | ||
|
|
||
| Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow | ||
| offers modular, scalable, and extensible tools to support your AI use cases. |
There was a problem hiding this comment.
This is more like marketing speak and isn't precise about what the security properties of the system are meant to be. What security properties is it meant to provide? What trust assumptions are there? What happens when these are violated, etc.?
There was a problem hiding this comment.
Good point, I refactored the overview to make it more security focused. WDYT @JustinCappos?
|
|
||
|  | ||
|
|
||
| - Spark Operator controller: A controller that watches for events of SparkApplication CRDs and acts on the watch events. It includes a submission runner that runs Spark submit for submissions received from the controller, and a Spark pod monitor that watches for Spark pods and sends pod status updates to the controller. |
There was a problem hiding this comment.
Is the submission runner validated / trusted?
There was a problem hiding this comment.
Yes, the runner should be trusted. @RobuRishabh @vara-bonthu @yuchaoran2011 @ChenYi015 @nabuskey to confirm here.
There was a problem hiding this comment.
Yes, the submitter is the operator itself.
There was a problem hiding this comment.
okay, can you try to split this section out so that the security compartments are clearer? In other words, if someone manages to compromise component X and that lets them get into component Y, they are effectively not isolated. Be clear about this please as well as what parts are meant to be isolated.
Note also that sometimes if a component is compromised it can do things like cause a DoS (by failing to schedule jobs, etc.) but cannot read or write sensitive data. That's good to note too
| - Experiment controller: controller that watches events of Experiment CRDs which manage single | ||
| hyperparameter tuning job. User can specify several parameters in Experiment such as objective | ||
| to define metric that user wants to achieve, search space to define set of all hyperparameter | ||
| values, and search algorithm to use for optimization job (e.g. bayesian optimization ) | ||
|
|
||
| - Suggestion controller: controller that watches events of Suggestion CRDs which manage set of | ||
| hyperparameter values that the hyperparameter tuning process has proposed. Suggestion is | ||
| responsible to manage algorithm service. | ||
|
|
||
| - Trail controller: controller that watches events of Trial CRDs which manage one iteration of | ||
| hyperparameter tuning process. A Trial corresponds to one worker job instance with a list of | ||
| parameter assignments. The list of parameter assignments corresponds to a Suggestion. | ||
|
|
||
| - Katib webhooks: Validates and mutates CRD resources to ensure they conform to Katib standards | ||
| and best practices. Katib also manages admission webhook to mutate metrics collector sidecar | ||
| container into Trial workers. |
There was a problem hiding this comment.
Who runs these? Are they trusted? What if a malicious party gets into one of these controllers, what could they impact?
There was a problem hiding this comment.
This is managed by the Katib controllers in the system namespace. These are trusted components, and it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.
There was a problem hiding this comment.
Please check out https://github.com/kubeflow/community-distribution#architecture and PSS restricted/baseline + networkpolicies
There was a problem hiding this comment.
it is the responsibility of platform administrators to harden them and ensure they are protected from compromise.
This part (and things like this) need to be clear up front. What the operator is responsible for doing is really important to surface. This should be clear in the self assessment docs and also in reasonable places on the kubeflow site / docs.
| - Security and Access Control: Spark Operator leverages Kubernetes RBAC for Spark drivers and | ||
| executors. This allows administrators to define who can create, modify, or delete SparkApplications | ||
| and associated pods within the specific namespaces, enabling proper multi-tenant isolation. |
There was a problem hiding this comment.
So is the assumption that multiple, untrusted parties will be using the same KubeFlow experiment controllers, suggestion controllers, etc.?
There was a problem hiding this comment.
Not really. We isolate users by namespaces, so with the appropriate ACLs users should not be able to manage other users resources.
There was a problem hiding this comment.
Please take a look at https://github.com/kubeflow/community-distribution#architecture
There was a problem hiding this comment.
okay, kindly make things like this get updated in the assessment doc. Someone should be able to treat it as self contained and get an idea of the security of Kubeflow...
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Ref: kubeflow/community#996 (comment)
Adding initial Kubeflow security self-assessment
/cc @kfaseela @franciscojavierarceo @juliusvonkohout @chasecadet @thesuperzapper