Skip to content

Commit acd30fe

Browse files
authored
add instructions for RHOAI 2.13 (#60)
1 parent b5ab2e7 commit acd30fe

17 files changed

+969
-2
lines changed

SETUP.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,12 @@ Instructions are provided for the following OpenShift AI ***stable*** releases:
2929
+ [RHOAI 2.10 Cluster Setup](./setup.RHOAI-v2.10/CLUSTER-SETUP.md)
3030
+ [RHOAI 2.10 Team Setup](./setup.RHOAI-v2.10/TEAM-SETUP.md)
3131
+ [RHOAI 2.10 Uninstall](./setup.RHOAI-v2.10/UNINSTALL.md)
32+
+ OpenShift AI 2.13
33+
+ [RHOAI 2.13 Cluster Setup](./setup.RHOAI-v2.13/CLUSTER-SETUP.md)
34+
+ [RHOAI 2.13 Team Setup](./setup.RHOAI-v2.13/TEAM-SETUP.md)
35+
+ [UPGRADING from RHOAI 2.10](./setup.RHOAI-v2.13/UPGRADE-STABLE.md)
36+
+ [UPGRADING from RHOAI 2.12](./setup.RHOAI-v2.13/UPGRADE-FAST.md)
37+
+ [RHOAI 2.13 Uninstall](./setup.RHOAI-v2.13/UNINSTALL.md)
3238

3339
Instructions are provided for the following OpenShift AI ***fast*** releases:
3440
+ OpenShift AI 2.11

setup.RHOAI-v2.12/UPGRADE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,10 @@ oc apply -f setup.RHOAI-v2.12/mlbatch-upgrade-configmaps.yaml
2828
Second, approve the install plan replacing the example plan name below with the actual
2929
value on your cluster:
3030
```sh
31-
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-st8vh
31+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-xs6gq
3232
```
3333

34-
Apply this patch:
34+
Third, apply this patch:
3535
```sh
3636
oc apply -f setup.RHOAI-v2.12/mlbatch-rbac-fix.yaml
3737
```

setup.RHOAI-v2.13/CLUSTER-SETUP.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Cluster Setup
2+
3+
The cluster setup installs OpenShift AI and Coscheduler, configures Kueue,
4+
cluster roles, and priority classes.
5+
6+
If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
7+
[MCAD](https://github.com/project-codeflare/mcad), OpenShift AI, or Coscheduler,
8+
make sure to scrub traces of these installations. In particular, make sure to
9+
delete the following custom resource definitions (CRD) if present on the
10+
cluster. Make sure to delete all instances prior to deleting the CRDs:
11+
```sh
12+
# Delete old appwrappers and crd
13+
oc delete appwrappers --all -A
14+
oc delete crd appwrappers.workload.codeflare.dev
15+
16+
# Delete old noderesourcetopologies and crd
17+
oc delete noderesourcetopologies --all -A
18+
oc delete crd noderesourcetopologies.topology.node.k8s.io
19+
```
20+
21+
## Priorities
22+
23+
Create `default-priority`, `high-priority`, and `low-priority` priority classes:
24+
```sh
25+
oc apply -f setup.RHOAI-v2.13/mlbatch-priorities.yaml
26+
```
27+
28+
## Coscheduler
29+
30+
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
31+
```sh
32+
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
33+
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
34+
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
35+
```
36+
Patch Coscheduler pod priorities:
37+
```sh
38+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-controller
39+
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
40+
```
41+
42+
## OpenShift AI
43+
44+
Create the OpenShift AI subscription:
45+
```sh
46+
oc apply -f setup.RHOAI-v2.13/mlbatch-subscription.yaml
47+
````
48+
Identify install plan:
49+
```sh
50+
oc get ip -n redhat-ods-operator
51+
```
52+
```
53+
NAMESPACE NAME CSV APPROVAL APPROVED
54+
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
55+
```
56+
Approve install plan replacing the generated plan name below with the actual
57+
value:
58+
```sh
59+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
60+
```
61+
Create DSC Initialization:
62+
```sh
63+
oc apply -f setup.RHOAI-v2.13/mlbatch-dsci.yaml
64+
```
65+
Create Data Science Cluster:
66+
```sh
67+
oc apply -f setup.RHOAI-v2.13/mlbatch-dsc.yaml
68+
```
69+
The provided DSCI and DSC are intended to install a minimal set of OpenShift
70+
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
71+
remaining components such as `dashboard` can be optionally enabled.
72+
73+
The configuration of the managed components differs from the default OpenShift
74+
AI configuration as follows:
75+
- Kubeflow Training Operator:
76+
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
77+
- Kueue:
78+
- `manageJobsWithoutQueueName` is enabled,
79+
- `batch/job` integration is disabled,
80+
- `waitForPodsReady` is disabled,
81+
- `LendingLimit` feature gate is enabled,
82+
- `enableClusterQueueResources` metrics is enabled,
83+
- Codeflare operator:
84+
- the AppWrapper controller is enabled and configured as follows:
85+
- `userRBACAdmissionCheck` is disabled,
86+
- `schedulerName` is set to `scheduler-plugins-scheduler`,
87+
- `queueName` is set to `default-queue`,
88+
- pod priorities, resource requests and limits have been adjusted.
89+
90+
To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition
91+
in OpenShift AI installation), do a rolling restart of the Kueue manager.
92+
```sh
93+
oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
94+
```
95+
96+
After doing the restart, verify that you see the following lines in the
97+
kueue-controller-manager's log:
98+
```sh
99+
{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"}
100+
{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"}
101+
{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"}
102+
103+
```
104+
105+
## Kueue Configuration
106+
107+
Create Kueue's default flavor:
108+
```sh
109+
oc apply -f setup.RHOAI-v2.13/default-flavor.yaml
110+
```
111+
112+
## Cluster Role
113+
114+
Create `mlbatch-edit` role:
115+
```sh
116+
oc apply -f setup.RHOAI-v2.13/mlbatch-edit-role.yaml
117+
```
118+
119+
## Slack Cluster Queue
120+
121+
Create the designated slack `ClusterQueue` which will be used to automate
122+
minor adjustments to cluster capacity caused by node failures and
123+
scheduler maintanence.
124+
```sh
125+
oc apply -f- << EOF
126+
apiVersion: kueue.x-k8s.io/v1beta1
127+
kind: ClusterQueue
128+
metadata:
129+
name: slack-cluster-queue
130+
spec:
131+
namespaceSelector: {}
132+
cohort: default-cohort
133+
preemption:
134+
withinClusterQueue: LowerOrNewerEqualPriority
135+
reclaimWithinCohort: Any
136+
borrowWithinCohort:
137+
policy: Never
138+
resourceGroups:
139+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
140+
flavors:
141+
- name: default-flavor
142+
resources:
143+
- name: "cpu"
144+
nominalQuota: 8000m
145+
- name: "memory"
146+
nominalQuota: 128Gi
147+
- name: "nvidia.com/gpu"
148+
nominalQuota: 8
149+
- name: "nvidia.com/roce_gdr"
150+
nominalQuota: 1
151+
- name: "pods"
152+
nominalQuota: 100
153+
EOF
154+
```
155+
Edit the above quantities to adjust the quota to the desired
156+
values. Pod counts are optional and can be omitted from the list of
157+
covered resources. The `lendingLimit` for each resource will be
158+
dynamically adjusted by the MLBatch system to reflect reduced cluster
159+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
160+
detailed discussion of the role of the slack `ClusterQueue`.

setup.RHOAI-v2.13/TEAM-SETUP.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Team Setup
2+
3+
A *team* in MLBatch is a group of users that share a resource quota.
4+
5+
Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
6+
for a discussion of our recommended best practices.
7+
8+
9+
Setting up a new team requires the cluster admin to create a project,
10+
a user group, a quota, a queue, and the required role bindings as described below.
11+
12+
Create project:
13+
```sh
14+
oc new-project team1
15+
```
16+
Create user group:
17+
```sh
18+
oc adm groups new team1-edit-group
19+
```
20+
Add users to group for example:
21+
```sh
22+
oc adm groups add-users team1-edit-group user1
23+
```
24+
Bind cluster role to group in namespace:
25+
```sh
26+
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
27+
```
28+
29+
Specify the intended quota for the namespace by creating a `ClusterQueue`:
30+
```sh
31+
oc apply -f- << EOF
32+
apiVersion: kueue.x-k8s.io/v1beta1
33+
kind: ClusterQueue
34+
metadata:
35+
name: team1-cluster-queue
36+
spec:
37+
namespaceSelector: {}
38+
cohort: default-cohort
39+
preemption:
40+
withinClusterQueue: LowerOrNewerEqualPriority
41+
reclaimWithinCohort: Any
42+
borrowWithinCohort:
43+
policy: Never
44+
resourceGroups:
45+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
46+
flavors:
47+
- name: default-flavor
48+
resources:
49+
- name: "cpu"
50+
nominalQuota: 8000m
51+
# borrowingLimit: 0
52+
# lendingLimit: 0
53+
- name: "memory"
54+
nominalQuota: 128Gi
55+
# borrowingLimit: 0
56+
# lendingLimit: 0
57+
- name: "nvidia.com/gpu"
58+
nominalQuota: 16
59+
# borrowingLimit: 0
60+
# lendingLimit: 0
61+
- name: "nvidia.com/roce_gdr"
62+
nominalQuota: 4
63+
# borrowingLimit: 0
64+
# lendingLimit: 0
65+
- name: "pods"
66+
nominalQuota: 100
67+
# borrowingLimit: 0
68+
# lendingLimit: 0
69+
EOF
70+
```
71+
Edit the above quantities to adjust the quota to the desired values. Pod counts
72+
are optional and can be omitted from the list of covered resources.
73+
74+
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
75+
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
76+
namespaces from borrowing quota from this namespace.
77+
78+
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
79+
```sh
80+
oc apply -n team1 -f- << EOF
81+
apiVersion: kueue.x-k8s.io/v1beta1
82+
kind: LocalQueue
83+
metadata:
84+
name: default-queue
85+
spec:
86+
clusterQueue: team1-cluster-queue
87+
EOF
88+
```
89+
We recommend naming the local queue `default-queue` as `AppWrappers` will
90+
default to this queue name.
91+

setup.RHOAI-v2.13/UNINSTALL.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Uninstall
2+
3+
***First, remove all team projects and corresponding cluster queues.***
4+
5+
Then to uninstall the MLBatch controllers and reclaim the corresponding
6+
namespaces, run:
7+
```sh
8+
# OpenShift AI uninstall
9+
oc delete dsc mlbatch-dsc
10+
oc delete dsci mlbatch-dsci
11+
oc delete subscription -n redhat-ods-operator rhods-operator
12+
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
13+
oc delete crd featuretrackers.features.opendatahub.io \
14+
dscinitializations.dscinitialization.opendatahub.io \
15+
datascienceclusters.datasciencecluster.opendatahub.io
16+
oc delete operators rhods-operator.redhat-ods-operator
17+
oc delete operatorgroup -n redhat-ods-operator rhods-operator
18+
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
19+
20+
# Coscheduler uninstall
21+
helm uninstall -n scheduler-plugins scheduler-plugins
22+
oc delete namespace scheduler-plugins
23+
```

setup.RHOAI-v2.13/UPGRADE-FAST.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Upgrading from RHOAI 2.12
2+
3+
These instructions assume you installed and configured RHOAI 2.12 following
4+
the MLBatch [install instructions for RHOAI-v2.12](../setup.RHOAI-v2.12/CLUSTER-SETUP.md)
5+
or the [upgrade instructions for RHOAI-V2.12](../setup.RHOAI-v2.12/UPGRADE.md)
6+
7+
Your subscription will have automatically created an unapproved
8+
install plan to upgrade to RHOAI 2.13.
9+
10+
Before beginning, verify that the expected install plan exists:
11+
```sh
12+
oc get ip -n redhat-ods-operator
13+
```
14+
Typical output would be:
15+
```sh
16+
NAME CSV APPROVAL APPROVED
17+
install-kpzzl rhods-operator.2.13.0 Manual false
18+
install-nqrbp rhods-operator.2.10.0 Manual true
19+
install-st8vh rhods-operator.2.11.0 Manual true
20+
install-xs6gq rhods-operator.2.12.0 Manual true
21+
```
22+
23+
Assuming the install plan exists you can begin the upgrade process.
24+
25+
There are no MLBatch modifications to the default RHOAI configuration maps
26+
beyond those already made in previous installs. Therefore, you can simply
27+
approve the install plan replacing the example plan name below with the actual
28+
value on your cluster:
29+
```sh
30+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
31+
```
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Upgrading from RHOAI 2.10
2+
3+
These instructions assume you installed and configured RHOAI 2.10 following
4+
the MLBatch [install instructions for RHOAI-v2.10](../setup.RHOAI-v2.10/CLUSTER-SETUP.md).
5+
6+
Your subscription will have automatically created an unapproved
7+
install plan to upgrade to RHOAI 2.13.
8+
9+
Before beginning, verify that the expected install plan exists:
10+
```sh
11+
oc get ip -n redhat-ods-operator
12+
```
13+
Typical output would be:
14+
```sh
15+
NAME CSV APPROVAL APPROVED
16+
install-kpzzl rhods-operator.2.13.0 Manual false
17+
install-nqrbp rhods-operator.2.10.0 Manual true
18+
```
19+
20+
Assuming the install plan exists you can begin the upgrade process.
21+
22+
First, update the MLBatch modifications to the default RHOAI configuration maps.
23+
```sh
24+
oc apply -f setup.RHOAI-v2.13/mlbatch-upgrade-stable-configmaps.yaml
25+
```
26+
27+
Second, approve the install plan replacing the example plan name below with the actual
28+
value on your cluster:
29+
```sh
30+
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
31+
```
32+
33+
Finally, create the Slack Cluster Queue as described in [CLUSTER-SETUP.md for RHOAI 2.13](./CLUSTER-SETUP.md#Slack-Cluster-Queue).
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
- op: add
2+
path: /spec/template/spec/priorityClassName
3+
value: system-node-critical
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
apiVersion: kueue.x-k8s.io/v1beta1
2+
kind: ResourceFlavor
3+
metadata:
4+
name: default-flavor

0 commit comments

Comments
 (0)