You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/AIConformance-1.33.yaml
+9-9Lines changed: 9 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -20,58 +20,58 @@ spec:
20
20
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
21
21
level: SHOULD
22
22
status: ""# Implemented, Not Implemented, Partially Implemented, N/A
23
-
evidence: ""#URL or reference to documentation/test results
23
+
evidence: []#List of URLs or references to documentation/test results
24
24
notes: ""# Must provide a justification when status is N/A
25
25
networking:
26
26
- id: ai_inference
27
27
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
28
28
level: MUST
29
29
status: ""
30
-
evidence: ""
30
+
evidence: []
31
31
notes: ""
32
32
schedulingOrchestration:
33
33
- id: gang_scheduling
34
34
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
35
35
level: MUST
36
36
status: ""
37
-
evidence: ""
37
+
evidence: []
38
38
notes: ""
39
39
- id: cluster_autoscaling
40
40
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
41
41
level: MUST
42
42
status: ""
43
-
evidence: ""
43
+
evidence: []
44
44
notes: ""
45
45
- id: pod_autoscaling
46
46
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
47
47
level: MUST
48
48
status: ""
49
-
evidence: ""
49
+
evidence: []
50
50
notes: ""
51
51
observability:
52
52
- id: accelerator_metrics
53
53
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
54
54
level: MUST
55
55
status: ""
56
-
evidence: ""
56
+
evidence: []
57
57
notes: ""
58
58
- id: ai_service_metrics
59
59
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
60
60
level: MUST
61
61
status: ""
62
-
evidence: ""
62
+
evidence: []
63
63
notes: ""
64
64
security:
65
65
- id: secure_accelerator_access
66
66
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
67
67
level: MUST
68
68
status: ""
69
-
evidence: ""
69
+
evidence: []
70
70
notes: ""
71
71
operator:
72
72
- id: robust_controller
73
73
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
Copy file name to clipboardExpand all lines: docs/AIConformance-1.34.yaml
+9-9Lines changed: 9 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -20,58 +20,58 @@ spec:
20
20
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
21
21
level: MUST
22
22
status: ""# Implemented, Not Implemented, Partially Implemented, N/A
23
-
evidence: ""#URL or reference to documentation/test results
23
+
evidence: []#List of URLs or references to documentation/test results
24
24
notes: ""# Must provide a justification when status is N/A
25
25
networking:
26
26
- id: ai_inference
27
27
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
28
28
level: MUST
29
29
status: ""
30
-
evidence: ""
30
+
evidence: []
31
31
notes: ""
32
32
schedulingOrchestration:
33
33
- id: gang_scheduling
34
34
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
35
35
level: MUST
36
36
status: ""
37
-
evidence: ""
37
+
evidence: []
38
38
notes: ""
39
39
- id: cluster_autoscaling
40
40
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
41
41
level: MUST
42
42
status: ""
43
-
evidence: ""
43
+
evidence: []
44
44
notes: ""
45
45
- id: pod_autoscaling
46
46
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
47
47
level: MUST
48
48
status: ""
49
-
evidence: ""
49
+
evidence: []
50
50
notes: ""
51
51
observability:
52
52
- id: accelerator_metrics
53
53
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
54
54
level: MUST
55
55
status: ""
56
-
evidence: ""
56
+
evidence: []
57
57
notes: ""
58
58
- id: ai_service_metrics
59
59
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
60
60
level: MUST
61
61
status: ""
62
-
evidence: ""
62
+
evidence: []
63
63
notes: ""
64
64
security:
65
65
- id: secure_accelerator_access
66
66
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
67
67
level: MUST
68
68
status: ""
69
-
evidence: ""
69
+
evidence: []
70
70
notes: ""
71
71
operator:
72
72
- id: robust_controller
73
73
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
0 commit comments