Skip to content

Commit cd43229

Browse files
authored
Merge pull request #11 from janetkuo/evidence
Change evidence to a list
2 parents b4a572e + fd3c890 commit cd43229

File tree

2 files changed

+18
-18
lines changed

2 files changed

+18
-18
lines changed

docs/AIConformance-1.33.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,58 +20,58 @@ spec:
2020
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
2121
level: SHOULD
2222
status: "" # Implemented, Not Implemented, Partially Implemented, N/A
23-
evidence: "" # URL or reference to documentation/test results
23+
evidence: [] # List of URLs or references to documentation/test results
2424
notes: "" # Must provide a justification when status is N/A
2525
networking:
2626
- id: ai_inference
2727
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
2828
level: MUST
2929
status: ""
30-
evidence: ""
30+
evidence: []
3131
notes: ""
3232
schedulingOrchestration:
3333
- id: gang_scheduling
3434
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
3535
level: MUST
3636
status: ""
37-
evidence: ""
37+
evidence: []
3838
notes: ""
3939
- id: cluster_autoscaling
4040
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
4141
level: MUST
4242
status: ""
43-
evidence: ""
43+
evidence: []
4444
notes: ""
4545
- id: pod_autoscaling
4646
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
4747
level: MUST
4848
status: ""
49-
evidence: ""
49+
evidence: []
5050
notes: ""
5151
observability:
5252
- id: accelerator_metrics
5353
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
5454
level: MUST
5555
status: ""
56-
evidence: ""
56+
evidence: []
5757
notes: ""
5858
- id: ai_service_metrics
5959
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
6060
level: MUST
6161
status: ""
62-
evidence: ""
62+
evidence: []
6363
notes: ""
6464
security:
6565
- id: secure_accelerator_access
6666
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
6767
level: MUST
6868
status: ""
69-
evidence: ""
69+
evidence: []
7070
notes: ""
7171
operator:
7272
- id: robust_controller
7373
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
7474
level: MUST
7575
status: ""
76-
evidence: ""
76+
evidence: []
7777
notes: ""

docs/AIConformance-1.34.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,58 +20,58 @@ spec:
2020
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
2121
level: MUST
2222
status: "" # Implemented, Not Implemented, Partially Implemented, N/A
23-
evidence: "" # URL or reference to documentation/test results
23+
evidence: [] # List of URLs or references to documentation/test results
2424
notes: "" # Must provide a justification when status is N/A
2525
networking:
2626
- id: ai_inference
2727
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
2828
level: MUST
2929
status: ""
30-
evidence: ""
30+
evidence: []
3131
notes: ""
3232
schedulingOrchestration:
3333
- id: gang_scheduling
3434
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
3535
level: MUST
3636
status: ""
37-
evidence: ""
37+
evidence: []
3838
notes: ""
3939
- id: cluster_autoscaling
4040
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
4141
level: MUST
4242
status: ""
43-
evidence: ""
43+
evidence: []
4444
notes: ""
4545
- id: pod_autoscaling
4646
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
4747
level: MUST
4848
status: ""
49-
evidence: ""
49+
evidence: []
5050
notes: ""
5151
observability:
5252
- id: accelerator_metrics
5353
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
5454
level: MUST
5555
status: ""
56-
evidence: ""
56+
evidence: []
5757
notes: ""
5858
- id: ai_service_metrics
5959
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
6060
level: MUST
6161
status: ""
62-
evidence: ""
62+
evidence: []
6363
notes: ""
6464
security:
6565
- id: secure_accelerator_access
6666
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
6767
level: MUST
6868
status: ""
69-
evidence: ""
69+
evidence: []
7070
notes: ""
7171
operator:
7272
- id: robust_controller
7373
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
7474
level: MUST
7575
status: ""
76-
evidence: ""
76+
evidence: []
7777
notes: ""

0 commit comments

Comments
 (0)