Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions v1.33/rosa/PRODUCT.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Kubernetes AI Conformance Checklist
# Notes: This checklist is based on the Kubernetes AI Conformance document.
# Participants should fill in the 'status', 'evidence', and 'notes' fields for each requirement.

metadata:
kubernetesVersion: v1.33
platformName: "Red Hat OpenShift Service on AWS"
platformVersion: "4.20"
vendorName: "Red Hat"
websiteUrl: "https://www.redhat.com/en/technologies/cloud-computing/openshift/aws"
documentationUrl: "https://docs.redhat.com/en/documentation/red_hat_openshift_service_on_aws/4"
productLogoUrl: "https://www.redhat.com/rhdc/managed-files/Logo-Red_Hat-OpenShift-A-Standard-RGB.svg"
description: "Red Hat OpenShift Service on AWS offers a reduced-cost solution to create a managed Red Hat OpenShift Service on AWS cluster with a focus on efficiency and security."

spec:
accelerators:
- id: dra_support
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
level: SHOULD
status: "N/A"
evidence: []
notes: ""
networking:
- id: ai_inference
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
level: MUST
status: "Implemented"
evidence:
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/network_apis/gateway-gateway-networking-k8s-io-v1"
notes: "ROSA exposes this feauture from OCP 4.20 directly without modification"
schedulingOrchestration:
- id: gang_scheduling
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
level: MUST
status: "Implemented"
evidence:
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/ai_workloads/red-hat-build-of-kueue#gangscheduling"
notes: "Red Hat build of Kueue enables gang admission and was successfully installed and configured on ROSA HCP as part of this validation."
- id: cluster_autoscaling
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
level: MUST
status: "Implemented"
evidence:
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/hardware_accelerators/about-hardware-accelerators"
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/machine_management/applying-autoscaling"
- "https://www.redhat.com/en/blog/autoscaling-nvidia-gpus-on-red-hat-openshift"
- "https://github.com/tiwillia/cncf-ai-conformance-rosa/blob/main/cluster_autoscaling/test-outputs/TEST-OVERVIEW.md"
notes: "The OpenShift cluster autoscaler implementation satisfies this requirement. We have tested on ROSA HCP in addition to OCP's testing of multip GPUs."
- id: pod_autoscaling
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
level: MUST
status: "Implemented"
evidence:
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/nodes/automatically-scaling-pods-with-the-custom-metrics-autoscaler-operator#nodes-cma-autoscaling-custom-trigger-prom-gpu_nodes-cma-autoscaling-custom-trigger"
- "https://developers.redhat.com/articles/2025/08/12/boost-ai-efficiency-gpu-autoscaling-openshift#custom_metrics_autoscaler__keda__and_prometheus"
- "https://github.com/tiwillia/cncf-ai-conformance-rosa/blob/main/pod_autoscaling/test-outputs/TEST-SUMMARY.md"
notes: "Successfully demonstrated that a HPA configured for custom GPU utilization metrics scaled pods running heavy GPU-utilization workloads on ROSA HCP."
observability:
- id: accelerator_metrics
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
level: MUST
status: "Implemented"
evidence:
- "https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/enable-gpu-monitoring-dashboard.html"
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/hardware_accelerators/nvidia-gpu-architecture"
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/hardware_accelerators/amd-gpu-operator"
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/red_hat_build_of_opentelemetry/index"
notes: "As part of the OpenShift observability solution, OpenShift provides comprehensive support for AI accelerators (e.g. NVIDIA, AMD) through dedicated GPU operators that enable standardized metrics collection and monitoring. NVIDIA GPU Operator integrates DCGM-based monitoring, exposing GPU utilization, power consumption (watts), temperature (Celsius), utilization (percent), and memory metrics. AMD GPU Operator with ROCm integration provides equivalent AI accelerator monitoring capabilities. GPU telemetry is exposed via DCGM Exporter for Prometheus consumption through /metrics endpoints. OpenShift observability solution also provides native integration with OpenTelemetry standards via the Red Hat build of OpenTelemetry. This solution is unmodified in ROSA HCP."
- id: ai_service_metrics
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
level: MUST
status: "Implemented"
evidence:
- "https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/pdf/monitoring/OpenShift_Container_Platform-4.20-Monitoring-en-US.pdf"
notes: "OpenShift provides a fully integrated monitoring system based on Prometheus, which automatically discovers and scrapes metrics endpoints exposed by workloads in the standard Prometheus exposition format, ensuring seamless integration for collecting and displaying key metrics from common AI frameworks and servers. This solution is unmodified in ROSA HCP."
security:
- id: secure_accelerator_access
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
level: MUST
status: "Implemented"
evidence:
- "https://github.com/tiwillia/cncf-ai-conformance-rosa/blob/main/secure_accelerator_access/test-outputs/SUMMARY.md"
- "https://docs.google.com/document/d/14t54X8N6Xg8P0p6-MKZxySLrSuJQm15M/edit?usp=sharing&ouid=104474006979808168216&rtpof=true&sd=true"
notes: "Successfully demonstrated on ROSA HCP in addition to OCP 4.20 validation."
operator:
- id: robust_controller
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
level: MUST
status: "Implemented"
evidence:
- "https://developers.redhat.com/articles/2025/04/22/fine-tune-llms-kubeflow-trainer-openshift-ai"
- "https://github.com/tiwillia/cncf-ai-conformance-rosa/blob/main/robust_controller/test-outputs/test_summary.md"
- "https://docs.google.com/document/d/1XKqETt-sXbznYwrX5toJ1pVhcYaoq6PL/edit?usp=sharing&ouid=104474006979808168216&rtpof=true&sd=true"
notes: "Successfully demonstrated on ROSA HCP in addition to OCP 4.20 validation."