Skip to content

Add Cloud Native AI Scheduling Challenges Whitepaper#2164

Open
rajaskakodkar wants to merge 3 commits into
cncf:mainfrom
rajaskakodkar:scheduling-whitepaper
Open

Add Cloud Native AI Scheduling Challenges Whitepaper#2164
rajaskakodkar wants to merge 3 commits into
cncf:mainfrom
rajaskakodkar:scheduling-whitepaper

Conversation

@rajaskakodkar

Copy link
Copy Markdown

Adds the Cloud Native Scheduling Challenges Whitepaper

Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>
@rajaskakodkar rajaskakodkar requested review from a team as code owners May 15, 2026 16:51
@github-actions github-actions Bot added needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) and removed needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) labels May 15, 2026

@andreyvelich andreyvelich left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this effort @rajaskakodkar!
Overall, looks great, I left few thoughts.

Model development has two distinct activities that are often combined:

* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data.
* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add topic around HPO?

Suggested change
* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
* **Hyperparameter tuning** optimizes how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These tuning jobs are highly parallelizable and are commonly distributed across GPUs or clusters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich please suggest a change for the paragraph below since it is no longer necessarily valid re: heavy resource demands only in the next stage. :)


## ML Platform Tools

These tools provide higher-level abstractions for ML workflows:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These tools provide higher-level abstractions for ML workflows:
These tools provide higher-level abstractions for AI workloads:

**For ML engineers working with existing infrastructure:**

1. Understand what scheduling tools are available in your cluster.
2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.

@andreyvelich andreyvelich May 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.
2. Use the appropriate job abstractions (TrainJob, MPIJob, PyTorchJob, etc.) rather than raw pods.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to also keep PyTorchJob for legacy environments

@andreyvelich andreyvelich Jun 18, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I changed the suggetsion @rajaskakodkar.

riaankleinhans added a commit to cncf/automation that referenced this pull request May 22, 2026
processLabelRule previously computed `shouldApply = !foundNamespace`
unconditionally, so a `kind: label` rule with `matchCondition: AND`
behaved identically to a `NOT` rule. Paired NOT/AND rules (e.g. apply
`needs-triage` when no `triage/*` exists, remove it when one does) ended
up firing in exactly the wrong situations: on a fresh PR with no labels
the labeler would add `needs-triage`/`needs-kind`/`needs-group` and
immediately remove them in the same run, and when a `triage/*` label was
later added manually via the UI the paired `needs-*` label would never
be cleared.

Also teach the label-rule `match` parser to understand a single level of
comma-separated brace alternation such as `{toc,tag/*,sub/*}`.
`filepath.Match` on its own does not support braces, so previously such
a pattern only matched a literal label whose name began with `{`.

Adds focused tests for both behaviors, including a regression test that
mirrors the cncf/toc#2164 scenario.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Riaan Kleinhans <riaankleinhans@gmail.com>
@brandtkeller brandtkeller linked an issue May 29, 2026 that may be closed by this pull request
@angellk angellk requested review from angellk and salaboy June 2, 2026 15:49
Model development has two distinct activities that are often combined:

* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data.
* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich please suggest a change for the paragraph below since it is no longer necessarily valid re: heavy resource demands only in the next stage. :)


| Stage | Primary Resources | Duration | Scheduling Characteristics |
| :---- | :---- | :---- | :---- |
| Data Preparation | CPU, storage I/O, network | Minutes to hours | Parallelizable, event-driven, no gang requirement |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's helpful to denote that there is no gang requirement -- could you please rephrase your suggesetion @andreyvelich

**For ML engineers working with existing infrastructure:**

1. Understand what scheduling tools are available in your cluster.
2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to also keep PyTorchJob for legacy environments

| Batch Inference | Variable GPU count | Hours to days | Parallelizable, throughput-oriented |
| Real-time Inference | GPUs with models preloaded | Continuous | Low latency, autoscaling, model serving |

The key insight: different stages need different scheduling strategies. A cluster running the full ML lifecycle must handle event-driven pipelines, interactive notebooks, gang-scheduled training, and latency-sensitive inference—often simultaneously, competing for the same GPU resources.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert on this topic, but doesn't make sense to have a single cluster for all these tasks? Wouldn't be more practical to have specialized clusters?


### **Traditional HPC Schedulers: Task-Level Scheduling**

In traditional high-performance computing (HPC) environments, schedulers such as Slurm operate at the task level (also called a rank or process group member).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need link for Slurm? are there other schedulers in that space? Is Slurm the most used one? Asking for context.


This model naturally supports gang scheduling, topology-aware placement, and reservation-based execution. These capabilities are particularly well-suited for model training workloads.

In HPC-style schedulers, the scheduling unit is a task within a job, with all tasks scheduled together.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see a diagram here, to make the relationship between the concepts such as tasks and jobs more easily understandable for people that haven't use these systems in the past.

* New data arrives in a storage bucket → trigger a data preparation pipeline
* A model training job completes → trigger an evaluation job
* An upstream job fails → trigger a notification or retry

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a diagram here might also help to solidify how concepts connect to each other.

* Reservation locks resources for a specific job. The scheduler identifies which resources will be needed and stops scheduling new work to them, even if they're currently idle.
* Backfill allows small, short jobs to use reserved resources temporarily, as long as they'll finish before the reserved job needs them.
* **Mechanics:** The scheduler estimates when reserved resources will be free (based on running jobs' expected completion), then allows backfill jobs that fit within that window. This requires jobs to declare (or the system to estimate) their expected runtime.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me that a summary is needed here before jumping to "What's next"

* **Failure domains.** Placing all workers in the same rack minimizes network hops but means a rack failure kills the entire job. Spreading workers across racks improves resilience but increases communication latency.

The scheduler must balance these concerns. For latency-sensitive training, co-location may be worth the reduced resilience. For long-running jobs, spreading across failure domains and accepting higher latency may be preferable to risking a full restart.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A summary of what was covered here and the main takeaways will help a lot for people who read until the end to go with a high-level view of what was covered in this paper.

@salaboy

salaboy commented Jun 11, 2026

Copy link
Copy Markdown

Folks, congratulations, these papers are great reads. I've added some comments to improve the reader experience, but besides that this looks awesome.

I've noticed that some terms like (all-reduce) have no references and for the non ML/AI engineer those thing might require some concrete references, as they are mentioned in several places.

I can't wait for this to get published.

* Role-aware scheduling: the scheduler must understand pod roles (e.g., master vs. worker) and preempt workers before masters to avoid job failure
* **Handling failures without full restart:** For gang-scheduled jobs, one worker failure typically crashes the entire job. Elastic training relaxes this—the job continues with the surviving workers, and a replacement worker can join later.

## Budget and Cost Constraints

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The budget constraints section covers financial cost (GPU-hours, cloud spend) well. Should power/energy budget also be considered here as a parallel infrastructure constraint?


# What's Next

This paper examined the resource and infrastructure challenges for AI workloads. The final paper in this series, **Solutions and Practical Guidance for AI Workload Scheduling**, catalogs the tools and Kubernetes features that address these challenges, provides a reference table mapping challenges to solutions, and offers practical guidance including real-world use cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should power consumption be included as a resource and infrastructure challenge?

| Topology Awareness (Cluster) | Topology Spread Constraints, DRANET (network DRA Driver) (limited) | KAI, Kueue, Slinky, Volcano | \- | Both | Network topology awareness is emerging |
| Resource Heterogeneity | Node selectors, labels | All batch schedulers | \- | Both | Standard Kubernetes features usually sufficient |
| GPU Sharing | DRA (GA, K8s 1.34+) | KAI | HAMi, KubeRay, Volcano | Both | MIG requires DRA or vendor tools |
| Scalability | Cluster Autoscaler, Karpenter | Armada, KAI, Kueue, Slinky, Volcano | interLink | Both | Large-scale scheduling is challenging |

@kfaseela kfaseela Jun 11, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scalability row/inference req autoscaling in the solutions table already covers good tools. Should KEDA be included somewhere? Ref: #2188 and https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/ . Just a qn coz we started with scheduling, but still the table covers scaling as well.

| Preemption | PriorityClass (pod-level) | KAI, Kueue, Slinky, Volcano | \- | Both | Job-level preemption needs external tools |
| Priority Scheduling | PriorityClass | All batch schedulers | \- | Both | Job-level priority in batch schedulers |
| Reservation & Backfill | \- | Slinky, Volcano, YuniKorn | \- | Training | Advanced feature in some schedulers |
| Topology Awareness (Node) | Topology Manager (NUMA), DRA CPU Driver (CPU topology) | KAI, Kueue, Slinky, Volcano | \- | Both | GPU interconnect awareness varies |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May i suggest to add HAMi here, HAMi also has the node-level topology-awareness(https://project-hami.io/docs/userguide/nvidia-device/scheduling-policy)

@riaankleinhans riaankleinhans added toc toc specific issue triage/valid Issue or PR is valid with enough information to be actionable kind/publication Item related to a publication (blog, tech-paper, etc.) pub/tech-paper Technical paper / whitepaper publication labels Jun 16, 2026
rajaskakodkar and others added 2 commits June 17, 2026 14:38
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Rajas Kakodkar <rajas.kakodkar@broadcom.com>
Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>

@mesutoezdil mesutoezdil left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small fixes: a broken markdown link in paper 3 (gang scheduling section) and missing references for the all-reduce term, which salaboy noted has no citations for readers unfamiliar with ML.

* Training a large language model takes weeks to months on thousands of GPUs

For deep learning at scale, training is distributed across multiple machines. Workers communicate using collective operations like all-reduce, which requires all participants to synchronize. This creates the gang scheduling requirement: if you need 64 workers and only 60 are available, the job cannot start. If one worker fails mid-training, the entire job may need to restart from a checkpoint.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds a reference to the NCCL all-reduce documentation for readers unfamiliar with collective communication.

Suggested change
For deep learning at scale, training is distributed across multiple machines. Workers communicate using collective operations like [all-reduce](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html), which requires all participants to synchronize. This creates the gang scheduling requirement: if you need 64 workers and only 60 are available, the job cannot start. If one worker fails mid-training, the entire job may need to restart from a checkpoint.

* **Solution approach:** Gang scheduling treats a group of pods as a single unit. Either all pods are scheduled together, or none are. The scheduler waits until sufficient resources are available for the entire job before starting any pods.
* **Lifecycle impact:** Gang scheduling is critical across multiple stages of the AI lifecycle:
* **Distributed training:** Workers use collective communication (all-reduce) that requires every participant. A partial allocation is useless.
* **Multi-pod inference:** Model-parallel deployments and disaggregated serving architectures require all components (e.g., prefill and decode workers) to be running before the system can serve requests.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same all-reduce reference for consistency across papers.

Suggested change
* **Multi-pod inference:** Model-parallel deployments and disaggregated serving architectures require all components (e.g., prefill and decode workers) to be running before the system can serve requests.
* **Distributed training:** Workers use collective communication ([all-reduce](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html)) that requires every participant. A partial allocation is useless.

* **Multi-pod inference:** Model-parallel deployments and disaggregated serving architectures require all components (e.g., prefill and decode workers) to be running before the system can serve requests.
* **Distributed data preparation:** Parallel jobs that must complete together to produce consistent output benefit from all-or-nothing scheduling.
* **Current state:** Kubernetes-native batch schedulers that support gang scheduling include the coscheduling plugin (via PodGroups), Armada, KAI Scheduler, and Volcano. Native gang scheduling with the Workloads API has been implemented in Kubernetes 1.35 as an alpha feature](https://kubernetes.io/docs/concepts/workloads/workload-api/), with a goal to reach beta in Kubernetes 1.36.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opening bracket is missing from the markdown link, so it renders as plain text instead of a hyperlink.

Suggested change
* **Current state:** Kubernetes-native batch schedulers that support gang scheduling include the coscheduling plugin (via PodGroups), Armada, KAI Scheduler, and Volcano. Native gang scheduling with the Workloads API has been implemented in Kubernetes 1.35 as [an alpha feature](https://kubernetes.io/docs/concepts/workloads/workload-api/), with a goal to reach beta in Kubernetes 1.36.

* Resource-intensive: Hundreds to thousands of GPUs
* Tightly coupled: All workers must run simultaneously
* Sensitive to topology: Communication speed depends on GPU interconnects

@andreyvelich andreyvelich Jun 18, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this comment: https://github.com/cncf/toc/pull/2164/changes#r3266600508

Suggested change
During model training phase, users want to tune hyperparameter to optimize how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These OptimizationJobs are highly parallelizable and are commonly distributed across GPUs or clusters.


| Stage | Primary Resources | Duration | Scheduling Characteristics |
| :---- | :---- | :---- | :---- |
| Data Preparation | CPU, storage I/O, network | Minutes to hours | Parallelizable, event-driven, no gang requirement |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify please @angellk? As I mentioned data processing jobs might also require gang-scheduling requirements to ensure driver and executors have sufficient capacity to run.
The only differences is that such jobs can tolerate executor failures, which is uncommon for training jobs.

This is what I ment by elastic by nature

**For ML engineers working with existing infrastructure:**

1. Understand what scheduling tools are available in your cluster.
2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.

@andreyvelich andreyvelich Jun 18, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I changed the suggetsion @rajaskakodkar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/publication Item related to a publication (blog, tech-paper, etc.) pub/tech-paper Technical paper / whitepaper publication toc toc specific issue triage/valid Issue or PR is valid with enough information to be actionable

Projects

Status: New

Development

Successfully merging this pull request may close these issues.

[Initiative]: Cloud Native AI Scheduling Challenges Whitepaper

10 participants