Add cost estimates for ragged sort kernels by NuojCheng · Pull Request #4228 · AI-Hypercomputer/maxtext

NuojCheng · 2026-06-22T22:01:20Z

Description

Add cost estimates for ragged gather and ragged gather reduce kernels, for better XLA compiler scheduling.

This PR also adds four flags for ragged kernel cost estimations:

ragged_gather_cost_estimate_flops
ragged_gather_reduce_cost_estimate_flops
ragged_gather_cost_estimate_bytes_accessed
ragged_gather_reduce_cost_estimate_bytes_accessed

Their default values are -1 and any positive input value will replace the flop/bytes estimations.

FIXES: b/525538961

Tests

Before (xids/265679573): step time 15.53s
After (xids/265674304): step time 15.48s

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-22T22:06:46Z

Codecov Report

❌ Patch coverage is 48.48485% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/kernels/ragged/ragged_gather_reduce.py	13.33%	13 Missing ⚠️
src/maxtext/kernels/ragged/ragged_gather.py	71.42%	2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

gobbleturk · 2026-06-23T00:05:03Z

  inner_kernel()


+def get_cost_estimate(


can we allow users to plumb in their own cost estimate and override this? The kernels are not very efficient so estimating cost from theory is not reflective of the time taken

e.g. just like for splash there is fwd and bwd cost estimate in base.yml

gobbleturk · 2026-06-23T01:51:27Z

                              # ragged gather SparseCore kernel. When false (default), use the SparseCore kernel.
 ragged_gather_reduce_fallback: false # when true, unconditionally use the JAX reference implementation instead of the
                                     # ragged gather reduce SparseCore kernel. When false (default), use the SparseCore kernel.
+ragged_gather_cost_estimate_flops: -1 # -1 means auto-compute, any > 0 value overrides the flop cost estimate for the ragged gather kernel


Is there benefit to exposing four separate cost estimates (fwd dispatch, fwd combine, bwd dispatch, bwd combine?) I guess there are only two main kernels - gather and gather reduce, so only need these two?

I think two is enough, though there is slight diff between using weight/not using weight for ragged gather kernel

gobbleturk · 2026-06-23T01:51:44Z

+    hidden_size: int,
+    dtype_bytes: int,
+    has_weights: bool,
+    flops_override: int = -1,


nit: I would just name it cost_estimate instead of flops_override, with the same -1 gets set to a nice default behavior (identical code just variable name)

oh nvm I see this just overrides the flops, maybe this makes sense. I forget to consider cost estimate has both bytes and flops... I'm not sure which is a more useful tuning knob for e2e performance (maybe even both...), I think this is fine for now. The end goal is to help XLA schedule collectives

it makes more sense to me that we can override the bytes_accessed since that is the main cost here, but I don't understand exactly how these affect our goal of helping tune XLA schedules

Added both now

gobbleturk · 2026-06-23T01:57:20Z

+  if flops_override > 0:
+    flops = flops_override
+  else:
+    flops = 2 * padded_input_size * aligned_hidden_size


I'm not sure there are any flops here actually - do flops of a cost_estimate refer to MXU flops? The additions performed here won't happen on the MXU. The flops here are tiny anyway unless the user is setting a large override to tune scheduling, so maybe this is fine, but we should default to 0 if flops of this cost_estimate refers to MXU flops

The flops are for tokens/gradients accumulation... Agree they are not very useful..

NuojCheng force-pushed the chengnuojin-ragged-cost branch 2 times, most recently from 0065932 to a2caaca Compare June 22, 2026 22:38

NuojCheng added pull ready and removed pull ready labels Jun 22, 2026

NuojCheng marked this pull request as ready for review June 22, 2026 23:29

NuojCheng requested review from A9isha, NicoGrande, RissyRan, SurbhiJainUSC, abhinavclemson, aireenmei, bvandermoon, darisoy, dipannita08, gagika, gobbleturk, hengtaoguo, igorts-git, jiangjy1982, khatwanimohit, richjames0, shralex, suexu1025 and vipannalla as code owners June 22, 2026 23:29

gobbleturk reviewed Jun 23, 2026

View reviewed changes

Shuwen-Fang self-requested a review June 23, 2026 00:08

Shuwen-Fang approved these changes Jun 23, 2026

View reviewed changes

NuojCheng force-pushed the chengnuojin-ragged-cost branch from a2caaca to d8b31fe Compare June 23, 2026 01:32

NuojCheng requested a review from michelle-yooh as a code owner June 23, 2026 01:32

gobbleturk reviewed Jun 23, 2026

View reviewed changes

gobbleturk approved these changes Jun 23, 2026

View reviewed changes

add cost estimates for ragged sort kernels

5a9aca2

NuojCheng force-pushed the chengnuojin-ragged-cost branch from d8b31fe to 5a9aca2 Compare June 23, 2026 02:17

NuojCheng added the pull ready label Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cost estimates for ragged sort kernels#4228

Add cost estimates for ragged sort kernels#4228
NuojCheng wants to merge 1 commit into
mainfrom
chengnuojin-ragged-cost

NuojCheng commented Jun 22, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

NuojCheng Jun 23, 2026

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

NuojCheng Jun 23, 2026

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

gobbleturk Jun 23, 2026 •

edited

Loading

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

NuojCheng Jun 23, 2026

Uh oh!

gobbleturk Jun 23, 2026

Uh oh!

NuojCheng Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NuojCheng commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NuojCheng commented Jun 22, 2026 •

edited

Loading

codecov Bot commented Jun 22, 2026 •

edited

Loading

gobbleturk Jun 23, 2026 •

edited

Loading