Skip to content

tune triton gemm kernel for MI355 DSV3 DP+EP configuration#2016

Open
inkcherry wants to merge 3 commits intoROCm:mainfrom
inkcherry:tune
Open

tune triton gemm kernel for MI355 DSV3 DP+EP configuration#2016
inkcherry wants to merge 3 commits intoROCm:mainfrom
inkcherry:tune

Conversation

@inkcherry
Copy link

@inkcherry inkcherry commented Feb 10, 2026

Decode Side. cc @Duyi-Wang

M N K Time_old (ms) Time_new (ms) speedup (old/new) delta (new-old, ms)
16 2112 7168 0.055951 0.056129 0.997 0.000178
32 2112 7168 0.048471 0.048778 0.994 0.000307
64 2112 7168 0.046766 0.047719 0.980 0.000953
128 2112 7168 0.047608 0.047576 1.001 -0.000032
256 2112 7168 0.097757 0.046720 2.092 -0.051037
16 4096 7168 0.055328 0.054776 1.010 -0.000552
32 4096 7168 0.057020 0.054329 1.050 -0.002691
64 4096 7168 0.055331 0.056177 0.985 0.000846
128 4096 7168 0.049443 0.048664 1.016 -0.000779
256 4096 7168 0.109374 0.046923 2.331 -0.062451
16 7168 16384 0.183842 0.055107 3.336 -0.128735
32 7168 16384 0.192137 0.053328 3.603 -0.138809
64 7168 16384 0.200794 0.048574 4.134 -0.152220
128 7168 16384 0.220417 0.052922 4.165 -0.167495
256 7168 16384 0.208873 0.085082 2.455 -0.123791
16 7168 2048 0.055840 0.055515 1.006 -0.000325
32 7168 2048 0.056272 0.024918 2.258 -0.031354
64 7168 2048 0.024018 0.023520 1.021 -0.000498
128 7168 2048 0.020884 0.021803 0.958 0.000919
256 7168 2048 0.032093 0.020644 1.555 -0.011449
16 16384 1536 0.024423 0.020252 1.206 -0.004171
32 16384 1536 0.025440 0.020578 1.236 -0.004862
64 16384 1536 0.025345 0.018211 1.392 -0.007134
128 16384 1536 0.026026 0.018262 1.425 -0.007764
256 16384 1536 0.026593 0.022947 1.159 -0.003646

@inkcherry inkcherry requested review from a team and Copilot February 10, 2026 04:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Triton GEMM tuning configs for gfx950 (MI355) targeting the A8W8 blocks-scale path for specific (N, K) shapes and small-M specializations.

Changes:

  • Add new tuned config files for (N=7168, K=16384) and (N=16384, K=1536).
  • Refine existing per-M-threshold tuning parameters (block sizes, warps/stages, waves_per_eu, k-splitting).
  • Add/adjust ultra-small M special-cases (e.g., M_LEQ_8) for some shapes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
aiter/ops/triton/configs/gemm/gfx950-GEMM-A8W8_BLOCKSCALE-N=7168-K=2048.json Retunes per-M configs and adds M_LEQ_8 specialization for K=2048.
aiter/ops/triton/configs/gemm/gfx950-GEMM-A8W8_BLOCKSCALE-N=7168-K=16384.json Adds a new tuned config for larger K=16384.
aiter/ops/triton/configs/gemm/gfx950-GEMM-A8W8_BLOCKSCALE-N=4096-K=7168.json Retunes per-M configs and adds M_LEQ_8 specialization for N=4096/K=7168.
aiter/ops/triton/configs/gemm/gfx950-GEMM-A8W8_BLOCKSCALE-N=2112-K=7168.json Retunes per-M configs, adds M_LEQ_8, and introduces M_LEQ_256.
aiter/ops/triton/configs/gemm/gfx950-GEMM-A8W8_BLOCKSCALE-N=16384-K=1536.json Adds a new tuned config for N=16384/K=1536.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +73
"any": {
"BLOCK_SIZE_M": 64,
"BLOCK_SIZE_N": 64,
"BLOCK_SIZE_K": 128,
"GROUP_SIZE_M": 8,
"num_warps": 4,
"num_stages": 2,
"waves_per_eu": 1,
"matrix_instr_nonkdim": 16,
"cache_modifier": null,
"NUM_KSPLIT": 1
}
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This JSON appears to be missing the final closing brace for the root object. The last line closes the any object, but there is no subsequent } to close the top-level {, making the file invalid JSON. Add a final } at end-of-file.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant