173 implement device grouped gemm fixed nk for rdna4 #3668

bidlekm · 2026-01-28T14:22:04Z

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…ement-device_grouped_gemm_fastgelu-for-rdna4

…k-for-rdna4

zsotakal

Looks good! A few minor things, mostly related to code quality.

example/15_grouped_gemm/grouped_gemm_wmma_fixed_nk_fp16.cpp

zsotakal · 2026-01-30T15:46:43Z

example/15_grouped_gemm/grouped_gemm_xdl_fixed_nk_fp16.cpp


    for(int i = 0; i < problem_size.group_count; i++)
    {
-        problem_size.Ms.push_back(128 + rand() % 128);


Did you inteted to change the xdl example?

zsotakal · 2026-01-30T15:55:13Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+            id_local += grid_size_grp;
+        }
+
+#undef TRACE_THREAD


Is this needed or just a leftover from the troubleshooting?

zsotakal · 2026-01-30T15:56:29Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+          typename BElementwiseOperation,
+          typename CDEElementwiseOperation,
+          GemmSpecialization GemmSpec,
+          ck::index_t NumGemmKPrefetchStage,


I think this parameter is XDL only

zsotakal · 2026-01-30T16:16:03Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+                std::ostringstream err;
+                err << "Not all gemms have same value for main_k0_block_loop! in " << __FILE__
+                    << ":" << __LINE__ << ", in function: " << __func__;
+                // throw std::runtime_error(err.str());


Did you intentionally remove runtime_errror?

zsotakal · 2026-01-30T16:20:56Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+                    arg.c_element_op_);
+            };
+
+            // const auto tail_num =


I couldn't find this tail number logic anywhere in the existing code. Also there are seemingly related commented out sections below. Can you explain a bit what was the motivation behind it?

zsotakal · 2026-02-02T09:52:10Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp

    {
-        using GemmSpecialization = tensor_operation::device::GemmSpecialization;
-        constexpr bool padM      = GemmSpec == GemmSpecialization::MKPadding ||
+        // using GemmSpecialization = tensor_operation::device::GemmSpecialization;


dead code (repeats multiple times)

zsotakal · 2026-02-02T09:58:41Z

profiler/include/profiler/profile_grouped_gemm_fixed_nk_impl.hpp

-            b_k_n[i].GenerateTensorValue(GeneratorTensor_2<BDataType>{-5, 5}, num_thread);
+            ck::utils::FillUniformDistributionIntegerValue<ADataType>{-5.f, 5.f}(a_m_k[i]);
+            ck::utils::FillUniformDistributionIntegerValue<BDataType>{-5.f, 5.f}(b_k_n[i]);
+            max_abs_in_val = 10.f;


Isn't this supposed to be 5.0f?

zsotakal · 2026-02-02T10:00:33Z

profiler/src/CMakeLists.txt

 target_link_libraries(${PROFILER_EXECUTABLE} PRIVATE ${PROFILER_LIBS})

 rocm_install(TARGETS ${PROFILER_EXECUTABLE} COMPONENT profiler)
+


Code block intented for troubleshooting. Remove pls

example/15_grouped_gemm/grouped_gemm_wmma_fixed_nk_fp16.cpp

chris-tsiaousis-hpc · 2026-02-02T16:11:38Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+#include "ck/utility/tuple.hpp"
+
+#include "ck/tensor_description/tensor_descriptor.hpp"
+#include "ck/tensor_description/tensor_descriptor_helper.hpp"


This header is not used.

example/15_grouped_gemm/grouped_gemm_wmma_fixed_nk_fp16.cpp

chris-tsiaousis-hpc · 2026-02-02T16:15:59Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+                                       >(static_cast<void*>(p_shared),
+                                         splitk_batch_offset,
+                                         kernel_arg,
+                                         block_2_etile_map,


block_2_etile_map and splitk_batch_offset can be defined as const.

chris-tsiaousis-hpc · 2026-02-02T16:21:24Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+            1, 1, 1, 1, 1))>;
+
+    template <typename UnderlyingBlockToCTileMap>
+    struct OffsettedBlockToCTileMapMLoops


OffsettedBlockToCTileMapMLoops is the same struct as in the Xdl variant. Maybe consider moving it to the base class DeviceGroupedGemmFixedNK?

chris-tsiaousis-hpc · 2026-02-02T16:23:24Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+    };
+
+    template <index_t MPerBlock_, index_t NPerBlock_>
+    struct BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops


Same as my other comment. Maybe consider moving this struct definition to the base class?

chris-tsiaousis-hpc · 2026-02-02T16:28:10Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+            }
+        }
+
+        //  private:


I can see this comes from the XDL variant, but this line serves no purpose other than (in the best case) annoy the reader or (in the worst case) make them wonder the reason behind it and lose time on it.

chris-tsiaousis-hpc · 2026-02-02T16:29:37Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+                }
+            }
+
+            // if constexpr(std::is_same<ADataType, ck::bhalf_t>::value)


Fix or remove this

chris-tsiaousis-hpc · 2026-02-02T16:32:28Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp

+        {
+            ignore = tail_num;
+            lambda(std::integral_constant<TailNumber, TailNumber::Full>{});
+            // switch(tail_num)


Same as my comment above, fix this or remove it.

chris-tsiaousis-hpc · 2026-02-02T16:46:33Z

...ry/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp

+#include "ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp"
+#include "ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector.hpp"
+#include "ck/library/tensor_operation_instance/add_device_operation_instance.hpp"
+#include "ck/utility/loop_scheduler.hpp"


This is not used, include scheduler_enum instead

chris-tsiaousis-hpc · 2026-02-02T16:47:47Z

...ry/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp

+#include "ck/tensor_operation/gpu/device/tensor_layout.hpp"
+#include "ck/tensor_operation/gpu/device/gemm_specialization.hpp"
+#include "ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp"
+#include "ck/tensor_operation/gpu/grid/gridwise_gemm_pipeline_selector.hpp"


This include is not used

chris-tsiaousis-hpc · 2026-02-02T16:50:33Z

...ry/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp

+static constexpr InstanceVariant InstanceVariants[] = {
+
+    make_tuple(GemmDefault, IntrawaveScheduler, PipelineV1),
+    // make_tuple(GemmDefault, InterwaveScheduler, PipelineV1),


Remove those comments

chris-tsiaousis-hpc · 2026-02-02T17:06:17Z

test/grouped_gemm/test_grouped_gemm_util.hpp

+        bool pass = true;
+        for(int kbatch : kbatches)
+        {
+            pass &= ck::profiler::profile_grouped_gemm_fixed_nk_impl<ADataType,


This function call may throw. What happens if it does? Should you maybe pass the fail_if_no_supported_instances as an argument and early return without throwing? Or at least use a try-catch block here?

chris-tsiaousis-hpc · 2026-02-02T17:11:10Z

Added some comments you might want to address. Great work overall! :)

martonbidlek-ship-it and others added 16 commits January 28, 2026 14:05

Migrated last commit and some additional changes from origin/172-impl…

de69e81

…ement-device_grouped_gemm_fastgelu-for-rdna4

Implementation working and example added

6060579

Added instances supported by the XDL version

277d369

unit test basic implementation

1cf5026

Test added

be375c4

Pushing everything to test out xdl version on another server

02580be

added xdl to factory and fixed wmma test bugs

c6f9cd9

fixed kernel and examples

6053dcb

changing test input and kernel tuple type

b5a9aa5

Working for kbatch=1, debugging the multi k versions

254a038

reverting experimental changes

de7c627

Using kernelargument to calculate ksplit offset

cca13c4

fix: split-k not working due to outdated values being used

41a6d45

removing unnecessary changes

b2dc120

restoring NumGemmKPrefetchStage

d472ade

clang format

e2ccce2

bidlekm marked this pull request as ready for review January 29, 2026 10:11

bidlekm requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc and qianfengz as code owners January 29, 2026 10:11

bidlekm requested review from a team, Snektron, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 29, 2026 10:11

krithalith requested a review from zsotakal January 29, 2026 13:15

bidlekm added the organization: streamhpc label Jan 29, 2026

martonbidlek-ship-it and others added 2 commits February 2, 2026 08:33

copyright header checker fix

c7cc32c

Merge branch 'develop' into 173-implement-device_grouped_gemm_fixed_n…

52f6870

…k-for-rdna4

zsotakal requested changes Feb 2, 2026

View reviewed changes

chris-tsiaousis-hpc reviewed Feb 2, 2026

View reviewed changes

example/15_grouped_gemm/grouped_gemm_wmma_fixed_nk_fp16.cpp Show resolved Hide resolved

chris-tsiaousis-hpc reviewed Feb 2, 2026

View reviewed changes

example/15_grouped_gemm/grouped_gemm_wmma_fixed_nk_fp16.cpp Show resolved Hide resolved

chris-tsiaousis-hpc reviewed Feb 2, 2026

View reviewed changes

		target_link_libraries(${PROFILER_EXECUTABLE} PRIVATE ${PROFILER_LIBS})

		rocm_install(TARGETS ${PROFILER_EXECUTABLE} COMPONENT profiler)

173 implement device grouped gemm fixed nk for rdna4 #3668

Are you sure you want to change the base?

173 implement device grouped gemm fixed nk for rdna4 #3668

Uh oh!

Conversation

bidlekm commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

zsotakal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-tsiaousis-hpc commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bidlekm commented Jan 28, 2026 •

edited

Loading