[CK-Tile] Improve cshuffle epilogue mfma tile coverage #3701

tenpercent · 2026-02-02T17:54:45Z

Summary

This PR enhances the CShuffleEpilogue test suite by expanding test coverage with parameterized configurations and organizing tests by data type (FP16 and FP8) into separate executables.

Changes

1. Expanded test coverage with parameterized configurations

Added comprehensive test cases covering various epilogue configurations
Introduced parameterized testing for different reduction operations, output types, and scaling modes
Tests now validate multiple combinations of epilogue features to ensure robust functionality

2. Sharded tests into FP16 and FP8 executables

Split test_cshuffle_epilogue.cpp into:
- test_cshuffle_epilogue_fp16.cpp - FP16/BF16 data type tests
- test_cshuffle_epilogue_fp8.cpp - FP8 data type tests
Created test_cshuffle_epilogue_common.hpp to share common test infrastructure
Enables more targeted testing and better organization by data type

3. Follow CMake pattern for FP8 compile flags

Updated test/ck_tile/epilogue/CMakeLists.txt to conditionally add -DCK_TILE_USE_OCP_FP8 compile option when CK_USE_OCP_FP8 is enabled
Follows the established pattern used in other test directories (test/ck_tile/data_type, test/ck_tile/gemm, etc.)
Ensures consistent FP8 flag handling across the codebase

4. Fixed header hygiene

Ensured proper header includes in common test header file
Improved code organization and maintainability

Test plan

[] Built both test executables successfully (test_ck_tile_cshuffle_epilogue_fp16, test_ck_tile_cshuffle_epilogue_fp8)
[] Verified CMake configuration follows codebase conventions
Ready for CI validation

…ions This commit expands the CShuffleEpilogue test coverage with: - Parameterized tests for multiple tile configurations - Half precision (FP16) tests with 5 different tile sizes - FP8 tests with 7 different tile sizes and MFMA configurations - Fix for acc_buffer initialization to fill entire buffer - Improved test organization with TileConfig template - Support for OCP FP8 flag in host code Test configurations cover various wave distributions (1x4, 2x2, 4x1) and XDL sizes to ensure correctness across different hardware patterns. Co-Authored-By: Claude <noreply@anthropic.com>

Split the parameterized tests into separate compilation units to reduce build times and enable parallel compilation: - test_cshuffle_epilogue_fp16.cpp: 5 FP16 configs + scale tests (100 LOC) - test_cshuffle_epilogue_fp8.cpp: 7 FP8 configs (34 LOC) - test_cshuffle_epilogue_common.hpp: Shared test infrastructure (84 LOC) This allows the FP16 and FP8 test suites to compile in parallel, significantly reducing overall build time compared to a single monolithic test file. Co-Authored-By: Claude <noreply@anthropic.com>

- Remove 'using namespace' directive from header file - Add explicit ck_tile:: namespace qualifications in header - Keep 'using namespace ck_tile' in .cpp files for convenience This fixes compilation with -Werror -Wheader-hygiene flags. Co-Authored-By: Claude <noreply@anthropic.com>

…P8 flag Revert the config.hpp host code change and instead follow the established pattern used in other test directories by adding the FP8 compile option directly in CMakeLists.txt when CK_USE_OCP_FP8 is enabled. Co-Authored-By: Claude <noreply@anthropic.com>

Use exactly representable integer values instead of floating-point for unique test data. Since cshuffle only permutes data without arithmetic, this allows exact EXPECT_EQ comparisons instead of EXPECT_FLOAT_EQ. - 16-bit types: Use values 1 to output_size (exactly representable) - 8-bit types: Use (index % 255) + 1 to stay in [1, 255] range Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…omparison Refactor the CShuffleEpilogue tests to properly verify permutation correctness using set-based comparison instead of checking single values. The tests now: - Verify output has correct size and multiple unique values - Check values are in valid range (FP16: positive, FP8: >10 unique values) - For scale tests, compare scaled vs unscaled outputs to verify scaling Co-Authored-By: Claude <noreply@anthropic.com>

Remove duplicated configuration constants from scale tests by reusing the existing HalfConfig_256x256_2x2x1_32x32x8 type alias. Co-Authored-By: Claude <noreply@anthropic.com>

Move CShuffleEpilogueScaleTest (RowCol and Tensor scale tests) from test_cshuffle_epilogue_fp16.cpp into a dedicated test_cshuffle_epilogue_scale target for better test organization and faster incremental builds. Co-Authored-By: Claude <noreply@anthropic.com>

Extract reusable helpers for better code organization: - Add verification namespace with named constants (kMaxFP8Value, kMinUniqueFP8Values, kScaleEpsilon) - Add MakeProblem<Config> type alias to construct problem types from TileConfig, eliminating 11-line duplication in scale tests - Add convert_and_sort_output() helper for reusable output conversion - Add run_scale_comparison_test() helper to bundle unscaled/scaled test execution - Extract verify_permutation_output() for reusable verification logic - Simplify BasicTest from ~45 lines to ~8 lines using new helpers Co-Authored-By: Claude <noreply@anthropic.com>

…e tests - Fix memory leak: add hipFree for m_scale/n_scale buffers in scale tests - Remove unused struct members from CShuffleEpilogueTestResult - Add CK_INSTANTIATE_TYPED_TEST_SUITE macro to reduce pragma duplication - Remove unnecessary :: prefix from MakeProblem usage Co-Authored-By: Claude <noreply@anthropic.com>

Extract scale-specific logic from run_cshuffle_epilogue_test into dedicated helper functions to improve maintainability and readability. Changes: - Add DeviceScaleBuffers RAII wrapper for automatic memory management - Implements move semantics for efficient resource transfer - Prevents memory leaks via automatic cleanup in destructor - Extract buffer allocation into helper functions: - allocate_rowcol_scale_buffers() for per-row/column scaling - allocate_tensor_scale_buffers() for uniform tensor scaling - Extract kernel launch into separate functions: - launch_kernel_with_rowcol_scale() - launch_kernel_with_tensor_scale() - launch_kernel_without_scale() - Add HIP_CHECK_ERROR for all HIP API calls (hipGetLastError, hipDeviceSynchronize) - Use (void) cast for hipFree in destructor/move-assignment (cannot throw) - Reduce main function complexity from 82 to 52 lines (37% reduction) Benefits: - Clear separation of concerns (allocation, launch, cleanup) - Each helper has single, well-defined purpose - Safer resource management via RAII - More consistent error handling across all HIP calls - Easier to extend with new scale types Co-Authored-By: Claude <noreply@anthropic.com>

…fication Update verification logic to properly handle floating-point types: - Change convert_and_sort_output to return std::vector<float> instead of uint32_t - Use tolerance-based uniqueness counting in verify_permutation_output - Replace EXPECT_EQ with EXPECT_NEAR for tensor scale comparison Co-Authored-By: Claude <noreply@anthropic.com>

Change assertion from "at least half" to "all rows" for RowCol scaling verification. If scaling works correctly, every row should show the expected behavior. Co-Authored-By: Claude <noreply@anthropic.com>

Add tests for MFMA tile sizes 16x16x128 and 32x32x64 which are only available on gfx950. The new test file is conditionally compiled when GPU_TARGETS includes gfx950. Co-Authored-By: Claude <noreply@anthropic.com>

Add MfmaDataType template parameter to TileConfig to allow testing FP8-specific tile sizes (KPerXdl=32,64,128) with half_t output. This avoids FP8 range limitations and eliminates special verification logic for 8-bit types. - Use half_t output with fp8_t MFMA type in FP8 tests - Remove kMaxFP8Value and kMinUniqueFP8Values constants - Simplify value generation (no modulo wrapping needed) - Unify verification logic for all output types Co-Authored-By: Claude <noreply@anthropic.com>

Add 1x4 and 4x1 warp layouts in addition to existing 2x2 layout for gfx950-specific tile sizes (16x16x128, 32x32x64). Co-Authored-By: Claude <noreply@anthropic.com>

Add NaN check and clarify that C-shuffle epilogue broadcasts values from the distributed accumulator tile, so duplicates are expected. Verification now checks for: correct size, no NaN, no zeros, and multiple unique values. Co-Authored-By: Claude <noreply@anthropic.com>

…ize unique values The previous verification only checked for more than 1 unique value. Since each thread generates unique values (thread_id * buffer_size + i + 1), we should expect at least kBlockSize unique values in the output. This stricter check verifies that all threads contributed to the output, providing better test coverage for the C-shuffle epilogue correctness. Co-Authored-By: Claude <noreply@anthropic.com>

Define kTestScaleFactor, kIdentityScale, and kScaledColIndex constants in the verification namespace to eliminate hardcoded magic numbers. This ensures the scale factors used in buffer allocation match those used in test verification, making the tests more maintainable. Co-Authored-By: Claude <noreply@anthropic.com>

tenpercent and others added 20 commits January 30, 2026 17:35

Apply clang-format to cshuffle epilogue tests

8263a1e

Co-Authored-By: Claude <noreply@anthropic.com>

[CK_TILE] Refactor scale tests to reuse HalfConfig type alias

e5cf4eb

Remove duplicated configuration constants from scale tests by reusing the existing HalfConfig_256x256_2x2x1_32x32x8 type alias. Co-Authored-By: Claude <noreply@anthropic.com>

[CK_TILE] Strengthen RowCol scale test to require all rows pass

69a9fa4

Change assertion from "at least half" to "all rows" for RowCol scaling verification. If scaling works correctly, every row should show the expected behavior. Co-Authored-By: Claude <noreply@anthropic.com>

[CK_TILE] Add gfx950-specific FP8 CShuffleEpilogue tests

7860706

Add tests for MFMA tile sizes 16x16x128 and 32x32x64 which are only available on gfx950. The new test file is conditionally compiled when GPU_TARGETS includes gfx950. Co-Authored-By: Claude <noreply@anthropic.com>

[CK_TILE] Add more warp layouts to gfx950 CShuffleEpilogue tests

6458734

Add 1x4 and 4x1 warp layouts in addition to existing 2x2 layout for gfx950-specific tile sizes (16x16x128, 32x32x64). Co-Authored-By: Claude <noreply@anthropic.com>

tenpercent marked this pull request as ready for review February 3, 2026 00:46

tenpercent requested review from afagaj, andriy-ca, aosewski, bartekxk, carlushuang, geyyer, illsilin, poyenc and qianfengz as code owners February 3, 2026 00:46

tenpercent requested review from Snektron, ThomasNing, asleepzzz, cgmillette, coderfeli, shumway, vidyasagar-amd and vpietila-amd as code owners February 3, 2026 00:46

clang-format

fb541b6

tenpercent marked this pull request as draft February 3, 2026 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK-Tile] Improve cshuffle epilogue mfma tile coverage #3701

[CK-Tile] Improve cshuffle epilogue mfma tile coverage #3701

tenpercent commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CK-Tile] Improve cshuffle epilogue mfma tile coverage #3701

Are you sure you want to change the base?

[CK-Tile] Improve cshuffle epilogue mfma tile coverage #3701

Conversation

tenpercent commented Feb 2, 2026

Summary

Changes

1. Expanded test coverage with parameterized configurations

2. Sharded tests into FP16 and FP8 executables

3. Follow CMake pattern for FP8 compile flags

4. Fixed header hygiene

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants