Skip to content

Conversation

@tenpercent
Copy link
Contributor

Summary

This PR enhances the CShuffleEpilogue test suite by expanding test coverage with parameterized configurations and organizing tests by data type (FP16 and FP8) into separate executables.

Changes

1. Expanded test coverage with parameterized configurations

  • Added comprehensive test cases covering various epilogue configurations
  • Introduced parameterized testing for different reduction operations, output types, and scaling modes
  • Tests now validate multiple combinations of epilogue features to ensure robust functionality

2. Sharded tests into FP16 and FP8 executables

  • Split test_cshuffle_epilogue.cpp into:
    • test_cshuffle_epilogue_fp16.cpp - FP16/BF16 data type tests
    • test_cshuffle_epilogue_fp8.cpp - FP8 data type tests
  • Created test_cshuffle_epilogue_common.hpp to share common test infrastructure
  • Enables more targeted testing and better organization by data type

3. Follow CMake pattern for FP8 compile flags

  • Updated test/ck_tile/epilogue/CMakeLists.txt to conditionally add -DCK_TILE_USE_OCP_FP8 compile option when CK_USE_OCP_FP8 is enabled
  • Follows the established pattern used in other test directories (test/ck_tile/data_type, test/ck_tile/gemm, etc.)
  • Ensures consistent FP8 flag handling across the codebase

4. Fixed header hygiene

  • Ensured proper header includes in common test header file
  • Improved code organization and maintainability

Test plan

  • [] Built both test executables successfully (test_ck_tile_cshuffle_epilogue_fp16, test_ck_tile_cshuffle_epilogue_fp8)
  • [] Verified CMake configuration follows codebase conventions
  • Ready for CI validation

tenpercent and others added 20 commits January 30, 2026 17:35
…ions

This commit expands the CShuffleEpilogue test coverage with:
- Parameterized tests for multiple tile configurations
- Half precision (FP16) tests with 5 different tile sizes
- FP8 tests with 7 different tile sizes and MFMA configurations
- Fix for acc_buffer initialization to fill entire buffer
- Improved test organization with TileConfig template
- Support for OCP FP8 flag in host code

Test configurations cover various wave distributions (1x4, 2x2, 4x1)
and XDL sizes to ensure correctness across different hardware patterns.

Co-Authored-By: Claude <noreply@anthropic.com>
Split the parameterized tests into separate compilation units to reduce
build times and enable parallel compilation:

- test_cshuffle_epilogue_fp16.cpp: 5 FP16 configs + scale tests (100 LOC)
- test_cshuffle_epilogue_fp8.cpp: 7 FP8 configs (34 LOC)
- test_cshuffle_epilogue_common.hpp: Shared test infrastructure (84 LOC)

This allows the FP16 and FP8 test suites to compile in parallel,
significantly reducing overall build time compared to a single
monolithic test file.

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove 'using namespace' directive from header file
- Add explicit ck_tile:: namespace qualifications in header
- Keep 'using namespace ck_tile' in .cpp files for convenience

This fixes compilation with -Werror -Wheader-hygiene flags.

Co-Authored-By: Claude <noreply@anthropic.com>
…P8 flag

Revert the config.hpp host code change and instead follow the established
pattern used in other test directories by adding the FP8 compile option
directly in CMakeLists.txt when CK_USE_OCP_FP8 is enabled.

Co-Authored-By: Claude <noreply@anthropic.com>
Use exactly representable integer values instead of floating-point for
unique test data. Since cshuffle only permutes data without arithmetic,
this allows exact EXPECT_EQ comparisons instead of EXPECT_FLOAT_EQ.

- 16-bit types: Use values 1 to output_size (exactly representable)
- 8-bit types: Use (index % 255) + 1 to stay in [1, 255] range

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…omparison

Refactor the CShuffleEpilogue tests to properly verify permutation correctness
using set-based comparison instead of checking single values. The tests now:

- Verify output has correct size and multiple unique values
- Check values are in valid range (FP16: positive, FP8: >10 unique values)
- For scale tests, compare scaled vs unscaled outputs to verify scaling

Co-Authored-By: Claude <noreply@anthropic.com>
Remove duplicated configuration constants from scale tests by reusing
the existing HalfConfig_256x256_2x2x1_32x32x8 type alias.

Co-Authored-By: Claude <noreply@anthropic.com>
Move CShuffleEpilogueScaleTest (RowCol and Tensor scale tests) from
test_cshuffle_epilogue_fp16.cpp into a dedicated test_cshuffle_epilogue_scale
target for better test organization and faster incremental builds.

Co-Authored-By: Claude <noreply@anthropic.com>
Extract reusable helpers for better code organization:
- Add verification namespace with named constants (kMaxFP8Value,
  kMinUniqueFP8Values, kScaleEpsilon)
- Add MakeProblem<Config> type alias to construct problem types from
  TileConfig, eliminating 11-line duplication in scale tests
- Add convert_and_sort_output() helper for reusable output conversion
- Add run_scale_comparison_test() helper to bundle unscaled/scaled
  test execution
- Extract verify_permutation_output() for reusable verification logic
- Simplify BasicTest from ~45 lines to ~8 lines using new helpers

Co-Authored-By: Claude <noreply@anthropic.com>
…e tests

- Fix memory leak: add hipFree for m_scale/n_scale buffers in scale tests
- Remove unused struct members from CShuffleEpilogueTestResult
- Add CK_INSTANTIATE_TYPED_TEST_SUITE macro to reduce pragma duplication
- Remove unnecessary :: prefix from MakeProblem usage

Co-Authored-By: Claude <noreply@anthropic.com>
Extract scale-specific logic from run_cshuffle_epilogue_test into
dedicated helper functions to improve maintainability and readability.

Changes:
- Add DeviceScaleBuffers RAII wrapper for automatic memory management
  - Implements move semantics for efficient resource transfer
  - Prevents memory leaks via automatic cleanup in destructor
- Extract buffer allocation into helper functions:
  - allocate_rowcol_scale_buffers() for per-row/column scaling
  - allocate_tensor_scale_buffers() for uniform tensor scaling
- Extract kernel launch into separate functions:
  - launch_kernel_with_rowcol_scale()
  - launch_kernel_with_tensor_scale()
  - launch_kernel_without_scale()
- Add HIP_CHECK_ERROR for all HIP API calls (hipGetLastError, hipDeviceSynchronize)
- Use (void) cast for hipFree in destructor/move-assignment (cannot throw)
- Reduce main function complexity from 82 to 52 lines (37% reduction)

Benefits:
- Clear separation of concerns (allocation, launch, cleanup)
- Each helper has single, well-defined purpose
- Safer resource management via RAII
- More consistent error handling across all HIP calls
- Easier to extend with new scale types

Co-Authored-By: Claude <noreply@anthropic.com>
…fication

Update verification logic to properly handle floating-point types:
- Change convert_and_sort_output to return std::vector<float> instead of uint32_t
- Use tolerance-based uniqueness counting in verify_permutation_output
- Replace EXPECT_EQ with EXPECT_NEAR for tensor scale comparison

Co-Authored-By: Claude <noreply@anthropic.com>
Change assertion from "at least half" to "all rows" for RowCol scaling
verification. If scaling works correctly, every row should show the
expected behavior.

Co-Authored-By: Claude <noreply@anthropic.com>
Add tests for MFMA tile sizes 16x16x128 and 32x32x64 which are only
available on gfx950. The new test file is conditionally compiled when
GPU_TARGETS includes gfx950.

Co-Authored-By: Claude <noreply@anthropic.com>
Add MfmaDataType template parameter to TileConfig to allow testing
FP8-specific tile sizes (KPerXdl=32,64,128) with half_t output.
This avoids FP8 range limitations and eliminates special verification
logic for 8-bit types.

- Use half_t output with fp8_t MFMA type in FP8 tests
- Remove kMaxFP8Value and kMinUniqueFP8Values constants
- Simplify value generation (no modulo wrapping needed)
- Unify verification logic for all output types

Co-Authored-By: Claude <noreply@anthropic.com>
Add 1x4 and 4x1 warp layouts in addition to existing 2x2 layout
for gfx950-specific tile sizes (16x16x128, 32x32x64).

Co-Authored-By: Claude <noreply@anthropic.com>
Add NaN check and clarify that C-shuffle epilogue broadcasts values
from the distributed accumulator tile, so duplicates are expected.
Verification now checks for: correct size, no NaN, no zeros, and
multiple unique values.

Co-Authored-By: Claude <noreply@anthropic.com>
…ize unique values

The previous verification only checked for more than 1 unique value.
Since each thread generates unique values (thread_id * buffer_size + i + 1),
we should expect at least kBlockSize unique values in the output.

This stricter check verifies that all threads contributed to the output,
providing better test coverage for the C-shuffle epilogue correctness.

Co-Authored-By: Claude <noreply@anthropic.com>
Define kTestScaleFactor, kIdentityScale, and kScaledColIndex constants
in the verification namespace to eliminate hardcoded magic numbers.

This ensures the scale factors used in buffer allocation match those
used in test verification, making the tests more maintainable.

Co-Authored-By: Claude <noreply@anthropic.com>
@tenpercent tenpercent marked this pull request as draft February 3, 2026 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants