Refactor vector type to reduce build times #3641

cgmillette · 2026-01-23T17:20:00Z

Proposed changes

Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are frontend parsing and backend codegen and compilation.

Frontend Parsing

The length of the code, the include header tree and macro expansions all affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization.

Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place.
The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t.

Backend Codegen

Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used.

Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>.

template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};

Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation.

template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};

This union storage has been removed from the vector_type storage class.

Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays.

template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};

Fixes

Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type.

template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!

Math utilities such as next_power_of_two addressed for invalid cases of X < 2
Remove redundant implementation of next_pow2

Additions

integer_log2_floor to math.hpp
is_power_of_two_integer to math.hpp

Build Time Analysis

Machine: banff-cyxtera-s78-2
Target: gfx942

Build Target	Threads	Frontend Parse Time (s)	Backend Codegen Time (s)	TotalTime (s)	commitId
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance	1	1452	331	1783	`2e08a7e` (develop)
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance	1	1403	332	1735 (-2.7%)	`fad4235`

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copilot

Pull request overview

This pull request refactors the vector_type implementation to reduce build times by consolidating specialized template implementations into a generalized design. The changes aim to improve frontend parsing times by reducing redundant code and backend codegen times by replacing recursive template instantiations with concrete implementations.

Changes:

Generalizes vector_type partial specializations into a single class with helper structs (vector_type_storage, non_native_vector_base)
Replaces recursive StaticallyIndexedArray with concrete StaticallyIndexedArray_v2
Fixes bool datatype handling with vector_type to avoid data slicing issues
Adds new math utilities: integer_log2_floor and is_power_of_two_integer
Updates next_power_of_two to handle edge cases (X <= 1)
Introduces default scalar_type template specialization with unsigned _BitInt fallback

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
include/ck/utility/statically_indexed_array.hpp	Adds blank line (cosmetic change)
include/ck/utility/math.hpp	Fixes next_power_of_two for edge cases, adds integer_log2_floor and is_power_of_two_integer functions
include/ck/utility/data_type.hpp	Refactors scalar_type with default template and typename keywords, removes redundant next_pow2
include/ck/utility/dtype_vector.hpp	Major refactor: introduces NativeVectorT alias, vector_type_storage helper, generalized vector_type class with is_as_type_cast_valid validation, consolidates non_native_vector_base specializations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/utility/dtype_vector.hpp

include/ck/utility/data_type.hpp

include/ck/utility/dtype_vector.hpp

Copilot · 2026-01-23T17:26:07Z

include/ck/utility/math.hpp

+    // TODO: x need to be 1 ~ 0x7fffffff
+    // __builtin_clz will produce unexpected result if x is 0;
+    return 31 - __builtin_clz(x);
+}
+
+__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
+{
+    // TODO: x need to be 1 ~ 0x7fffffff
+    return x == (1 << integer_log2_floor(x));


The is_power_of_two_integer function has a potential issue when x equals 1. When x is 1, integer_log2_floor(1) returns 0 (since 31 - __builtin_clz(1) = 31 - 31 = 0), and then (1 << 0) = 1, so the function correctly returns true. However, this edge case behavior should be verified. Additionally, the function can return incorrect results for x <= 0 per the TODO on line 244, but there's no runtime check to prevent such inputs. Consider adding a static_assert or runtime check if this function is called with compile-time constant expressions.

Suggested change

// TODO: x need to be 1 ~ 0x7fffffff

// __builtin_clz will produce unexpected result if x is 0;

return 31 - __builtin_clz(x);

}

__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)

{

// TODO: x need to be 1 ~ 0x7fffffff

return x == (1 << integer_log2_floor(x));

// x needs to be in the range [1, 0x7fffffff].

// __builtin_clz is undefined for x <= 0; guard against that here.

return (x > 0) ? (31 - __builtin_clz(static_cast<uint32_t>(x))) : int32_t{-1};

}

__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)

{

// x needs to be in the range [1, 0x7fffffff]. For x <= 0, this is not a power of two.

return (x > 0) && (x == (int32_t(1) << integer_log2_floor(x)));

include/ck/utility/dtype_vector.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ype is visible where needed

…se for pk_i4_t, but should be addressed in the future.

…es incorrect slicing calculations from scalar_type::vector_size with f6_pk_t types.

tenpercent · 2026-02-03T01:23:07Z

include/ck/utility/data_type.hpp

+ * @tparam T The element type of the vector
+ * @tparam Rank The number of elements in the vector
+ */
+template <typename T, index_t Rank>


using this attribute requires T to be a builtin. maybe it's possible to use some concepts here, conditionally if they are available

Yes, once we have ability to use concepts, we can make some requirements about attribute support. For now we assume Clang supports it.

tenpercent · 2026-02-03T01:25:08Z

include/ck/utility/data_type.hpp

+template <typename T>
+inline constexpr bool is_native_type()
+{
+    return is_same_v<T, double> || is_same_v<T, float> || is_same_v<T, half_t> ||


what's the difference to std::is_arithmetic_v?

std::is_arithmetic_v == false for _Float16 and _BitInt

This was original code, didn't want to change it :)

tenpercent · 2026-02-03T01:28:33Z

include/ck/utility/data_type.hpp

+
+/// @brief scalar_type trait override for uint32_t vector of size 3
+template <>
+struct scalar_type<uint32_t __attribute__((__vector_size__(sizeof(uint32_t) * 3)))>


what's the difference between ext_vector_type and __vector_size__?

Compiler treats them separately. I think I've discovered some compiler bugs that I'll report later. You can play with those here: https://godbolt.org/z/v87KEfx7d

Each one is trivially constructible from the other, however some of the llvm __builtins return types use the vector_size notation while we mostly use the ext_vector_type due to the ability to alias them properly.

That's an interesting observation! I liked this doc regarding vector_size https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html, it explains some details. In godbolt I tried to experiment by defining vectors as:

template<typename T, index_t Rank>
struct NativeVectorT3 {
using dtype = T;
using type = attribute((vector_size(Rank * sizeof(T)))) T;
static constexpr index_t vector_rank = Rank;
};

And this way we can deduce template argument and test is passing. I'm not sure if it brings us back to the previous design though XD

Regarding the NativeVectorT2, I would agree with compiler, we allocate a vector of floats 4 byte long, so effectively it is a vector of 1 float, so the test is failing.

include/ck/utility/math.hpp

include/ck/utility/dtype_vector.hpp

geyyer

@cgmillette, thanks for refactoring this! Removing a lot of extra code is good, and cutting build time is even better!

ammallya · 2026-02-03T22:01:39Z

Imported to ROCm/rocm-libraries

cgmillette requested a review from Copilot January 23, 2026 17:20

Copilot started reviewing on behalf of cgmillette January 23, 2026 17:20 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

cgmillette marked this pull request as ready for review January 27, 2026 00:16

cgmillette requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 27, 2026 00:16

cgmillette force-pushed the refactor_vector_type branch from f46b373 to bea883a Compare January 30, 2026 20:49

cgmillette and others added 8 commits February 2, 2026 13:09

Refactor vector_type to reduce build time

c1a5bba

Apply suggestions from code review

8b307fd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixes clang formatting and adjust storage class type check

c3e573d

Re-home NativeVectorT and ensure partial specialization with scalar_t…

b731dc1

…ype is visible where needed

Fixes scalar_type definition for llvm builtin mma type

13a5177

Fixes aliasing for vector size of 1

f518830

Workaround adjustment to scalar_type<pk_i4_t>::type. Skips invalid ca…

cc0f42f

…se for pk_i4_t, but should be addressed in the future.

Use generalized vector_type_traits instead of scalar_type traits. Fix…

cffc2f5

…es incorrect slicing calculations from scalar_type::vector_size with f6_pk_t types.

cgmillette force-pushed the refactor_vector_type branch from 8219757 to cffc2f5 Compare February 2, 2026 18:11

cgmillette assigned geyyer Feb 2, 2026

tenpercent reviewed Feb 3, 2026

View reviewed changes

include/ck/utility/math.hpp Show resolved Hide resolved

tenpercent reviewed Feb 3, 2026

View reviewed changes

include/ck/utility/dtype_vector.hpp Show resolved Hide resolved

tenpercent reviewed Feb 3, 2026

View reviewed changes

include/ck/utility/dtype_vector.hpp Show resolved Hide resolved

geyyer reviewed Feb 3, 2026

View reviewed changes

assistant-librarian bot mentioned this pull request Feb 3, 2026

Refactor vector type to reduce build times ROCm/rocm-libraries#4281

Open

7 tasks

ammallya closed this Feb 3, 2026

Refactor vector type to reduce build times #3641

Refactor vector type to reduce build times #3641

Conversation

cgmillette commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Frontend Parsing

Backend Codegen

Fixes

Additions

Build Time Analysis

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tenpercent Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cgmillette Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

tenpercent Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cgmillette Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cgmillette Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

tenpercent Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgmillette Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

cgmillette Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

geyyer Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geyyer left a comment

Choose a reason for hiding this comment

Uh oh!

ammallya commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cgmillette commented Jan 23, 2026 •

edited

Loading

tenpercent Feb 3, 2026 •

edited

Loading

geyyer Feb 3, 2026 •

edited

Loading