179 refactor recursive template implementations with fold expressions to reduce compiler memory usage and compilation times by morousg · Pull Request #188 · Libraries-Openly-Fused/FusedKernelLibrary

morousg · 2026-02-09T20:32:22Z

No description provided.

…<>(NewOperationTuple) returns.

…not have before. Completed NewFusedOperation, missing tests. Adding FusedInstantiableOperation. Adding size to NewOperationTuple

… nested FusedOperations.

…on, and the DivergentTransformDPP

… get<>(NewOperationTuple) to get_opt, to ease debugging, fixing tests

…rsions

…tion, and renaming the new ones to the final name

…C condition

… element. Created a variadic version of TypeListCat_t with fold expressions. Fixed a test that was creating instances of TypeList. Fixed an issue with AreVVEqCN, that was evaluating fk::cn with types that are not cuda vectors

…ntations-with-fold-expressions-to-reduce-compiler-memory-usage-and-compilation-times * LTS-C++17: Comenting out incomplete code # Conflicts: # tests/examples/inlining_and_LDL_STL.h

Copilot

Pull request overview

Refactors core compile-time plumbing around TypeList, OperationTuple, and FusedOperation to use fold expressions (reducing recursive template depth), and updates tests/examples accordingly while moving memory_operations.h to algorithms/basic_ops.

Changes:

Reworks fk::TypeList and TypeListCat_t to reduce recursive instantiation (index-based leaf/getter approach).
Replaces the recursive OperationTuple construction/access with make_new_operation_tuple(...) + get_opt<Idx>(...) and updates fusing/building code paths.
Updates many tests/examples/includes to the new operation/fusion tuple APIs and new memory_operations.h include location.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
utests/core/execution_model/utest_executors.h	Updates include path and adjusts expected fused type alias.
utests/algorithm/image_processing/utest_saturate/utest_saturate_common.h	Simplifies saturate expected-value comparison logic.
tests/operation_test_utils.h	Updates read-type detection trait usage.
tests/operation/test_operation_types.h	Updates operation type lists and type-traits used in compile-time tests.
tests/operation/test_operation_tuple.h	Migrates to `make_new_operation_tuple` / `get_opt`-style tuple APIs and adds new tuple tests.
tests/operation/test_operation_fuser.h	Updates complex fused-type composition and simplifies test logic.
tests/operation/test_instantiable_operations.h	Updates type assertions for new operation tuple / instantiable-operation shapes.
tests/operation/test_fused_operation.h	Updates assertions to new tuple access and new fused operation type shapes.
tests/operation/test_filtered_index_sequence.h	Updates restriction/type-list usage to new operation types and restriction naming.
tests/examples/readme_test_code.h	Makes several example constants `constexpr` and updates include path.
tests/examples/inlining_and_LDL_STL.h	Replaces recursive operate() with fold-expression execution; updates pack indexing helper usage and test sizes.
tests/data/test_tuple.h	Updates operation tuple type extraction to new `Operations` typelist model.
tests/data/basic_test.h	Migrates operation tuple access to `get_opt`/`TypeAt_t` and adds more verbose failure output.
tests/buildAPI/batch_build_compilation_time.h	Updates include path and minor formatting.
tests/algorithm/test_warp.h	Updates include path.
tests/algorithm/test_deinterlace.h	Updates include path.
tests/algorithm/test_crop.h	Updates include path.
tests/algorithm/test_border_reader.h	Updates include path.
tests/algorithm/test_batchresize_build.h	Updates include path.
include/fused_kernel/core/utils/vector_utils.h	Updates typelist concatenation usage and minor trait/constexpr robustness tweaks.
include/fused_kernel/core/utils/type_lists.h	Reimplements `TypeList` indexing/catenation via fold expressions to reduce recursion depth.
include/fused_kernel/core/utils/parameter_pack_utils.h	Renames parameter-pack accessor to `get_arg` and updates helpers to use it.
include/fused_kernel/core/execution_model/thread_fusion.h	Updates source type list composition and adds explicit cast for `elems_per_thread`.
include/fused_kernel/core/execution_model/operation_model/parent_operations.h	Adds new parent types (MidWrite/Open/Closed) and macros to support new fused-operation kinds.
include/fused_kernel/core/execution_model/operation_model/operation_types.h	Extends operation “instance types” (Open/Closed) and introduces generalized `OpIs`.
include/fused_kernel/core/execution_model/operation_model/operation_tuple.h	Replaces recursive operation tuple structure with filtered tuple-of-instances + `get_opt` + `make_new_operation_tuple`.
include/fused_kernel/core/execution_model/operation_model/operation_data.h	Extends which instance types are considered “has params”.
include/fused_kernel/core/execution_model/operation_model/iop_fuser.h	Switches non-batch fusing to go through `FusedOperation<>::build(...)` and tightens type assertions.
include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h	Updates fold-expression piping mechanics and adds Open/Closed instantiable op wrappers.
include/fused_kernel/core/execution_model/operation_model/fused_operation.h	Reimplements fused-operation execution using fold expressions and introduces Open/Closed semantics.
include/fused_kernel/core/execution_model/executors.h	Updates include path and parameter-pack access helper usage.
include/fused_kernel/core/execution_model/data_parallel_patterns.h	Updates parameter-pack access helper usage.
include/fused_kernel/core/data/tuple.h	Small reference/formatting adjustments for tuple `get`.
include/fused_kernel/core/data/rawptr.h	Zero-initializes `dims` members to avoid uninitialized state.
include/fused_kernel/core/data/ptr_nd.h	Value-initializes internal members to avoid uninitialized state.
include/fused_kernel/core/data/circular_tensor.h	Updates include path.
include/fused_kernel/algorithms/image_processing/resize.h	Updates include path.
include/fused_kernel/algorithms/image_processing/deinterlace.h	Updates include path.
include/fused_kernel/algorithms/basic_ops/memory_operations.h	Removes NVRTC guard around `<vector>` include and updates header content accordingly.

Comments suppressed due to low confidence (1)

include/fused_kernel/algorithms/basic_ops/memory_operations.h:24

is now included unconditionally. This header previously avoided including the full STL when NVRTC_COMPILER is defined (forward-declaring std::vector instead), which is important because NVRTC typically can’t parse many standard library headers. Reintroduce the NVRTC_COMPILER guard (or otherwise ensure NVRTC builds still work) before including .

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/fused_kernel/core/execution_model/operation_model/parent_operations.h

include/fused_kernel/core/execution_model/operation_model/fused_operation.h

Copilot · 2026-02-09T20:40:40Z

include/fused_kernel/core/utils/type_lists.h

+#if __cplusplus < 202002L
+namespace std {
+template <typename T>
+struct type_identity {
+    using type = T;
+};
+} // namespace std
+#endif


This adds a custom std::type_identity in namespace std when compiling pre-C++20. Adding user-defined names to namespace std is undefined behavior, and some C++17 standard libraries already provide std::type_identity as an extension (which would cause a redefinition error). Prefer defining a fk::type_identity (or detail::type_identity) and using that, or gate the definition on a feature-test macro like __cpp_lib_type_identity.

@copilot open a new pull request to apply changes based on this feedback

tests/data/basic_test.h

Copilot · 2026-02-09T20:40:41Z

include/fused_kernel/core/execution_model/operation_model/parent_operations.h

+  FK_HOST_DEVICE_FUSE void exec(const Point& thread,                                                     \
+                                const InputType& input,                                                  \
+                                const OperationDataType& opData) {                                       \
+    Parent::exec(thread, input, opData);                                                                 \
+  }                                                                                                      \


DECLARE_OPEN_PARENT defines exec(...) as void, but OpenType operations are expected to return an OutputType (they feed into further computation and fold expressions). The void wrapper will break OpenInstantiableOperation operator| and any OpenType-based fusion. Change the wrapper to return OutputType and forward Parent::exec's return value.

@copilot open a new pull request to apply changes based on this feedback

include/fused_kernel/core/execution_model/operation_model/fused_operation.h

tests/data/basic_test.h

Copilot

Pull request overview

Copilot reviewed 38 out of 39 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (2)

utests/algorithm/image_processing/utest_saturate/utest_saturate_common.h:32

cxp::cmp_greater::f(input, ...) returns a vector-of-bool when input is a CUDA vector type (used by addOneTestAllChannels), which is not usable as the if condition here and can fail to compile or behave incorrectly. The previous fk::get<0>(input) ensured a scalar comparison; consider restoring that (or explicitly comparing a scalar channel / reducing the boolean vector) so this works for both scalar and vector inputs.
include/fused_kernel/algorithms/basic_ops/memory_operations.h:24
This header now unconditionally includes <vector> and uses std::vector in APIs like SplitWrite::build. Previously it was guarded for NVRTC_COMPILER, presumably because NVRTC builds don't have full libstdc++/libc++ support. Unconditional inclusion can break NVRTC compilation; consider restoring the #if !defined(NVRTC_COMPILER) guard (or moving std::vector-based helpers behind a host-only guard) so device/NVRTC builds don't depend on <vector>.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/fused_kernel/core/execution_model/operation_model/fused_operation.h

tests/operation/test_fused_operation.h

include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h

Copilot · 2026-02-09T21:33:58Z

include/fused_kernel/core/utils/type_lists.h

+#if __cplusplus < 202002L
+namespace std {
+template <typename T>
+struct type_identity {
+    using type = T;
+};
+} // namespace std
+#endif


Defining std::type_identity inside namespace std is undefined behavior (user code must not add new declarations to std, even conditionally). To avoid UB while keeping C++17 compatibility, define an internal fk::detail::type_identity (or a type_identity_t alias) and use that instead of injecting into std.

@copilot open a new pull request to apply changes based on this feedback

include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h

Copilot

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

include/fused_kernel/algorithms/basic_ops/memory_operations.h:24

memory_operations.h now unconditionally includes <vector>. This header previously had NVRTC-specific handling, and the repo still uses NVRTC_COMPILER guards elsewhere. NVRTC typically cannot include the full STL headers, so this risks breaking NVRTC compilation. Consider restoring the #if !defined(NVRTC_COMPILER) include guard and/or using a forward declaration strategy for std::vector when NVRTC_COMPILER is defined.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…| in InstantiableOperations

Copilot

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

include/fused_kernel/algorithms/basic_ops/memory_operations.h:23

memory_operations.h now unconditionally includes <vector>, but this header is consumed in NVRTC builds elsewhere in the project (there were previously NVRTC-specific guards here). If NVRTC compilation is still a supported configuration, this can break NVRTC because <vector> (and related parts of the standard library) are often unavailable/unsupported there. Consider restoring the #if !defined(NVRTC_COMPILER) conditional include (and the forward-declaration fallback) or moving the std::vector-dependent APIs behind an NVRTC guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T17:01:50Z

include/fused_kernel/core/execution_model/data_parallel_patterns.h

+            const ActiveThreads activeThreads = getActiveThreads(details, get_arg<0>(iOps...));

            if (x < activeThreads.x && y < activeThreads.y) {
                Parent::execute_thread(thread, activeThreads, iOps...);
            }
        }
    };
-#endif // defined(__NVCC__) || defined(__HIPCC__) || defined(__NVRTC__) || defined(NVRTC_COMPILER)
+#endif // defined(__NVCC__) || CLANG_HOST_DEVICE


The GPU specialization guard for TransformDPP<ParArch::GPU_NVIDIA,...> was narrowed to #if defined(__NVCC__) || CLANG_HOST_DEVICE, but surrounding comments (and another #endif comment later in the file) still reference __NVRTC__ / __HIPCC__. If NVRTC or HIP builds are still supported, this guard change will exclude them from the GPU implementation and likely break compilation/behavior. Please either restore the previous condition (including NVRTC/HIP macros as appropriate) or update the stale comments and build matrix to reflect an intentional support drop.

Copilot · 2026-02-17T17:01:50Z

include/fused_kernel/core/execution_model/data_parallel_patterns.h

+#if CLANG_HOST_DEVICE
            const int x = (blockDim.x * blockIdx.x) + threadIdx.x;
            const int y = (blockDim.y * blockIdx.y) + threadIdx.y;
            const int z = blockIdx.z; // So far we only consider the option of using the z dimension to specify n (x*y) thread planes


The conditional inside TransformDPP<ParArch::GPU_NVIDIA>::exec was changed from #if VS2017_COMPILER || CLANG_HOST_DEVICE to #if CLANG_HOST_DEVICE. On VS2017 + nvcc, CLANG_HOST_DEVICE is false and VS2017_COMPILER is true, but cooperative_groups.h (and the cg alias) are explicitly not included for VS2017 at the top of this file. That means the #else branch using cg::thread_block will fail to compile on VS2017. Please restore the VS2017 condition here (or ensure cg is available in that configuration).

Copilot

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 9 comments.

Comments suppressed due to low confidence (2)

include/fused_kernel/algorithms/basic_ops/memory_operations.h:36

The return type declaration has changed from explicitly specifying the ThreadFusionType to using trailing return type syntax with auto. While this is cleaner, ensure that template argument deduction works correctly, especially for the ELEMS_PER_THREAD template parameter which has a default value of 1.
include/fused_kernel/algorithms/basic_ops/memory_operations.h:75
The ReadOp and WriteOp build methods now use perfect forwarding with PtrType&& and std::forward. However, the template instantiation uses std::decay_t<PtrType> to extract the type information. This is correct, but verify that the forwarding doesn't cause issues when the ptr is an lvalue reference, as the build method may be called with both lvalues and rvalues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T20:41:10Z

include/fused_kernel/core/execution_model/operation_model/fused_operation.h

    template <typename Enabler, typename... Operations>
    struct FusedOperation_;

-    template <typename FirstOp, typename... RemOps>
-    struct FusedOperation_<std::enable_if_t<allUnaryTypes<FirstOp, RemOps...> && (sizeof...(RemOps) + 1 > 1)>, FirstOp, RemOps...> {
+    // 1. OpenType Specialization
+    // Changed to use IOps... for consistency
+    template <typename... IOps>
+    struct FusedOperation_<std::enable_if_t<!isAnyReadType<FirstType_t<IOps...>> && 
+                                            !isWriteType<LastType_t<IOps...>> && 
+                                            atLeastOneMidWriteType<IOps...>>,
+                           IOps...> {
    private:
-        using SelfType = FusedOperation_<std::enable_if_t<allUnaryTypes<FirstOp, RemOps...> && (sizeof...(RemOps) + 1 > 1)>, FirstOp, RemOps...>;
+        using SelfType = FusedOperation_<std::enable_if_t<!isAnyReadType<FirstType_t<IOps...>> && 
+                                                          !isWriteType<LastType_t<IOps...>> && 
+                                                          atLeastOneMidWriteType<IOps...>>,
+                                         IOps...>;
+
+        using Parent = OpenOperationParent<typename FirstType_t<IOps...>::Operation::InputType, 
+                                           OperationTuple<IOps...>,
+                                           typename LastType_t<IOps...>::Operation::OutputType, 
+                                           SelfType, true>;
    public:
        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using Parent =
-            UnaryOperation<typename FirstOp::InputType,
-            typename LastType_t<RemOps...>::OutputType,
-            FusedOperation_<std::enable_if_t<allUnaryTypes<FirstOp, RemOps...> && (sizeof...(RemOps) + 1 > 1)>, FirstOp, RemOps...>,
-            true>;
-        DECLARE_UNARY_PARENT
-
-        using Operations = TypeList<FirstOp, RemOps...>;
-        FK_HOST_DEVICE_FUSE OutputType exec(const InputType& input) {
-            return fused_operation_impl::tuple_operate<FirstOp, RemOps...>(input);
+        DECLARE_OPEN_PARENT
+        using Operations = TypeList<IOps...>;
+
+        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const InputType& input, const ParamsType& params) {
+            return exec_helper(std::make_index_sequence<ParamsType::size>{}, thread, input, params);
        }
-    };
-
-    template <typename Operation>
-    struct FusedOperation_<std::enable_if_t<isUnaryType<Operation>>, Operation> {
    private:
-        using SelfType = FusedOperation_<std::enable_if_t<isUnaryType<Operation>>, Operation>;
-    public:
-        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using Parent =
-            UnaryOperation<typename Operation::InputType,
-            typename Operation::OutputType,
-            FusedOperation_<std::enable_if_t<isUnaryType<Operation>>, Operation>,
-            true>;
-        DECLARE_UNARY_PARENT
-
-        using Operations = TypeList<Operation>;
-        FK_HOST_DEVICE_FUSE OutputType exec(const InputType& input) {
-            return Operation::exec(input);
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE OutputType exec_helper(const std::index_sequence<Idx...>&,
+                                                   const Point& thread,
+                                                   const InputType& input,
+                                                   const ParamsType& params) {
+            return (InputFoldType<InputType>{thread, input} | ... | get_opt<Idx>(params)).input;
        }
    };

-    template <typename... Operations>
-    struct FusedOperation_<std::enable_if_t<isComputeType<FirstType_t<Operations...>> &&
-                           !allUnaryTypes<Operations...>>, Operations...> {
+    // 2. ReadType Specialization
+    // Changed to use IOps... for consistency
+    template <typename... IOps>
+    struct FusedOperation_<
+        std::enable_if_t<isAnyReadType<FirstType_t<IOps...>> && !(isWriteType<LastType_t<IOps...>>)>,
+        IOps...> {
    private:
-        using SelfType = FusedOperation_<std::enable_if_t<isComputeType<FirstType_t<Operations...>> &&
-                                         !allUnaryTypes<Operations...>>, Operations...>;
+      using SelfType = FusedOperation_<
+        std::enable_if_t<isAnyReadType<FirstType_t<IOps...>> && !(isWriteType<LastType_t<IOps...>>)>,
+        IOps...>;
+
+      using FusedReadDataType = typename std::decay_t<FirstType_t<IOps...>>::Operation::ReadDataType;
+      using FusedOutputType = typename LastType_t<IOps...>::Operation::OutputType;
+
+      using Parent = ReadOperation<FusedReadDataType, OperationTuple<IOps...>, FusedOutputType, TF::DISABLED, SelfType, true>;
    public:
        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using Parent =
-            BinaryOperation<typename FirstType_t<Operations...>::InputType,
-            OperationTuple<Operations...>,
-            FOOT<LastType_t<Operations...>>,
-            SelfType, true>;
-        DECLARE_BINARY_PARENT
-        FK_HOST_DEVICE_FUSE OutputType exec(const InputType& input,
-                                            const ParamsType& params) {
-            return fused_operation_impl::tuple_operate(input, params);
-        }
-
-    };
-
-    template <typename... Operations>
-    struct FusedOperation_<std::enable_if_t<isAnyReadType<FirstType_t<Operations...>>>, Operations...> {
-    private:
-        static constexpr bool isTFEnabled = std::is_same_v<typename FirstType_t<Operations...>::ReadDataType, FOOT<LastType_t<Operations...>>> && ((sizeof...(Operations) > 1) ? false :
-            FirstType_t<Operations...>::THREAD_FUSION);
-        using SelfType = FusedOperation_<std::enable_if_t<isAnyReadType<FirstType_t<Operations...>>>, Operations...>;
-    public:
-        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using Parent = ReadOperation<typename FirstType_t<Operations...>::ReadDataType,
-                                     OperationTuple<Operations...>,
-                                     FOOT<LastType_t<Operations...>>,
-                                     isTFEnabled ? TF::ENABLED : TF::DISABLED,
-                                     SelfType, true>;
        DECLARE_READ_PARENT
-        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread,
-                                            const ParamsType& params) {
-            return fused_operation_impl::tuple_operate(thread, params);
+        using Operations = TypeList<IOps...>;
+
+        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const ParamsType& params) {
+            return exec_helper(std::make_index_sequence<ParamsType::size>{}, thread, params);
        }

        FK_HOST_DEVICE_FUSE uint num_elems_x(const Point& thread,
                                             const OperationDataType& opData) {
-            return ParamsType::Operation::num_elems_x(thread, opData.params.instance);
+            return FirstType_t<IOps...>::Operation::num_elems_x(thread, get_opt<0>(opData.params));
        }

        FK_HOST_DEVICE_FUSE uint num_elems_y(const Point& thread,
                                             const OperationDataType& opData) {
-            return ParamsType::Operation::num_elems_y(thread, opData.params.instance);
+            return FirstType_t<IOps...>::Operation::num_elems_y(thread, get_opt<0>(opData.params));
        }

        FK_HOST_DEVICE_FUSE uint num_elems_z(const Point& thread,
                                             const OperationDataType& opData) {
-            return ParamsType::Operation::num_elems_z(thread, opData.params.instance);
+            return FirstType_t<IOps...>::Operation::num_elems_z(thread, get_opt<0>(opData.params));
        }

        FK_HOST_DEVICE_FUSE ActiveThreads getActiveThreads(const OperationDataType& opData) {
-            return { num_elems_x(Point(), opData), num_elems_y(Point(), opData), num_elems_z(Point(), opData) };
+            return { num_elems_x(Point{0,0,0}, opData), num_elems_y(Point{0,0,0}, opData), num_elems_z(Point{0,0,0}, opData) };
        }
-    };

-    template <typename... Operations>
-    struct FusedOperation_<std::enable_if_t<isWriteType<FirstType_t<Operations...>>>, Operations...> {
    private:
-        using SelfType = FusedOperation_<std::enable_if_t<isWriteType<FirstType_t<Operations...>>>, Operations...>;
-    public:
-        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using ParamsType = OperationTuple<Operations...>;
-        using OutputType = FOOT<LastType_t<Operations...>>;
-        using InputType = typename FirstType_t<Operations...>::InputType;
-        using InstanceType = MidWriteType;
-        // THREAD_FUSION in this case will not be used in the current Transform implementation
-        // May be used in future implementations
-        static constexpr bool IS_FUSED_OP{ true };
-        static constexpr bool THREAD_FUSION{ false };
-        using WriteDataType = typename FirstType_t<Operations...>::WriteDataType;
-        using OperationDataType = OperationData<FusedOperation_<void, Operations...>>;
-        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const InputType& input,
-                                            const ParamsType& params) {
-            return fused_operation_impl::tuple_operate(thread, input, params);
-        }
-        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const InputType& input,
-                                            const OperationDataType& opData) {
-            return exec(thread, input, opData.params);
-        }
-        using InstantiableType = MidWrite<FusedOperation_<void, Operations...>>;
-        FK_HOST_DEVICE_FUSE auto build(const OperationDataType& opData) {
-            return InstantiableType{ opData };
-        }
-        FK_HOST_DEVICE_FUSE auto build(const ParamsType& params) {
-            return InstantiableType{ { params } };
-        }
-        template <size_t BATCH_N, typename FirstType, typename... ArrayTypes>
-        FK_HOST_FUSE auto build_batch(const std::array<FirstType, BATCH_N>& firstInstance,
-            const ArrayTypes&... arrays) {
-            return BatchUtils::build_batch<SelfType>(firstInstance, arrays...);
-        }
-        template <size_t BATCH_N, typename FirstType, typename... ArrayTypes>
-        FK_HOST_FUSE auto build(const std::array<FirstType, BATCH_N>& firstInstance,
-            const ArrayTypes&... arrays) {
-            return BatchWrite<BATCH_N, SelfType>::build(firstInstance, arrays...);
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE OutputType exec_helper(const std::index_sequence<Idx...>&,
+                                                   const Point& thread,
+                                                   const ParamsType& params) {
+            return (thread | ... | get_opt<Idx>(params)).input;
        }
    };

-    template <typename... Operations>
-    struct FusedOperation_<std::enable_if_t<isMidWriteType<FirstType_t<Operations...>>>, Operations...> {
-    private:
-        using SelfType = FusedOperation_<std::enable_if_t<isMidWriteType<FirstType_t<Operations...>>>, Operations...>;
-    public:
+    // 3. UnaryType Specialization
+    template <typename... IOps>
+    struct FusedOperation_<std::enable_if_t<allUnaryTypes<IOps...>>, IOps...> {
+      private:
+        using SelfType = FusedOperation_<std::enable_if_t<allUnaryTypes<IOps...>>, IOps...>;
+        using Parent = UnaryOperation<typename FirstType_t<IOps...>::Operation::InputType,
+                                      typename LastType_t<IOps...>::Operation::OutputType, SelfType, true>;
+      public:
        FK_STATIC_STRUCT(FusedOperation_, SelfType)
-        using ParamsType = OperationTuple<Operations...>;
-        using OutputType = FOOT<LastType_t<Operations...>>;
-        using InputType = typename FirstType_t<Operations...>::InputType;
-        using InstanceType = MidWriteType;
-        // THREAD_FUSION in this case will not be used in the current Transform implementation
-        // May be used in future implementations
-        static constexpr bool IS_FUSED_OP{ true };
-        static constexpr bool THREAD_FUSION{ false };
-        using WriteDataType = typename FirstType_t<Operations...>::WriteDataType;
-        using OperationDataType = OperationData<FusedOperation_<void, Operations...>>;
-        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const InputType& input,
-                                            const ParamsType& params) {
-            return fused_operation_impl::tuple_operate(thread, input, params);
-        }
-        FK_HOST_DEVICE_FUSE OutputType exec(const Point& thread, const InputType& input,
-                                            const OperationDataType& opData) {
-            return exec(thread, input, opData.params);
-        }
-        using InstantiableType = MidWrite<FusedOperation_<void, Operations...>>;
-        FK_HOST_DEVICE_FUSE auto build(const OperationDataType& opData) {
-            return InstantiableType{ opData };
-        }
-        FK_HOST_DEVICE_FUSE auto build(const ParamsType& params) {
-            return InstantiableType{ { params } };
-        }
-        template <size_t BATCH_N, typename FirstType, typename... ArrayTypes>
-        FK_HOST_FUSE auto build_batch(const std::array<FirstType, BATCH_N>& firstInstance,
-                                      const ArrayTypes&... arrays) {
-            return BatchUtils::build_batch<SelfType>(firstInstance, arrays...);
+        DECLARE_UNARY_PARENT
+        using Operations = TypeList<IOps...>;
+        FK_HOST_DEVICE_FUSE OutputType exec(const InputType &input) {
+            return exec_helper(std::make_index_sequence<OperationTuple<IOps...>::size>{}, input);
        }
-        template <size_t BATCH_N, typename FirstType, typename... ArrayTypes>
-        FK_HOST_FUSE auto build(const std::array<FirstType, BATCH_N>& firstInstance,
-                                const ArrayTypes&... arrays) {
-            return BatchWrite<BATCH_N, SelfType>::build(firstInstance, arrays...);
+
+      private:
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE OutputType exec_helper(const std::index_sequence<Idx...>&, const InputType& input) {
+            constexpr OperationTuple<IOps...> poTup{};
+            // Optimization, we use a version of operator| that does not use InputTypeFold,
+            // thus it does not propagate Point thread, because it is not needed.
+            return (input | ... | get_opt<Idx>(poTup));
        }
    };

-    template <typename... Operations>
-    using FusedOperation = FusedOperation_<void, Operations...>;
+    // 4. ClosedType Specialization
+    template <typename... IOps>
+    struct FusedOperation_<std::enable_if_t<isAnyReadType<FirstType_t<IOps...>> && isWriteType<LastType_t<IOps...>>>,
+                           IOps...> {
+      private:
+        using SelfType =
+            FusedOperation_<std::enable_if_t<isAnyReadType<FirstType_t<IOps...>> && isWriteType<LastType_t<IOps...>>>, IOps...>;
+        using Parent = ClosedOperation<OperationTuple<IOps...>, SelfType, true>;

-    template <typename FusedOperationType, typename = void>
-    struct IsAllUnaryFusedOperation : std::false_type {};
-
-    template <typename FusedOperationType>
-    struct IsAllUnaryFusedOperation<FusedOperationType, std::void_t<typename FusedOperationType::Operations>> : std::true_type {};
-
-    template <typename FusedOperationType, typename = void>
-    struct IsNotAllUnaryFusedOperation : std::true_type {};
-
-    template <typename FusedOperationType>
-    struct IsNotAllUnaryFusedOperation<FusedOperationType, std::void_t<typename FusedOperationType::Operations>> : std::false_type {};
-
-    template <typename FusedOperationType>
-    constexpr bool isAllUnaryFusedOperation = IsAllUnaryFusedOperation<FusedOperationType>::value;
-
-    template <typename FusedOperationType>
-    constexpr bool isNotAllUnaryFusedOperation = IsNotAllUnaryFusedOperation<FusedOperationType>::value;
-
-    template <typename IOp, typename Enabler = void>
-    struct InstantiableFusedOperationToOperationTuple {};
-
-    template <typename FusedIOp>
-    struct InstantiableFusedOperationToOperationTuple<FusedIOp, std::enable_if_t<isAllUnaryFusedOperation<typename FusedIOp::Operation>, void>> {
-    private:
-        using SelfType = InstantiableFusedOperationToOperationTuple<FusedIOp, std::enable_if_t<isAllUnaryFusedOperation<typename FusedIOp::Operation>, void>>;
-    public:
-        FK_STATIC_STRUCT(InstantiableFusedOperationToOperationTuple, SelfType)
-        FK_HOST_FUSE auto value(const FusedIOp& iOp) {
-            return TypeListToOT<typename FusedIOp::Operation::Operations>{};
+      public:
+        FK_STATIC_STRUCT(FusedOperation_, SelfType)
+        DECLARE_CLOSED_PARENT
+        using Operations = TypeList<IOps...>;
+        FK_HOST_DEVICE_FUSE void exec(const Point &thread, const ParamsType &params) {
+            exec_helper(std::make_index_sequence<ParamsType::size>{}, thread, params);
        }
-    };
-    template <typename FusedIOp>
-    struct InstantiableFusedOperationToOperationTuple<FusedIOp, std::enable_if_t<isNotAllUnaryFusedOperation<typename FusedIOp::Operation>, void>> {
-    private:
-        using SelfType = InstantiableFusedOperationToOperationTuple<FusedIOp, std::enable_if_t<isNotAllUnaryFusedOperation<typename FusedIOp::Operation>, void>>;
-    public:
-        FK_STATIC_STRUCT(InstantiableFusedOperationToOperationTuple, SelfType)
-        FK_HOST_FUSE auto value(const FusedIOp& iOp) {
-            return iOp.params;
+
+      private:
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE void exec_helper(const std::index_sequence<Idx...>&, const Point &thread,
+                                             const ParamsType &params) {
+            LastType_t<typename ParamsType::Operations>::Operation::exec(thread,
+                (thread | ... | get_opt<Idx>(params)).input,
+                get_opt<sizeof...(Idx) - 1>(params));
        }
    };

-    template <typename OperationTupleType, typename Enabler = void>
-    struct OperationTupleToInstantiableOperation;
+    // 5. WriteType Specialization
+    template <typename... IOps>
+    struct FusedOperation_<std::enable_if_t<!isAnyReadType<FirstType_t<IOps...>> && isWriteType<LastType_t<IOps...>>>,
+                           IOps...> {
+      private:
+        using SelfType =
+            FusedOperation_<std::enable_if_t<!isAnyReadType<FirstType_t<IOps...>> && isWriteType<LastType_t<IOps...>>>,
+                            IOps...>;
+        using FusedInputType = typename FirstType_t<IOps...>::Operation::InputType;
+        using FusedWriteDataType = typename LastType_t<IOps...>::Operation::WriteDataType;
+        using Parent = WriteOperation<FusedInputType,
+            OperationTuple<IOps...>, FusedWriteDataType, TF::DISABLED, SelfType, true>;
+
+      public:
+        FK_STATIC_STRUCT(FusedOperation_, SelfType)
+        DECLARE_WRITE_PARENT
+        using Operations = TypeList<IOps...>;
+        FK_HOST_DEVICE_FUSE void exec(const Point &thread, const InputType& input, const ParamsType &params) {
+            exec_helper(std::make_index_sequence<ParamsType::size - 1>{}, thread, input, params);
+        }

-    template <typename... Operations>
-    struct OperationTupleToInstantiableOperation<OperationTuple<Operations...>, std::enable_if_t<allUnaryTypes<Operations...>, void>> {
-    private:
-        using SelfType = OperationTupleToInstantiableOperation<OperationTuple<Operations...>, std::enable_if_t<allUnaryTypes<Operations...>, void>>;
-    public:
-        FK_STATIC_STRUCT(OperationTupleToInstantiableOperation, SelfType)
-        FK_HOST_FUSE auto value(const OperationTuple<Operations...>& opTuple) {
-            return Instantiable<FusedOperation<Operations...>>{};
+      private:
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE void exec_helper(const std::index_sequence<Idx...> &, const Point &thread,
+                                             const InputType& input, const ParamsType &params) {
+            LastType_t<typename ParamsType::Operations>::Operation::exec(
+                thread, (InputFoldType<>::build(thread, input) | ... | get_opt<Idx>(params)).input,
+                get_opt<ParamsType::size - 1>(params));
        }
    };

-    template <typename... Operations>
-    struct OperationTupleToInstantiableOperation<OperationTuple<Operations...>, std::enable_if_t<notAllUnaryTypes<Operations...>, void>> {
-    private:
-        using SelfType = OperationTupleToInstantiableOperation<OperationTuple<Operations...>, std::enable_if_t<notAllUnaryTypes<Operations...>, void>>;
-    public:
-        FK_STATIC_STRUCT(OperationTupleToInstantiableOperation, SelfType)
-        FK_HOST_FUSE auto value(const OperationTuple<Operations...>& opTuple) {
-            return Instantiable<FusedOperation<Operations...>>{opTuple};
+    // 6. BinaryType (ComputeType) Specialization
+    template <typename... IOps>
+    struct FusedOperation_<std::enable_if_t<allComputeTypes<IOps...> && !allUnaryTypes<IOps...>>, IOps...> {
+      private:
+        using SelfType =
+            FusedOperation_<std::enable_if_t<allComputeTypes<IOps...> && !allUnaryTypes<IOps...>>, IOps...>;
+        using FusedInputType = typename FirstType_t<IOps...>::Operation::InputType;
+        using FusedOutputType = typename LastType_t<IOps...>::Operation::OutputType;
+        using Parent = BinaryOperation<FusedInputType, OperationTuple<IOps...>, FusedOutputType, SelfType, true>;
+
+      public:
+        FK_STATIC_STRUCT(FusedOperation_, SelfType)
+        DECLARE_BINARY_PARENT
+        using Operations = TypeList<IOps...>;
+        FK_HOST_DEVICE_FUSE OutputType exec(const InputType &input, const ParamsType &params) {
+            return exec_helper(std::make_index_sequence<ParamsType::size>{}, input, params);
+        }
+
+      private:
+        template <size_t... Idx>
+        FK_HOST_DEVICE_FUSE OutputType exec_helper(const std::index_sequence<Idx...>&,
+                                                   const InputType& input, const ParamsType& params) {
+            // Optimization, we use a version of operator| that does not use InputTypeFold,
+            // thus it does not propagate Point thread, because it is not needed.
+            return (input | ... | get_opt<Idx>(params));
        }
    };



The FusedOperation implementation has been completely rewritten using fold expressions instead of recursive template instantiation. The specializations have changed from checking the first/last operation types to using std::enable_if_t with type predicates. While this should reduce compilation times, the logic for determining the FusedOperation type (OpenType, ReadType, UnaryType, ClosedType, WriteType, BinaryType) needs careful verification. Specifically, verify that the specialization priority and std::enable_if_t conditions don't create ambiguous matches or miss valid combinations.

Copilot · 2026-02-17T20:41:11Z

include/fused_kernel/core/utils/parameter_pack_utils.h

@@ -1,5 +1,5 @@
-/* Copyright 2023 Mediaproduccion S.L.U. (Oscar Amoros Huguet)
-   Copyright 2023-2025 Oscar Amoros Huguet
+/* Copyright 2023 Grup Mediapro S.L.U. (Oscar Amoros Huguet)


Similarly, the copyright uses "Mediaproduccion S.L.U." instead of the "Grup Mediapro S.L.U." used in other updated files. Ensure consistent naming of the legal entity across all copyright notices.

Copilot · 2026-02-17T20:41:11Z

include/fused_kernel/core/execution_model/operation_model/operation_tuple.h

+    using OperationTuple = OperationTuple_<void, Operations...>;



The OperationTuple structure has been completely refactored from a recursive linked-list style to a flat structure with filtered indices. The old .next member access pattern has been replaced with get_opt<Index>(). This is a major breaking API change. While the tests have been updated, any external code using OperationTuple will need significant refactoring. Consider providing a migration guide or compatibility layer.

Suggested change

using OperationTuple = OperationTuple_<void, Operations...>;

using FlatOperationTuple = OperationTuple_<void, Operations...>;

/**

* @brief Operation tuple for execution model operations.

*

* Migration note:

* The OperationTuple structure has been refactored from a recursive,

* linked-list-style representation (accessed via `.next`) to a flat

* representation with filtered indices.

*

* Code that previously relied on patterns such as:

*

* auto &first = ops.operation;

* auto &second = ops.next.operation;

* auto &third = ops.next.next.operation;

*

* must be updated to use index-based accessors on the flat tuple,

* for example `get_opt<Index>()` (and/or helpers defined in this

* header) instead of chaining `.next`.

*

* The FlatOperationTuple alias names the new, flat implementation

* explicitly:

*

* using Ops = fk::FlatOperationTuple<Op0, Op1, Op2>;

*

* while OperationTuple is kept as the primary alias for backward

* compatibility with existing code that only depends on the type

* name, not on the old recursive layout.

*/

template <typename... Operations>

using OperationTuple = FlatOperationTuple<Operations...>;

Copilot · 2026-02-17T20:41:11Z

include/fused_kernel/core/execution_model/data_parallel_patterns.h

        template <typename... IOps>
        FK_DEVICE_FUSE void exec(const Details& details, const IOps&... iOps) {
-#if VS2017_COMPILER || CLANG_HOST_DEVICE
+#if CLANG_HOST_DEVICE


The VS2017_COMPILER macro check has been removed and only CLANG_HOST_DEVICE is checked. If VS2017 support is being dropped, this should be documented. If VS2017 should still be supported but behaves like Clang in this context, that assumption should be verified.

Copilot · 2026-02-17T20:41:11Z

tests/operation/test_fused_operation.h

+    // ===========================================================================
+    // COMPREHENSIVE TESTS FOR DEEPLY NESTED FUSED OPERATIONS
+    // Testing all specializations with various nesting depths as requested in PR feedback
+    // ===========================================================================
+    // Note: OpenType requires MidWriteType operations which are specialized internal operations.
+    // OpenType is tested implicitly through the fold expression implementation tests below.
+
+    // Test 1: ClosedType FusedOperation (Read + operations + Write)
+    // This combines ReadType at the start and WriteType at the end
+    {
+        using ClosedFusedOp = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename ClosedFusedOp::InstanceType, fk::ClosedType>,
+            "ClosedType FusedOperation not correctly identified");
+    }
+
+    // Test 2: WriteType FusedOperation (operations + Write, no Read at start)
+    {
+        using WriteFusedOp = fk::FusedOperation<
+            fk::Binary<fk::Add<float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename WriteFusedOp::InstanceType, fk::WriteType>,
+            "WriteType FusedOperation not correctly identified");
+    }
+
+    // Test 3: ReadType FusedOperation (Read + operations, no Write at end)
+    {
+        using ReadFusedOp = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Unary<fk::Cast<float, int>>
+        >;
+        static_assert(std::is_same_v<typename ReadFusedOp::InstanceType, fk::ReadType>,
+            "ReadType FusedOperation not correctly identified");
+    }
+
+    // Test 4: UnaryType FusedOperation (all operations are Unary)
+    {
+        using UnaryChainOp = fk::FusedOperation<
+            fk::Unary<fk::Cast<int, float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Unary<fk::Cast<double, int>>
+        >;
+        static_assert(std::is_same_v<typename UnaryChainOp::InstanceType, fk::UnaryType>,
+            "UnaryType FusedOperation not correctly identified");
+    }
+
+    // Test 5: BinaryType FusedOperation (compute operations, no Read/Write/MidWrite)
+    {
+        using BinaryChainOp = fk::FusedOperation<
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>
+        >;
+        static_assert(std::is_same_v<typename BinaryChainOp::InstanceType, fk::BinaryType>,
+            "BinaryType FusedOperation not correctly identified");
+    }
+
+    // Test 6: Deeply nested ReadType (5+ levels)
+    {
+        using DeeplyNestedRead = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Unary<fk::Cast<float, int>>
+        >;
+        static_assert(std::is_same_v<typename DeeplyNestedRead::InstanceType, fk::ReadType>,
+            "Deeply nested ReadType FusedOperation failed");
+    }
+
+    // Test 7: Deeply nested ClosedType (5+ levels)
+    {
+        using DeeplyNestedClosed = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename DeeplyNestedClosed::InstanceType, fk::ClosedType>,
+            "Deeply nested ClosedType FusedOperation failed");
+    }
+
+    // Test 8: Deeply nested UnaryType (5+ levels)
+    {
+        using DeeplyNestedUnary = fk::FusedOperation<
+            fk::Unary<fk::Cast<int, float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Unary<fk::Cast<double, float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Unary<fk::Cast<double, int>>
+        >;
+        static_assert(std::is_same_v<typename DeeplyNestedUnary::InstanceType, fk::UnaryType>,
+            "Deeply nested UnaryType FusedOperation failed");
+    }
+
+    // Test 9: Deeply nested BinaryType (5+ levels)
+    {
+        using DeeplyNestedBinary = fk::FusedOperation<
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Binary<fk::Add<float>>
+        >;
+        static_assert(std::is_same_v<typename DeeplyNestedBinary::InstanceType, fk::BinaryType>,
+            "Deeply nested BinaryType FusedOperation failed");
+    }
+
+    // Test 10: Deeply nested WriteType (5+ levels)
+    {
+        using DeeplyNestedWrite = fk::FusedOperation<
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename DeeplyNestedWrite::InstanceType, fk::WriteType>,
+            "Deeply nested WriteType FusedOperation failed");
+    }
+
+    // Test 11: Very deeply nested ClosedType (10+ levels) - stress test
+    {
+        using VeryDeeplyNestedClosed = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Binary<fk::Div<double>>,
+            fk::Unary<fk::Cast<double, float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename VeryDeeplyNestedClosed::InstanceType, fk::ClosedType>,
+            "Very deeply nested ClosedType FusedOperation failed");
+    }
+
+    // Test 12: Mixed compute types (Binary and Unary) in deep chain
+    {
+        using MixedComputeChain = fk::FusedOperation<
+            fk::Binary<fk::Add<int>>,
+            fk::Unary<fk::Cast<int, float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Binary<fk::Div<double>>,
+            fk::Unary<fk::Cast<double, int>>
+        >;
+        static_assert(std::is_same_v<typename MixedComputeChain::InstanceType, fk::BinaryType>,
+            "Mixed compute chain should be BinaryType when not all are Unary");
+    }
+
+    // Test 13: Verify complex existing SomeFusedOp is still valid
+    // This ensures backward compatibility with existing complex operations
+    static_assert(std::is_same_v<typename SomeFusedOp::InstanceType, fk::ReadType>,
+        "Complex SomeFusedOp should be ReadType (starts with ReadBack, no Write at end)");
+
+    // Test 14: Verify operation fusion with .then() for deep chains
+    {
+        constexpr auto chain1 = fk::Add<int, int, int, fk::UnaryType>::build()
+            .then(fk::Cast<int, float>::build())
+            .then(fk::Cast<float, double>::build())
+            .then(fk::Cast<double, float>::build())
+            .then(fk::Cast<float, int>::build());
+
+        using ChainType = std::decay_t<decltype(chain1)>;
+        static_assert(ChainType::Operation::Operations::size == 5,
+            "Deep .then() chain should have 5 operations");
+    }
+
+    // Test 15: Verify FusedOperation can be fused again (nesting FusedOperations)
+    {
+        constexpr auto inner = fk::fuse(
+            fk::Add<int, int, int, fk::UnaryType>::build(),
+            fk::Cast<int, float>::build()
+        );
+        [[maybe_unused]] constexpr auto outer = fk::fuse(
+            inner,
+            fk::Cast<float, int>::build()
+        );
+        // This should compile without errors
+    }
+
+    // Test 16: Edge case - Single operation wrapped in FusedOperation
+    {
+        using SingleOpFused = fk::FusedOperation<fk::Unary<fk::Cast<int, float>>>;
+        static_assert(std::is_same_v<typename SingleOpFused::InstanceType, fk::UnaryType>,
+            "Single Unary operation should be UnaryType");
+    }
+
+    // Test 17: Edge case - Two operations (minimum for meaningful fusion)
+    {
+        using TwoOpFused = fk::FusedOperation<
+            fk::Binary<fk::Add<float>>,
+            fk::Unary<fk::Cast<float, int>>
+        >;
+        static_assert(std::is_same_v<typename TwoOpFused::InstanceType, fk::BinaryType>,
+            "Two mixed compute ops should be BinaryType");
+    }
+
+    // Test 18: Maximum stress test - 15+ operations deeply nested
+    {
+        using MaxStressTest = fk::FusedOperation<
+            fk::Read<fk::PerThreadRead<fk::ND::_2D, float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Binary<fk::Div<float>>,
+            fk::Binary<fk::Add<float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Binary<fk::Sub<float>>,
+            fk::Unary<fk::Cast<float, double>>,
+            fk::Binary<fk::Div<double>>,
+            fk::Binary<fk::Add<double>>,
+            fk::Unary<fk::Cast<double, float>>,
+            fk::Binary<fk::Mul<float>>,
+            fk::Unary<fk::Cast<float, int>>,
+            fk::Binary<fk::Add<int>>,
+            fk::Write<fk::PerThreadWrite<fk::ND::_2D, int>>
+        >;
+        static_assert(std::is_same_v<typename MaxStressTest::InstanceType, fk::ClosedType>,
+            "Maximum stress test (15+ ops) ClosedType FusedOperation failed");
+    }


The comprehensive tests for deeply nested fused operations (Tests 1-18) are excellent additions that verify the fold expression implementation works correctly at various nesting depths. However, these tests only verify type correctness via static_assert. Consider adding runtime tests that actually execute these operations with sample data to ensure the execution logic is correct, not just the type deduction.

Copilot · 2026-02-17T20:41:12Z

include/fused_kernel/core/data/point.h

        T z;
-        FK_HOST_DEVICE_CNST Point_(const T x_ = 0, const T y_ = 0, const T z_ = 0) : x(x_), y(y_), z(z_) {}
    };



The Point constructors have been removed, requiring aggregate initialization with braced-init-lists. However, the z coordinate is being set to 0 in many places throughout the codebase. This pattern should be documented or a helper function should be provided to create 2D Points more ergonomically, as Point{x, y, 0} is verbose and error-prone.

Suggested change

// Helper to ergonomically construct 2D points represented in 3D space

// by fixing the z coordinate to zero. This centralises the common

// pattern `Point_{x, y, 0}` in a single, type-safe place.

template <typename T>

[[nodiscard]] constexpr Point_<T, ND::_3D> make_point2d(T x, T y) {

return Point_<T, ND::_3D>{x, y, T{}};

}

Copilot · 2026-02-17T20:41:12Z

include/fused_kernel/core/execution_model/operation_model/parent_operations.h

+  FK_HOST_DEVICE_FUSE OutputType exec(const InputType &input, const OperationDataType &opData) { \
+    return exec(input, opData.params);                                                           \
+  }                                                                                              \
+  FK_HOST_FUSE InstantiableType build(const OperationDataType& opData) { return {opData}; }      \
+  FK_HOST_FUSE InstantiableType build(const ParamsType& params) { return {{params}}; }


The exec function implementations in the macros have been changed to call exec(...) directly instead of calling through the Parent. This recursive call pattern could lead to infinite recursion if not properly overloaded. Verify that all Operation types that use these macros provide their own exec implementations that don't rely on the parent's exec.

Copilot · 2026-02-17T20:41:12Z

tests/data/test_tuple.h

@@ -1,4 +1,5 @@
-/* Copyright 2023-2024 Mediaproduccion S.L.U. (Oscar Amoros Huguet)
+/* Copyright 2023-2024 Grup Mediapro S.L.U. (Oscar Amoros Huguet)


The copyright line lists "Mediaproduccion S.L.U." but other files in this PR use "Grup Mediapro S.L.U.". Verify which is the correct legal entity name and ensure consistency across all copyright headers.

Copilot · 2026-02-17T20:41:12Z

include/fused_kernel/algorithms/image_processing/resize.h

                    thread.y >= params.y1 && thread.y <= params.y2) {
-                    const Point roiThread(thread.x - params.x1, thread.y - params.y1, thread.z);
-                    return exec_resize(roiThread, params, backIOp);
+                    // z is 0 because we do not use it going forward, and batch is handled before calling this exec


The z dimension initialization comment states "z is 0 because we do not use it going forward, and batch is handled before calling this exec". However, this is an implementation detail that may not be obvious. Consider documenting this more clearly in the function signature or with a more prominent comment explaining why z is always 0 in this context.

Suggested change

// z is 0 because we do not use it going forward, and batch is handled before calling this exec

// Note: ResizeComplete operates on a single 2D image slice per invocation.

// The z dimension is not used by exec_resize (see num_elems_z(), which always returns 1),

// and any batching or z-dimension iteration is handled by the caller before invoking exec().

// For this reason we explicitly construct a Point with z = 0 here, after translating x/y

// into the coordinate system of the cropped destination region.

…ld be used only on CPU.

…gregate

Copilot

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T21:13:43Z

include/fused_kernel/core/utils/type_lists.h

+    // ------------------------------------------------------------------
+    // TypeListCat Implementation
+    // ------------------------------------------------------------------
+    template <typename... Lists>
+    struct TypeListCat {
+        using type = decltype((std::declval<Lists>() >> ...));
    };
-
-    template<typename... Types>
-    struct TypeListCat{};
-
-    template<typename... Args1, typename... Args2>
-    struct TypeListCat<TypeList<Args1...>, TypeList<Args2...>> {
-        using type = TypeList<Args1..., Args2...>;
+
+    template <typename... Lists>
+    using TypeListCat_t = typename TypeListCat<Lists...>::type;


The TypeListCat implementation now uses a fold expression over operator>> which is defined as a friend function that returns TypeList<Ts..., Us...>. However, the operator>> is declared but not defined in the TypeList class. The implementation relies on template argument deduction from a decltype expression in TypeListCat, which evaluates std::declval() >> .... This should work, but verify that the operator>> overload resolution is unambiguous and correctly deduces the result type.

Copilot · 2026-02-17T21:13:44Z

include/fused_kernel/core/execution_model/operation_model/operation_tuple.h

+    template <typename IndexSeq, size_t IdxValue>
+    struct GetIndex;

-    template <typename TupleType>
-    struct OperationTupleTypeToTypeList;
+    template <typename IndexSeq, typename IndexSeqOut, size_t IdxValue>
+    struct GetIndexHelper;

-    template <typename... Types>
-    struct OperationTupleTypeToTypeList<OperationTuple<Types...>> {
-        using type = TypeList<Types...>;
+    template <size_t... Idx, size_t... IdxOut, size_t IdxValue>
+    struct  GetIndexHelper<std::index_sequence<Idx...>, std::index_sequence<IdxOut...>, IdxValue> {
+        static constexpr size_t value = []() {
+            size_t result = 0;
+            ((result = (Idx == IdxValue ? IdxOut : result)), ...);
+            return result;
+            }();
    };

-    template <typename TupleType>
-    using OTToTypeList = typename OperationTupleTypeToTypeList<TupleType>::type;
-
-    template <typename TypeList_t>
-    struct TypeListToOperationTuple;
-
-    template <typename... Operations>
-    struct TypeListToOperationTuple<TypeList<Operations...>> {
-        using type = OperationTuple<Operations...>;
+    template <size_t... Idx, size_t IdxValue>
+    struct GetIndex<std::index_sequence<Idx...>, IdxValue> {
+        static constexpr size_t value = GetIndexHelper<std::index_sequence<Idx...>, decltype(std::make_index_sequence<sizeof...(Idx)>()), IdxValue>::value;
    };

-    template <typename TypeList_t>
-    using TypeListToOT = typename TypeListToOperationTuple<TypeList_t>::type;
-
-    template <int INDEX, typename TupleType>
-    using get_ot = TypeAt_t<INDEX, OTToTypeList<TupleType>>;
-
-    template <int INDEX, typename... OperationTypes>
-    FK_HOST_DEVICE_CNST auto getIOp(const OperationTuple<OperationTypes...>& instances) {
-        using SelectedOperation = get_ot<INDEX, OperationTuple<OperationTypes...>>;
-        if constexpr (isUnaryType<SelectedOperation>) {
-            return SelectedOperation::build();
+    // As observed in get<>(Tuple<...>), returning a const& as auto,
+    // may lead to local memory accesses in the GPU
+    template <size_t Idx, typename... Operations>
+    FK_HOST_DEVICE_CNST decltype(auto) get_opt(const OperationTuple<Operations...>& opTuple){
+        if constexpr (isUnaryType<TypeAt_t<Idx, TypeList<Operations...>>>) {
+            return typename TypeAt_t<Idx, TypeList<Operations...>>::Operation::InstantiableType{};
        } else {
-            return SelectedOperation::build(get<INDEX>(instances));
+            return get<GetIndex<typename OperationTuple<Operations...>::Indexes, Idx>::value>(opTuple.instances);
        }
    }


The OperationTuple refactoring changes from a recursive linked-list structure (with instance and next members) to a flat structure with an instances Tuple member. The get_opt function now uses GetIndex to map from the operation index to the instance index, filtering out Unary operations. However, the implementation is complex and may have edge cases. Specifically, the GetIndexHelper uses a fold expression with side effects in the lambda, which relies on left-to-right evaluation order. While C++17 guarantees this for fold expressions, the logic should be carefully verified for correctness.

Copilot · 2026-02-17T21:13:44Z

tests/examples/inlining_and_LDL_STL.h


    // We perform 5 crops on the image
-    constexpr int BATCH = 10;
+    constexpr int BATCH = 100;


The change in BATCH size from 10 to 100 on line 466 significantly increases the test workload (10x increase). While this may be intentional to provide more thorough testing or benchmarking, it could significantly increase test execution time. Consider whether this increase is necessary or if it should be configurable.

Copilot · 2026-02-17T21:13:44Z

include/fused_kernel/core/data/array.h

+        FK_HOST_DEVICE_CNST const T& operator[](const int index) const {
+            return values[index];
        }
-        FK_HOST_DEVICE_CNST Array() {}
-        FK_HOST_DEVICE_CNST T operator[](const size_t& index) const {
+        FK_HOST_DEVICE_CNST T& operator[](const int index) {
            return values[index];
        }
-        FK_HOST_DEVICE_CNST T& operator[](const size_t& index) {
+        FK_HOST_DEVICE_CNST const T& operator[](const size_t index) const {
+            return values[index];
+        }
+        FK_HOST_DEVICE_CNST T& operator[](const size_t index) {
            return values[index];
        }


The Array struct changes its operator[] parameter type from const size_t& index to const int index and then adds separate overloads for both int and size_t. This could cause ambiguity issues when calling with unsigned integer types other than size_t. The const and non-const versions are also inconsistent in their parameter types (int vs size_t). Consider consolidating to a single template operator[] that accepts any integral type, or ensuring consistent parameter types across all overloads.

Copilot · 2026-02-17T21:13:45Z

include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h

@@ -169,12 +172,18 @@ FK_HOST_CNST auto then(const ContinuationIOp& cIOp, const ContinuationIOps&... c

        template <typename Input>
        FK_HOST_DEVICE_CNST friend auto operator|(Input&& input, const OperationData<Operation_t>& opData) {
-            const auto result = Operation::exec(std::forward<Input>(input).input, opData);
-            return InputFoldType<std::decay_t<decltype(result)>>(std::forward<Input>(input).thread, result);
+            return InputFoldType<>::build(std::forward<Input>(input).thread,
+                                          Operation::exec(std::forward<Input>(input).input, opData));
+        }
+        FK_HOST_DEVICE_CNST
+        friend typename Operation::OutputType operator|(const typename Operation::InputType &input,
+                                                        const TernaryInstantiableOperation<Operation_t> &opData) {
+            return Operation::exec(input, opData.params, opData.backIOp);
        }

        template <typename PreviousIOp, typename Fuser_t = Fuser>
-        FK_HOST_CNST friend auto operator&(PreviousIOp&& prevIOp, const TernaryInstantiableOperation<Operation_t>& self) {
+        FK_HOST_CNST friend auto operator&(PreviousIOp&& prevIOp,
+                                           const TernaryInstantiableOperation<Operation_t>& self) {
            return Fuser_t::fuse(std::forward<PreviousIOp>(prevIOp), self);
        }
    };
@@ -191,14 +200,21 @@ FK_HOST_CNST auto then(const ContinuationIOp& cIOp, const ContinuationIOps&... c
    struct UnaryInstantiableOperation {
        INSTANTIABLE_OPERATION_DETAILS_IS_ASSERT_THEN(UnaryType)

-        template <typename Input>
-        FK_HOST_DEVICE_CNST friend auto operator|(Input&& input, const UnaryInstantiableOperation<Operation_t>& opData) {
-            const auto result = Operation::exec(std::forward<Input>(input).input);
-            return InputFoldType<std::decay_t<decltype(result)>>(std::forward<Input>(input).thread, result);
+        template <typename InputType>
+        FK_HOST_DEVICE_CNST friend auto operator|(const InputFoldType<InputType>& input,
+                                                  const UnaryInstantiableOperation<Operation_t>& opData) {
+            return InputFoldType<>::build(input.thread, Operation::exec(input.input));
+        }
+
+        FK_HOST_DEVICE_CNST friend typename Operation::OutputType operator|(
+                const typename Operation::InputType& input,
+                const UnaryInstantiableOperation<Operation_t> &opData) {
+            return Operation::exec(input);


The operator| overloads for UnaryInstantiableOperation, BinaryInstantiableOperation, and TernaryInstantiableOperation now have two versions: one that works with InputFoldType for fold expressions, and one that works directly with the input type. The second version (e.g., lines 148-150, 209-212) returns the OutputType directly without wrapping in InputFoldType. This dual behavior should be carefully documented as it affects how these operations can be composed. The non-fold version is simpler but can't be used in the middle of a fold expression chain.

Copilot · 2026-02-17T21:13:45Z

include/fused_kernel/core/execution_model/operation_model/parent_operations.h

+  FK_HOST_DEVICE_FUSE OutputType exec(const InputType &input, const OperationDataType &opData) { \
+    return exec(input, opData.params);                                                           \
+  }                                                                                              \


The exec() functions in the DECLARE_*_PARENT macros now call exec() with unpacked parameters (e.g., exec(input, opData.params)) instead of calling Parent::exec(input, opData). This changes the function being called from a qualified call to an unqualified call, which will use ADL (Argument-Dependent Lookup) and may resolve to a different overload. This could be a subtle bug if there are multiple exec() overloads in scope. Consider making these calls explicitly qualified (e.g., Child::exec() or using CRTP pattern) to avoid ambiguity.

Copilot · 2026-02-17T21:13:45Z

include/fused_kernel/core/execution_model/data_parallel_patterns.h

@@ -257,14 +257,14 @@ namespace fk { // namespace FusedKernel
 #endif
            const Point thread{ x, y, z };

-            const ActiveThreads activeThreads = getActiveThreads(details, get<0>(iOps...));
+            const ActiveThreads activeThreads = getActiveThreads(details, get_arg<0>(iOps...));

            if (x < activeThreads.x && y < activeThreads.y) {
                Parent::execute_thread(thread, activeThreads, iOps...);
            }
        }
    };
-#endif // defined(__NVCC__) || defined(__HIPCC__) || defined(__NVRTC__) || defined(NVRTC_COMPILER)
+#endif // defined(__NVCC__) || CLANG_HOST_DEVICE


The removal of VS2017_COMPILER checks and replacement with just CLANG_HOST_DEVICE on lines 246 and 344 indicates dropping support for Visual Studio 2017. This should be documented in the PR description or release notes as it affects supported compilers.

Copilot · 2026-02-17T21:13:45Z

include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h

+    /**
+    * @brief OpenInstantiableOperation: represents a IOp that takes the result of the previous IOp as input
+    * (which will reside on GPU registers) and ParamsType, plus a Point thread.
+    */
+    template <typename Operation_t>
+    struct OpenInstantiableOperation final : public OperationData<Operation_t> {
+        INSTANTIABLE_OPERATION_DETAILS_IS(OpenType)
+            static_assert(std::is_same_v<typename Operation::InstanceType, OpenType>,
+                "Operation is not OpenType");
+
+        INSTANTIABLE_OPERATION_THEN
+
+        template <typename Input>
+        FK_HOST_DEVICE_CNST friend auto operator|(Input&& input, const OperationData<Operation_t>& opData) {
+            return InputFoldType<>::build(input.thread, Operation::exec(std::forward<Input>(input).thread,
+                                                                        std::forward<Input>(input).input, opData));
+        }
+
+        template <typename PreviousIOp, typename Fuser_t = Fuser>
+        FK_HOST_CNST friend auto operator&(PreviousIOp&& prevIOp, const OpenInstantiableOperation<Operation_t>& self) {
+            return Fuser_t::fuse(std::forward<PreviousIOp>(prevIOp), self);
+        }
+    };
+
+    /**
+     * @brief ClosedInstantiableOperation: represents a IOp that does not take an input and does not return an output
+     * It only gets a thread index and it's parameters. It will read and write from/to global memory.
+     */
+    template <typename Operation_t>
+    struct ClosedInstantiableOperation final : public OperationData<Operation_t> {
+        INSTANTIABLE_OPERATION_DETAILS_IS_ASSERT(ClosedType)
+
+        FK_HOST_DEVICE_CNST friend void operator|(const Point& thread, const OperationData<Operation_t>& opData) {
+            Operation::exec(thread, opData);
+        }
+    };


The OpenInstantiableOperation and ClosedInstantiableOperation classes are newly added (lines 249-284). These represent new operation types in the framework. However:

OpenInstantiableOperation supports fusion via operator& but ClosedInstantiableOperation does not

ClosedInstantiableOperation only supports operator| with Point, effectively making it a terminal operation

The asymmetry between these two (Open can fuse, Closed cannot) should be documented

Verify that this design is intentional and that use cases requiring fusion of Closed operations are properly handled or documented as unsupported.

Copilot · 2026-02-17T21:13:46Z

tests/data/test_rect.h

 int launch() {

-    constexpr fk::Rect test(fk::Point(16, 32), fk::Size(32, 64));
+    constexpr fk::Rect test(fk::Point{16, 32, 0}, fk::Size(32, 64));


The change from using Point(x, y, z) constructor syntax to Point{x, y, z} aggregate initialization is consistent throughout the codebase. However, this requires that Point no longer has user-defined constructors (which was removed in point.h). While this is consistent, it's a breaking API change. All uses of Point construction now must use brace-initialization syntax. This could break external code that uses the old constructor syntax.

Copilot · 2026-02-17T21:13:46Z

include/fused_kernel/core/execution_model/operation_model/instantiable_operations.h

-
-        FK_HOST_DEVICE_CNST InputFoldType(const Point& thread_) : thread(thread_) {}
+        template <typename InputT>
+        FK_HOST_DEVICE_FUSE auto build(const Point& thread, InputT&& input) {


The new InputFoldType specialization (lines 72-77) provides a build() function that constructs an InputFoldType with the concrete input type. However, this creates the type on-the-fly during fold expression evaluation. The original InputFoldType had a constructor that took only a Point. The new design changes how the type is used in fold expressions. Verify that all uses of operator| with Point correctly work with this new build pattern and that the automatic type deduction doesn't cause issues in complex expressions.

Suggested change

FK_HOST_DEVICE_FUSE auto build(const Point& thread, InputT&& input) {

FK_HOST_DEVICE_FUSE static auto build(const Point& thread, InputT&& input) {

Copilot

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 56 out of 56 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/fused_kernel/core/utils/type_lists.h

include/fused_kernel/core/execution_model/operation_model/parent_operations.h

include/fused_kernel/core/execution_model/operation_model/operation_types.h

morousg added 18 commits February 2, 2026 23:11

Created and tested NewOperationTuple. Mising all unary version.

00793b6

Finishing new OperationTuple implementation with basic set of tests.

36e79c2

Making sure we preserve the r/l-vaue and constness nature of what get…

8bf4fc3

…<>(NewOperationTuple) returns.

Adding FusedType since we need a different exec specification we did …

e94dd65

…not have before. Completed NewFusedOperation, missing tests. Adding FusedInstantiableOperation. Adding size to NewOperationTuple

Adapting tests, and working on final details. We need to avoid having…

3abf46d

… nested FusedOperations.

Adding a modification to the API, that will simplify the FusedOperati…

d319009

…on, and the DivergentTransformDPP

Adding OpenType to ParamsTypes in OperationData, temporarily renaming…

1b46d34

… get<>(NewOperationTuple) to get_opt, to ease debugging, fixing tests

Compiling version. Still pending to implement a few FusedOperation ve…

fdf02aa

…rsions

Removing old commencted out versions of OperationTuple and FusedOpera…

3751f25

…tion, and renaming the new ones to the final name

Moving memory_operations.h to algorithms/basic_ops, and removing NVRT…

b80b8b2

…C condition

Fixing basic_test.h

1665c27

Removing dead code

f4843cf

Finished implementing the last FusedOperation specializations

b6a3a30

Removing unused variable

22e829f

Unifying template parameter pack across FusedOperation implementations

6c5623d

Comenting out incomplete code

8c987ac

Merge branch 'LTS-C++17' into 179-refactor-recursive-template-impleme…

5538c23

…ntations-with-fold-expressions-to-reduce-compiler-memory-usage-and-compilation-times * LTS-C++17: Comenting out incomplete code # Conflicts: # tests/examples/inlining_and_LDL_STL.h

morousg self-assigned this Feb 9, 2026

Copilot AI review requested due to automatic review settings February 9, 2026 20:32

Copilot started reviewing on behalf of morousg February 9, 2026 20:32 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

morousg added 2 commits February 9, 2026 22:04

Reduced test size

1ddcef9

Removing unnecessary Parent Operation, and fixing some bugs

2fa7c9c

Copilot AI review requested due to automatic review settings February 9, 2026 21:25

Copilot started reviewing on behalf of morousg February 9, 2026 21:26 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

morousg added 2 commits February 11, 2026 17:17

Adding nice vector type aliases

4f8da1f

Making fk::Array an aggregate

3cb7c71

Copilot AI review requested due to automatic review settings February 11, 2026 17:14

Copilot AI review requested due to automatic review settings February 17, 2026 14:49

Copilot started reviewing on behalf of morousg February 17, 2026 14:49 View session