If someone is enthusiastic for Kokkos, that could be done as well, but so far there are
- directive based (OpenMP offloading),
- language based (Python-numba in progress),
- "portable" kernel based (SYCL) implementations,
but no CUDA or HIP. So CUDA would be useful.