Skip to content

Asserting current device and CUB stream matches#9119

Open
thom-gg wants to merge 8 commits into
NVIDIA:mainfrom
thom-gg:validate-device-and-stream-matches
Open

Asserting current device and CUB stream matches#9119
thom-gg wants to merge 8 commits into
NVIDIA:mainfrom
thom-gg:validate-device-and-stream-matches

Conversation

@thom-gg
Copy link
Copy Markdown

@thom-gg thom-gg commented May 23, 2026

Description

closes #7782

Adding an assertion to all the dispatching codes to ensure current device and CUB stream matches. Calling it at the very beginning of each of the dispatch functions, hence the number of modified files

The assertion itself uses cudaStreamGetDevice which was introduced in CTK 12.8 so it's guarded by the macro _CCCL_CTK_AT_LEAST(12,8).

I'm new to the project so unsure if there is a better place to call the assertion rather than doing it in every dispatch file, also unsure if the assertion should be put in the cub/cub/util_device.cuh file like i did or elsewhere, please tell me if this issue should be addressed differently and i'll try to do it !

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@thom-gg thom-gg requested a review from a team as a code owner May 23, 2026 15:52
@thom-gg thom-gg requested a review from fbusato May 23, 2026 15:52
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 23, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 23, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 23, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

important:

Walkthrough

Driver-API no-throw wrappers and cub::detail::validate_stream_device(cudaStream_t) were added; dispatch entrypoints now call it up-front and return early on mismatch; tests for cross-device stream behavior were added.

Changes

Stream-device validation layer

Layer / File(s) Summary
Core validation utility
cub/cub/util_device.cuh, libcudacxx/include/cuda/__driver/driver_api.h
Added cub::detail::validate_stream_device(cudaStream_t) and non-throwing driver wrappers (__ctxPushNoThrow, __ctxPopNoThrow, __ctxGetDeviceNoThrow, __streamGetCtxNoThrow).
Dispatch validation rollout
cub/cub/device/dispatch/*.cuh (adjacent_difference, batch_memcpy, batched_topk, find, for, histogram, merge, merge_sort, radix_sort, reduce, reduce_by_key, reduce_deterministic, reduce_nondeterministic, rle, scan, scan_by_key, segmented_radix_sort, segmented_reduce, segmented_scan, segmented_sort, select_if, three_way_partition, topk, transform, unique_by_key)
Inserted validate_stream_device(stream) at start of dispatch entrypoints and helper dispatch functions; each returns early on validation error before PTX/compute-capability queries, temporary-storage sizing, or kernel dispatch.
Tests
cub/test/catch2_test_device_for.cu, cub/test/catch2_test_device_for_api.cu
Added tests: one disables validation and exercises cross-device stream usage; another asserts cudaErrorInvalidDevice when calling ForEachN with a stream from a different device.

Assessment against linked issues

Objective Addressed Explanation
Add stream-device validation to CUB dispatch functions [#7782]

Suggested reviewers

  • fbusato
  • bernhardmgruber
  • srinivasyadav18

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (16)
cub/cub/device/dispatch/dispatch_merge_sort.cuh (1)

406-406: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) with its global namespace-qualified symbol (matching its declaration namespace) instead of using unqualified lookup.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 477-477

cub/cub/device/dispatch/dispatch_radix_sort.cuh (1)

1141-1141: ⚡ Quick win

suggestion: use the global namespace-qualified form of validate_stream_device(stream) at both dispatch entry points to satisfy the free-function qualification rule.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1206-1206

cub/cub/device/dispatch/dispatch_reduce.cuh (1)

481-481: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) from the global namespace in both locations rather than relying on unqualified lookup.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 754-754

cub/cub/device/dispatch/dispatch_reduce_by_key.cuh (1)

609-609: ⚡ Quick win

suggestion: switch both validate_stream_device(stream) calls to the fully qualified global-namespace symbol to comply with the free-function call rule.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 698-698

cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh (1)

342-342: ⚡ Quick win

suggestion: qualify validate_stream_device(stream) from the global namespace instead of calling it unqualified.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh (1)

176-176: ⚡ Quick win

suggestion: call validate_stream_device(stream) via its global namespace-qualified symbol here.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_rle.cuh (1)

608-608: ⚡ Quick win

suggestion: make both validate_stream_device(stream) calls fully qualified from the global namespace to align with project call-qualification rules.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 666-666

cub/cub/device/dispatch/dispatch_scan.cuh (1)

865-865: ⚡ Quick win

suggestion: use the global namespace-qualified form for validate_stream_device(stream) in both locations rather than unqualified calls.
As per coding guidelines "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 933-933

cub/cub/device/dispatch/dispatch_scan_by_key.cuh (1)

599-599: ⚡ Quick win

suggestion: Qualify validate_stream_device from the global namespace in both dispatch entrypoints to match the repository call-style rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 737-737

cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh (1)

620-620: ⚡ Quick win

suggestion: Use a globally qualified call for validate_stream_device at both insertion points to keep dispatch code aligned with repository qualification rules.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 907-907

cub/cub/device/dispatch/dispatch_segmented_reduce.cuh (1)

424-424: ⚡ Quick win

suggestion: Fully qualify validate_stream_device from global scope in both dispatch paths for consistency with the project’s free-function call rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 531-531

cub/cub/device/dispatch/dispatch_segmented_scan.cuh (1)

132-132: ⚡ Quick win

suggestion: Qualify validate_stream_device from the global namespace here to satisfy the repository’s free-function qualification requirement.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

cub/cub/device/dispatch/dispatch_segmented_sort.cuh (1)

692-692: ⚡ Quick win

suggestion: Switch both validate_stream_device invocations to globally qualified form to match the enforced free-function qualification convention.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1285-1285

cub/cub/device/dispatch/dispatch_select_if.cuh (1)

846-846: ⚡ Quick win

suggestion: Apply global qualification to validate_stream_device in both dispatch entrypoints to comply with the project-wide free-function call convention.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 1105-1105

cub/cub/device/dispatch/dispatch_three_way_partition.cuh (1)

367-367: ⚡ Quick win

suggestion: Use globally qualified validate_stream_device calls in both updated dispatch layers to align with the mandatory free-function qualification rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".

Also applies to: 438-438

cub/cub/device/dispatch/dispatch_topk.cuh (1)

478-478: ⚡ Quick win

suggestion: Qualify validate_stream_device from global scope in this dispatch entrypoint to satisfy the repository free-function qualification rule.

As per coding guidelines, "All calls to free functions must be fully qualified from the global namespace, e.g. ::cuda::ceil_div, even when calling functions in the same namespace".


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d183277-1a01-4666-9f1b-617c330bdabb

📥 Commits

Reviewing files that changed from the base of the PR and between c47f140 and e23c56c.

📒 Files selected for processing (26)
  • cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
  • cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/dispatch_find.cuh
  • cub/cub/device/dispatch/dispatch_for.cuh
  • cub/cub/device/dispatch/dispatch_histogram.cuh
  • cub/cub/device/dispatch/dispatch_merge.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_reduce.cuh
  • cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
  • cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh
  • cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh
  • cub/cub/device/dispatch/dispatch_rle.cuh
  • cub/cub/device/dispatch/dispatch_scan.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
  • cub/cub/device/dispatch/dispatch_segmented_scan.cuh
  • cub/cub/device/dispatch/dispatch_segmented_sort.cuh
  • cub/cub/device/dispatch/dispatch_select_if.cuh
  • cub/cub/device/dispatch/dispatch_three_way_partition.cuh
  • cub/cub/device/dispatch/dispatch_topk.cuh
  • cub/cub/device/dispatch/dispatch_transform.cuh
  • cub/cub/device/dispatch/dispatch_unique_by_key.cuh
  • cub/cub/util_device.cuh

Comment thread cub/cub/device/dispatch/dispatch_adjacent_difference.cuh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2d2b79d9-a47e-40c6-99d9-5d9e96a54b55

📥 Commits

Reviewing files that changed from the base of the PR and between e23c56c and a8d21f9.

📒 Files selected for processing (26)
  • cub/cub/device/dispatch/dispatch_adjacent_difference.cuh
  • cub/cub/device/dispatch/dispatch_batch_memcpy.cuh
  • cub/cub/device/dispatch/dispatch_batched_topk.cuh
  • cub/cub/device/dispatch/dispatch_find.cuh
  • cub/cub/device/dispatch/dispatch_for.cuh
  • cub/cub/device/dispatch/dispatch_histogram.cuh
  • cub/cub/device/dispatch/dispatch_merge.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_reduce.cuh
  • cub/cub/device/dispatch/dispatch_reduce_by_key.cuh
  • cub/cub/device/dispatch/dispatch_reduce_deterministic.cuh
  • cub/cub/device/dispatch/dispatch_reduce_nondeterministic.cuh
  • cub/cub/device/dispatch/dispatch_rle.cuh
  • cub/cub/device/dispatch/dispatch_scan.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/dispatch_segmented_radix_sort.cuh
  • cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
  • cub/cub/device/dispatch/dispatch_segmented_scan.cuh
  • cub/cub/device/dispatch/dispatch_segmented_sort.cuh
  • cub/cub/device/dispatch/dispatch_select_if.cuh
  • cub/cub/device/dispatch/dispatch_three_way_partition.cuh
  • cub/cub/device/dispatch/dispatch_topk.cuh
  • cub/cub/device/dispatch/dispatch_transform.cuh
  • cub/cub/device/dispatch/dispatch_unique_by_key.cuh
  • cub/cub/util_device.cuh
✅ Files skipped from review due to trivial changes (1)
  • cub/cub/device/dispatch/dispatch_histogram.cuh

Comment thread cub/cub/device/dispatch/dispatch_scan_by_key.cuh
Comment thread cub/cub/device/dispatch/dispatch_segmented_reduce.cuh
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this contribution! Please add a unit test to at least one algorithm calling it with a stream that does not match the current device. This test must be written in a way that it also works if there is only one GPU/device in the system (just succeeding is fine I think). I can try it briefly on my machine where I have two GPUs.

Comment thread cub/cub/util_device.cuh Outdated
Comment on lines +467 to +471
error = cudaStreamGetDevice(stream, &streamDevice);
if (error != cudaSuccess)
{
return error;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Let's not reuse the error variable:

Suggested change
error = cudaStreamGetDevice(stream, &streamDevice);
if (error != cudaSuccess)
{
return error;
}
if (const auto error = cudaStreamGetDevice(stream, &streamDevice);)
{
return error;
}

Comment thread cub/cub/util_device.cuh Outdated
Comment on lines +473 to +477
error = cudaGetDevice(&currentDevice);
if (error != cudaSuccess)
{
return error;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error = cudaGetDevice(&currentDevice);
if (error != cudaSuccess)
{
return error;
}
if (const auto error = cudaGetDevice(&currentDevice);)
{
return error;
}

Comment thread cub/cub/util_device.cuh Outdated
return cudaErrorInvalidDevice;
}
# endif // _CCCL_CTK_AT_LEAST(12,8)
return error;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return error;
return cudaSuccess;

Comment thread cub/cub/util_device.cuh Outdated
CUB_RUNTIME_FUNCTION _CCCL_FORCEINLINE cudaError_t validate_stream_device(cudaStream_t stream)
{
cudaError_t error = cudaSuccess;
# if _CCCL_CTK_AT_LEAST(12, 8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: sometimes users violate our API requirements, but their software ran fine for a long time. They would be upset if we suddenly enforce requirements, causing their software to break. Let's add a macro to disable this new feature:

Suggested change
# if _CCCL_CTK_AT_LEAST(12, 8)
# if _CCCL_CTK_AT_LEAST(12, 8) && !defined(CCCL_DISABLE_STREAM_DEVICE_CHECK)

If possible, add a unit test that calls a simple algorithm like DeviceFor with a stream and a different current device and define the CCCL_DISABLE_STREAM_DEVICE_CHECK macro, to see whether the escape hatch works.

Comment thread cub/cub/util_device.cuh Outdated
Comment on lines +465 to +467
# if _CCCL_CTK_AT_LEAST(12, 8)
int streamDevice;
error = cudaStreamGetDevice(stream, &streamDevice);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this function work even before CTK 12.8 using CUDA Driver API. We already have this implemented for cuda::stream_ref. It should look as:

{
  ::CUdevice current_device;
  if (const auto error = ::cuda::__driver::__ctxGetDeviceNoThrow(current_device); error != cudaSuccess)
  {
    return error;
  }

  ::CUcontext stream_ctx;
  if (const auto error = ::cuda::__driver::__streamGetCtxNoThrow(stream_ctx, stream); error != cudaSuccess)
  {
    return error;
  }

  if (const auto error = ::cuda::__driver::__ctxPushNoThrow(stream_ctx); error != cudaSuccess)
  {
    return error;
  }

  ::CUdevice stream_device;
  if (const auto error = ::cuda::__driver::__ctxGetDeviceNoThrow(stream_device); error != cudaSuccess)
  {
    return error;
  }

  if (const auto error = ::cuda::__driver::__ctxPopNoThrow(); error != cudaSuccess)
  {
    return error;
  }

  _CCCL_ASSERT(current_device == stream_device, "current device must match CUB stream device");
}

The only problem is that we need to add __meowNoThrow variants of all context-related driver APIs to <cuda/__driver/driver_api.h>.

If you don't feel comfortable doing this, I will make a follow up PR after this one is merged :)

Comment thread cub/cub/util_device.cuh Outdated
Comment on lines +464 to +484
cudaError_t error = cudaSuccess;
# if _CCCL_CTK_AT_LEAST(12, 8)
int streamDevice;
error = cudaStreamGetDevice(stream, &streamDevice);
if (error != cudaSuccess)
{
return error;
}
int currentDevice;
error = cudaGetDevice(&currentDevice);
if (error != cudaSuccess)
{
return error;
}
_CCCL_ASSERT(currentDevice == streamDevice, "current device must match CUB stream device");
if (currentDevice != streamDevice)
{
return cudaErrorInvalidDevice;
}
# endif // _CCCL_CTK_AT_LEAST(12,8)
return error;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Since this is an assertion, we need to make sure all of the CUDA Runtime/Driver calls are done only when assertions are enabled, because they won't get optimized out and can introduce some unwanted overhead.

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in CCCL May 25, 2026
@thom-gg thom-gg requested a review from a team as a code owner May 25, 2026 14:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bd797629-5166-4034-acec-48de9bb9ed1e

📥 Commits

Reviewing files that changed from the base of the PR and between 8de3755 and 329bf57.

📒 Files selected for processing (4)
  • cub/cub/util_device.cuh
  • cub/test/catch2_test_device_for.cu
  • cub/test/catch2_test_device_for_api.cu
  • libcudacxx/include/cuda/__driver/driver_api.h

Comment thread cub/cub/util_device.cuh Outdated
Comment thread cub/cub/util_device.cuh Outdated
Comment thread cub/test/catch2_test_device_for_api.cu Outdated
Comment thread cub/test/catch2_test_device_for.cu Outdated
Comment thread cub/test/catch2_test_device_for.cu Outdated
Comment thread libcudacxx/include/cuda/__driver/driver_api.h Outdated
… end of tests, and making sure pop always gets executed in validate_stream_device
@thom-gg
Copy link
Copy Markdown
Author

thom-gg commented May 25, 2026

Hi, thanks to you both for the feedbacks:

  • i added the noThrow variants for the driver api methods
  • used them in validate_stream_device to support all CTK version and not only >=12.8
  • stopped re-using the error variables
  • added a macro CCCL_DISABLE_STREAM_DEVICE_CHECK to disable the check
  • guarded the check in behind a CCCL_ENABLE_ASSERTIONS macro
  • wrote 2 unit tests launching DeviceFor on a stream from another device than the current one, one test that should fail and one that defines CCCL_DISABLE_STREAM_DEVICE_CHECK and should therefore skip the check and succeed. I wasn't able to run the tests since i only have one gpu, but i compiled them though. if there is less than 2 gpus the tests are skipped.

happy to keep modifying this if needed :)

@bernhardmgruber
Copy link
Copy Markdown
Contributor

/ok to test efd3b9a

Comment thread cub/test/catch2_test_device_for_api.cu Outdated
Comment on lines +426 to +437
int num_devices = 0;
REQUIRE(cudaGetDeviceCount(&num_devices) == cudaSuccess);

if (num_devices < 2)
{
SKIP("Test requires at least 2 CUDA devices");
}

REQUIRE(cudaSetDevice(1) == cudaSuccess);
cudaStream_t stream_on_device_1;
REQUIRE(cudaStreamCreate(&stream_on_device_1) == cudaSuccess);
REQUIRE(cudaSetDevice(0) == cudaSuccess);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We have a modern CUDA runtime API now. From the top of my head:

Suggested change
int num_devices = 0;
REQUIRE(cudaGetDeviceCount(&num_devices) == cudaSuccess);
if (num_devices < 2)
{
SKIP("Test requires at least 2 CUDA devices");
}
REQUIRE(cudaSetDevice(1) == cudaSuccess);
cudaStream_t stream_on_device_1;
REQUIRE(cudaStreamCreate(&stream_on_device_1) == cudaSuccess);
REQUIRE(cudaSetDevice(0) == cudaSuccess);
if (cuda::devices.size() < 2)
{
SKIP("Test requires at least 2 CUDA devices");
}
cuda::stream stream_on_device_1(cuda::devices[1]);
REQUIRE(cudaSetDevice(0) == cudaSuccess); // not sure if this is still required, please check

Comment thread cub/test/catch2_test_device_for.cu Outdated
@github-actions
Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 2h 27m: Pass: 61%/341 | Total: 10d 10h | Max: 2h 26m | Hits: 15%/1128950

See results here.

…st that defines macro to its own file, using modern cuda runtime api to create streams in tests
@thom-gg
Copy link
Copy Markdown
Author

thom-gg commented May 26, 2026

Hi, implemented your suggestions @bernhardmgruber :

  • tests use the modern cuda runtime api cuda::devices, cuda::stream etc, and as you mentioned it's not necessary to set back to device 0 anymore since the stream constructor uses __ensure_current_context which sets state back on destruction

  • i moved the test containing #define CCCL_DISABLE_STREAM_DEVICE_CHECK to another test file to avoid messing with the other tests

  • the CI highlighted that dispatch functions are CUB_RUNTIME_FUNCTION which can be host and device functions, while the driver functions are host only, so i added a check using NV_IF_TARGET(NV_IS_HOST in the validate_stream_device function. this seems to be why some builds of CUB and Thrust failed

  • i'm quite confused why builds of libcudaxx were failing tho, the errors mention issues in libcudacxx/test/libcudacxx/std/numerics/simd/simd.complex/simd_test_utils.h which as far as i see are not using the driver api (which is the only part i touched in libcudaxx, and even there i only added new functions so it shouldnt affect existing tests)

tests are taking forever to build locally on my laptop (like +2hrs per target) so i only tried on one target (cuda 12.0 gcc 12). if you have any ideas about the libcudaxx fail i'll take it otherwise i'll try to replicate tomorrow

Comment thread cub/cub/util_device.cuh
Comment on lines +462 to +463
// Validates stream's device is current device. Does nothing if CCCL_DISABLE_STREAM_DEVICE_CHECK or
// !CCCL_DISABLE_STREAM_DEVICE_CHECK or is being called from device
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Validates stream's device is current device. Does nothing if CCCL_DISABLE_STREAM_DEVICE_CHECK or
// !CCCL_DISABLE_STREAM_DEVICE_CHECK or is being called from device
// Validates stream's device is current device. Does nothing if CCCL_DISABLE_STREAM_DEVICE_CHECK is
// defined or when being called from device code.

Comment thread cub/cub/util_device.cuh
NV_IS_HOST,
(

# if CCCL_ENABLE_ASSERTIONS && !defined(CCCL_DISABLE_STREAM_DEVICE_CHECK)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: We cannot put preprocessor directives inside macro argument lists. Please move then outside NV_IF_TARGET(...)

Comment thread cub/cub/util_device.cuh
Comment on lines +466 to +468
NV_IF_TARGET(
NV_IS_HOST,
(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: If you wrap the code passed into a block statement, you get a lot better formatting:

Suggested change
NV_IF_TARGET(
NV_IS_HOST,
(
NV_IF_TARGET(
NV_IS_HOST,
({

and the closing } later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Validate current device and CUB stream matches

3 participants