[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper by andralex · Pull Request #9136 · NVIDIA/cccl

andralex · 2026-05-26T22:48:55Z

Summary

Two small, independent follow-ups on top of #8919 that came out of a post-merge review:

cudaFree(0) is now called exactly once per process. backend_ctx<>::impl previously called it on every construction to force lazy CUDA runtime initialization. The runtime is process-global state, so re-checking it for every new context is pure overhead -- particularly on the hot path of back-to-back stream_ctx objects on the same caller stream. Wrap the call in std::call_once. The pre-existing capture-safety gate (initialize_cuda_runtime = false when the user stream is capturing) is unchanged; cudaFree(0) under cudaStreamCaptureModeThreadLocal/Global is rejected with cudaErrorStreamCaptureUnsupported and invalidates the capture, so that upstream guard still matters.
Three copies of the cudaStreamIsCapturing pattern collapse into one helper. Before this PR, the same cudaStreamIsCapturing(stream, &status); status != cudaStreamCaptureStatusNone snippet appeared in places::get_device_from_stream, places::get_stream_id, and as a private stream_ctx::is_capturing. Introduce cuda::experimental::places::is_stream_capturing(cudaStream_t) in __places/stream_pool.cuh -- the natural home, since that file already centralizes the other capture-aware queries -- document why it is the one driver query that is itself capture-safe, re-export it into the stf namespace via stf_places_extended_exports.cuh, and replace the three call sites. No behavior change.

Each item is a separate commit for easy review/revert.

Test plan

CI green on the existing in-capture coverage (legacy_to_stf_in_capture.cu, stream_ctx_lifetime_btb.cu) -- both exercise the call paths touched here.
No new tests warranted: change Add axis for docker builds #1 is a perf/cleanup nop on the existing semantics, change Install curl #2 is a textual refactor.

copy-pr-bot · 2026-05-26T22:48:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

andralex · 2026-05-26T22:49:08Z

/ok to test f5ab3fc

coderabbitai · 2026-05-26T22:53:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9fc93313-661b-4c84-a14a-a82b3c8ef72b

📥 Commits

Reviewing files that changed from the base of the PR and between b993534 and c1d7b8b.

📒 Files selected for processing (3)

cudax/include/cuda/experimental/__places/stream_pool.cuh
cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

✅ Files skipped from review due to trivial changes (1)

cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh

🚧 Files skipped from review as they are similar to previous changes (2)

cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
cudax/include/cuda/experimental/__places/stream_pool.cuh

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added a lightweight helper to detect when a CUDA stream is in capture mode and re-exported it for broader use.
Refactor
- Stream utilities now use the new capture-detection helper for consistent capture-aware behavior.
- CUDA runtime initialization for contexts is now gated to run at most once per process, reducing repeated overhead.
Documentation
- Clarified capture failure modes and default-stream capture semantics in stream context docs.

suggestion:

Walkthrough

Adds inline is_stream_capturing(cudaStream_t), updates stream_pool and stream_ctx to use it, re-exports the helper in the stf namespace, and changes backend_ctx to run the CUDA runtime "kick" at most once per process.

Changes

Stream Capture Helper and Runtime Deduplication

Layer / File(s)	Summary
Extract is_stream_capturing helper and refactor stream_pool `cudax/include/cuda/experimental/__places/stream_pool.cuh`	Introduces `is_stream_capturing(cudaStream_t)` as an inline wrapper around `cudaStreamIsCapturing` and refactors `get_device_from_stream` and `get_stream_id` to call this helper.
Re-export is_stream_capturing in stf namespace `cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh`	Adds `using ::cuda::experimental::places::is_stream_capturing;` in `cuda::experimental::stf`.
Adopt is_stream_capturing in stream_ctx `cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh`	`stream_ctx` constructor and the async-resources `EXPECT` now use `is_stream_capturing(user_stream)`; the private `is_capturing` helper was removed.
Deduplicate CUDA runtime initialization per process `cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh`	Guards the `cudaFree(0)` runtime kick with a function-local `static bool` so the kick runs at most once while retaining the `initialize_cuda_runtime` gate and inline comments documenting capture-safety.

Possibly related PRs

NVIDIA/cccl#8919: Both PRs modify the STF stream capture/runtime-initialization path by changing stream_ctx to detect CUDA stream capture via cudaStreamIsCapturing and updating backend runtime initialization gating.

Suggested labels

stf

Suggested reviewers

caugonnet
alliepiper
bernhardmgruber

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)
144-144: ⚡ Quick win

suggestion: ret should be const cudaError_t since it's never modified.

As per coding guidelines: "All variables that are not modified must use const qualifier."
-          cudaError_t ret = cudaFree(0);
+          const cudaError_t ret = cudaFree(0);
cudax/include/cuda/experimental/__places/stream_pool.cuh (1)

57-62: ⚡ Quick win

important: Bring the new helper and call sites in line with required CCCL function style.

is_stream_capturing should carry a _CCCL_*_API annotation, and free-function calls should be globally qualified (::cudaStreamIsCapturing, ::cuda::experimental::places::is_stream_capturing) per the repo rules.

As per coding guidelines “All functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API” and “All calls to free functions must be fully qualified starting from the global namespace”.

Also applies to: 77-77, 116-116

cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh (1)

145-145: ⚡ Quick win

suggestion: Qualify is_stream_capturing calls from the global namespace.

Use ::cuda::experimental::places::is_stream_capturing(user_stream) at both call sites to match the project’s free-function qualification rule.

As per coding guidelines “All calls to free functions must be fully qualified starting from the global namespace, including calls to functions in the same namespace”.

Also applies to: 165-165

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1e7ce1f-b4b9-4ce0-84aa-c3af4a5dcd96

📥 Commits

Reviewing files that changed from the base of the PR and between 9722f26 and f5ab3fc.

📒 Files selected for processing (4)

cudax/include/cuda/experimental/__places/stream_pool.cuh
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

backend_ctx<>::impl was calling ``cudaFree(0)`` on every construction to force lazy CUDA runtime initialization. The runtime is process-global state, so re-checking it for every new context is pure overhead -- particularly on the hot path of back-to-back contexts on the same caller stream. Gate the call on a plain ``static bool``. ``cudaFree(0)`` is idempotent, so a benign data race on first use just costs a couple of extra (still-correct) driver calls until every thread sees the store. We deliberately avoid ``std::call_once`` and magic-static initialization here: both add an unconditional mutex / double-checked-lock on the steady-state path, which is exactly what we are trying to remove. The pre-existing capture-safety gate is unchanged: callers pass ``initialize_cuda_runtime = false`` when the user stream is capturing, because ``cudaFree(0)`` under ``cudaStreamCaptureModeThreadLocal`` / ``Global`` is rejected with ``cudaErrorStreamCaptureUnsupported`` and invalidates the in-flight capture.

coderabbitai

🧹 Nitpick comments (1)

cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)
149-157: 💤 Low value

suggestion: Plain static bool with concurrent reads/writes is technically undefined behavior in C++. std::atomic<bool> with std::memory_order_relaxed for both load and store would eliminate UB while generating identical code on most architectures (no barriers added). The idempotency argument still holds; this is just about avoiding formally undefined behavior.
-        static bool cuda_runtime_initialized = false;
-        if (!cuda_runtime_initialized)
+        static ::std::atomic<bool> cuda_runtime_initialized{false};
+        if (!cuda_runtime_initialized.load(::std::memory_order_relaxed))
         {
           cudaError_t ret = cudaFree(0);
           // If we are running the task in the context of a CUDA callback, we
           // are not allowed to issue any CUDA API call.
           EXPECT((ret == cudaSuccess || ret == cudaErrorNotPermitted));
-          cuda_runtime_initialized = true;
+          cuda_runtime_initialized.store(true, ::std::memory_order_relaxed);
         }
Would require #include <atomic> (already included in the file).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7114f1ea-1ba7-4f2a-b26d-ed801be1013a

📥 Commits

Reviewing files that changed from the base of the PR and between f5ab3fc and b993534.

📒 Files selected for processing (4)

cudax/include/cuda/experimental/__places/stream_pool.cuh
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

✅ Files skipped from review due to trivial changes (1)

cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

Three independent copies of the ``cudaStreamIsCapturing(stream, &s); s != None`` pattern had grown across cudax: two inline in places.cuh (``get_device_from_stream``, ``get_stream_id``) and one as a private ``stream_ctx::is_capturing`` helper. They all do the same thing -- ask "is this stream currently part of a CUDA graph capture?" -- which is the *only* driver query that is itself capture-safe and so deserves a single named, documented home. Introduce ``cuda::experimental::places::is_stream_capturing(cudaStream_t)`` in ``__places/stream_pool.cuh``, document why it is the one query you can safely call mid-capture, and re-export it into the ``stf`` namespace via ``stf_places_extended_exports.cuh``. Replace the three call sites. No behavior change; the new helper does **not** silently swallow the ``cudaStreamIsCapturing`` return value the way the old ``cap_err == cudaSuccess`` paths did, because that error is always fatal in practice and the existing ``cuda_try`` machinery already gives the right diagnostic.

andralex · 2026-05-27T00:19:02Z

/ok to test c1d7b8b

andralex · 2026-05-27T00:50:20Z

/ok to test 29cd5c1

github-actions · 2026-05-27T01:56:20Z

🥳 CI Workflow Results

🟩 Finished in 1h 04m: Pass: 100%/55 | Total: 19h 56m | Max: 1h 03m | Hits: 34%/96195

See results here.

caugonnet · 2026-05-27T05:02:35Z

+        // ``cudaStreamCaptureModeThreadLocal`` / ``Global`` is rejected with
+        // ``cudaErrorStreamCaptureUnsupported`` *and* invalidates the
+        // in-flight capture.
+        static bool cuda_runtime_initialized = false;


I thought about that too, but this is replacing a very cheap operation we do all the time by some static state that might have bad consequences. We sometimes entirely destroy CUDA contexts (eg. this is what some python stacks do), I am unsure things would work if we don't "initialize" CUDA again but it's probably too defensive ...

andralex requested a review from a team as a code owner May 26, 2026 22:48

github-project-automation Bot added this to CCCL May 26, 2026

andralex requested a review from caugonnet May 26, 2026 22:48

github-project-automation Bot moved this to Todo in CCCL May 26, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 26, 2026

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

andralex force-pushed the andralex/stf-cudafree-once-and-dedupe branch from f5ab3fc to b993534 Compare May 26, 2026 22:54

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

andralex force-pushed the andralex/stf-cudafree-once-and-dedupe branch from b993534 to c1d7b8b Compare May 27, 2026 00:19

Merge branch 'main' into andralex/stf-cudafree-once-and-dedupe

29cd5c1

caugonnet reviewed May 27, 2026

View reviewed changes

caugonnet added the stf Sequential Task Flow programming model label May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136

[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136
andralex wants to merge 3 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cudafree-once-and-dedupe

andralex commented May 26, 2026

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

andralex commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

This comment has been minimized.

andralex commented May 27, 2026

Uh oh!

andralex commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

caugonnet May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andralex commented May 26, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

andralex commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

andralex commented May 27, 2026

Uh oh!

andralex commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 04m: Pass: 100%/55 | Total: 19h 56m | Max: 1h 03m | Hits: 34%/96195

Uh oh!

caugonnet May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 26, 2026 •

edited

Loading