Skip to content

[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136

Open
andralex wants to merge 3 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cudafree-once-and-dedupe
Open

[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136
andralex wants to merge 3 commits into
NVIDIA:mainfrom
andralex:andralex/stf-cudafree-once-and-dedupe

Conversation

@andralex
Copy link
Copy Markdown
Contributor

Summary

Two small, independent follow-ups on top of #8919 that came out of a post-merge review:

  1. cudaFree(0) is now called exactly once per process. backend_ctx<>::impl previously called it on every construction to force lazy CUDA runtime initialization. The runtime is process-global state, so re-checking it for every new context is pure overhead -- particularly on the hot path of back-to-back stream_ctx objects on the same caller stream. Wrap the call in std::call_once. The pre-existing capture-safety gate (initialize_cuda_runtime = false when the user stream is capturing) is unchanged; cudaFree(0) under cudaStreamCaptureModeThreadLocal/Global is rejected with cudaErrorStreamCaptureUnsupported and invalidates the capture, so that upstream guard still matters.

  2. Three copies of the cudaStreamIsCapturing pattern collapse into one helper. Before this PR, the same cudaStreamIsCapturing(stream, &status); status != cudaStreamCaptureStatusNone snippet appeared in places::get_device_from_stream, places::get_stream_id, and as a private stream_ctx::is_capturing. Introduce cuda::experimental::places::is_stream_capturing(cudaStream_t) in __places/stream_pool.cuh -- the natural home, since that file already centralizes the other capture-aware queries -- document why it is the one driver query that is itself capture-safe, re-export it into the stf namespace via stf_places_extended_exports.cuh, and replace the three call sites. No behavior change.

Each item is a separate commit for easy review/revert.

Test plan

  • CI green on the existing in-capture coverage (legacy_to_stf_in_capture.cu, stream_ctx_lifetime_btb.cu) -- both exercise the call paths touched here.
  • No new tests warranted: change Add axis for docker builds #1 is a perf/cleanup nop on the existing semantics, change Install curl #2 is a textual refactor.

@andralex andralex requested a review from a team as a code owner May 26, 2026 22:48
@andralex andralex requested a review from caugonnet May 26, 2026 22:48
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 26, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 26, 2026
@andralex
Copy link
Copy Markdown
Contributor Author

/ok to test f5ab3fc

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9fc93313-661b-4c84-a14a-a82b3c8ef72b

📥 Commits

Reviewing files that changed from the base of the PR and between b993534 and c1d7b8b.

📒 Files selected for processing (3)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh
  • cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
✅ Files skipped from review due to trivial changes (1)
  • cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
🚧 Files skipped from review as they are similar to previous changes (2)
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
  • cudax/include/cuda/experimental/__places/stream_pool.cuh

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added a lightweight helper to detect when a CUDA stream is in capture mode and re-exported it for broader use.
  • Refactor
    • Stream utilities now use the new capture-detection helper for consistent capture-aware behavior.
    • CUDA runtime initialization for contexts is now gated to run at most once per process, reducing repeated overhead.
  • Documentation
    • Clarified capture failure modes and default-stream capture semantics in stream context docs.

suggestion:

Walkthrough

Adds inline is_stream_capturing(cudaStream_t), updates stream_pool and stream_ctx to use it, re-exports the helper in the stf namespace, and changes backend_ctx to run the CUDA runtime "kick" at most once per process.

Changes

Stream Capture Helper and Runtime Deduplication

Layer / File(s) Summary
Extract is_stream_capturing helper and refactor stream_pool
cudax/include/cuda/experimental/__places/stream_pool.cuh
Introduces is_stream_capturing(cudaStream_t) as an inline wrapper around cudaStreamIsCapturing and refactors get_device_from_stream and get_stream_id to call this helper.
Re-export is_stream_capturing in stf namespace
cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
Adds using ::cuda::experimental::places::is_stream_capturing; in cuda::experimental::stf.
Adopt is_stream_capturing in stream_ctx
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
stream_ctx constructor and the async-resources EXPECT now use is_stream_capturing(user_stream); the private is_capturing helper was removed.
Deduplicate CUDA runtime initialization per process
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
Guards the cudaFree(0) runtime kick with a function-local static bool so the kick runs at most once while retaining the initialize_cuda_runtime gate and inline comments documenting capture-safety.

Possibly related PRs

  • NVIDIA/cccl#8919: Both PRs modify the STF stream capture/runtime-initialization path by changing stream_ctx to detect CUDA stream capture via cudaStreamIsCapturing and updating backend runtime initialization gating.

Suggested labels

stf

Suggested reviewers

  • caugonnet
  • alliepiper
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)

144-144: ⚡ Quick win

suggestion: ret should be const cudaError_t since it's never modified.

As per coding guidelines: "All variables that are not modified must use const qualifier."

-          cudaError_t ret = cudaFree(0);
+          const cudaError_t ret = cudaFree(0);
cudax/include/cuda/experimental/__places/stream_pool.cuh (1)

57-62: ⚡ Quick win

important: Bring the new helper and call sites in line with required CCCL function style.

is_stream_capturing should carry a _CCCL_*_API annotation, and free-function calls should be globally qualified (::cudaStreamIsCapturing, ::cuda::experimental::places::is_stream_capturing) per the repo rules.

As per coding guidelines “All functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API” and “All calls to free functions must be fully qualified starting from the global namespace”.

Also applies to: 77-77, 116-116

cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh (1)

145-145: ⚡ Quick win

suggestion: Qualify is_stream_capturing calls from the global namespace.

Use ::cuda::experimental::places::is_stream_capturing(user_stream) at both call sites to match the project’s free-function qualification rule.

As per coding guidelines “All calls to free functions must be fully qualified starting from the global namespace, including calls to functions in the same namespace”.

Also applies to: 165-165


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a1e7ce1f-b4b9-4ce0-84aa-c3af4a5dcd96

📥 Commits

Reviewing files that changed from the base of the PR and between 9722f26 and f5ab3fc.

📒 Files selected for processing (4)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh
  • cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
  • cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

backend_ctx<>::impl was calling ``cudaFree(0)`` on every construction to
force lazy CUDA runtime initialization. The runtime is process-global
state, so re-checking it for every new context is pure overhead --
particularly on the hot path of back-to-back contexts on the same
caller stream.

Gate the call on a plain ``static bool``. ``cudaFree(0)`` is idempotent,
so a benign data race on first use just costs a couple of extra
(still-correct) driver calls until every thread sees the store. We
deliberately avoid ``std::call_once`` and magic-static initialization
here: both add an unconditional mutex / double-checked-lock on the
steady-state path, which is exactly what we are trying to remove.

The pre-existing capture-safety gate is unchanged: callers pass
``initialize_cuda_runtime = false`` when the user stream is capturing,
because ``cudaFree(0)`` under ``cudaStreamCaptureModeThreadLocal`` /
``Global`` is rejected with ``cudaErrorStreamCaptureUnsupported`` and
invalidates the in-flight capture.
@andralex andralex force-pushed the andralex/stf-cudafree-once-and-dedupe branch from f5ab3fc to b993534 Compare May 26, 2026 22:54
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)

149-157: 💤 Low value

suggestion: Plain static bool with concurrent reads/writes is technically undefined behavior in C++. std::atomic<bool> with std::memory_order_relaxed for both load and store would eliminate UB while generating identical code on most architectures (no barriers added). The idempotency argument still holds; this is just about avoiding formally undefined behavior.

-        static bool cuda_runtime_initialized = false;
-        if (!cuda_runtime_initialized)
+        static ::std::atomic<bool> cuda_runtime_initialized{false};
+        if (!cuda_runtime_initialized.load(::std::memory_order_relaxed))
         {
           cudaError_t ret = cudaFree(0);
           // If we are running the task in the context of a CUDA callback, we
           // are not allowed to issue any CUDA API call.
           EXPECT((ret == cudaSuccess || ret == cudaErrorNotPermitted));
-          cuda_runtime_initialized = true;
+          cuda_runtime_initialized.store(true, ::std::memory_order_relaxed);
         }

Would require #include <atomic> (already included in the file).


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7114f1ea-1ba7-4f2a-b26d-ed801be1013a

📥 Commits

Reviewing files that changed from the base of the PR and between f5ab3fc and b993534.

📒 Files selected for processing (4)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh
  • cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
  • cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
✅ Files skipped from review due to trivial changes (1)
  • cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh

@github-actions

This comment has been minimized.

Three independent copies of the ``cudaStreamIsCapturing(stream, &s); s !=
None`` pattern had grown across cudax: two inline in places.cuh
(``get_device_from_stream``, ``get_stream_id``) and one as a private
``stream_ctx::is_capturing`` helper. They all do the same thing -- ask
"is this stream currently part of a CUDA graph capture?" -- which is
the *only* driver query that is itself capture-safe and so deserves a
single named, documented home.

Introduce ``cuda::experimental::places::is_stream_capturing(cudaStream_t)``
in ``__places/stream_pool.cuh``, document why it is the one query you
can safely call mid-capture, and re-export it into the ``stf`` namespace
via ``stf_places_extended_exports.cuh``. Replace the three call sites.
No behavior change; the new helper does **not** silently swallow the
``cudaStreamIsCapturing`` return value the way the old ``cap_err ==
cudaSuccess`` paths did, because that error is always fatal in practice
and the existing ``cuda_try`` machinery already gives the right
diagnostic.
@andralex
Copy link
Copy Markdown
Contributor Author

/ok to test c1d7b8b

@andralex andralex force-pushed the andralex/stf-cudafree-once-and-dedupe branch from b993534 to c1d7b8b Compare May 27, 2026 00:19
@andralex
Copy link
Copy Markdown
Contributor Author

/ok to test 29cd5c1

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 04m: Pass: 100%/55 | Total: 19h 56m | Max: 1h 03m | Hits: 34%/96195

See results here.

// ``cudaStreamCaptureModeThreadLocal`` / ``Global`` is rejected with
// ``cudaErrorStreamCaptureUnsupported`` *and* invalidates the
// in-flight capture.
static bool cuda_runtime_initialized = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that too, but this is replacing a very cheap operation we do all the time by some static state that might have bad consequences. We sometimes entirely destroy CUDA contexts (eg. this is what some python stacks do), I am unsure things would work if we don't "initialize" CUDA again but it's probably too defensive ...

@caugonnet caugonnet added the stf Sequential Task Flow programming model label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants