[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136
[STF] Run cudaFree(0) once + deduplicate cudaStreamIsCapturing helper#9136andralex wants to merge 3 commits into
Conversation
|
/ok to test f5ab3fc |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
suggestion: WalkthroughAdds inline is_stream_capturing(cudaStream_t), updates stream_pool and stream_ctx to use it, re-exports the helper in the stf namespace, and changes backend_ctx to run the CUDA runtime "kick" at most once per process. ChangesStream Capture Helper and Runtime Deduplication
Possibly related PRs
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)
144-144: ⚡ Quick winsuggestion:
retshould beconst cudaError_tsince it's never modified.As per coding guidelines: "All variables that are not modified must use const qualifier."
- cudaError_t ret = cudaFree(0); + const cudaError_t ret = cudaFree(0);cudax/include/cuda/experimental/__places/stream_pool.cuh (1)
57-62: ⚡ Quick winimportant: Bring the new helper and call sites in line with required CCCL function style.
is_stream_capturingshould carry a_CCCL_*_APIannotation, and free-function calls should be globally qualified (::cudaStreamIsCapturing,::cuda::experimental::places::is_stream_capturing) per the repo rules.As per coding guidelines “All functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API” and “All calls to free functions must be fully qualified starting from the global namespace”.
Also applies to: 77-77, 116-116
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh (1)
145-145: ⚡ Quick winsuggestion: Qualify
is_stream_capturingcalls from the global namespace.Use
::cuda::experimental::places::is_stream_capturing(user_stream)at both call sites to match the project’s free-function qualification rule.As per coding guidelines “All calls to free functions must be fully qualified starting from the global namespace, including calls to functions in the same namespace”.
Also applies to: 165-165
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a1e7ce1f-b4b9-4ce0-84aa-c3af4a5dcd96
📒 Files selected for processing (4)
cudax/include/cuda/experimental/__places/stream_pool.cuhcudax/include/cuda/experimental/__stf/internal/backend_ctx.cuhcudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuhcudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
backend_ctx<>::impl was calling ``cudaFree(0)`` on every construction to force lazy CUDA runtime initialization. The runtime is process-global state, so re-checking it for every new context is pure overhead -- particularly on the hot path of back-to-back contexts on the same caller stream. Gate the call on a plain ``static bool``. ``cudaFree(0)`` is idempotent, so a benign data race on first use just costs a couple of extra (still-correct) driver calls until every thread sees the store. We deliberately avoid ``std::call_once`` and magic-static initialization here: both add an unconditional mutex / double-checked-lock on the steady-state path, which is exactly what we are trying to remove. The pre-existing capture-safety gate is unchanged: callers pass ``initialize_cuda_runtime = false`` when the user stream is capturing, because ``cudaFree(0)`` under ``cudaStreamCaptureModeThreadLocal`` / ``Global`` is rejected with ``cudaErrorStreamCaptureUnsupported`` and invalidates the in-flight capture.
f5ab3fc to
b993534
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh (1)
149-157: 💤 Low valuesuggestion: Plain
static boolwith concurrent reads/writes is technically undefined behavior in C++.std::atomic<bool>withstd::memory_order_relaxedfor both load and store would eliminate UB while generating identical code on most architectures (no barriers added). The idempotency argument still holds; this is just about avoiding formally undefined behavior.- static bool cuda_runtime_initialized = false; - if (!cuda_runtime_initialized) + static ::std::atomic<bool> cuda_runtime_initialized{false}; + if (!cuda_runtime_initialized.load(::std::memory_order_relaxed)) { cudaError_t ret = cudaFree(0); // If we are running the task in the context of a CUDA callback, we // are not allowed to issue any CUDA API call. EXPECT((ret == cudaSuccess || ret == cudaErrorNotPermitted)); - cuda_runtime_initialized = true; + cuda_runtime_initialized.store(true, ::std::memory_order_relaxed); }Would require
#include <atomic>(already included in the file).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7114f1ea-1ba7-4f2a-b26d-ed801be1013a
📒 Files selected for processing (4)
cudax/include/cuda/experimental/__places/stream_pool.cuhcudax/include/cuda/experimental/__stf/internal/backend_ctx.cuhcudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuhcudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
✅ Files skipped from review due to trivial changes (1)
- cudax/include/cuda/experimental/__stf/internal/stf_places_extended_exports.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
- cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
This comment has been minimized.
This comment has been minimized.
Three independent copies of the ``cudaStreamIsCapturing(stream, &s); s != None`` pattern had grown across cudax: two inline in places.cuh (``get_device_from_stream``, ``get_stream_id``) and one as a private ``stream_ctx::is_capturing`` helper. They all do the same thing -- ask "is this stream currently part of a CUDA graph capture?" -- which is the *only* driver query that is itself capture-safe and so deserves a single named, documented home. Introduce ``cuda::experimental::places::is_stream_capturing(cudaStream_t)`` in ``__places/stream_pool.cuh``, document why it is the one query you can safely call mid-capture, and re-export it into the ``stf`` namespace via ``stf_places_extended_exports.cuh``. Replace the three call sites. No behavior change; the new helper does **not** silently swallow the ``cudaStreamIsCapturing`` return value the way the old ``cap_err == cudaSuccess`` paths did, because that error is always fatal in practice and the existing ``cuda_try`` machinery already gives the right diagnostic.
|
/ok to test c1d7b8b |
b993534 to
c1d7b8b
Compare
|
/ok to test 29cd5c1 |
🥳 CI Workflow Results🟩 Finished in 1h 04m: Pass: 100%/55 | Total: 19h 56m | Max: 1h 03m | Hits: 34%/96195See results here. |
| // ``cudaStreamCaptureModeThreadLocal`` / ``Global`` is rejected with | ||
| // ``cudaErrorStreamCaptureUnsupported`` *and* invalidates the | ||
| // in-flight capture. | ||
| static bool cuda_runtime_initialized = false; |
There was a problem hiding this comment.
I thought about that too, but this is replacing a very cheap operation we do all the time by some static state that might have bad consequences. We sometimes entirely destroy CUDA contexts (eg. this is what some python stacks do), I am unsure things would work if we don't "initialize" CUDA again but it's probably too defensive ...
Summary
Two small, independent follow-ups on top of #8919 that came out of a post-merge review:
cudaFree(0)is now called exactly once per process.backend_ctx<>::implpreviously called it on every construction to force lazy CUDA runtime initialization. The runtime is process-global state, so re-checking it for every new context is pure overhead -- particularly on the hot path of back-to-backstream_ctxobjects on the same caller stream. Wrap the call instd::call_once. The pre-existing capture-safety gate (initialize_cuda_runtime = falsewhen the user stream is capturing) is unchanged;cudaFree(0)undercudaStreamCaptureModeThreadLocal/Globalis rejected withcudaErrorStreamCaptureUnsupportedand invalidates the capture, so that upstream guard still matters.Three copies of the
cudaStreamIsCapturingpattern collapse into one helper. Before this PR, the samecudaStreamIsCapturing(stream, &status); status != cudaStreamCaptureStatusNonesnippet appeared inplaces::get_device_from_stream,places::get_stream_id, and as a privatestream_ctx::is_capturing. Introducecuda::experimental::places::is_stream_capturing(cudaStream_t)in__places/stream_pool.cuh-- the natural home, since that file already centralizes the other capture-aware queries -- document why it is the one driver query that is itself capture-safe, re-export it into thestfnamespace viastf_places_extended_exports.cuh, and replace the three call sites. No behavior change.Each item is a separate commit for easy review/revert.
Test plan
legacy_to_stf_in_capture.cu,stream_ctx_lifetime_btb.cu) -- both exercise the call paths touched here.