fix: data race on last_storage_ptr_ cache in gil_safe_call_once_and_store#6087
Conversation
…tore Under free-threaded CPython (Py_GIL_DISABLED) the GIL provides no mutual exclusion, so the plain pointer last_storage_ptr_ was read and written concurrently without synchronization, a C++ data race. Make it a std::atomic<T *>. Also reorder get_stored() to check is_last_storage_valid() before loading the cached pointer. The writer publishes the pointer before setting the validity flag, so the flag must be observed first for correct acquire/release ordering. Assisted-by: ClaudeCode:claude-fable-5
3f3c573 to
faf4b5d
Compare
rwgk
left a comment
There was a problem hiding this comment.
gpt-5.5:
I reviewed the change in gil_safe_call_once.h: making last_storage_ptr_ atomic and loading it only after the validity flag is observed matches the intended seq-cst publish/observe ordering. The change stays scoped to the subinterpreter-support branch and preserves the existing slow-path behavior.
The functional bug being fixed is a C++ data race that matters when Py_GIL_DISABLED is active, because the GIL no longer serializes reads/writes to last_storage_ptr_.
On normal GIL builds, the same code is compiled in the subinterpreter-support branch on Python 3.12+, but the existing GIL discipline already prevents the problematic concurrent unsynchronized access in the intended usage. So the atomic pointer and reordered load are mostly harmless correctness hardening there, not a user-visible behavioral fix.
One nuance: the reorder in get_stored() is conceptually tied to the atomic publish protocol. Once last_storage_ptr_ becomes atomic, checking is_initialized_by_at_least_one_interpreter_ before reading the pointer is the right way to make the memory ordering argument valid. So both changes belong together, but the reason they are needed is free-threaded Python.
🤖 AI text below 🤖
Problem
In
include/pybind11/gil_safe_call_once.h, thePYBIND11_HAS_SUBINTERPRETER_SUPPORTbranch ofgil_safe_call_once_and_storecaches the per-interpreter storage pointer inT *last_storage_ptr_as a fast path for the single-interpreter case.This plain pointer is:
call_once_and_store_result()(inside thestd::call_oncelambda) and inget_stored(), andget_stored().Under free-threaded CPython (
Py_GIL_DISABLED),gil_scoped_acquireprovides no mutual exclusion, so these concurrent unsynchronized reads/writes of a plain pointer are a C++ data race (undefined behavior).Additionally,
get_stored()loaded the cached pointer before callingis_last_storage_valid(). The writer publishes the pointer (last_storage_ptr_) before setting the validity flag (is_initialized_by_at_least_one_interpreter_), so a reader must observe the flag first and only then load the pointer to get correct acquire/release ordering.Fix
last_storage_ptr_astd::atomic<T *>(<atomic>is already included in this branch). Default seq_cst operations pair with the existing atomic validity flag.get_stored()to checkis_last_storage_valid()first, and only loadlast_storage_ptr_on the fast path; on the slow path, look up the per-interpreter storage and update the cache as before.The change is minimal and behavior-preserving on the non-free-threaded build.
Not addressed here
The separate embedded finalize / re-init staleness concern (a stale
last_storage_ptr_surviving interpreter finalize + re-init within the same process) is not fixed by this PR. It is tracked separately in the issue.Verification
Compiled a scratch translation unit including
<pybind11/gil_safe_call_once.h>and instantiatingpybind11::gil_safe_call_once_and_store<int>withc++ -std=c++17 -fsyntax-onlyagainst Python 3.14 (the subinterpreter branch is active by default for Python >= 3.12). Compiles cleanly.clang-formatapplied.Part of #6084