Skip to content

Commit dad2ede

Browse files
authored
Fix py_thread_callback_SUITE flake on slow CI hosts (#63) (#64)
* Fix py_thread_callback_SUITE flake on slow CI hosts (#63) Pipe traffic between Python threads and the Erlang coordinator was vulnerable to short reads, two-write length+data races, and frame interleaving on the shared async pipe. Loop reads with a monotonic deadline, send each response as a single id-prefixed frame written under a non-blocking dirty-IO NIF, and serialise async writes through one Erlang process per pipe so chunked kernel writes can no longer interleave. Sync workers self-poison and the async pipe fails loud on unrecoverable read errors. Adds high-concurrency, async-concurrent and large-payload regression tests plus a local stress harness. * Bound async-callback in-flight via per-pipe atomics counter The previous backpressure check looked only at the writer's mailbox length at submission time, so concurrent executors could complete after the check and still pile {respond,...} into the writer unbounded. The counter now tracks executors-plus-queued together: incremented at submission, decremented after the writer hands the frame to the NIF. * Drain async pipe after pipe_broken to stop reader busy-fire The reader returned -1 immediately when pipe_broken was set, but left bytes in the fd, so asyncio kept firing the callback while the Erlang writer was still draining its mailbox into the pipe. Discard the bytes and return 0 so the wrapper's while loop exits cleanly and the fd goes quiet once the writer times out. Also document that recovery is fail-loud only — there is no event-loop reference stored and no re-registration path to fail. * Link async writers to coordinator instead of monitoring Monitored writers survived a coordinator crash, leaking the async pipe fd until the OS reclaimed it. spawn_link from the coordinator (which traps exits) propagates the EXIT signal in both directions: a writer death surfaces in the existing trap_exit clause for cleanup, and a coordinator crash now takes the writer with it so supervision can restart from a clean slate. * Document the async_futures_mutex no-Python-call invariant All three callsites already snapshot or modify the dict under the lock and then call set_result / set_exception with the lock released, but the rule was only spelled out next to one of them. Pin it on the struct definition so the next contributor cannot put a future method back under the mutex and reintroduce the deadlock / re-entry hazard. * Free poisoned thread workers instead of leaking them The previous code marked workers poisoned and skipped them in acquire_thread_worker but kept the struct on g_thread_pool_head until NIF unload, so a hot loop of synchronisation failures grew runtime memory linearly. Unlink under g_thread_pool_mutex and free immediately; the lifetime counter still bumps so the diagnostic ceiling and stderr warning fire. Caller now clears tl_thread_worker and the pthread key BEFORE poisoning so no dangling references can survive the free.
1 parent c74bc39 commit dad2ede

9 files changed

Lines changed: 1123 additions & 249 deletions

File tree

c_src/CMakeLists.txt

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,15 @@ option(ENABLE_ASAN "Enable AddressSanitizer" OFF)
5555
option(ENABLE_TSAN "Enable ThreadSanitizer" OFF)
5656
option(ENABLE_UBSAN "Enable UndefinedBehaviorSanitizer" OFF)
5757

58+
# Optional thread-callback I/O tracing. Compiled out by default; turn on
59+
# locally to debug py_thread_callback_SUITE-style flakiness. Logs one
60+
# line per send/receive on stderr.
61+
option(ENABLE_PY_THREAD_CB_TRACE "Trace thread-callback I/O on stderr" OFF)
62+
if(ENABLE_PY_THREAD_CB_TRACE)
63+
message(STATUS "Thread-callback I/O tracing enabled")
64+
add_compile_definitions(PY_THREAD_CB_TRACE=1)
65+
endif()
66+
5867
if(ENABLE_ASAN)
5968
message(STATUS "AddressSanitizer enabled")
6069
add_compile_options(-fsanitize=address -fno-omit-frame-pointer -g -O1)

0 commit comments

Comments
 (0)