Commit dad2ede
authored
* Fix py_thread_callback_SUITE flake on slow CI hosts (#63)
Pipe traffic between Python threads and the Erlang coordinator was
vulnerable to short reads, two-write length+data races, and frame
interleaving on the shared async pipe. Loop reads with a monotonic
deadline, send each response as a single id-prefixed frame written
under a non-blocking dirty-IO NIF, and serialise async writes through
one Erlang process per pipe so chunked kernel writes can no longer
interleave. Sync workers self-poison and the async pipe fails loud on
unrecoverable read errors. Adds high-concurrency, async-concurrent and
large-payload regression tests plus a local stress harness.
* Bound async-callback in-flight via per-pipe atomics counter
The previous backpressure check looked only at the writer's mailbox
length at submission time, so concurrent executors could complete
after the check and still pile {respond,...} into the writer
unbounded. The counter now tracks executors-plus-queued together:
incremented at submission, decremented after the writer hands the
frame to the NIF.
* Drain async pipe after pipe_broken to stop reader busy-fire
The reader returned -1 immediately when pipe_broken was set, but
left bytes in the fd, so asyncio kept firing the callback while the
Erlang writer was still draining its mailbox into the pipe. Discard
the bytes and return 0 so the wrapper's while loop exits cleanly and
the fd goes quiet once the writer times out. Also document that
recovery is fail-loud only — there is no event-loop reference stored
and no re-registration path to fail.
* Link async writers to coordinator instead of monitoring
Monitored writers survived a coordinator crash, leaking the async
pipe fd until the OS reclaimed it. spawn_link from the coordinator
(which traps exits) propagates the EXIT signal in both directions:
a writer death surfaces in the existing trap_exit clause for
cleanup, and a coordinator crash now takes the writer with it so
supervision can restart from a clean slate.
* Document the async_futures_mutex no-Python-call invariant
All three callsites already snapshot or modify the dict under the
lock and then call set_result / set_exception with the lock
released, but the rule was only spelled out next to one of them.
Pin it on the struct definition so the next contributor cannot put
a future method back under the mutex and reintroduce the
deadlock / re-entry hazard.
* Free poisoned thread workers instead of leaking them
The previous code marked workers poisoned and skipped them in
acquire_thread_worker but kept the struct on g_thread_pool_head until
NIF unload, so a hot loop of synchronisation failures grew runtime
memory linearly. Unlink under g_thread_pool_mutex and free
immediately; the lifetime counter still bumps so the diagnostic
ceiling and stderr warning fire. Caller now clears tl_thread_worker
and the pthread key BEFORE poisoning so no dangling references can
survive the free.
1 parent c74bc39 commit dad2ede
9 files changed
Lines changed: 1123 additions & 249 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
58 | 67 | | |
59 | 68 | | |
60 | 69 | | |
| |||
0 commit comments