feat: experimental cross-process jobserver via POSIX semaphore#13856
Draft
Kha wants to merge 5 commits into
Draft
feat: experimental cross-process jobserver via POSIX semaphore#13856Kha wants to merge 5 commits into
Kha wants to merge 5 commits into
Conversation
Member
Author
|
!bench |
|
Benchmark results for 387478f against 2cd9863 are in. There are significant results. @Kha
No significant changes detected. |
This PR adds an opt-in cross-process parallelism limit to the Lean runtime. When `LEAN_JOB_SEMAPHORE=/name` points at a POSIX named semaphore, `task_manager`'s standard workers acquire a token before running a task and release it after, so the total number of concurrently running standard workers across all participating processes is bounded by the semaphore's initial value. When the env var is unset, behavior is unchanged and there is no overhead. `Task.get` releases its token while blocked and reacquires before resuming, so a worker waiting on a sub-task cannot starve the global pool. Dedicated workers (priority above `LEAN_MAX_PRIO`) and `LEAN_SYNC_PRIO` tasks bypass the worker loop and so do not consume tokens. This is intended for experimentation, not production: Linux and macOS only (Windows is a no-op), no `MAKEFLAGS` parsing, no crash-recovery for tokens leaked by killed processes, and no Lake integration — callers must create and destroy the semaphore themselves.
This PR makes the experimental jobserver self-bootstrapping: when no `LEAN_JOB_SEMAPHORE` is set in the environment, `task_manager` now creates a fresh named semaphore (`/lean-jobs-<pid>`) sized to `max_std_workers`, exports the name via `LEAN_JOB_SEMAPHORE` so child processes inherit it, and `sem_unlink`s on exit. `LEAN_JOB_SEMAPHORE_AUTO=N` overrides the size. The creating process does not gate its own workers against the semaphore. The creator is typically an orchestrator (e.g. `lake`) whose workers block on subprocesses; gating it would consume tokens that its child `lean` processes need, deadlocking the pool. Together with the previous patch this means `lake build` participates in cross-process parallelism limiting with no command-line changes.
Collaborator
|
Reference manual CI status:
|
|
Mathlib CI status (docs):
|
This PR avoids a thread-count cascade that the previous prototype provoked under heavy parallel elaboration. The earlier design released and re-acquired tokens through the global semaphore at every `Task.get` boundary; the blocking `sem_wait` on the re-acquire side let further `Task.get` calls inflate `m_max_std_workers` and spawn additional workers, multiplying the live OS thread count and tripping "failed to create thread" under `RLIMIT_NPROC`. In the new design, when a worker calls `Task.get`, it `sem_post`s its token globally so a sibling can pick up the blocked sub-task, then waits. Sibling `release_token` calls in the same process check whether a `Task.get` is actively waiting (registered on `m_parked_cv` after its `m_task_finished_cv.wait` returns) and, if so, hand the freed token directly to that waiter via `m_parked_cv` instead of `sem_post`. The waiter wakes without a blocking `sem_wait`, so the cascade cannot form. Excess releases (`m_parked_tokens >= m_parked_waiters`) still flow back to the global semaphore, so tokens aren't hoarded. Counting waiters only *after* `m_task_finished_cv.wait` is essential: counting them before would route releases to a pool nobody is listening on, starving the global semaphore and deadlocking workers that are blocked in `sem_wait`.
Member
Author
|
!bench |
|
Benchmark results for 046409e against 2cd9863 are in. There are significant results. @Kha
No significant changes detected. |
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/batteries
that referenced
this pull request
May 27, 2026
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/mathlib4-nightly-testing
that referenced
this pull request
May 27, 2026
This PR fixes a deadlock observed during a stage2 build of Lean and at sem=1 in nested `Task.get` smoke tests. The previous patch routed a freed token to the parked pool only when `m_parked_waiters > 0`, and counted the waiter only after `m_task_finished_cv.wait` returned. But the worker that resolves the sub-task holds the lock continuously through `resolve_core` and `release_token`, so the waiter cannot increment `m_parked_waiters` in between — the release always sees `waiters == 0` and `sem_post`s globally instead. The waiter then woke up, found `m_parked_tokens == 0`, and blocked on `m_parked_cv` forever because no further `release_token` was coming. On wake-up, the waiter now tries the parked pool first (in case another in-process release happened to route there), then attempts a non-blocking `sem_trywait` to recover a token the racing release sent to the global semaphore. Only when both fail does it register as a parked waiter and block on `m_parked_cv`. This handles the race without widening the lock scope or changing the waiter-counting policy.
Member
Author
|
!bench |
|
Benchmark results for f3c3b98 against 2cd9863 are in. There are significant results. @Kha
No significant changes detected. |
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/batteries
that referenced
this pull request
May 27, 2026
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/mathlib4-nightly-testing
that referenced
this pull request
May 27, 2026
…ersubscription This PR replaces the parked-pool + `sem_trywait`-fallback design with a simpler approach: when `wait_for` cannot reclaim a token non-blockingly, the worker continues running its task un-gated rather than blocking in `sem_wait`. A thread-local `g_holds_token` flag tracks whether the current worker actually has a token; the worker-loop's `release_token` skips its `sem_post` when the flag is false, keeping per-worker token accounting balanced. The previous design either deadlocked (when the in-process parked-pool notification couldn't reach a token that had been taken by another process) or risked re-introducing the original thread-explosion cascade (when the `sem_trywait` fallback hit blocking `sem_wait` under contention). The new design avoids both: no blocking call in `wait_for`'s reclaim, so the cascade can't form; and no in-process-only wakeup, so cross-process token freeing isn't missed. The cost is brief inter-process oversubscription: while a worker runs un-gated, the global cap is exceeded by one. This is bounded per worker by the depth of nested `Task.get` and clears as soon as the worker finishes its current task.
Member
Author
|
!bench |
|
Benchmark results for fefdfc2 against 2cd9863 are in. There are significant results. @Kha
No significant changes detected. |
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/batteries
that referenced
this pull request
May 27, 2026
mathlib-nightly-testing Bot
pushed a commit
to leanprover-community/mathlib4-nightly-testing
that referenced
this pull request
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds an opt-in cross-process parallelism limit to the Lean runtime. When
LEAN_JOB_SEMAPHORE=/namepoints at a POSIX named semaphore,task_manager's standard workers acquire a token before running a task and release it after, so the total number of concurrently running standard workers across all participating processes is bounded by the semaphore's initial value.When the env var is unset, behavior is unchanged and there is no overhead.
Task.getreleases its token while blocked and reacquires before resuming, so a worker waiting on a sub-task cannot starve the global pool. Dedicated workers (priority aboveLEAN_MAX_PRIO) andLEAN_SYNC_PRIOtasks bypass the worker loop and so do not consume tokens.This is intended for experimentation, not production: Linux and macOS only (Windows is a no-op), no
MAKEFLAGSparsing, no crash-recovery for tokens leaked by killed processes, and no Lake integration