Skip to content

feat: experimental cross-process jobserver via POSIX semaphore#13856

Draft
Kha wants to merge 5 commits into
leanprover:masterfrom
Kha:push-sntnnpnmsktm
Draft

feat: experimental cross-process jobserver via POSIX semaphore#13856
Kha wants to merge 5 commits into
leanprover:masterfrom
Kha:push-sntnnpnmsktm

Conversation

@Kha
Copy link
Copy Markdown
Member

@Kha Kha commented May 26, 2026

This PR adds an opt-in cross-process parallelism limit to the Lean runtime. When LEAN_JOB_SEMAPHORE=/name points at a POSIX named semaphore, task_manager's standard workers acquire a token before running a task and release it after, so the total number of concurrently running standard workers across all participating processes is bounded by the semaphore's initial value.

When the env var is unset, behavior is unchanged and there is no overhead. Task.get releases its token while blocked and reacquires before resuming, so a worker waiting on a sub-task cannot starve the global pool. Dedicated workers (priority above LEAN_MAX_PRIO) and LEAN_SYNC_PRIO tasks bypass the worker loop and so do not consume tokens.

This is intended for experimentation, not production: Linux and macOS only (Windows is a no-op), no MAKEFLAGS parsing, no crash-recovery for tokens leaked by killed processes, and no Lake integration

@Kha
Copy link
Copy Markdown
Member Author

Kha commented May 26, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented May 26, 2026

Benchmark results for 387478f against 2cd9863 are in. There are significant results. @Kha

  • 🟥 build exited with code 1
  • 🟥 other exited with code 1

No significant changes detected.

Kha added 2 commits May 26, 2026 17:21
This PR adds an opt-in cross-process parallelism limit to the Lean
runtime. When `LEAN_JOB_SEMAPHORE=/name` points at a POSIX named
semaphore, `task_manager`'s standard workers acquire a token before
running a task and release it after, so the total number of concurrently
running standard workers across all participating processes is bounded
by the semaphore's initial value.

When the env var is unset, behavior is unchanged and there is no
overhead. `Task.get` releases its token while blocked and reacquires
before resuming, so a worker waiting on a sub-task cannot starve the
global pool. Dedicated workers (priority above `LEAN_MAX_PRIO`) and
`LEAN_SYNC_PRIO` tasks bypass the worker loop and so do not consume
tokens.

This is intended for experimentation, not production: Linux and macOS
only (Windows is a no-op), no `MAKEFLAGS` parsing, no crash-recovery
for tokens leaked by killed processes, and no Lake integration —
callers must create and destroy the semaphore themselves.
This PR makes the experimental jobserver self-bootstrapping: when no
`LEAN_JOB_SEMAPHORE` is set in the environment, `task_manager` now
creates a fresh named semaphore (`/lean-jobs-<pid>`) sized to
`max_std_workers`, exports the name via `LEAN_JOB_SEMAPHORE` so child
processes inherit it, and `sem_unlink`s on exit. `LEAN_JOB_SEMAPHORE_AUTO=N`
overrides the size.

The creating process does not gate its own workers against the
semaphore. The creator is typically an orchestrator (e.g. `lake`) whose
workers block on subprocesses; gating it would consume tokens that its
child `lean` processes need, deadlocking the pool.

Together with the previous patch this means `lake build` participates
in cross-process parallelism limiting with no command-line changes.
@github-actions github-actions Bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label May 26, 2026
@leanprover-bot
Copy link
Copy Markdown
Collaborator

Reference manual CI status:

  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-05-25 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-05-26 18:11:24)

@github-actions github-actions Bot added the mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN label May 26, 2026
@mathlib-lean-pr-testing mathlib-lean-pr-testing Bot added the builds-mathlib CI has verified that Mathlib builds against this PR label May 26, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

mathlib-lean-pr-testing Bot commented May 26, 2026

Mathlib CI status (docs):

This PR avoids a thread-count cascade that the previous prototype
provoked under heavy parallel elaboration. The earlier design released
and re-acquired tokens through the global semaphore at every
`Task.get` boundary; the blocking `sem_wait` on the re-acquire side
let further `Task.get` calls inflate `m_max_std_workers` and spawn
additional workers, multiplying the live OS thread count and tripping
"failed to create thread" under `RLIMIT_NPROC`.

In the new design, when a worker calls `Task.get`, it `sem_post`s its
token globally so a sibling can pick up the blocked sub-task, then
waits. Sibling `release_token` calls in the same process check whether
a `Task.get` is actively waiting (registered on `m_parked_cv` after
its `m_task_finished_cv.wait` returns) and, if so, hand the freed
token directly to that waiter via `m_parked_cv` instead of
`sem_post`. The waiter wakes without a blocking `sem_wait`, so the
cascade cannot form. Excess releases (`m_parked_tokens >=
m_parked_waiters`) still flow back to the global semaphore, so tokens
aren't hoarded.

Counting waiters only *after* `m_task_finished_cv.wait` is essential:
counting them before would route releases to a pool nobody is
listening on, starving the global semaphore and deadlocking workers
that are blocked in `sem_wait`.
@Kha Kha force-pushed the push-sntnnpnmsktm branch from 387478f to 046409e Compare May 26, 2026 21:03
@Kha
Copy link
Copy Markdown
Member Author

Kha commented May 26, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented May 26, 2026

Benchmark results for 046409e against 2cd9863 are in. There are significant results. @Kha

  • 🟥 build exited with code 137
  • 🟥 other exited with code 137

No significant changes detected.

mathlib-nightly-testing Bot pushed a commit to leanprover-community/batteries that referenced this pull request May 27, 2026
mathlib-nightly-testing Bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request May 27, 2026
This PR fixes a deadlock observed during a stage2 build of Lean and at
sem=1 in nested `Task.get` smoke tests. The previous patch routed a
freed token to the parked pool only when `m_parked_waiters > 0`, and
counted the waiter only after `m_task_finished_cv.wait` returned. But
the worker that resolves the sub-task holds the lock continuously
through `resolve_core` and `release_token`, so the waiter cannot
increment `m_parked_waiters` in between — the release always sees
`waiters == 0` and `sem_post`s globally instead. The waiter then woke
up, found `m_parked_tokens == 0`, and blocked on `m_parked_cv` forever
because no further `release_token` was coming.

On wake-up, the waiter now tries the parked pool first (in case
another in-process release happened to route there), then attempts a
non-blocking `sem_trywait` to recover a token the racing release sent
to the global semaphore. Only when both fail does it register as a
parked waiter and block on `m_parked_cv`. This handles the race
without widening the lock scope or changing the waiter-counting
policy.
@Kha
Copy link
Copy Markdown
Member Author

Kha commented May 27, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented May 27, 2026

Benchmark results for f3c3b98 against 2cd9863 are in. There are significant results. @Kha

  • 🟥 build exited with code 137
  • 🟥 other exited with code 137

No significant changes detected.

mathlib-nightly-testing Bot pushed a commit to leanprover-community/batteries that referenced this pull request May 27, 2026
mathlib-nightly-testing Bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request May 27, 2026
@mathlib-lean-pr-testing mathlib-lean-pr-testing Bot removed the builds-mathlib CI has verified that Mathlib builds against this PR label May 27, 2026
…ersubscription

This PR replaces the parked-pool + `sem_trywait`-fallback design with a
simpler approach: when `wait_for` cannot reclaim a token non-blockingly,
the worker continues running its task un-gated rather than blocking in
`sem_wait`. A thread-local `g_holds_token` flag tracks whether the
current worker actually has a token; the worker-loop's `release_token`
skips its `sem_post` when the flag is false, keeping per-worker token
accounting balanced.

The previous design either deadlocked (when the in-process parked-pool
notification couldn't reach a token that had been taken by another
process) or risked re-introducing the original thread-explosion cascade
(when the `sem_trywait` fallback hit blocking `sem_wait` under
contention). The new design avoids both: no blocking call in
`wait_for`'s reclaim, so the cascade can't form; and no in-process-only
wakeup, so cross-process token freeing isn't missed.

The cost is brief inter-process oversubscription: while a worker runs
un-gated, the global cap is exceeded by one. This is bounded per worker
by the depth of nested `Task.get` and clears as soon as the worker
finishes its current task.
@Kha
Copy link
Copy Markdown
Member Author

Kha commented May 27, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented May 27, 2026

Benchmark results for fefdfc2 against 2cd9863 are in. There are significant results. @Kha

  • 🟥 build exited with code 1
  • 🟥 other exited with code 1

No significant changes detected.

mathlib-nightly-testing Bot pushed a commit to leanprover-community/batteries that referenced this pull request May 27, 2026
mathlib-nightly-testing Bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants