ci: dynamically limit parallel build jobs to prevent OOM errors by prasadn1 · Pull Request #3441 · ml-explore/mlx

prasadn1 · 2026-04-22T18:30:20Z

Heavy C++ compilation (especially linking) can exhaust memory on GitHub Actions runners with high core counts but limited RAM, leading to intermittent OOM failures.

This introduces a cross-platform Python script that calculates a safe -j parallel build limit based on the system's available memory, ensuring builds scale safely across different runner types.

Proposed changes

Please include a description of the problem or feature this PR is addressing. If there is a corresponding issue, include the issue #.

No issue number. Pulled this from internal bug tracking

The Problem:
The current build configuration uses unbounded parallelism based on CPU core count (e.g., -j $(nproc) and %NUMBER_OF_PROCESSORS%). Because heavy C++ compilation and linking for ML kernels require significant
RAM per job (often 3-4GB per core), runners with high core counts but limited RAM (e.g., 8 cores / 16GB RAM) attempt to launch too many heavy processes simultaneously, exceeding the memory limit and causing
the build to crash.

The Solution:
This PR introduces a cross-platform Python utility (.github/scripts/set_cmake_parallel.py) that acts as a "defense in depth" system fix:

It dynamically queries the system's total physical memory.
It calculates a memory-safe parallel job limit (reserving ~3-4GB per core for Windows/MSVC and ~3GB for Linux/GCC).
It bounds the CMAKE_BUILD_PARALLEL_LEVEL environment variable.

This ensures the build system remains hardware-aware, protecting the current 16GB runners from OOMs while allowing the build to automatically and safely scale up if/when higher-spec 32GB or 64GB runners are
provisioned.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works (Note: CI configuration changes are difficult to unit-test. I have verified the Darwin logic locally, but the Linux/Windows branches will need to be verified by the CI run on this PR).
[] I have updated the necessary documentation (if needed)

zcbenz

We tried to limit parallel builds before but it made compilation time unbearable long. The special thing in our case is that we only have a few kernels that take insane RAM to build so using swap space ends up as a better solution for us:

mlx/.github/actions/setup-linux/action.yml

Lines 57 to 61 in 211e57b

    
               - name: Set swap space 
        
                 if: ${{ startsWith(inputs.toolkit, 'cuda') }} 
        
                 uses: pierotofy/set-swap-space@fc79b3f67fa8a838184ce84a674ca12238d2c761 
        
                 with: 
        
                   swap-size-gb: 16

prasadn1 · 2026-04-23T21:55:42Z

If swap is the preferred stopgap, I can update this PR to expand the Windows pagefile during setup so it mirrors the Linux workaround. Alternatively, if you want a stricter build-system fix, we could use CMake Job Pools. We could define a heavy_compilation_pool limited to 1 or 2 jobs and explicitly assign only those specific memory-hungry kernels to it. That allows the other 95% of the framework to safely compile at -j$(nproc) while ensuring the heavy kernels serialize. This should prevent both OOMs and swap-thrashing without penalizing total CI time. Are either of these options helpful?

zcbenz · 2026-04-23T22:56:12Z

Didn't know CMake Job Pools before and it sounds a perfect solution! Most of the heavy kernels are under backend/cuda/quantized/qmm.

For Windows we are only doing CPU build though because the free runner does not have enough disk space for installing CUDA toolkits and buildings.

Replaces the global -j parallelism limit with CMake JOB_POOLS. Global -j limits artificially starve the CPU during the shallow parts of the build tree. Instead, this defines a 'heavy_compilation_pool' (max 2 jobs) and explicitly assigns the massive qmm generated targets to it. This allows the vast majority of the framework to compile at -j$(nproc), while mathematically bounding the memory footprint of the heaviest template instantiations, preventing OOMs without relying on OS swap space.

prasadn1 · 2026-04-24T00:58:04Z

I've pivoted the PR to implement a strict build-system fix using CMake JOB_POOLS (supported by the Ninja generator)

I defined a heavy_compilation_pool with a concurrency limit of 2 in the root CMakeLists.txt
I explicitly assigned the generated .cu and .cpp targets in backend/cuda/quantized/qmm to this pool
This should allow 95% of the framework to safely compile at -j$(nproc), keeping the build fast, while mathematically guaranteeing that the heaviest template instantiations serialize

Regarding windows: Do you happen to know which specific files/targets in the core C++ or CPU/Metal backends are causing the memory spikes on the Windows and ARM runners? If we can identify the heaviest targets there, we can assign them to this heavy_compilation_pool as well

zcbenz · 2026-04-24T02:22:33Z


+# Define a job pool for heavy template metaprogramming tasks (e.g., quantized matmul)
+# Limit to 2 concurrent jobs to prevent OOM on standard GitHub Actions runners (16GB RAM)
+set_property(GLOBAL PROPERTY JOB_POOLS heavy_compilation_pool=2)


Can you only enable job pool when the CI environment variable is true?

zcbenz · 2026-04-24T02:23:47Z

Regarding windows: Do you happen to know which specific files/targets in the core C++ or CPU/Metal backends are causing the memory spikes on the Windows and ARM runners? If we can identify the heaviest targets there, we can assign them to this heavy_compilation_pool as well

The CPU/Metal backends are totally fine building on CI, this is only a CUDA kernel problem.

Gate the job pool and pool assignments behind `if(DEFINED ENV{CI})` so local developer builds retain full parallelism.

zcbenz reviewed Apr 23, 2026

View reviewed changes

prasadn1 force-pushed the fix-ci-oom branch from 0963cc8 to 4f9967c Compare April 24, 2026 00:35

prasadn1 force-pushed the fix-ci-oom branch from 4f9967c to 2c67cb5 Compare April 24, 2026 00:38

zcbenz reviewed Apr 24, 2026

View reviewed changes

build: only enable heavy_compilation_pool in CI

6519c4d

Gate the job pool and pool assignments behind `if(DEFINED ENV{CI})` so local developer builds retain full parallelism.

zcbenz temporarily deployed to dry-run April 24, 2026 07:47 — with GitHub Actions Inactive

zcbenz temporarily deployed to dry-run April 24, 2026 08:44 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: dynamically limit parallel build jobs to prevent OOM errors#3441

ci: dynamically limit parallel build jobs to prevent OOM errors#3441
prasadn1 wants to merge 2 commits intoml-explore:mainfrom
prasadn1:fix-ci-oom

prasadn1 commented Apr 22, 2026

Uh oh!

zcbenz left a comment

Uh oh!

prasadn1 commented Apr 23, 2026

Uh oh!

zcbenz commented Apr 23, 2026 •

edited

Loading

Uh oh!

prasadn1 commented Apr 24, 2026

Uh oh!

zcbenz Apr 24, 2026

Uh oh!

prasadn1 Apr 24, 2026

Uh oh!

zcbenz commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- name: Set swap space
	if: ${{ startsWith(inputs.toolkit, 'cuda') }}
	uses: pierotofy/set-swap-space@fc79b3f67fa8a838184ce84a674ca12238d2c761
	with:
	swap-size-gb: 16

Conversation

prasadn1 commented Apr 22, 2026

Proposed changes

Checklist

Uh oh!

zcbenz left a comment

Choose a reason for hiding this comment

Uh oh!

prasadn1 commented Apr 23, 2026

Uh oh!

zcbenz commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prasadn1 commented Apr 24, 2026

Uh oh!

zcbenz Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

prasadn1 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zcbenz commented Apr 23, 2026 •

edited

Loading