Skip to content

ci: dynamically limit parallel build jobs to prevent OOM errors#3441

Open
prasadn1 wants to merge 2 commits intoml-explore:mainfrom
prasadn1:fix-ci-oom
Open

ci: dynamically limit parallel build jobs to prevent OOM errors#3441
prasadn1 wants to merge 2 commits intoml-explore:mainfrom
prasadn1:fix-ci-oom

Conversation

@prasadn1
Copy link
Copy Markdown

Heavy C++ compilation (especially linking) can exhaust memory on GitHub Actions runners with high core counts but limited RAM, leading to intermittent OOM failures.

This introduces a cross-platform Python script that calculates a safe -j parallel build limit based on the system's available memory, ensuring builds scale safely across different runner types.

Proposed changes

Please include a description of the problem or feature this PR is addressing. If there is a corresponding issue, include the issue #.

No issue number. Pulled this from internal bug tracking

The Problem:
The current build configuration uses unbounded parallelism based on CPU core count (e.g., -j $(nproc) and %NUMBER_OF_PROCESSORS%). Because heavy C++ compilation and linking for ML kernels require significant
RAM per job (often 3-4GB per core), runners with high core counts but limited RAM (e.g., 8 cores / 16GB RAM) attempt to launch too many heavy processes simultaneously, exceeding the memory limit and causing
the build to crash.

The Solution:
This PR introduces a cross-platform Python utility (.github/scripts/set_cmake_parallel.py) that acts as a "defense in depth" system fix:

  1. It dynamically queries the system's total physical memory.
  2. It calculates a memory-safe parallel job limit (reserving ~3-4GB per core for Windows/MSVC and ~3GB for Linux/GCC).
  3. It bounds the CMAKE_BUILD_PARALLEL_LEVEL environment variable.

This ensures the build system remains hardware-aware, protecting the current 16GB runners from OOMs while allowing the build to automatically and safely scale up if/when higher-spec 32GB or 64GB runners are
provisioned.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works (Note: CI configuration changes are difficult to unit-test. I have verified the Darwin logic locally, but the Linux/Windows branches will need to be verified by the CI run on this PR).
  • [] I have updated the necessary documentation (if needed)

Copy link
Copy Markdown
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried to limit parallel builds before but it made compilation time unbearable long. The special thing in our case is that we only have a few kernels that take insane RAM to build so using swap space ends up as a better solution for us:

- name: Set swap space
if: ${{ startsWith(inputs.toolkit, 'cuda') }}
uses: pierotofy/set-swap-space@fc79b3f67fa8a838184ce84a674ca12238d2c761
with:
swap-size-gb: 16

@prasadn1
Copy link
Copy Markdown
Author

If swap is the preferred stopgap, I can update this PR to expand the Windows pagefile during setup so it mirrors the Linux workaround. Alternatively, if you want a stricter build-system fix, we could use CMake Job Pools. We could define a heavy_compilation_pool limited to 1 or 2 jobs and explicitly assign only those specific memory-hungry kernels to it. That allows the other 95% of the framework to safely compile at -j$(nproc) while ensuring the heavy kernels serialize. This should prevent both OOMs and swap-thrashing without penalizing total CI time. Are either of these options helpful?

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Apr 23, 2026

Didn't know CMake Job Pools before and it sounds a perfect solution! Most of the heavy kernels are under backend/cuda/quantized/qmm.

For Windows we are only doing CPU build though because the free runner does not have enough disk space for installing CUDA toolkits and buildings.

Replaces the global -j parallelism limit with CMake JOB_POOLS.
Global -j limits artificially starve the CPU during the shallow parts of the build tree. Instead, this defines a 'heavy_compilation_pool' (max 2 jobs) and explicitly assigns the massive qmm generated targets to it.

This allows the vast majority of the framework to compile at -j$(nproc), while mathematically bounding the memory footprint of the heaviest template instantiations, preventing OOMs without relying on OS swap space.
@prasadn1
Copy link
Copy Markdown
Author

I've pivoted the PR to implement a strict build-system fix using CMake JOB_POOLS (supported by the Ninja generator)

  1. I defined a heavy_compilation_pool with a concurrency limit of 2 in the root CMakeLists.txt
  2. I explicitly assigned the generated .cu and .cpp targets in backend/cuda/quantized/qmm to this pool
    This should allow 95% of the framework to safely compile at -j$(nproc), keeping the build fast, while mathematically guaranteeing that the heaviest template instantiations serialize

Regarding windows: Do you happen to know which specific files/targets in the core C++ or CPU/Metal backends are causing the memory spikes on the Windows and ARM runners? If we can identify the heaviest targets there, we can assign them to this heavy_compilation_pool as well

Comment thread CMakeLists.txt Outdated

# Define a job pool for heavy template metaprogramming tasks (e.g., quantized matmul)
# Limit to 2 concurrent jobs to prevent OOM on standard GitHub Actions runners (16GB RAM)
set_property(GLOBAL PROPERTY JOB_POOLS heavy_compilation_pool=2)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you only enable job pool when the CI environment variable is true?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented Apr 24, 2026

Regarding windows: Do you happen to know which specific files/targets in the core C++ or CPU/Metal backends are causing the memory spikes on the Windows and ARM runners? If we can identify the heaviest targets there, we can assign them to this heavy_compilation_pool as well

The CPU/Metal backends are totally fine building on CI, this is only a CUDA kernel problem.

Gate the job pool and pool assignments behind `if(DEFINED ENV{CI})`
so local developer builds retain full parallelism.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants