-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Name of failing test
uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5' && uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2' && pytest -v -s models/language/generation -m hybrid_model --num-shards=$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$BUILDKITE_PARALLEL_JOB
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
🧪 Describe the failing test
Parallel test execution in Shard 4 - C++ extension compilation failure during test runtime
Failure: RuntimeError during JIT compilation - "Error compiling objects for extension"
Stack trace highlights:
torch/utils/cpp_extension.py:2612in_run_ninja_build_write_ninja_file_and_compile_objects→ ninja build process- Extension compilation through setuptools/Cython build_ext
Configuration: Parallel pytest execution (shard 4 of multi-shard run)
Likely cause: JIT compilation failure for PyTorch custom extensions on ROCm. When vLLM imports model code, PyTorch attempts to compile custom CUDA/ROCm kernels on-the-fly using ninja. The compilation crashes on ROCm, possibly due to:
- missing ROCm compilation toolchain components (hipcc, rocm-dev packages)
- incompatible compiler flags between CUDA and ROCm compilation paths
- parallel shard conflicts where multiple test processes simultaneously attempt to compile the same extension to the same cache location, or
- insufficient memory/resources during parallel compilation.
The "One of the processes failed with 1" message confirms a process crash during the build phase.
📝 History of failing test
AMD-CI build Buildkite references:
- 1041
- 1077
- 1088
- 1109
- 1111
CC List.
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status