Skip to content

tests: skip local NVML runtime mismatches while preserving CI failures#1739

Open
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud:fix/nvml-local-skip-ci-fail
Open

tests: skip local NVML runtime mismatches while preserving CI failures#1739
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud:fix/nvml-local-skip-ci-fail

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Mar 7, 2026

Summary

  • Centralize NVML runtime gating through require_nvml_runtime_or_skip_local fixtures in cuda_bindings and cuda_core tests.
  • Treat NVML init/load failures (including driver/library mismatch and missing NVML shared library) as local skips, but continue to fail in CI by re-raising.
  • Cover the behavior with regression tests and apply fixture-based gating across NVML-dependent test modules/fixtures.
  • Rationale: after a driver upgrade without a reboot (a common local developer state), NVML can report a temporary driver/library mismatch; local runs should skip NVML-dependent tests instead of failing collection, while CI should still fail fast for real infra regressions.

Test plan

  • pixi run --manifest-path cuda_bindings pytest cuda_bindings/tests --override-ini norecursedirs=examples -k "not test_cufile"
  • CI=1 pixi run --manifest-path cuda_bindings pytest cuda_bindings/tests/nvml/test_init.py::test_init_ref_count (expected error on NVML mismatch in CI mode)
  • pixi run --manifest-path cuda_core test (currently blocked in this workspace by unrelated import mismatch: cuda.core._resource_handles does not export expected C function create_culink_handle)
  • CI=1 pixi run --manifest-path cuda_core pytest cuda_core/tests/system/test_system_system.py::test_num_devices (same unrelated import mismatch blocker)

Made with Cursor

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 7, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud added 3 commits March 7, 2026 11:51
Driver upgrades without a reboot can temporarily leave NVML in a driver/library mismatch state, which is a common local developer scenario. Route NVML-dependent checks through shared fixtures/helpers so local runs skip cleanly while CI still fails fast on real NVML init/load regressions.

Made-with: Cursor
Run repository hooks and keep the NVML fixture changes compliant by applying ruff import ordering and formatting adjustments.

Made-with: Cursor
Apply hook-driven import ordering/spacing updates introduced by rebasing onto upstream/main so pre-commit passes cleanly.

Made-with: Cursor
@cpcloud cpcloud force-pushed the fix/nvml-local-skip-ci-fail branch from 3ae1ece to 4272269 Compare March 7, 2026 16:52
@cpcloud
Copy link
Contributor Author

cpcloud commented Mar 7, 2026

/ok to test

@github-actions
Copy link

github-actions bot commented Mar 7, 2026

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, in terms of centralizing all of this logic.

However, I'm not sure why some tests unrelated to NVML now have this tagging.

And within system/test_system_system.py, we need to keep most of those tests still running even when NVML is totally missing.



@pytest.mark.parametrize("change_device", [True, False])
@pytest.mark.usefixtures("require_nvml_runtime_or_skip_local")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the tests here now require a working NVML? These tests predate NVML in cuda_bindings... What's the root cause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into why this ends up being required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_num_devices will use NVML if it's available. So, yeah, they're pre-existing tests, but they're hitting other APIs now.

Now that these route through NVML when it's available, they're a place where we need to skip if nvml is available but fails in an expected way.

What's the root cause?

The root cause is that I upgraded the driver without rebooting. Since NVML is a driver library, I can no longer use it without a reboot.

I don't want to reboot to keep working in the repo, especially if I'm working on something unrelated to any of this code.


from .conftest import skip_if_nvml_unsupported

pytestmark = skip_if_nvml_unsupported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the tests in this file are expected to run, other than test_gpu_driver_version, even without an NVML available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into why this ends up being required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required

Remove the module-level pytestmark from test_system_system.py and the
per-test require_nvml_runtime_or_skip_local markers from test_memory.py.
These tests don't inherently need NVML; the NVML-specific tests already
have individual @skip_if_nvml_unsupported decorators.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants