fix(pt): select device before loading torch ops by njzjz-bot · Pull Request #5612 · deepmodeling/deepmd-kit

njzjz-bot · 2026-06-30T01:46:50Z

Summary

Add a shared PyTorch backend helper to choose the rank-local CUDA/HIP device before PyTorch CUDA queries or custom-op loading can create a default context.
Move the PT C++ model init paths (DeepPotPT, DeepSpinPT, DeepTensorPT, DeepPotPTExpt, DeepSpinPTExpt) to call this helper before deepmd::load_op_library().
This should avoid each MPI rank leaving a small unused context on GPU 0 while preserving CPU fallback behavior.

Verification

git diff --check HEAD~1..HEAD
Static check: verified preselect_torch_device precedes deepmd::load_op_library() in all touched PT init paths.

Fixes #4171

Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5)

Summary by CodeRabbit

Bug Fixes
- Improved device selection so models start on the correct GPU more reliably.
- Reduced cases where GPU-enabled environments could accidentally fall back to the wrong device.
- CPU fallback behavior remains intact when no supported GPU is available.

Preselect the CUDA/HIP device from the rank-local GPU before PyTorch CUDA queries or torch custom-op loading can create a default-device context. This avoids each MPI rank leaving an unused context on GPU 0 in multi-GPU LAMMPS runs. Fixes deepmodeling#4171 Authored by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5) Signed-off-by: njzjz-bot (driven by OpenClaw (model: custom-chat-jinzhezeng-group/gpt-5.5))[bot] <48687836+njzjz-bot@users.noreply.github.com>

coderabbitai · 2026-06-30T01:51:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a4338421-7b9b-450b-867a-37dfe436ba0b

📥 Commits

Reviewing files that changed from the base of the PR and between 73de44b and 050f97e.

📒 Files selected for processing (6)

source/api_cc/include/commonPT.h
source/api_cc/src/DeepPotPT.cc
source/api_cc/src/DeepPotPTExpt.cc
source/api_cc/src/DeepSpinPT.cc
source/api_cc/src/DeepSpinPTExpt.cc
source/api_cc/src/DeepTensorPT.cc

📝 Walkthrough

Walkthrough

A new inline helper deepmd::preselect_torch_device is added to commonPT.h. It centralizes rank-local GPU selection (via CUDA/ROCm or Torch APIs) and sets gpu_id/gpu_enabled. All five PyTorch model init functions (DeepPotPT, DeepPotPTExpt, DeepSpinPT, DeepSpinPTExpt, DeepTensorPT) replace their duplicated inline GPU-selection logic with a single call to this helper.

Changes

GPU Preselection Refactor

Layer / File(s)	Summary
`preselect_torch_device` helper `source/api_cc/include/commonPT.h`	Adds `#include "device.h"` and defines `deepmd::preselect_torch_device`, selecting a rank-local GPU via `DPGetDeviceCount`/`DPSetDevice` (under `GOOGLE_CUDA`/`TENSORFLOW_USE_ROCM`) or `torch::cuda::device_count()`, and assigning `gpu_enabled` from `torch::cuda::is_available()`.
Adopt helper in all init paths `source/api_cc/src/DeepPotPT.cc`, `source/api_cc/src/DeepPotPTExpt.cc`, `source/api_cc/src/DeepSpinPT.cc`, `source/api_cc/src/DeepSpinPTExpt.cc`, `source/api_cc/src/DeepTensorPT.cc`	Each `::init` replaces its inline `torch::cuda::device_count()` / `torch::cuda::is_available()` / `DPSetDevice` block with `preselect_torch_device(gpu_rank, gpu_id, gpu_enabled)`. `DeepTensorPT.cc` additionally adds the `commonPT.h` include. Subsequent CUDA-vs-CPU device construction and logging remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: preselecting the device before loading Torch ops.
Linked Issues check	✅ Passed	The changes match issue `#4171` by selecting the rank-local GPU before op-library loading in all touched PT init paths.
Out of Scope Changes check	✅ Passed	No obvious unrelated changes are present beyond the device-preselection refactor needed for the linked bug fix.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-30T02:46:46Z

Codecov Report

❌ Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.98%. Comparing base (73de44b) to head (050f97e).

Files with missing lines	Patch %	Lines
source/api_cc/include/commonPT.h	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5612      +/-   ##
==========================================
- Coverage   81.98%   81.98%   -0.01%     
==========================================
  Files         959      959              
  Lines      105430   105423       -7     
  Branches     4071     4069       -2     
==========================================
- Hits        86442    86426      -16     
- Misses      17518    17528      +10     
+ Partials     1470     1469       -1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz · 2026-06-30T05:41:29Z

Need a real test on a machine with multiple GPUs.

github-actions Bot added the C++ label Jun 30, 2026

njzjz marked this pull request as draft June 30, 2026 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pt): select device before loading torch ops#5612

fix(pt): select device before loading torch ops#5612
njzjz-bot wants to merge 1 commit into
deepmodeling:masterfrom
njzjz-bot:fix/pt-preselect-device-4171

njzjz-bot commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

njzjz commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njzjz-bot commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

njzjz-bot commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jun 30, 2026 •

edited

Loading