Skip to content

unify rope device#4467

Open
CUHKSZzxy wants to merge 1 commit intoInternLM:mainfrom
CUHKSZzxy:unify-rope-device
Open

unify rope device#4467
CUHKSZzxy wants to merge 1 commit intoInternLM:mainfrom
CUHKSZzxy:unify-rope-device

Conversation

@CUHKSZzxy
Copy link
Collaborator

Copilot AI review requested due to automatic review settings March 25, 2026 10:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR follows up on #4466 to make RoPE inverse-frequency initialization consistent across several model implementations/backends by creating the arange/exponent on the default device first and then moving the resulting inv_freq to the target device.

Changes:

  • Update multiple model-specific rotary embedding initializations to compute inv_freq without specifying device in torch.arange, then move inv_freq to the requested device.
  • Update dlinfer rotary embedding init to remove hardcoded CUDA allocation for inv_freq.
  • Adjust default backend’s Dynamic NTK _ntk_inv_freq to match the “no device specified in arange” pattern (but this currently introduces a device mismatch bug).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lmdeploy/pytorch/models/qwen3_5.py Compute inv_freq then move it to device for default RoPE params.
lmdeploy/pytorch/models/qwen2_vl.py Vision rotary embedding now moves inv_freq to device after computation.
lmdeploy/pytorch/models/qwen2_5_vl.py Same “compute then move” update for vision rotary embedding.
lmdeploy/pytorch/models/glm4_1v.py Same “compute then move” update for vision rotary embedding.
lmdeploy/pytorch/backends/dlinfer/rotary_embedding.py Remove hardcoded device='cuda' for inv_freq init.
lmdeploy/pytorch/backends/default/rotary_embedding.py Change Dynamic NTK _ntk_inv_freq arange creation (currently breaks CUDA).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

base = self.base * ((self.scaling_factor * seq_len / self.max_position_embeddings) -
(self.scaling_factor - 1))**(self.dim / (self.dim - 2))
inv_freq = 1.0 / (base**(torch.arange(0, self.dim, 2, dtype=torch.int64, device=device).float() / self.dim))
inv_freq = 1.0 / (base**(torch.arange(0, self.dim, 2, dtype=torch.int64).float() / self.dim))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _ntk_inv_freq, base is derived from seq_len (so it will be on seq_len.device), but torch.arange(...) is now created on the default device (CPU). This will raise a device-mismatch error when seq_len is on CUDA (common during inference). Ensure the exponent tensor is on the same device as base/seq_len (e.g., move the arange to seq_len.device before the pow), or compute the full NTK inv_freq on CPU and then move it to x.device in forward before torch.where.

Suggested change
inv_freq = 1.0 / (base**(torch.arange(0, self.dim, 2, dtype=torch.int64).float() / self.dim))
inv_freq = 1.0 / (base**(torch.arange(0, self.dim, 2, dtype=torch.int64, device=seq_len.device).float() / self.dim))

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be handled by

        if self.inv_freq.device != x.device:
            self.inv_freq = self.inv_freq.to(x.device)

so no need to add the device during initial calculation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants