-
Notifications
You must be signed in to change notification settings - Fork 646
[Bugfix] Fix the Eagle3 inference failure issue. #4544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an inference failure with Eagle3 speculative decoding. The changes involve correctly setting the attention state for Eagle3, adjusting how attention masks are generated for prefill with cache hits, and reusing pre-computed attention masks. Overall, the changes appear to correctly address the issue. However, I've found a critical bug in the calculation of max_seq_len within get_splitfuse_attn_mask which could lead to a runtime error. I've provided a specific suggestion to fix this.
| max_seq_len = max(seq_lens, default=0) | ||
| if hasattr(max_seq_len, "item"): | ||
| max_seq_len = int(max_seq_len.item()) | ||
| else: | ||
| max_seq_len = int(max_seq_len) if max_seq_len else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method for calculating max_seq_len is incorrect when seq_lens is a tensor. The max() built-in function cannot be used on a tensor to find its maximum value; seq_lens.max() should be used instead. The current implementation will raise a ValueError if seq_lens is a tensor. This logic can be simplified and corrected to properly handle a tensor input as specified by the type hint.
| max_seq_len = max(seq_lens, default=0) | |
| if hasattr(max_seq_len, "item"): | |
| max_seq_len = int(max_seq_len.item()) | |
| else: | |
| max_seq_len = int(max_seq_len) if max_seq_len else 0 | |
| max_seq_len = seq_lens.max().item() if seq_lens.numel() > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
bd75832 to
64e221d
Compare
64e221d to
6e4dd8e
Compare
Qwen/Qwen3-32B eagle3 accuracy failedThe output of `python collect_env.py`command: vllm serve Qwen/Qwen3-32B --served-model-name Qwen3-32B --tensor-parallel-size 4 --trust-remote-code --speculative-config '{"method":"eagle3","model":"/root/.cache/modelscope/hub/models/AngelSlim/Qwen3-32B_eagle3","num_speculative_tokens":4,"draft_tensor_parallel_size":1}' --max-num-batched-tokens 4096 --max-model-len 4096Qwen/Qwen3-32B eagle3 accuracy failed |
| cann_version = getattr(torch.version, "cann", "") | ||
| target_device = device or self.device | ||
| use_chunked_mask = (seq_lens is None or position is None | ||
| or dtype is None or cann_version.startswith("8.3")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding version 8.3 for judgment is not conducive to maintenance. It is recommended to define the version condition as a constant (e.g., MIN_CANN_VERSION_FOR_OPTIMIZED_MASK = "8.3").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if target_device is None: | ||
| raise ValueError( | ||
| "splitfuse_attn_mask requires device for non-chunked mask") | ||
| max_seq_len = seq_lens.max().item() if seq_lens.numel() > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number: The value 0 for max_seq_len should be defined as a constant (e.g., DEFAULT_MAX_SEQ_LEN = 0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
12b7323 to
3388cc6
Compare
3388cc6 to
c4a11a7
Compare
What this PR does / why we need it?
Fix the Eagle3 inference failure issue.
error message: "EngineCore encountered an issue. See stack trace (above) for the root cause."
Fixes #4323
How was this patch tested?
vllm serve /nfs/1_AscendPackage/05_weights_public/Qwen3-32B \ --served-model-name Qwen3-32B \ -tp 4 \ --host "0.0.0.0" \ --port "8000" \ --trust-remote-code \ --speculative-config '{"method":"eagle3","model":"/home/scd/qwen3_32b_eagle3/","num_speculative_tokens":4,"draft_tensor_parallel_size":1}' \ --max-num-batched-tokens 4096 \ --max-model-len 4096vLLM version: v0.11.0
vLLM-ascend version: v0.11.0rc2