-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[refactor] [mla]: independently passing q_nope & q_rope #28368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: vnadathur <glvikramn@gmail.com> Co-Authored-By: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Multi-Head Latent Attention (MLA) layers to pass the q_nope and q_pe components of the query tensor as separate arguments, instead of a single concatenated tensor. This is a good preparatory step for introducing custom attention ops that can leverage this separation for optimizations. The changes are applied consistently across the affected attention backends and model layers.
However, I've identified a potential performance regression in the prefill path for the MLACommonImpl backend. The new API forces a split -> slice -> cat sequence of operations, which is less efficient than the previous slice-only approach due to an extra memory allocation and copy. I've left a detailed comment with a suggestion to address this. Other than that, the refactoring looks solid.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vnadathur <glvikramn@gmail.com>
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com> Co-Authored-By: vnadathur <236933696+vnadathur@users.noreply.github.com>
Signed-off-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com>
|
Thanks for this work! Do you have any benchmarking results for what kind of speedup we can expect from this? |
|
@MatthewBonanni Here are the benchmark results for with split and without. There is some other work going on for refactoring mla backends like this pr for example: #27501 , my pr is just to enable more of the same stuff coming from that original issue & adjacent prs in the desc. Ran deepseek-V2.5 on a 8xh100 BENCHMARK UTILIZING SPLIT BENCHMARK NORMAL MLA BACKEND: Config
|
I track in this issue: #26567
Reference: #24620 and #25103
Some things of note are that the prefill path still require concatted q (this is because of kernel expectations), so I focused on just deconcating the decode path.
cc @ProExpertProg @MatthewBonanni @LucasWilkinson