[OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure #527

kaix-nv · 2025-11-07T07:53:49Z

What does this PR do?

Type of change: ?
New feature

Overview: ?
This PR provides a sparse attention support in ModelOpt for applying attention sparsity through skip softmax method, enabling inference speedups for LLMs.

Key Features:

Skip softmax support
Sparse attention config
Extensible method registry for future sparse attention algorithms
HuggingFace Transformers integration
Phase-aware thresholds (separate prefill/decode)

Design doc

Usage

import torch
import modelopt.torch.sparsity.attention_sparsity as mts
from transformers import AutoModelForCausalLM

# Load model (must use eager attention for softmax patching)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="eager",  # Required!
    torch_dtype=torch.bfloat16,
)

# Use pre-defined configuration
from modelopt.torch.sparsity.attention_sparsity import SKIP_SOFTMAX_DEFAULT
model = mts.sparsify(model, SKIP_SOFTMAX_DEFAULT)

Testing

Unit Test

pytest tests/unit/torch/sparsity/attention_sparsity -v
pytest tests/gpu/torch/sparsity/attention_sparsity -v
pytest tests/examples/llm_sparsity/attention_sparsity -v

ALL PASSED.

Accuracy

Benchmark: MMLU
Model: Qwen/Qwen3-4B
Cmd: python mmlu.py --model_name causal --model_path Qwen/Qwen3-4B --sparse_cfg SKIP_SOFTMAX_DEFAULT

	MMLU
BF16	69.96
SKIP_SOFTMAX_DEFAULT	69.86

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

codecov · 2025-11-07T08:07:19Z

Codecov Report

❌ Patch coverage is 89.67254% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.95%. Comparing base (fa84955) to head (cd6fce2).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...ch/sparsity/attention_sparsity/sparse_attention.py	71.42%	16 Missing ⚠️
...pt/torch/sparsity/attention_sparsity/conversion.py	93.02%	9 Missing ⚠️
...y/attention_sparsity/methods/flash_skip_softmax.py	90.90%	8 Missing ⚠️
...ch/sparsity/attention_sparsity/methods/registry.py	88.88%	3 Missing ⚠️
modelopt/torch/sparsity/attention_sparsity/mode.py	90.32%	3 Missing ⚠️
...delopt/torch/sparsity/attention_sparsity/config.py	95.91%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #527      +/-   ##
==========================================
+ Coverage   74.64%   74.95%   +0.31%     
==========================================
  Files         183      192       +9     
  Lines       18542    18939     +397     
==========================================
+ Hits        13840    14196     +356     
- Misses       4702     4743      +41

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Kai Xu <kaix@nvidia.com>

cjluo-nv · 2025-11-18T16:52:32Z

Hi @kaix-nv could you further split this code change? This PR has 3000+ lines of code change and many file moves

kevalmorabia97 · 2025-12-01T19:33:12Z

modelopt/torch/sparsity/attention_sparsity/sparse_attention.py

+
+
+# Create registry for sparse attention modules
+SparseAttentionRegistry = _DMRegistryCls("SparseAttention", SparseAttentionModule)


Can we use a single registry for all Sparsity algorithms and modes and then use top-level mts.sparsify(model, mode=...) so all algorithms (e.g. weight or attention sparsify) are invoked by single shared API instead of separate API per algorithm?

kevalmorabia97 · 2025-12-01T19:34:27Z

tests/examples/llm_sparsity/attention_sparsity/test_attention_sparsity.py

+    run_example_command(cmd_parts, "llm_sparsity/attention_sparsity")
+
+
+@minimum_gpu(1)


No need for 1-gpu marker. All tests are run on 1 or more gpus only

kevalmorabia97 · 2025-12-01T19:35:13Z

tests/gpu/torch/sparsity/attention_sparsity/test_attention_sparsity_gpu.py

+import modelopt.torch.sparsity.attention_sparsity as sparse_attn
+
+# Skip all tests if GPU is not available
+pytestmark = pytest.mark.skipif(not torch.cuda.is_available(), reason="GPU not available")


tests inside tests/gpu dont need a gpu check. Its assumed it is run only on gpu enabled machines.

Same applies to all test files

kevalmorabia97 · 2025-12-01T19:36:17Z

tests/gpu/torch/sparsity/attention_sparsity/test_integration_gpu.py

+        hidden_size=512,
+        intermediate_size=1024,


why do we need such large hidden and intermediate size? Can we use 32/64 instead?

kevalmorabia97 · 2025-12-01T19:38:29Z

tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_config.py

these tests seem unnecessary

kaix-nv requested a review from a team as a code owner November 7, 2025 07:53

kaix-nv requested review from realAsma and removed request for realAsma November 7, 2025 07:53

kaix-nv force-pushed the kaix/sparse_attention_core branch 4 times, most recently from 54bfe2c to 0ce1376 Compare November 8, 2025 03:31

kaix-nv requested review from cjluo-nv, kevalmorabia97, mxinO and realAsma November 8, 2025 03:32

add initial support for sparse attention

376b206

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv requested a review from jy-yuan November 11, 2025 18:29

Add unit and GPU tests for core sparse attention functionality

5d027e0

Signed-off-by: Kai Xu <kaix@nvidia.com>

kaix-nv force-pushed the kaix/sparse_attention_core branch from fc9d285 to 5d027e0 Compare November 11, 2025 23:44

kaix-nv changed the title ~~[2/n] Add Core Sparse Attention Infrastructure~~ [OMNIML-2852][2/n] Add Core Sparse Attention Infrastructure Nov 12, 2025

kaix-nv changed the title ~~[OMNIML-2852][2/n] Add Core Sparse Attention Infrastructure~~ [OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure Nov 12, 2025

kaix-nv requested review from RalphMao and shengliangxu November 13, 2025 00:49

shengliangxu added 2 commits November 13, 2025 16:25

Merge branch 'main' into kaix/sparse_attention_core

f5edda2

Merge branch 'main' into kaix/sparse_attention_core

40c1b7d

Merge branch 'main' into kaix/sparse_attention_core

cd6fce2

kevalmorabia97 removed the request for review from RalphMao December 1, 2025 19:07

kevalmorabia97 reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure #527

[OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure #527

Uh oh!

kaix-nv commented Nov 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

cjluo-nv commented Nov 18, 2025

Uh oh!

kevalmorabia97 Dec 1, 2025 •

edited

Loading

Uh oh!

kevalmorabia97 Dec 1, 2025

Uh oh!

kevalmorabia97 Dec 1, 2025 •

edited

Loading

Uh oh!

kevalmorabia97 Dec 1, 2025

Uh oh!

kevalmorabia97 Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		# Create registry for sparse attention modules
		SparseAttentionRegistry = _DMRegistryCls("SparseAttention", SparseAttentionModule)

		run_example_command(cmd_parts, "llm_sparsity/attention_sparsity")


		@minimum_gpu(1)

[OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure #527

Are you sure you want to change the base?

[OMNIML-2852] [2/n] Add Core Sparse Attention Infrastructure #527

Uh oh!

Conversation

kaix-nv commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Unit Test

Accuracy

Before your PR is "Ready for review"

Additional Information

Uh oh!

codecov bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv commented Nov 18, 2025

Uh oh!

kevalmorabia97 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kaix-nv commented Nov 7, 2025 •

edited

Loading

codecov bot commented Nov 7, 2025 •

edited

Loading

kevalmorabia97 Dec 1, 2025 •

edited

Loading

kevalmorabia97 Dec 1, 2025 •

edited

Loading