IFU release v2.6 by wangye805 · Pull Request #406 · ROCm/TransformerEngine

wangye805 · 2026-01-03T16:27:03Z

Description

upstream release_v2.6 (with commit c90a720) IFU based on dev commit (669b556)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Resolve several conflicts

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

* Remove GH pinned deps Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pin onnxscript Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Reset FP8 weight workspace if usages are invalid Signed-off-by: Tim Moon <tmoon@nvidia.com>

…end` (#1965) Update utils.py Fix the condition error of the FP8 attention in `get_attention_backend` Signed-off-by: yuzhongw-nvidia <yuzhongw@nvidia.com> Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>

* exclude 9.10.0/.1 for certain configs Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kv_channels Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add get_backend to tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add init files Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix numerics and cuda graph tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor changes after renaming Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix import structure and rename get_attention_backends Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix docs and benchmarks Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get backend calls Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix get backend calls" This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "fix docs and benchmarks" This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix docs, benchmarks and pre-commit ci Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dpa/mha flash attn selection Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix rng states Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix backend selection on Ampere Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix issues from last merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/utils.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove initialization of rng_states to None Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * redefine ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ModelConfig Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix seed for CP tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update tests/pytorch/test_sanity.py Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fixture from utils to individual tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CI Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

…ug quantizer (#1963) * Debug linear layer when saving original input and using debug quantizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Workaround bugs with quantizing with only column-wise usage Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused imports Signed-off-by: Tim Moon <tmoon@nvidia.com> * Avoid unnecessary row-wise data Signed-off-by: Tim Moon <tmoon@nvidia.com> * Workaround bugs with quantizing with only column-wise usage FP8 does not support transpose-only cast. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fixed conflicts Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor code refactoring to avoid unnecessary checks Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed dBias accumulation error due to initialization. Minor code refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test case to reproduce the init error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed rowwise dbias error Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed ptx API Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added a struct for two packed FP8 values Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Rolled back to scalar code for columnwise scaling due to its better performance Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Minor corrections Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rebased on main Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per code review Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed constexpr in C++ test suite to build faster Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Computed activations are now numerically truncated to InputType before scaling. Improved test suite. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Minor refactoring Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modified mismatches checks of MXFP8 to address FP8 numerics Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Implemented Jeremy's fixes to JAX test suite with an intermediate downcast Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduced the dims of the test tensors to improve CI runtime Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * Fixed memory alignment issue. Compute dbias without downcast. Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix current device for cuDNN/cuBLAS handles Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use weight device and improve tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… L0 (#1990) Fix current scaling test_helper.py and enable test_helper.py in L0 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

…on-MXFP8 recipes. (#1962) * add manage_primitives() helper * disable GEMM primitives for non-MXFP8 recipes * implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE * replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples * fix use_jax_gemm contextmanager Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Fix cuDNN lib runtime loading and simplify Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix cudnn versioning in support in PyTorch DPA and Fused attn Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

…elism correctly for sequence-parallel inputs (#1980) * updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism Signed-off-by: Alp Dener <adener@nvidia.com> * corrected handling of FSDP sharding for the RHS operand Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral Signed-off-by: Alp Dener <adener@nvidia.com> * fixed linting issues Signed-off-by: Alp Dener <adener@nvidia.com> * added assert on sequence-parallel options when GemmPrimitive is disabled Signed-off-by: Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alp Dener <adener@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* optimize static grad outputs Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

…u_release_v2.6_rocm

tests/pytorch/attention/test_attention.py

tests/pytorch/test_numerics.py

tests/pytorch/attention/test_kv_cache.py

tests/pytorch/distributed/test_sanity.py

transformer_engine/common/normalization/common.h

transformer_engine/common/util/rocm_dequantize_kernels.cuh

benchmarks/attention/benchmark_attention_rocm.py

* Run core with all GPUs * Change n_parallel_jobs number to use all GPUs

ipanfilo

Please update copyright year on files that continue to modify

ipanfilo · 2026-02-01T20:29:51Z

.github/workflows/rocm-ci.yml


-          HIP_VISIBLE_DEVICES=3 ci/core.sh > /workspace/core_sgpu.log 2>&1 &
-          core_pid=$!; echo Core test pid $!
+          ci/core.sh > /workspace/core_sgpu.log 2>&1 


Why is it a part of release?

This is from cherry-picking #426 from dev branch. We found that sgpu cpp core gtest always use all 8 GPUs which can affect the sm_count pytorch pytests.

tests/pytorch/test_numerics.py

* [CI] Skipped test_gpt_full_activation_recompute tests for gfx950 * [CI] Skipped unsupported test_basic_linear_quantized tests on gfx950 * [CI] Fixed test_numerics, test_norms, test_fused_optimizer failures for gfx950 ci enablement * [CI] Disabled gfx950 support until FP8 GEMM layout coverage is verified with hipblaslt * [CI] [gfx950] Disable cudaGraph for gemmm and grouped-gemm * Addressed reviews * [CI] Add MI355 nodes to github actions workflow * [CI] Update docker image * [CI] add MI355 runner matrix and keep matrix legs independent * Skip unstable Gemm tests on gfx950 * Addressed reviews * Guard gfx950 TN skip by ROCm version and adjust MXFP8 Dq test size * Removed ROCM7.2 guards * Reverted ROCM7.2 guards * Update rocm-ci.yml

KshitijLakhani and others added 16 commits July 20, 2025 12:43

Changed VERSION to 2.6.0

bf5b217

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

[PyTorch] Remove GH pinned deps (#1961)

c7d0271

* Remove GH pinned deps Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Pin onnxscript Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Reset FP8 weight workspace if usages are invalid (#1972)

787acff

Reset FP8 weight workspace if usages are invalid Signed-off-by: Tim Moon <tmoon@nvidia.com>

[JAX] Fix current scaling test_helper.py and enable test_helper.py in…

928dfa8

… L0 (#1990) Fix current scaling test_helper.py and enable test_helper.py in L0 Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

Fix runtime lib loading for cuDNN (#1989)

e02e289

Fix cuDNN lib runtime loading and simplify Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix cudnn versioning support in PyTorch DPA and Fused attn (#1991)

21d7410

Fix cudnn versioning in support in PyTorch DPA and Fused attn Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>

Fix the use-after-free bug in unfused normalization (#2002)

c90a720

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Merge remote-tracking branch 'upstream/release_v2.6' into yewang12/if…

966a4ac

…u_release_v2.6_rocm

wangye805 requested review from ipanfilo and wenchenvincent as code owners January 3, 2026 16:27

wangye805 force-pushed the release_v2.6_rocm branch from 40b85a9 to 669b556 Compare January 6, 2026 23:04

[ROCm] Resolve conflicts

97556c6

wangye805 force-pushed the yewang12/ifu_release_v2.6_rocm branch from a90850f to 97556c6 Compare January 9, 2026 19:37

ipanfilo reviewed Jan 13, 2026

View reviewed changes

ipanfilo reviewed Jan 30, 2026

View reviewed changes

benchmarks/attention/benchmark_attention_rocm.py Outdated Show resolved Hide resolved

wangye805 and others added 2 commits January 31, 2026 11:00

[ROCm] address reviewer comments

3bf9150

CI: Serialize core sgpu test (#426)

fdac02a

* Run core with all GPUs * Change n_parallel_jobs number to use all GPUs

wangye805 requested a review from ipanfilo February 1, 2026 16:21

ipanfilo reviewed Feb 1, 2026

View reviewed changes

[ROCm] add EnvVarCleaner definition and update copyright years

bf75169

wangye805 requested a review from ipanfilo February 2, 2026 04:09

wangye805 force-pushed the yewang12/ifu_release_v2.6_rocm branch from fd073bf to 9b12e69 Compare February 2, 2026 19:54

wangye805 and others added 2 commits February 3, 2026 22:28

[ROCm] cherrypick the timeout setting for jax distributed pytests hang

a512352

Do not fail CI on known failed JAX test (#421)

4998d8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IFU release v2.6#406

IFU release v2.6#406
wangye805 wants to merge 23 commits intorelease_v2.6_rocmfrom
yewang12/ifu_release_v2.6_rocm

wangye805 commented Jan 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo left a comment

Uh oh!

ipanfilo Feb 1, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

wangye805 commented Jan 3, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

ipanfilo Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants