Add `NVTE_BACKWARD_OVERRIDE=high_precision|dequantized` by zianglih · Pull Request #2644 · NVIDIA/TransformerEngine

zianglih · 2026-02-03T00:48:37Z

Description

~~Add an NVTE_KEEP_BACKWARD_UNQUANTIZED env var for quantized fprop + high precision wgrad & dgrad.~~

~~Add NVTE_BACKWARD_MODE=default|unquant|dequant env var~~

Add NVTE_BACKWARD_OVERRIDE=high_precision|dequantized env var:

Not set: existing default quantization behavior
high_precision: quantized fprop + high precision wgrad & dgrad using unquantized activation and weight
dequantized: quantized fpop + high precision wgrad & dgrad using activation and weight dequantized directly from fprop quantized value

The movitivation for this dequantized design is RL. Unlike pre-training which only needs to preserve coarse optimization direction and convergence, RL gradients are noisy and useful updates are small and delicate. If gradient quantization and chain rule violation are present, noise dominates the true and fragile update signal and model will collapse. This dequantized design avoids gradient quantization and effectively preserves chain rule.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-02-03T00:52:31Z

Greptile Summary

This PR adds NVTE_BACKWARD_OVERRIDE=high_precision|dequantized support, enabling high-precision backward passes (wgrad & dgrad) in combination with quantized fprop. The high_precision mode saves original unquantized activations/weights; dequantized mode saves quantized fprop tensors and dequantizes them before backward GEMMs to avoid gradient quantization errors during RL fine-tuning.

LayerNormMLP and DelayedScaling recipe intentionally reject NVTE_BACKWARD_OVERRIDE with clear assertion messages and guidance to use LayerNormLinear + Linear as an alternative. All other previously flagged concerns (duplicate recipe fields, recipe None crash, unnecessary saved tensors) appear resolved in this revision.

Confidence Score: 5/5

Safe to merge — all previously flagged blocking issues are resolved; remaining findings are P2 style suggestions.

All previously raised P0/P1 concerns (duplicate recipe fields, recipe None crash, LayerNormMLP assertion message quality, unnecessary saved tensors, DelayedScaling interaction) are addressed in this revision. The feature is guarded by explicit assertions with clear error messages for unsupported combinations. The only new findings are a defensive getattr suggestion in fuser.py and a minor asymmetry in empty-tensor guards for MXFP8 storage, both P2. Comprehensive test coverage was added.

transformer_engine/pytorch/ops/fuser.py (minor: direct attribute access on recipe.backward_override), transformer_engine/pytorch/tensor/storage/mxfp8_tensor_storage.py (minor: asymmetric empty-tensor guard in dequantize vs _FromMXFP8Func.forward)

Important Files Changed

Filename	Overview
transformer_engine/common/recipe/init.py	Adds `backward_override` field to all recipe dataclasses. DelayedScaling asserts `backward_override is None` with a clear error message. Previously flagged duplicate fields in Float8CurrentScaling are gone.
transformer_engine/pytorch/module/linear.py	Adds `backward_override` detection; sets `save_original_input=True` for `high_precision` mode and saves unquantized/quantized operands accordingly. Properly asserts against Float8Quantizer (DelayedScaling) when `save_original_input` is true.
transformer_engine/pytorch/ops/basic/basic_linear.py	Functional forward sets `columnwise=False` for quantized operands when `backward_override` is not None. `op_forward` chooses between saving original (high_precision) or quantized (dequantized) tensors. Backward pass dispatches dequantization accordingly.
transformer_engine/pytorch/module/layernorm_linear.py	Correctly implements both `high_precision` and `dequantized` modes. Disables `optimize_for_gemm` for MXFP8/NVFP4 dequantized mode. `save_for_backward` order matches test expectations.
transformer_engine/pytorch/module/layernorm_mlp.py	Explicitly asserts `backward_override is None` with a clear error message directing users to LayerNormLinear + Linear. Intentional documented limitation.
transformer_engine/pytorch/module/grouped_linear.py	Adds `backward_override` support; when set, disables FP8/UB/debug context in backward. Correctly handles both override modes in wgrad/dgrad GEMMs.
transformer_engine/pytorch/ops/fuser.py	Adds `backward_override` to the fusion cache key so fused op graphs are rebuilt when the override changes. Direct attribute access `recipe.backward_override` could AttributeError for custom Recipe subclasses not defining the field.
transformer_engine/pytorch/ops/basic/quantize.py	Disables backward quantization when `recipe.backward_override` is not None. `get_fp8_recipe()` is guarded by the `fp8_enabled` check and always returns a recipe, so no None crash.
transformer_engine/pytorch/tensor/storage/mxfp8_tensor_storage.py	Adds empty-tensor early returns in both `_FromMXFP8Func.forward` (rowwise and columnwise) and `MXFP8TensorStorage.dequantize` (rowwise only). Asymmetry is functionally safe but inconsistent.
transformer_engine/pytorch/tensor/storage/nvfp4_tensor_storage.py	Adds empty-tensor early returns consistent with MXFP8 pattern. Functionally safe for the dequantized backward use case.
transformer_engine/pytorch/tensor/storage/float8_blockwise_tensor_storage.py	Adds early return for empty rowwise tensor in `dequantize`, preventing errors when empty-token grouped-linear chunks are encountered in dequantized mode.
tests/pytorch/test_backward_override.py	New comprehensive test file covering both override modes across all recipe types, module types, shapes, and fused op patterns. Layout invariant checks guard against hidden requantization during backward.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Forward Pass - quantized fprop] --> B{NVTE_BACKWARD_OVERRIDE}
    B -->|None| C[Default: save rowwise+columnwise quantized tensors]
    B -->|high_precision| D[Save original unquantized input and weight]
    B -->|dequantized| E[Save rowwise-only quantized tensors]
    C --> F[Backward: quantized dgrad and wgrad GEMMs]
    D --> G[Backward: high-precision dgrad and wgrad using original fp16/bf16/fp32 operands]
    E --> H[Backward: dequantize saved tensors then high-precision GEMMs]
    subgraph Supported
        L[Linear]
        M[LayerNormLinear]
        N[GroupedLinear]
        O[ops.Linear / fused ops]
    end
    subgraph Unsupported - assertion error with clear message
        P[LayerNormMLP]
        Q[DelayedScaling recipe]
    end

_{Reviews (43): Last reviewed commit: "Merge branch 'main' into keep-bwd" | Re-trigger Greptile}

greptile-apps

_{17 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

zianglih · 2026-02-03T08:31:14Z

I'll work on potential unit test breakage.

transformer_engine/pytorch/ops/fuser.py

transformer_engine/pytorch/module/layernorm_linear.py

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/linear.py

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/quantization.py

greptile-apps

_{5 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/layernorm_mlp.py

transformer_engine/pytorch/module/layernorm_linear.py

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/module/linear.py

zhongbozhu · 2026-02-03T22:30:02Z

transformer_engine/pytorch/module/linear.py

            # Note: dgrad GEMM requires row-wise usage, wgrad GEMM
            # requires column-wise usage
-            if ctx.grad_output_quantizer is not None:
+            if ctx.grad_output_quantizer is not None and use_fp8_bwd:


this line seems redundant since you already skip the quantization step in base.py grad_output_preprocess?

zhongbozhu · 2026-02-03T22:30:27Z

transformer_engine/pytorch/module/linear.py

                not ctx.use_bias
                and not ctx.requires_wgrad
                and ctx.grad_output_quantizer is not None
+                and use_fp8_bwd


same comment as above

transformer_engine/pytorch/module/grouped_linear.py

zhongbozhu · 2026-02-03T22:39:21Z

transformer_engine/pytorch/quantization.py

+        recipe = cls.get_fp8_recipe()
+        if recipe is not None and recipe.delayed():
+            # Ignore NVTE_KEEP_BACKWARD_UNQUANTIZED when delayed scaling is used
+            return False


Maybe it's better to assert an error for delayed scaling? Okay with both.

I agree. If the user specifies an unsupported combination, I think it's better to fail loudly than to secretly disobey their instructions.

transformer_engine/pytorch/module/layernorm_linear.py

zhongbozhu · 2026-02-03T23:41:16Z

transformer_engine/pytorch/module/layernorm_linear.py

            # Note: dgrad GEMM requires row-wise usage, wgrad GEMM
            # requires column-wise usage
-            if ctx.grad_output_quantizer is not None:
+            if ctx.grad_output_quantizer is not None and use_fp8_bwd:


this seems redundant too if we skip quant in grad_output_preprocess

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/layernorm_mlp.py

zhongbozhu · 2026-03-13T18:23:51Z

Not a fan of NVTE_BACKWARD_MODE, it's too generic. I am still not sure if this feature should be allowed via environment toggle. It's easy for the users but we should make it explicitly configurable via recipe API and not envvar.

Is there a reason to have the dequant mode? Is it just for memory saving? Can't imagine it being numerically better that unquant. Either way, dequantized and high_precision might be better names for these features.

Naming part I agree but I have no strong opinion.

zianglih · 2026-03-13T18:34:07Z

Hi @ksivaman , thanks for reviewing!

we should make it explicitly configurable via recipe API and not envvar

Currently the backward_mode is a configurable recipe member, not a global toggle. It is set by the NVTE_BACKWARD_MODE envvar. I can work on a better interface.

Is there a reason to have the dequant mode?

Yes we have very good reasons in RL use cases since it best preserves chain rule and serves as an STE. Our experiments showed clearly more stable gradient curves compared with default and unquant mode. unquant seems to have good numerics but violates chain rule more, which is acceptable in pre-training but not RL.

dequantized and high_precision might be better names for these features

Yes I can change naming to default|high_precision|dequantized.

zhongbozhu · 2026-03-13T20:01:17Z

Can you clarify the dequant method here? For fprop, we quantize and get input_fp8, and weight_fp8, and then for dequantize you also dequantize both, is that right?

zianglih · 2026-03-13T20:20:25Z

Hi @zhongbozhu ,

For fprop, we quantize and get input_fp8, and weight_fp8, and then for dequantize you also dequantize both

This is exactly right. The fprop uses quantized compute specified by the quantization recipe with no behavioral changes. In bwd, input_fp8 is dequantized for high-precision wgrad, weight_fp8 is dequantized for high-precision dgrad, gradient is always kept in high-precision and gradient quantization never happens.

The movitivation for this dequantized design is RL. Unlike pre-training which only needs to preserve coarse optimization direction and convergence, RL gradients are noisy and useful updates are small and delicate. If gradient quantization and chain rule violation are present, noise dominates the true and fragile update signal and model will collapse. This dequantized design avoids gradient quantization and effectively preserves chain rule.

Signed-off-by: Ziang Li <ziangli@umich.edu>

…zed` Signed-off-by: Ziang Li <ziangli@umich.edu>

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih · 2026-03-16T07:03:43Z

using "dequantized" in bwd still does not preserve the chain rule 100%, as the quantization in fwd and bwd happens along different dims

@victordion I think you are describing the default TE 1d recipe or requantized behavior.

victordion · 2026-03-16T17:05:22Z

using "dequantized" in bwd still does not preserve the chain rule 100%, as the quantization in fwd and bwd happens along different dims

@victordion I think you are describing the default TE 1d recipe or requantized behavior.

Right. My mistake. My mental model assumed there is requantize happening. Thanks for responding!

zianglih · 2026-03-20T18:24:54Z

Regarding the env var design, since this feature is mainly used by RL, there has to be a way for the user to directly override the bwd behavior in RL framework instead of plumbing all the way through Megatron.

ksivaman · 2026-03-23T22:14:25Z

/te-ci L0 L1

zianglih · 2026-03-24T17:56:12Z

All pytorch ci passed.

Some failed jax tests are due to FileExistsError: [Errno 17] File exists: '/logs' .

zhongbozhu · 2026-03-30T17:36:16Z

/te-ci L0 L1

zhongbozhu

LGTM, pending CI

Unblocking

zhongbozhu · 2026-04-02T17:26:20Z

/te-ci L0 L1

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-04-03T16:42:37Z

/te-ci L0 L1

zianglih · 2026-04-07T02:09:13Z

Failed JAX ci is unrelated to this PR:

B200:

../../tests/jax/test_permutation.py::TestHighLevelPermutationAPI::test_sort_chunks_by_index[dtype_float32-8-4096-1280] FAILED

L40:

../../tests/jax/test_permutation.py::TestHighLevelPermutationAPI::test_sort_chunks_by_index[dtype_float32-8-4096-1280] FAILED

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/ops/fuser.py Outdated Show resolved Hide resolved

ziang-and force-pushed the keep-bwd branch from 539af7d to 3e6eb64 Compare February 3, 2026 08:58

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/quantization.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

ziang-and force-pushed the keep-bwd branch from c934298 to b449fc4 Compare February 3, 2026 19:51

greptile-apps bot reviewed Feb 3, 2026

View reviewed changes

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 3, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Feb 4, 2026

View reviewed changes

transformer_engine/pytorch/module/layernorm_mlp.py Outdated Show resolved Hide resolved

Fix Blackwell debug ci

21d744c

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and dismissed zhongbozhu’s stale review via af207c4 March 14, 2026 09:38

Fix sm89 and sm90 tests

52ed189

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the keep-bwd branch from af207c4 to 52ed189 Compare March 14, 2026 09:46

Fix unquant mode memory saving

9f475a1

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the keep-bwd branch from 77e1c58 to 9f475a1 Compare March 14, 2026 19:34

Refactor interface to `NVTE_BACKWARD_OVERRIDE=high_precision|dequanti…

4164907

…zed` Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih changed the title ~~Add NVTE_BACKWARD_MODE=default|unquant|dequant~~ Add NVTE_BACKWARD_OVERRIDE=high_precision|dequantized Mar 14, 2026

zianglih added 2 commits March 14, 2026 15:09

Rename unit test

00893bb

Signed-off-by: Ziang Li <ziangli@umich.edu>

Simplify env var parsing

433880d

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the keep-bwd branch from e6b00ba to 433880d Compare March 14, 2026 22:24

zianglih requested review from ksivaman and zhongbozhu March 17, 2026 05:42

Merge branch 'main' into keep-bwd

fda56bd

zhongbozhu approved these changes Mar 30, 2026

View reviewed changes

Merge branch 'main' into keep-bwd

547b0fc

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Conversation

zianglih commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

zianglih commented Feb 3, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhongbozhu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhongbozhu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhongbozhu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zianglih commented Feb 3, 2026 •

edited

Loading

greptile-apps bot commented Feb 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

zianglih commented Mar 13, 2026 •

edited

Loading

zianglih commented Mar 13, 2026 •

edited

Loading

zianglih commented Mar 16, 2026 •

edited

Loading