Fix logical sharding resolution in NNX by xibinliu · Pull Request #4205 · AI-Hypercomputer/maxtext

xibinliu · 2026-06-19T00:27:20Z

Description

In pure NNX training runs, model variables retrieve physical PartitionSpecs via get_nnx_named_sharding_with_scan_axis in maxtext_utils.py. Previously, this helper used Flax core SPMD's from_sharding_rules to map logical names to physical axes. However, from_sharding_rules resolves rules by converting the rules list into a dictionary (last-write-wins). This caused fallback rules sharing the same logical name (e.g. 'embed') to overwrite preceding specific rules, dropping essential axes like fsdp_transpose and leading to unsharded parameter percentage assertion errors.

Additionally, resolving specifications independently for each dimension without tracking assigned axes could bind a single physical axis (like fsdp_transpose) to multiple positional dimensions of a tensor, causing DuplicateSpecError.

To fix this:

Replaced from_sharding_rules with a Rules-first resolution loop that matches rules sequentially (first-match-wins), matching Flax Linen's mapping behavior.
Implemented an assigned_axes tracker within the loop to ensure physical mesh axes are bound to at most one dimension per tensor.
Added unit tests covering sequential matching (first-match-wins) and duplicate physical axis prevention during resolution.

Tests

Log with Gemma3-12B (2x v6e-256)

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-19T00:30:59Z

Codecov Report

❌ Patch coverage is 90.56604% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/model_creation_utils.py	73.33%	2 Missing and 2 partials ⚠️
src/maxtext/utils/sharding.py	96.87%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

NuojCheng · 2026-06-22T16:51:43Z

  )


+def _resolve_logical_sharding(out_sharding, context_rules, local_rules) -> list:


could you move this function to utils/sharding.py? The goal is to move all sharding related util functions to this file.

Moved the get_nnx_named_sharding_with_scan_axis() to sharding.py, and moved the corresponding unit tests accordingly.

NuojCheng · 2026-06-22T16:57:23Z

+    # We define rules for 'embed' mapping to 'fsdp' (specific) then 'layers' (fallback)
+    rules = (
+        ("embed", "fsdp"),
+        ("embed", "layers"),


layers is not a physical axis name? maybe for something else, expert?

Changed to "stage", and use "layers" only as logical name.

NuojCheng · 2026-06-22T16:57:38Z

+        ("embed", "layers"),
+    )
+    with nn_partitioning.axis_rules(rules):
+      with jax.set_mesh(self.mesh):


with jax.set_mesh(self.mesh), nn_partitioning.axis_rules(rules):

NuojCheng · 2026-06-22T17:07:33Z

+    # When matching 'mlp', 'fsdp' is already bound, so it is skipped (unassigned/None).
+    rules = (
+        ("embed", ("fsdp", "layers")),
+        ("mlp", "fsdp"),


IMO this is an error caused by the rule. We should throw an error if there is a conflict instead of silently solving it.

In base.yaml: both mlp and embed have the "fsdp_transpose" physical match:

['mlp', ['fsdp_transpose', 'tensor', 'tensor_sequence', 'autoregressive']], ... ['embed', ['fsdp', 'fsdp_transpose', 'tensor_transpose', 'context', 'expert']],

And this is used in MLP layer:

kernel_axes=("embed", "mlp"),

I guess this is not a config issue.

Linen resolved this by calling remove_size_one_mesh_axis on the physical specs. We will now apply the same to NNX.

I would expect 'embed' downgraded to the next item,

maxtext/src/maxtext/configs/base.yml

Line 558 in 0b9f604

['embed', ['fsdp', 'tensor_transpose', 'context', 'expert']],

, so the conflict no longer exists. With your function that auto move to next list, I think the error you mentioned should be solved?

You are right. Checked the Linen implementation, and it does fall to the next item.
I modified the NNX code to re-use the linen nn.logical_to_mesh_axes(), instead of implementing the same func for NNX.
Now they should have the same behaviour.

NuojCheng

Overall I like this feature. I was concerned that this function would silently hide some errors that should be explicit raised, e.g. logical rule conflicting part. I agree we should improve the from_sharding_rules if it simply treats logical rule as dictionary.

xibinliu · 2026-06-23T01:56:57Z

Overall I like this feature. I was concerned that this function would silently hide some errors that should be explicit raised, e.g. logical rule conflicting part. I agree we should improve the from_sharding_rules if it simply treats logical rule as dictionary.

With pure NNX implementation of MaxText models, are we still allowed to import linen functions? If the answer is yes, then we can continue to re-use Linen implementation. I think we can decide during the Linen removal phase. (PR#12)

Pure NNX training runs previously used custom logical sharding resolution helpers which diverged from the standard Flax Linen path, causing logical axis fallback mismatch and DuplicateSpecErrors when multiple logical dimensions mapped to a single physical axis. This change aligns the NNX path with Flax Linen and consolidates utilities: 1. Replaced the custom rules resolution logic with standard Flax Linen `logical_to_mesh_axes` to ensure identical behavior for rules mapping. 2. Added the `remove_size_one_mesh_axis` reduction step inside the NNX variable resolver to strip size-1 axes from the PartitionSpec, preventing JAX from raising DuplicateSpecError on models with overlapping axis mappings. 3. Aligned the variable wrappers and extraction lifecycle: - `sharding.nnx_construct_named_sharding` and `sharding.get_nnx_var_named_sharding_with_scan_axis` retain standard Flax NNX `Variable` / `Param` wrappers to maintain structural type compatibility during multi-tree maps in trainer setup. - `maxtext_utils_nnx.nnx_extract_named_sharding` extracts clean JAX-native `NamedSharding` trees for compilation and device dispatch. 4. Cleaned up comments and unit tests (in `sharding_nnx_test.py` and `maxtext_utils_nnx_test.py`) to verify behavior on local meshes and support CPU-only testing environments by avoiding host offloading during JIT.

xibinliu force-pushed the xibin/nnx_sharding branch 3 times, most recently from e044d17 to 0730724 Compare June 19, 2026 02:06

xibinliu marked this pull request as ready for review June 19, 2026 06:12

NuojCheng reviewed Jun 22, 2026

View reviewed changes

xibinliu force-pushed the xibin/nnx_sharding branch from 0730724 to 80b40f7 Compare June 22, 2026 22:59

xibinliu requested review from Lumosis, gpolovets1, jrplatin, mailvijayasingh, mitalisi and patemotter as code owners June 22, 2026 22:59

xibinliu force-pushed the xibin/nnx_sharding branch 9 times, most recently from 94b165c to c4c6d80 Compare June 23, 2026 01:30

xibinliu force-pushed the xibin/nnx_sharding branch from c4c6d80 to 0750360 Compare June 23, 2026 02:41

		)


		def _resolve_logical_sharding(out_sharding, context_rules, local_rules) -> list:

Conversation

xibinliu commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NuojCheng left a comment

Choose a reason for hiding this comment

Uh oh!

xibinliu commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xibinliu commented Jun 19, 2026 •

edited

Loading

codecov Bot commented Jun 19, 2026 •

edited

Loading