Skip to content

[Diloco] add DCN bandwidth throttling options#4219

Open
khatwanimohit wants to merge 1 commit into
mainfrom
mohit/dcn_throttling
Open

[Diloco] add DCN bandwidth throttling options#4219
khatwanimohit wants to merge 1 commit into
mainfrom
mohit/dcn_throttling

Conversation

@khatwanimohit

@khatwanimohit khatwanimohit commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Description

This change introduces programmatic traffic control ( tc ) bandwidth throttling support in MaxText and includes a dedicated DCN bandwidth microbenchmark script. This allows simulated network environment testing (e.g. limiting inter-slice DCN bandwidth) during multi-slice training.

Key Changes

• Centralized Throttling Utilities: Implemented apply_dcn_throttling and cleanup_dcn_throttling inside train_utils.py using the Linux traffic control ( tc ) utility with a Token Bucket Filter ( tbf ) queuing discipline.
• Training Integration: Integrated DCN throttling within train.py's main training loop, applying configured traffic rules at startup and cleaning them up in the finally block of train_loop .
• Config Specifications: Introduced default configurations in base.yml and types.py for the following variables:

  -       dcn_bandwidth_limit
  -       dcn_bandwidth_burst
  -       dcn_bandwidth_latency
  -       dcn_bandwidth_interface

• Microbenchmarking Script: Added scripts/dcn_bandwidth_test.py which creates a hybrid multi-slice device mesh and runs collective shard-mapped psum operations to measure and print latency/achieved DCN bandwidth under throttling constraints.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/526650570

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

diloco_sync_period: 36
diloco_outer_lr: 0.3
diloco_outer_momentum: 0.9
dcn_bandwidth_limit: ""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments

Comment thread scripts/dcn_bandwidth_test.py Outdated
@@ -0,0 +1,155 @@
# Copyright 2026 Google LLC

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is not under tests/.?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is meant to run on multislice setup, I can add it there but then will have to --ignore in various places so pytest does not pick it up

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 48.27586% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/utils/train_utils.py 46.42% 14 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@khatwanimohit khatwanimohit force-pushed the mohit/dcn_throttling branch from 8356d90 to 1314ca9 Compare June 22, 2026 18:31

@richjames0 richjames0 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a couple of nits

Comment thread scripts/dcn_bandwidth_test.py Outdated
for line in route_output.splitlines():
if "default" in line:
return line.split("dev")[1].strip().split()[0]
except Exception: # pylint: disable=bare-except

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems a bit broad

Comment thread scripts/dcn_bandwidth_test.py Outdated

total_devices = jax.device_count()
if total_devices != (dcn_size * ici_size):
raise ValueError(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnec. line break? or maybe it's right on the lien

Comment thread scripts/dcn_bandwidth_test.py Outdated
dcn_bandwidth_burst="10mb",
dcn_bandwidth_latency="50ms",
dcn_bandwidth_interface=interface,
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add more tests for different configs?

Comment thread src/maxtext/trainers/pre_train/train.py Outdated
state,
) = train_utils.setup_train_loop(config, recorder)

train_utils.apply_dcn_throttling(config)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a simple DCN performance regression test in PR description?

Comment thread src/maxtext/trainers/pre_train/train.py Outdated
state,
) = train_utils.setup_train_loop(config, recorder)

train_utils.apply_dcn_throttling(config)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename method maybe_apply_dcn_throttling. Add a comment on this line that the default flag value is false, its otherwise concerning to see throttling applied in train.py

@khatwanimohit khatwanimohit force-pushed the mohit/dcn_throttling branch from d979229 to 0978b3d Compare June 23, 2026 01:44
@khatwanimohit khatwanimohit force-pushed the mohit/dcn_throttling branch from 0978b3d to 98f0c5d Compare June 23, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants