[Diloco] add DCN bandwidth throttling options#4219
Conversation
| diloco_sync_period: 36 | ||
| diloco_outer_lr: 0.3 | ||
| diloco_outer_momentum: 0.9 | ||
| dcn_bandwidth_limit: "" |
| @@ -0,0 +1,155 @@ | |||
| # Copyright 2026 Google LLC | |||
There was a problem hiding this comment.
why it is not under tests/.?
There was a problem hiding this comment.
This test is meant to run on multislice setup, I can add it there but then will have to --ignore in various places so pytest does not pick it up
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
8356d90 to
1314ca9
Compare
richjames0
left a comment
There was a problem hiding this comment.
lgtm with a couple of nits
| for line in route_output.splitlines(): | ||
| if "default" in line: | ||
| return line.split("dev")[1].strip().split()[0] | ||
| except Exception: # pylint: disable=bare-except |
|
|
||
| total_devices = jax.device_count() | ||
| if total_devices != (dcn_size * ici_size): | ||
| raise ValueError( |
There was a problem hiding this comment.
unnec. line break? or maybe it's right on the lien
| dcn_bandwidth_burst="10mb", | ||
| dcn_bandwidth_latency="50ms", | ||
| dcn_bandwidth_interface=interface, | ||
| ) |
There was a problem hiding this comment.
maybe add more tests for different configs?
| state, | ||
| ) = train_utils.setup_train_loop(config, recorder) | ||
|
|
||
| train_utils.apply_dcn_throttling(config) |
There was a problem hiding this comment.
can we have a simple DCN performance regression test in PR description?
| state, | ||
| ) = train_utils.setup_train_loop(config, recorder) | ||
|
|
||
| train_utils.apply_dcn_throttling(config) |
There was a problem hiding this comment.
rename method maybe_apply_dcn_throttling. Add a comment on this line that the default flag value is false, its otherwise concerning to see throttling applied in train.py
d979229 to
0978b3d
Compare
0978b3d to
98f0c5d
Compare
Description
This change introduces programmatic traffic control ( tc ) bandwidth throttling support in MaxText and includes a dedicated DCN bandwidth microbenchmark script. This allows simulated network environment testing (e.g. limiting inter-slice DCN bandwidth) during multi-slice training.
Key Changes
• Centralized Throttling Utilities: Implemented apply_dcn_throttling and cleanup_dcn_throttling inside train_utils.py using the Linux traffic control ( tc ) utility with a Token Bucket Filter ( tbf ) queuing discipline.
• Training Integration: Integrated DCN throttling within train.py's main training loop, applying configured traffic rules at startup and cleaning them up in the finally block of train_loop .
• Config Specifications: Introduced default configurations in base.yml and types.py for the following variables:
• Microbenchmarking Script: Added scripts/dcn_bandwidth_test.py which creates a hybrid multi-slice device mesh and runs collective shard-mapped psum operations to measure and print latency/achieved DCN bandwidth under throttling constraints.
If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/526650570
Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.
Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.
Tests
Please describe how you tested this change, and include any instructions and/or
commands to reproduce.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.