Skip to content

fix(generate_wan): avoid process0-only gate for profiling run#426

Open
edgexyz wants to merge 1 commit into
AI-Hypercomputer:mainfrom
edgexyz:main
Open

fix(generate_wan): avoid process0-only gate for profiling run#426
edgexyz wants to merge 1 commit into
AI-Hypercomputer:mainfrom
edgexyz:main

Conversation

@edgexyz

@edgexyz edgexyz commented Jun 22, 2026

Copy link
Copy Markdown

Root Cause:
In commit 589c3d5, the profiling pass in generate_wan.run() was gated on
max_utils.profiler_enabled(config). That helper includes an implicit
jax.process_index() == 0 guard (via _jax_profiler_enabled /
_ml_diagnostics_profiler_enabled) so it returns True only on the coordinator.
This causes only process 0 to enter the profiling block and call call_pipeline,
while all other processes skip it and return. Since call_pipeline issues
collective JAX ops (AllReduce, AllGather, etc.) that require a barrier from every
process, the job hangs indefinitely with no error message. The bug is invisible on
single-device runs, which masked it.

Fix:
Replace the gate with original_enable_profiler — the raw boolean saved from
config.enable_profiler before the warm-up pass temporarily disables it. This is
True on all processes when the user sets enable_profiler=True, so every process
enters the profiling pass together and collectives succeed. The
jax.process_index() == 0 guard inside profiler_enabled is correct for profiler
context management and remains untouched.

Verified end-to-end on a 4-process TPU v4 pod (16 devices), WAN2.2-I2V
1920×1088×81, enable_profiler=True:

Before (589c3d5): warm-up completed, profiling pass deadlocked.

generation_time: 522.09s   # warm-up ok on all 4 processes
# hung — generation_time_with_profiler never logged

DEADLINE_EXCEEDED: Barrier timed out.
# of tasks that reached the barrier: 3/4.
The first task at the barrier: 0.
Aborted (core dumped)

Process 0 blocked inside call_pipeline; processes 1–3 exited and hit the
shutdown barrier first (3/4 reached it → timeout → abort).

After (f62927f): all processes complete, XLA trace produced.

Warm generation                : count=4, mean=522.26s
Warm generation with profiler  : count=4, mean=530.24s
Trace artifact count           : 2

@edgexyz edgexyz requested a review from entrpn as a code owner June 22, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant