Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions runners/launch_mi355x-amds.sh
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,31 @@
echo "=== Slurm job stderr ==="
tail -100 "$err_file"
echo "========================"
# Surface the real failure class in the Actions UI. Without this, a
# launch failure shows only the generic "No benchmark result files
# found" from benchmark-multinode-tmpl.yml. Order matters: check the
# deterministic recipe error (model-not-found, #1581) before the
# transport-flake patterns (#1584 MoRI/readiness) so a config bug is
# never mislabeled as a flake.
if [[ -n "${GITHUB_ACTIONS:-}" ]]; then
local sig=""
if grep -qiE "Model '.*' not found|FATAL: Model|model .* not found" "$err_file"; then
sig="recipe-error: model not found (deterministic - check MODEL/MODEL_PATH, not MoRI)"
elif grep -qiE "ReadTimeout|readiness.*timeout|warmup.*time(d)? ?out|health.*timeout" "$err_file"; then
sig="transport-flake: readiness/warmup timeout (MoRI pd-disagg)"
elif grep -qiE "Fp8BlockwiseQuant.*IntraNode|dispatch_combine|combine.*IntraNode" "$err_file"; then
sig="config-error: MoRI fp8_blockwise combine needs IntraNode (disable TBO/SDMA on FP4 prefill, #1584)"
elif grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"; then
sig="transport-flake: MoRI KV-transport error"
elif grep -qiE "segfault|Segmentation fault|signal 11|core dumped|gpucore" "$err_file"; then
sig="transport-flake: server segfault / core dump"
fi
if [[ -n "$sig" ]]; then
echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::${sig} (see slurm .err artifact)"
else
echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above"
fi
fi

Check failure on line 94 in runners/launch_mi355x-amds.sh

View check run for this annotation

Claude / Claude Code Review

::error:: annotation fires on successful runs

The new `::error::` annotation block in `cleanup_and_save_logs` is gated only on stderr being non-empty and `GITHUB_ACTIONS` being set — neither implies the job actually failed. Since the trap runs on EXIT unconditionally (line 101) and SLURM .err is essentially always non-empty on a successful MoRI run (NCCL/MoRI INFO banners, tqdm to stderr, Python warnings, srun output), every green run will emit a red `::error title=AMD disagg job ... failed::` banner — inverting the PR's stated goal. Fix: c
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error annotation emits on successful runs with stderr

Medium Severity

The cleanup_and_save_logs EXIT trap fires on all exits, including successful ones. The new ::error:: annotation block only checks whether the .err file is non-empty (-s), not whether the job actually failed. Slurm .err files commonly contain content on success (Python deprecation warnings, CUDA init messages, library logs). On a successful run with any stderr output, this will either match a broad pattern like MoRI (line 84 — likely present in normal operational logs) and emit a false transport-flake annotation, or fall through and emit the misleading "Unclassified failure" annotation. This defeats the purpose of surfacing real failure classes by flooding the Actions UI with false positives.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 992b90f. Configure here.

Comment on lines +70 to +94
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new ::error:: annotation block in cleanup_and_save_logs is gated only on stderr being non-empty and GITHUB_ACTIONS being set — neither implies the job actually failed. Since the trap runs on EXIT unconditionally (line 101) and SLURM .err is essentially always non-empty on a successful MoRI run (NCCL/MoRI INFO banners, tqdm to stderr, Python warnings, srun output), every green run will emit a red ::error title=AMD disagg job ... failed:: banner — inverting the PR's stated goal. Fix: capture local rc=$? as the very first line of the function and gate the entire ::error:: emission on rc -ne 0 (leaving the diagnostic .err tail unconditional).

Extended reasoning...

What the bug is

cleanup_and_save_logs is registered as a generic EXIT trap at line 101 (trap cleanup_and_save_logs EXIT), so it runs on every script exit — both success (exit 0) and failure. The new annotation block (lines 76–94) is guarded by only two conditions:

  1. [[ -s "$err_file" ]] — stderr file is non-empty (line 66 enclosing block)
  2. [[ -n "${GITHUB_ACTIONS:-}" ]] — running under Actions (line 76)

Neither indicates that the job actually failed. There is no $? capture and no exit-code gate.

Why a successful run will hit it

SGLang/vLLM multinode jobs with MoRI + NCCL + SLURM essentially always produce non-empty stderr on success:

  • tqdm writes progress bars to stderr by default
  • NCCL/RCCL prints INFO banners to stderr
  • warnings.warn(...) writes to stderr
  • SGLang/vLLM server INFO logs often route to stderr
  • srun --unbuffered and docker stop $(docker ps -a -q) || true (job.slurm) leak banners and "no such container" messages to stderr — the trailing || true swallows the exit code but not the output
  • The submit.sh wrapper sets --error=slurm_job-%j.err, so all of the above lands in the file the trap inspects

Two false-positive paths on green runs

  1. Pattern match (line 84): grep -qiE "MoRI|mori_conn|pd[- ]?disagg" matches case-insensitively against the literal string MoRI anywhere in stderr. For a MoRI-based benchmark this is effectively guaranteed (framework name appears in INFO/log lines and startup banners). Result: ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error.
  2. Else branch (line 92): Even when no class matches, any non-empty unmatched stderr emits ::error title=AMD disagg job <id> failed::Unclassified failure - see last 100 lines of slurm .err above.

Either path renders a red banner in the Actions UI claiming the job FAILED on a run whose exit code is 0 and whose result JSON was written normally.

Step-by-step proof

  1. Script runs, trap cleanup_and_save_logs EXIT is set (line 101).
  2. Multinode benchmark completes successfully, exit code 0.
  3. cleanup_and_save_logs fires on EXIT.
  4. err_file=$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.err is non-empty (contains NCCL/MoRI INFO banners + tqdm + srun stderr).
  5. [[ -s "$err_file" ]] → true → enter the block.
  6. [[ -n "${GITHUB_ACTIONS:-}" ]] → true (we are in CI) → enter annotation block.
  7. First three patterns (Model not found, ReadTimeout, Fp8BlockwiseQuant) don't match a clean run.
  8. Fourth elif: grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file" → matches the literal "MoRI" in any startup log line → sig="transport-flake: MoRI KV-transport error".
  9. Line 90 emits ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error (see slurm .err artifact).
  10. Actions UI shows a red error banner on a successful, exit-0 run.

Existing code already acknowledges this

The pre-existing if [[ -s "$err_file" ]] guard at line 66 plus the unconditional tail -100 block already implicitly acknowledges that .err is routinely non-empty on non-failure runs (otherwise the tail print wouldn't be conditional). What the new block does is escalate that benign non-emptiness into a misleading ::error:: annotation.

Impact

The annotation directly defeats the PR's stated goal ("surface real failure class", "fail loudly and legibly"). After a few green-but-red runs, reviewers learn to ignore the annotation entirely, and the next real failure looks identical to the prior false positives.

Fix

Capture the trap-entry exit code as the very first line of cleanup_and_save_logs and gate only the ::error:: emission on it. The diagnostic .err tail (lines 67–69) is fine to leave unconditional since it's only printed output, not an Actions annotation:

cleanup_and_save_logs() {
    local rc=$?  # MUST be first statement
    # ... existing cp/tail logic unchanged ...
    if [[ $rc -ne 0 && -n "${GITHUB_ACTIONS:-}" && -s "$err_file" ]]; then
        # existing classifier + ::error:: emission
    fi
}

fi
sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true
}
Expand Down
25 changes: 15 additions & 10 deletions utils/bench_serving/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -879,6 +879,21 @@ def main(args: argparse.Namespace):
lora_modules=args.lora_modules,
))

# Gate the run BEFORE writing any result file. A sub-threshold (or
# zero-completion) run must not leave a schema-valid JSON on disk:
# downstream collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml)
# treat file *existence* as success, so a written-then-failed file looks
# successful. Raising here keeps disk state consistent with the exit code.
max_failure_rate = 0.05
completed = benchmark_result["completed"]
failure_rate = 1 - completed / args.num_prompts
if failure_rate > max_failure_rate:
raise SystemExit(
f"FAIL: request failure rate {failure_rate:.1%} exceeds "
f"{max_failure_rate:.0%} threshold "
f"({completed}/{args.num_prompts} completed)"
)

# Save config and results to json
if args.save_result:
result_json: Dict[str, Any] = {}
Expand Down Expand Up @@ -940,16 +955,6 @@ def main(args: argparse.Namespace):
json.dump(result_json, outfile)
save_to_pytorch_benchmark_format(args, result_json, file_name)

max_failure_rate = 0.05
completed = benchmark_result["completed"]
failure_rate = 1 - completed / args.num_prompts
if failure_rate > max_failure_rate:
raise SystemExit(
f"FAIL: request failure rate {failure_rate:.1%} exceeds "
f"{max_failure_rate:.0%} threshold "
f"({completed}/{args.num_prompts} completed)"
)


if __name__ == "__main__":
parser = FlexibleArgumentParser(
Expand Down