-
Notifications
You must be signed in to change notification settings - Fork 179
ci(disagg): fail before writing result file + surface real failure class #1591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -67,6 +67,31 @@ | |
| echo "=== Slurm job stderr ===" | ||
| tail -100 "$err_file" | ||
| echo "========================" | ||
| # Surface the real failure class in the Actions UI. Without this, a | ||
| # launch failure shows only the generic "No benchmark result files | ||
| # found" from benchmark-multinode-tmpl.yml. Order matters: check the | ||
| # deterministic recipe error (model-not-found, #1581) before the | ||
| # transport-flake patterns (#1584 MoRI/readiness) so a config bug is | ||
| # never mislabeled as a flake. | ||
| if [[ -n "${GITHUB_ACTIONS:-}" ]]; then | ||
| local sig="" | ||
| if grep -qiE "Model '.*' not found|FATAL: Model|model .* not found" "$err_file"; then | ||
| sig="recipe-error: model not found (deterministic - check MODEL/MODEL_PATH, not MoRI)" | ||
| elif grep -qiE "ReadTimeout|readiness.*timeout|warmup.*time(d)? ?out|health.*timeout" "$err_file"; then | ||
| sig="transport-flake: readiness/warmup timeout (MoRI pd-disagg)" | ||
| elif grep -qiE "Fp8BlockwiseQuant.*IntraNode|dispatch_combine|combine.*IntraNode" "$err_file"; then | ||
| sig="config-error: MoRI fp8_blockwise combine needs IntraNode (disable TBO/SDMA on FP4 prefill, #1584)" | ||
| elif grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"; then | ||
| sig="transport-flake: MoRI KV-transport error" | ||
| elif grep -qiE "segfault|Segmentation fault|signal 11|core dumped|gpucore" "$err_file"; then | ||
| sig="transport-flake: server segfault / core dump" | ||
| fi | ||
| if [[ -n "$sig" ]]; then | ||
| echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::${sig} (see slurm .err artifact)" | ||
| else | ||
| echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above" | ||
| fi | ||
| fi | ||
|
Check failure on line 94 in runners/launch_mi355x-amds.sh
|
||
|
Comment on lines
+70
to
+94
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...What the bug is
Neither indicates that the job actually failed. There is no Why a successful run will hit itSGLang/vLLM multinode jobs with MoRI + NCCL + SLURM essentially always produce non-empty stderr on success:
Two false-positive paths on green runs
Either path renders a red banner in the Actions UI claiming the job FAILED on a run whose exit code is 0 and whose result JSON was written normally. Step-by-step proof
Existing code already acknowledges thisThe pre-existing ImpactThe annotation directly defeats the PR's stated goal ("surface real failure class", "fail loudly and legibly"). After a few green-but-red runs, reviewers learn to ignore the annotation entirely, and the next real failure looks identical to the prior false positives. FixCapture the trap-entry exit code as the very first line of cleanup_and_save_logs() {
local rc=$? # MUST be first statement
# ... existing cp/tail logic unchanged ...
if [[ $rc -ne 0 && -n "${GITHUB_ACTIONS:-}" && -s "$err_file" ]]; then
# existing classifier + ::error:: emission
fi
} |
||
| fi | ||
| sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error annotation emits on successful runs with stderr
Medium Severity
The
cleanup_and_save_logsEXIT trap fires on all exits, including successful ones. The new::error::annotation block only checks whether the.errfile is non-empty (-s), not whether the job actually failed. Slurm.errfiles commonly contain content on success (Python deprecation warnings, CUDA init messages, library logs). On a successful run with any stderr output, this will either match a broad pattern likeMoRI(line 84 — likely present in normal operational logs) and emit a false transport-flake annotation, or fall through and emit the misleading "Unclassified failure" annotation. This defeats the purpose of surfacing real failure classes by flooding the Actions UI with false positives.Reviewed by Cursor Bugbot for commit 992b90f. Configure here.