Skip to content

feat(launcher): add Megatron-Bridge quantize/generate/export wrappers#1767

Open
yueshen2016 wants to merge 1 commit into
mainfrom
yueshen/launcher-megatron-bridge
Open

feat(launcher): add Megatron-Bridge quantize/generate/export wrappers#1767
yueshen2016 wants to merge 1 commit into
mainfrom
yueshen/launcher-megatron-bridge

Conversation

@yueshen2016

@yueshen2016 yueshen2016 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: new example

Adds Megatron-Bridge PTQ wrappers to the launcher so users can quantize / generate / export Hugging Face models through Megatron-Bridge on Slurm or Docker, mirroring the existing common/megatron_lm/{quantize,export}/ and common/megatron_bridge/import/ conventions.

  • Common scripts (tools/launcher/common/megatron_bridge/):

    • quantize/quantize.sh — wraps /opt/Megatron-Bridge/examples/quantization/quantize.py
    • generate/generate.sh — wraps /opt/Megatron-Bridge/examples/quantization/ptq_generate.py
    • export/export.sh — wraps /opt/Megatron-Bridge/examples/quantization/export.py

    All three assume nvcr.io/nvidia/nemo:26.04+ (Megatron-Bridge preinstalled), read configuration from env vars (HF_MODEL_ID, MEGATRON_SAVE_PATH/MEGATRON_LOAD_PATH, EXPORT_QUANT_CFG, TP/PP/EP/ETP, DTYPE, …), and launch via python -m torch.distributed.run with multi-node rendezvous (--nnodes/--node-rank/--master-addr/--master-port) derived from SLURM_NNODES/SLURM_NODEID/SLURM_NODELIST, falling back to --standalone for single-node.

  • Per-model example YAMLs (tools/launcher/examples/nvidia/):

    • NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 2 nodes × 4 GPUs (TP=2 PP=4 EP=2 for PTQ, TP=8 EP=8 for generate, TP=1 PP=8 for export), nvfp4 config.
    • NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 4 nodes × 4 GPUs for PTQ/generate (TP=8 PP=2 EP=8), 2 nodes × 4 GPUs for export (TP=1 PP=8), mamba_moe_nvfp4_conservative per super3 docs.
    • NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 18 nodes × 4 GPUs for PTQ/generate (TP=8 PP=9 EP=8, super-nvfp4-max-calib.yaml recipe, PYTORCH_ALLOC_CONF=expandable_segments:True), 3 nodes × 4 GPUs for export (TP=1 PP=12 EP=8 — PP bumped from doc's 9 so the world maps cleanly to 4-GPU nodes; 108 layers / 12 = 9 layers/stage), per ultra3 docs.

Usage

# Quantize Nemotron-3-Nano-30B-A3B to NVFP4 via Megatron-Bridge:
cd tools/launcher
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml --yes

# Smoke-test generation on the quantized MCore checkpoint:
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml --yes

# Export to deployable HF format:
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml --yes

# Same three-step flow works for Super and Ultra by swapping the model subdir.

Testing

  • bash -n syntax check on all three shell scripts (pass).
  • uv run launch.py --dryrun --yes on all 9 example YAMLs (all parse, resolve <<global_vars.X>>, and build the Fiddle SandboxPipeline cleanly).
  • End-to-end Slurm runs not yet exercised; need cluster access with the right node count for super (16 GPUs) and ultra (72 GPUs).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ — all files are new; no existing files modified.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A — no copied code, no new PIP dependencies.
  • Did you write any new necessary tests?: ❌ — the launcher's existing --dryrun path is the validation surface for example YAMLs and was used here; no new unit tests added. Happy to add launcher YAML regression tests if requested.
  • Did you update Changelog?: ❌ — pure launcher-example addition, no library API change. Can add a one-liner under 0.46 "New Features" if preferred (prior precedent: the moonshotai/ launcher examples were mentioned in 0.46).
  • Did you get Claude approval on this PR?: ❌ — to be run via /claude review after PR creation.

Additional Information

  • Upstream reference scripts: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/quantization
  • Note on trust_remote_code=True: the Nemotron-3 family requires it, and all three Nemotron YAMLs set TRUST_REMOTE_CODE=true consistent with the upstream NVIDIA-NeMo/Nemotron docs.
  • The Ultra-export YAML deliberately deviates from the upstream doc's --pp 9 to PP=12 so the world (12 ranks) maps cleanly to 3 × 4-GPU nodes; this is called out in the YAML's header comment.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Megatron-Bridge tools for model quantization, text generation, and checkpoint export workflows.
    • Added pre-configured job templates for three NVIDIA Nemotron model variants (Nano-30B, Super-120B, Ultra-550B) to streamline distributed training and inference operations.

@yueshen2016 yueshen2016 requested a review from a team as a code owner June 17, 2026 22:46
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds three reusable Bash launcher scripts (quantize.sh, export.sh, generate.sh) under tools/launcher/common/megatron_bridge/ for distributed Megatron-Bridge PTQ, HuggingFace checkpoint export, and text generation jobs. Also adds nine Slurm YAML pipeline configurations covering PTQ, export, and generate operations for three Nemotron model variants (Nano-30B, Super-120B, Ultra-550B).

Changes

Megatron-Bridge Launcher Scripts and Model Configs

Layer / File(s) Summary
PTQ quantization launcher script
tools/launcher/common/megatron_bridge/quantize/quantize.sh
Validates HF_MODEL_ID, sets defaults for OUTPUT_DIR, EXPORT_QUANT_CFG, MEGATRON_SAVE_PATH, and parallelism (TP, PP, EP, ETP, CALIB_SIZE), auto-detects NPROC_PER_NODE via nvidia-smi, builds Slurm-derived or standalone torchrun rendezvous args, assembles optional flags (compress, weight-only, KV cache, trust remote code, HF file lock, prompts), applies runtime workaround to rewrite deprecated dataset IDs, then execs torchrun invoking quantize.py.
HuggingFace export launcher script
tools/launcher/common/megatron_bridge/export/export.sh
Validates HF_MODEL_ID and MEGATRON_LOAD_PATH, sets output/parallelism/dtype defaults, auto-detects NPROC_PER_NODE, builds torchrun rendezvous args from Slurm or standalone mode, conditionally adds EXPORT_EXTRA_MODULES and TRUST_REMOTE_CODE flags, execs torchrun with export.py, then lists export directory and prints hf_quant_config.json if present.
PTQ generation launcher script
tools/launcher/common/megatron_bridge/generate/generate.sh
Validates HF_MODEL_ID and MEGATRON_LOAD_PATH, sets TP/PP/EP/ETP and OSL defaults, derives Slurm-based or standalone rendezvous args (including optional MASTER_ADDR via scontrol), builds optional flags from TRUST_REMOTE_CODE and PROMPTS, logs configuration, then execs torchrun invoking ptq_generate.py.
PTQ job configs (3 models)
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
Each config defines pipeline metadata and global output_dir, wires task_0 to invoke quantize.sh with model-specific env vars (HF model path, quantization config, parallelism, calibration size, trust remote code), and provisions Slurm/container execution with resource shapes from 2 nodes/4 GPUs (Nano, 4 h) to 18 nodes/4 GPUs (Ultra, 12 h).
Export job configs (3 models)
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
Each config invokes export.sh with HF model/load paths and EXPORT_DIR, fixes TP=1 with model-specific PP (8 for Nano/Super, 12 for Ultra), uses bfloat16 dtype and TRUST_REMOTE_CODE=true, and specifies Slurm shapes of 2 nodes (Nano) and 3 nodes (Ultra) with 2-hour limits.
Generate job configs (3 models)
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml, tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml
Each config invokes generate.sh with HF model id, templated MEGATRON_LOAD_PATH, model-specific TP/PP/EP/ETP, TRUST_REMOTE_CODE=true, and optional PYTORCH_ALLOC_CONF. Slurm shapes range from 2 nodes/4 GPUs (Nano) to 4 nodes/4 GPUs (Super) to 18 nodes/1 task per node/4 GPUs (Ultra), with 1–2 hour limits.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

cherry-pick-0.45.0

Suggested reviewers

  • kevalmorabia97
  • h-guo18
  • ChenhanYu

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR contains 9 hardcoded instances of TRUST_REMOTE_CODE: "true" across all example YAML files, violating SECURITY.md rule #3: trust_remote_code must be exposed as caller-configurable parameter, no... Comment out TRUST_REMOTE_CODE lines in all 9 YAML files with a note stating it should be set only when explicitly needed for trusted model sources, matching the pattern already flagged in review comments.
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main change: it introduces Megatron-Bridge quantize/generate/export wrapper scripts to the launcher framework, which is the primary focus of the changeset.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yueshen/launcher-megatron-bridge

Comment @coderabbitai help to get the list of available commands and usage tips.

@yueshen2016

Copy link
Copy Markdown
Contributor Author

/claude review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/megatron_bridge/export/export.sh`:
- Around line 67-74: After the MASTER_ADDR resolution attempt from
SLURM_NODELIST in the multi-node block (when NNODES > 1), add an explicit
validation check to ensure MASTER_ADDR is not empty. If MASTER_ADDR is still
unresolved at that point, exit the script immediately with a clear error message
before the RDZV_ARGS are constructed. This prevents passing an empty
--master-addr value and failing downstream.

In `@tools/launcher/common/megatron_bridge/generate/generate.sh`:
- Around line 58-65: When NNODES is greater than 1, the script attempts to
construct RDZV_ARGS with MASTER_ADDR, but if the Slurm hostname discovery using
scontrol fails to resolve, MASTER_ADDR remains empty and gets passed as an empty
value in the rendezvous arguments, causing runtime failure. Add a validation
check after the conditional block that attempts to set MASTER_ADDR that
explicitly verifies MASTER_ADDR is not empty when NNODES is greater than 1, and
fail with a clear error message if it is empty, preventing the script from
proceeding to build RDZV_ARGS with an invalid master address.

In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 68-75: When NNODES is greater than 1, the code attempts to resolve
MASTER_ADDR from SLURM_NODELIST using scontrol, but if this resolution fails the
variable remains empty and RDZV_ARGS is still assembled with the empty value,
causing failures to surface later in torchrun with unclear error messages. Add a
validation check after the conditional block that attempts to set MASTER_ADDR to
verify it is not empty when NNODES > 1, and exit immediately with a clear error
message if MASTER_ADDR is unresolved, preventing the script from proceeding to
assemble RDZV_ARGS with undefined values.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`:
- Line 34: The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE
environment variable set to "true" which removes user control over remote code
execution security settings. Remove the line containing TRUST_REMOTE_CODE:
"true" from the megatron_bridge_export.yaml file, as the underlying shell
scripts (export.sh, generate.sh, quantize.sh) already support this as an
optional flag with a safe default that lets users explicitly opt in when needed,
restoring proper security boundaries.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`:
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE environment variable
assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration
files in tools/launcher/examples/nvidia/. This security-sensitive setting should
not be hardcoded in configs; instead, allow it to default to false and require
users to explicitly enable it per-run after vetting the model source. Apply this
change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for
export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from
each environment variables section.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`:
- Line 40: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the megatron_bridge_ptq.yaml configuration file, which enables remote code
execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE:
"true" line entirely from the configuration, or if it must be present, set it to
"false" to enforce a safe default. This requires operators to explicitly opt-in
to remote code execution for trusted artifacts rather than allowing it
unconditionally.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml`:
- Line 37: The TRUST_REMOTE_CODE environment variable is hardcoded as true in
the configuration, but per security guidelines this flag should not be baked
into default configs. Either remove the TRUST_REMOTE_CODE: "true" line entirely
and require users to explicitly pass this flag when running the example if
needed, or if this is a documented exception for NVIDIA's official trusted
artifacts, keep the line but add a clear inline comment explaining why this
security-sensitive flag is being enabled for this specific case. Apply the same
change consistently across all Nemotron variant examples (Super, Nano, Ultra)
and MiniMax examples where this pattern appears.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml`:
- Line 35: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the YAML launcher config, which forces users to enable remote code execution
without explicit opt-in. Replace the hardcoded "true" value with a pattern that
allows runtime override while defaulting to false, such as using parameter
expansion syntax (e.g., `${TRUST_REMOTE_CODE:-false}`) so users can choose to
enable it explicitly by setting the environment variable before running the
config, rather than having it forced on by default. This change should be
applied to the TRUST_REMOTE_CODE assignment in the launcher YAML file and
similar hardcoded instances across other model config files.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`:
- Line 38: The megatron_bridge_export.yaml file contains a hardcoded
TRUST_REMOTE_CODE: "true" environment variable without justification. Either
remove this line entirely if it is not necessary, or if the
Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline
YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is
required for this specific model and confirming that the artifact provenance has
been vetted and is trusted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ebef21ea-c5a1-4eb2-adb2-6d9b32bbf043

📥 Commits

Reviewing files that changed from the base of the PR and between 769ea5f and fd9fd4c.

📒 Files selected for processing (12)
  • tools/launcher/common/megatron_bridge/export/export.sh
  • tools/launcher/common/megatron_bridge/generate/generate.sh
  • tools/launcher/common/megatron_bridge/quantize/quantize.sh
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml

Comment on lines +67 to +74
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
else

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when MASTER_ADDR is unresolved for multi-node runs.

For NNODES>1, Line 73 can end up passing an empty --master-addr if SLURM_NODELIST resolution fails. Add an explicit guard and exit early with a clear error.

Suggested patch
 if [[ "${NNODES}" -gt 1 ]]; then
     if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
         MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
     fi
+    if [[ -z "${MASTER_ADDR}" ]]; then
+        echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2
+        exit 1
+    fi
     MASTER_PORT="${MASTER_PORT:-29500}"
     RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
                "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/export/export.sh` around lines 67 - 74,
After the MASTER_ADDR resolution attempt from SLURM_NODELIST in the multi-node
block (when NNODES > 1), add an explicit validation check to ensure MASTER_ADDR
is not empty. If MASTER_ADDR is still unresolved at that point, exit the script
immediately with a clear error message before the RDZV_ARGS are constructed.
This prevents passing an empty --master-addr value and failing downstream.

Comment on lines +58 to +65
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
else

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a hard check for MASTER_ADDR before building multi-node rendezvous args.

When NNODES>1, the script may construct --master-addr= with an empty value if Slurm hostname discovery does not resolve, causing avoidable runtime failure.

Suggested patch
 if [[ "${NNODES}" -gt 1 ]]; then
     if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
         MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
     fi
+    if [[ -z "${MASTER_ADDR}" ]]; then
+        echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2
+        exit 1
+    fi
     MASTER_PORT="${MASTER_PORT:-29500}"
     RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
                "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/generate/generate.sh` around lines 58 -
65, When NNODES is greater than 1, the script attempts to construct RDZV_ARGS
with MASTER_ADDR, but if the Slurm hostname discovery using scontrol fails to
resolve, MASTER_ADDR remains empty and gets passed as an empty value in the
rendezvous arguments, causing runtime failure. Add a validation check after the
conditional block that attempts to set MASTER_ADDR that explicitly verifies
MASTER_ADDR is not empty when NNODES is greater than 1, and fail with a clear
error message if it is empty, preventing the script from proceeding to build
RDZV_ARGS with an invalid master address.

Comment on lines +68 to +75
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
else

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when multi-node rendezvous address is unresolved.

When NNODES > 1, RDZV_ARGS is assembled even if MASTER_ADDR stays empty (e.g., missing SLURM_NODELIST or scontrol unavailable), which pushes the failure downstream into torchrun with a less actionable error.

Suggested fix
 if [[ "${NNODES}" -gt 1 ]]; then
     if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
         MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
     fi
+    if [[ -z "${MASTER_ADDR}" ]]; then
+        echo "[ERROR] MASTER_ADDR is required for multi-node runs (set MASTER_ADDR or SLURM_NODELIST)." >&2
+        exit 1
+    fi
     MASTER_PORT="${MASTER_PORT:-29500}"
     RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
                "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
 else
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 68 -
75, When NNODES is greater than 1, the code attempts to resolve MASTER_ADDR from
SLURM_NODELIST using scontrol, but if this resolution fails the variable remains
empty and RDZV_ARGS is still assembled with the empty value, causing failures to
surface later in torchrun with unclear error messages. Add a validation check
after the conditional block that attempts to set MASTER_ADDR to verify it is not
empty when NNODES > 1, and exit immediately with a clear error message if
MASTER_ADDR is unresolved, preventing the script from proceeding to assemble
RDZV_ARGS with undefined values.

- EP: "1"
- ETP: "1"
- DTYPE: bfloat16
- TRUST_REMOTE_CODE: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the target YAML file
find . -type f -name "megatron_bridge_export.yaml" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 363


🏁 Script executed:

# Also search for similar YAML configs to understand patterns
find tools/launcher/examples -type f -name "*.yaml" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 776


🏁 Script executed:

# Search for how TRUST_REMOTE_CODE is used in the codebase
rg "TRUST_REMOTE_CODE" -A 2 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 14710


🏁 Script executed:

# Search for MLM_MODEL_CFG and QUANT_CFG usage patterns in YAML files
rg "MLM_MODEL_CFG|QUANT_CFG" tools/launcher/examples -A 1 -B 1

Repository: NVIDIA/Model-Optimizer

Length of output: 5902


🏁 Script executed:

# View the specific file mentioned in the review
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1879


🏁 Script executed:

# Check if there's any documentation or comments about TRUST_REMOTE_CODE in launcher configs
rg "TRUST_REMOTE_CODE" tools/launcher -B 5 -A 2 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 11043


🏁 Script executed:

# Look for any README or documentation in the launcher examples directory
find tools/launcher/examples -type f -name "README*" -o -name "*.md"

Repository: NVIDIA/Model-Optimizer

Length of output: 48


Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable.

This setting forces remote code execution for all runs, removing user control over a critical security boundary. Per security guidelines, do not hardcode trust_remote_code=True; instead let callers explicitly opt in at submission time when they can justify trusting the model source.

The shell scripts (export.sh, generate.sh, quantize.sh) already support this as an optional flag with a safe default ([[ "${TRUST_REMOTE_CODE:-false}" == "true" ]]), so removing this line simply restores user control without breaking the pipeline.

Suggested patch
      - DTYPE: bfloat16
-     - TRUST_REMOTE_CODE: "true"
    slurm_config:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- TRUST_REMOTE_CODE: "true"
- DTYPE: bfloat16
slurm_config:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`
at line 34, The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE
environment variable set to "true" which removes user control over remote code
execution security settings. Remove the line containing TRUST_REMOTE_CODE:
"true" from the megatron_bridge_export.yaml file, as the underlying shell
scripts (export.sh, generate.sh, quantize.sh) already support this as an
optional flag with a safe default that lets users explicitly opt in when needed,
restoring proper security boundaries.

Source: Coding guidelines

- PP: "1"
- EP: "8"
- ETP: "1"
- TRUST_REMOTE_CODE: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the actual file content and context
cd tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 2>/dev/null && pwd && ls -la || echo "Directory may not exist, searching for the file..."

Repository: NVIDIA/Model-Optimizer

Length of output: 672


🏁 Script executed:

# Find the file and examine its content
fd -type f -name "megatron_bridge_generate.yaml" | head -5 | xargs -I {} sh -c 'echo "File: {}" && cat -n {}'

Repository: NVIDIA/Model-Optimizer

Length of output: 237


🏁 Script executed:

# Search for TRUST_REMOTE_CODE usage in Python code to understand how it's consumed
rg "TRUST_REMOTE_CODE" -A 3 -B 3 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 1536


🏁 Script executed:

# Check for similar environment variable patterns in other launcher configs
fd -type f -name "*.yaml" -path "*/tools/launcher/examples/*" | xargs grep -l "TRUST_REMOTE_CODE" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 239


🏁 Script executed:

# Look for how environment variables are used in launcher code
fd -type f -name "*.py" -path "*/tools/launcher/*" | head -10 | xargs ls -lh

Repository: NVIDIA/Model-Optimizer

Length of output: 1442


🏁 Script executed:

# Examine the actual megatron_bridge_generate.yaml file
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1939


🏁 Script executed:

# Also check the export version to see if it has the same pattern
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1879


🏁 Script executed:

# Search for documentation on how launcher configs use environment variables
find tools/launcher -name "*.md" -o -name "README*" | xargs cat 2>/dev/null | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3272


🏁 Script executed:

# Check how fakequant_worker.py uses the TRUST_REMOTE_CODE env var
cat -n examples/vllm_serve/fakequant_worker.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 5192


🏁 Script executed:

# Search for other launcher YAML configs with TRUST_REMOTE_CODE or similar patterns
grep -r "TRUST_REMOTE_CODE" tools/launcher --include="*.yaml"

Repository: NVIDIA/Model-Optimizer

Length of output: 1348


🏁 Script executed:

# Check if there's any documentation about environment variable handling in launcher configs
find tools/launcher -type f \( -name "*.py" -o -name "*.md" \) | xargs grep -l "environment\|TRUST_REMOTE_CODE" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 243


🏁 Script executed:

# Check the launcher scripts to understand how environment variables flow from YAML to actual execution
find tools/launcher/common -name "*.sh" | head -3 | xargs head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 3937


🏁 Script executed:

# Check if there are any configuration override mechanisms in the launcher
cat tools/launcher/core.py | grep -A 10 -B 5 "environment\|override" | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1613


🏁 Script executed:

# Look for comments or documentation about setting TRUST_REMOTE_CODE in launcher configs
grep -r "trust.remote\|TRUST_REMOTE" tools/launcher --include="*.md" --include="*.py" -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check the launcher's query.py to see if environment variables can be overridden
find tools/launcher/common -name "query.py" -exec head -50 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 1848


🏁 Script executed:

# Check if there's any documentation or comments explaining why TRUST_REMOTE_CODE is hardcoded
grep -B 10 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 419


🏁 Script executed:

# Check if launcher configs support environment variable overrides at runtime
cat tools/launcher/launch.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4138


🏁 Script executed:

# Look for documentation on how to override environment variables when running launcher
grep -r "override\|environment" tools/launcher/docs/ tools/launcher/*.md 2>/dev/null | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1889


🏁 Script executed:

# Check the shell script that actually runs - see if it respects or overrides TRUST_REMOTE_CODE
cat tools/launcher/common/megatron_bridge/generate/generate.sh 2>/dev/null | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1911


🏁 Script executed:

# Look for any comments or inline documentation about the TRUST_REMOTE_CODE setting in similar files
cat tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1567


🏁 Script executed:

# Check if these launcher YAML files are used in actual CICD or if they're just examples
grep -r "megatron_bridge_generate" . --include="*.yml" --include="*.yaml" --include="*.py" | grep -v "examples" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at how the environment variables are actually consumed in the shell script
cat -n tools/launcher/common/megatron_bridge/generate/generate.sh | grep -A 30 "TRUST_REMOTE_CODE"

Repository: NVIDIA/Model-Optimizer

Length of output: 2096


🏁 Script executed:

# Check if there are any warnings or comments in the generated shell scripts about trust_remote_code
find tools/launcher/common -name "*.sh" -exec grep -l "trust" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 612


🏁 Script executed:

# Look at the launcher configuration documentation to see if there's guidance on setting sensitive env vars
cat tools/launcher/docs/configuration.md 2>/dev/null || echo "File not found"

Repository: NVIDIA/Model-Optimizer

Length of output: 4251


🏁 Script executed:

# Check the CONTRIBUTING.md for any guidance on launcher configs and security
grep -A 10 "launcher\|environment\|TRUST" tools/launcher/docs/contributing.md 2>/dev/null || grep -A 10 "launcher\|environment" CONTRIBUTING.md 2>/dev/null | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 277


🏁 Script executed:

# Check if these are actual production configs or just examples
find tools/launcher/examples -type f -name "*.yaml" | wc -l

Repository: NVIDIA/Model-Optimizer

Length of output: 69


🏁 Script executed:

# Verify the broader scope - how many YAML configs have this issue
grep -r "TRUST_REMOTE_CODE.*true" tools/launcher --include="*.yaml" | wc -l

Repository: NVIDIA/Model-Optimizer

Length of output: 68


🏁 Script executed:

# Check if the shell scripts document that these env vars can be overridden
cat -n tools/launcher/common/megatron_bridge/quantize/quantize.sh | grep -A 10 -B 5 "TRUST_REMOTE_CODE"

Repository: NVIDIA/Model-Optimizer

Length of output: 1944


🏁 Script executed:

# Look for any warnings or notes about security in the example configs
grep -r "security\|trusted\|remote.code" tools/launcher/examples --include="*.yaml" --include="*.md"

Repository: NVIDIA/Model-Optimizer

Length of output: 6233


Remove hardcoded TRUST_REMOTE_CODE: "true" from launcher configs; require per-run override instead.

These Megatron-Bridge launcher configs in tools/launcher/examples/nvidia/ hardcode TRUST_REMOTE_CODE: "true", violating the coding guideline that states: "Do not hardcode trust_remote_code=True". The guideline explicitly requires this security-sensitive setting to default to false and be controlled per-run after model-source vetting.

This pattern appears in all 9 NVIDIA Nemotron launcher configs (Nano, Super, Ultra—across export, generate, and ptq tasks). Either remove the environment variable entirely (shell scripts default to false) or document it as an optional per-run override, allowing users to enable it explicitly when needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`
at line 35, Remove the hardcoded TRUST_REMOTE_CODE environment variable
assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration
files in tools/launcher/examples/nvidia/. This security-sensitive setting should
not be hardcoded in configs; instead, allow it to default to false and require
users to explicitly enable it per-run after vetting the model source. Apply this
change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for
export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from
each environment variables section.

Source: Coding guidelines

- PP: "4"
- EP: "2"
- ETP: "1"
- TRUST_REMOTE_CODE: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not hardcode remote-code execution opt-in in the committed config.

Setting TRUST_REMOTE_CODE: "true" by default makes every run execute model-provided code. Prefer a safe default (false/unset) and require explicit operator opt-in for trusted artifacts.

As per coding guidelines: “Do not hardcode trust_remote_code=True.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`
at line 40, The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the megatron_bridge_ptq.yaml configuration file, which enables remote code
execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE:
"true" line entirely from the configuration, or if it must be present, set it to
"false" to enforce a safe default. This requires operators to explicitly opt-in
to remote code execution for trusted artifacts rather than allowing it
unconditionally.

Source: Coding guidelines

- EP: "8"
- ETP: "1"
- DTYPE: bfloat16
- TRUST_REMOTE_CODE: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and check its content
find . -path "*tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"

Repository: NVIDIA/Model-Optimizer

Length of output: 166


🏁 Script executed:

# Search for the specific YAML file and read it
fd -e yaml "megatron_bridge_export.yaml" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 357


🏁 Script executed:

# Check if the file exists and its content around line 38
if [ -f "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml" ]; then
  echo "=== File content around line 38 ==="
  sed -n '30,45p' "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"
  echo ""
  echo "=== Full file content ==="
  cat "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"
else
  echo "File not found at specified path"
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 2662


🏁 Script executed:

# Search for TRUST_REMOTE_CODE usage in launcher configs and scripts
rg "TRUST_REMOTE_CODE" --type yaml -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 8526


🏁 Script executed:

# Check for similar patterns in other YAML configs
rg "trust_remote_code|TRUST_REMOTE_CODE" -i --type yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 4802


🏁 Script executed:

# Look at the launcher scripts to understand how environment variables are processed
find tools/launcher -name "*.py" -o -name "*.sh" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1011


🏁 Script executed:

# Check the export.sh script to understand how TRUST_REMOTE_CODE is used
if [ -f "tools/launcher/common/megatron_bridge/export/export.sh" ]; then
  echo "=== export.sh script ==="
  cat "tools/launcher/common/megatron_bridge/export/export.sh"
else
  echo "File not found"
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 4140


🏁 Script executed:

# Also check if there's a README or documentation about the launcher
find tools/launcher -name "README*" -o -name "*.md" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 291


🏁 Script executed:

# Look at the launch.py to understand how env vars are used
find tools/launcher -name "launch.py" -o -name "launcher.py" | xargs ls -la 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 140


🏁 Script executed:

# Check the README for launcher guidelines
cat tools/launcher/README.md | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3006


🏁 Script executed:

# Look at the launch.py to see if there's any context about how environment variables are handled
head -50 tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2207


🏁 Script executed:

# Check if there are other scripts that also use TRUST_REMOTE_CODE similarly
grep -r "TRUST_REMOTE_CODE" tools/launcher/common --include="*.sh" -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3507


🏁 Script executed:

# Check the inline comments in other similar export/quantize configs to see if they document the reason
grep -B 5 -A 1 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 207


Remove hardcoded TRUST_REMOTE_CODE: "true" or document why it's required.

Per the coding guidelines, do not hardcode trust_remote_code=True. While the underlying shell script correctly defaults to false, this YAML example hardcodes the override without justification. If --trust-remote-code is required for this model, add an inline YAML comment explaining why it is necessary (e.g., "Nemotron models require custom modeling code") and confirming artifact provenance is vetted.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`
at line 38, The megatron_bridge_export.yaml file contains a hardcoded
TRUST_REMOTE_CODE: "true" environment variable without justification. Either
remove this line entirely if it is not necessary, or if the
Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline
YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is
required for this specific model and confirming that the artifact provenance has
been vetted and is trusted.

Source: Coding guidelines

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.31%. Comparing base (769ea5f) to head (be54e63).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1767      +/-   ##
==========================================
+ Coverage   74.28%   74.31%   +0.03%     
==========================================
  Files         511      511              
  Lines       56360    56360              
==========================================
+ Hits        41867    41884      +17     
+ Misses      14493    14476      -17     
Flag Coverage Δ
regression 14.70% <ø> (+0.06%) ⬆️
unit 54.33% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

CALIB_SIZE="${CALIB_SIZE:-512}"

if [[ -z "${NPROC_PER_NODE}" ]]; then
NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] (applies to all three scripts — quantize.sh, generate.sh, export.sh)

NPROC_PER_NODE auto-detects via nvidia-smi --query-gpu=count, but the 9 example YAMLs request gpus_per_node: 4 from Slurm and rely on this detection rather than setting NPROC_PER_NODE explicitly.

On most Slurm clusters cgroup isolation means nvidia-smi only sees the allocated GPUs, so this works. But on setups where nvidia-smi reports all physical GPUs on the node (e.g. an 8-GPU node with only 4 allocated), NPROC_PER_NODE becomes 8 while Slurm allocated 4, so nproc_per_node × nnodes no longer equals the intended TP×PP world size — torchrun then fails with an opaque rendezvous/world-size error.

Since the YAMLs already pin gpus_per_node, consider either passing NPROC_PER_NODE: "4" explicitly in the configs, or deriving it from SLURM_GPUS_ON_NODE (which reflects the allocation) when present:

if [[ -z "${NPROC_PER_NODE}" ]]; then
    NPROC_PER_NODE="${SLURM_GPUS_ON_NODE:-$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)}"
fi

Non-blocking — the current path is correct on cgroup-isolated clusters, which is the common case.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Summary

Scope reviewed: All 12 changed files (3 new bash launcher scripts + 9 new example YAMLs). The PR is entirely under tools/launcher/ — no modelopt/ library changes — so the ModelOpt-specific concerns (mode registration, modelopt_state schema/migration, HF/TRT-LLM export-API compatibility, plugin laziness) do not apply here.

Findings by severity: CRITICAL: 0 · IMPORTANT: 0 · SUGGESTION: 1

What I verified

  • World-size arithmetic across all 9 configs is consistent (TP×PP == nodes×gpus_per_node): Nano PTQ 2×4=8 / TP2·PP4; Super PTQ 8·PP2=16 = 4×4; Ultra PTQ/generate 8·9=72 = 18×4; export configs all TP=1 with PP matching node×GPU counts. The Ultra-export PP=9→12 deviation is correctly documented in the YAML header (108 layers / 12 = 9 layers/stage, maps cleanly to 3×4-GPU nodes).
  • Script structure mirrors the existing common/megatron_lm/{quantize,export} and common/megatron_bridge/import conventions: env-var validation, Slurm-derived rendezvous with standalone fallback, cd into the example dir for sibling-module imports, exec torchrun (export.sh intentionally omits exec so it can ls/cat hf_quant_config.json afterward).
  • Optional-flag assembly (COMPRESS, WEIGHT_ONLY, EXPORT_KV_CACHE_QUANT, EXPORT_EXTRA_MODULES, PROMPTS) is correct and safely defaulted.

Deliberately not duplicated (already gated/flagged)

  • Hardcoded TRUST_REMOTE_CODE: "true" in all 9 YAMLs — this is a CodeRabbit-gated security anti-pattern (already flagged by the pre-merge check and CodeRabbit's 9 inline comments). The author documents it as a Nemotron-3 family requirement matching upstream NVIDIA-NeMo docs. Out of scope for this review per the division of labor.
  • Empty MASTER_ADDR if scontrol resolution fails in the multi-node branch — already posted by CodeRabbit on all three scripts.

Sole new finding (non-blocking)

  • [SUGGESTION] NPROC_PER_NODE auto-detection via nvidia-smi can over-count on non-cgroup-isolated Slurm nodes where more physical GPUs exist than were allocated, breaking the world-size invariant. Suggested preferring SLURM_GPUS_ON_NODE or pinning NPROC_PER_NODE in the YAMLs. Correct as-is on cgroup-isolated clusters (the common case).

Risk level: Low. Additive-only tooling change; no library/API surface touched, backward compatible by construction.

Add three shell-script wrappers under tools/launcher/common/megatron_bridge/
that drive /opt/Megatron-Bridge/examples/quantization/{quantize,ptq_generate,
export}.py via torchrun, with multi-node rendezvous derived from Slurm env.
Also add per-model launcher YAMLs for Nemotron-3 Nano-30B-A3B, Super-120B-A12B,
and Ultra-550B-A55B following the upstream NVIDIA-NeMo/Nemotron quantization
docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
@yueshen2016 yueshen2016 force-pushed the yueshen/launcher-megatron-bridge branch from fd9fd4c to be54e63 Compare June 18, 2026 00:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 61-63: After the NPROC_PER_NODE auto-detection block that uses
nvidia-smi, add an explicit validation check to ensure NPROC_PER_NODE is not
empty and contains a numeric value. If the validation fails, the script should
exit immediately with a clear error message indicating that NPROC_PER_NODE
detection failed or produced an invalid value, rather than allowing the script
to proceed to the torchrun invocation where the error would be less actionable.
- Around line 104-113: The exec command invokes quantize.py with argument names
that do not match the upstream parser's expected snake_case names. Replace the
kebab-case argument names with their correct snake_case equivalents: change
--hf-model-id to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg,
--megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to
--pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples.
Additionally, remove the --etp argument entirely since it has no corresponding
parser argument and will cause an unrecognized argument error at runtime.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`:
- Around line 28-36: The environment section in the launcher YAML is missing two
required environment variables according to the guidelines. Add MLM_MODEL_CFG
set to the HuggingFace repo ID (which should match the HF_MODEL_ID value of
/hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add QUANT_CFG set
to the appropriate quantization configuration value for this model. Both
variables should be added as list items in the environment section alongside the
existing variables like TP, PP, EP, and ETP.
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable
setting from the megatron_bridge_generate.yaml configuration file. This setting
forces trust-remote-code execution by default, which violates security
guidelines by automatically executing arbitrary Python code from checkpoints.
Instead, allow users to explicitly opt-in to this security-sensitive flag
through their own configuration or command-line arguments rather than having it
enforced by default in the configuration file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c05886a-833a-41e4-b3b9-fe5bf0ab7e10

📥 Commits

Reviewing files that changed from the base of the PR and between fd9fd4c and be54e63.

📒 Files selected for processing (12)
  • tools/launcher/common/megatron_bridge/export/export.sh
  • tools/launcher/common/megatron_bridge/generate/generate.sh
  • tools/launcher/common/megatron_bridge/quantize/quantize.sh
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
✅ Files skipped from review due to trivial changes (2)
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
  • tools/launcher/common/megatron_bridge/generate/generate.sh
  • tools/launcher/common/megatron_bridge/export/export.sh
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
  • tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml

Comment on lines +61 to +63
if [[ -z "${NPROC_PER_NODE}" ]]; then
NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate NPROC_PER_NODE before torchrun invocation.

If auto-detection yields empty/non-numeric output, the script defers failure to torchrun with a less actionable error. Add an explicit sanity check and fail fast.

Minimal fail-fast guard
 if [[ -z "${NPROC_PER_NODE}" ]]; then
     NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)
 fi
+if ! [[ "${NPROC_PER_NODE}" =~ ^[1-9][0-9]*$ ]]; then
+    echo "[ERROR] NPROC_PER_NODE must be a positive integer (set NPROC_PER_NODE explicitly if auto-detect fails)." >&2
+    exit 1
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 61 -
63, After the NPROC_PER_NODE auto-detection block that uses nvidia-smi, add an
explicit validation check to ensure NPROC_PER_NODE is not empty and contains a
numeric value. If the validation fails, the script should exit immediately with
a clear error message indicating that NPROC_PER_NODE detection failed or
produced an invalid value, rather than allowing the script to proceed to the
torchrun invocation where the error would be less actionable.

Comment on lines +104 to +113
exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \
--hf-model-id "${HF_MODEL_ID}" \
--export-quant-cfg "${EXPORT_QUANT_CFG}" \
--megatron-save-path "${MEGATRON_SAVE_PATH}" \
--tp "${TP}" \
--pp "${PP}" \
--ep "${EP}" \
--etp "${ETP}" \
--calib-size "${CALIB_SIZE}" \
"${EXTRA_FLAGS[@]}" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify launcher flags vs parser-defined flags in repo sources.
LAUNCHER="tools/launcher/common/megatron_bridge/quantize/quantize.sh"
PARSER="examples/megatron_bridge/quantize.py"

echo "== Launcher forwarded flags =="
rg -n --no-heading --pcre2 '\s--[a-z0-9][a-z0-9-]*' "$LAUNCHER" \
  | sed -E 's/.*(--[a-z0-9][a-z0-9-]*).*/\1/' \
  | sort -u

echo
echo "== Parser add_argument options =="
rg -n --no-heading --pcre2 'add_argument\(\s*"--[A-Za-z0-9_]+"' "$PARSER" \
  | sed -E 's/.*"--([^"]+)".*/--\1/' \
  | sort -u

echo
echo "== Potential launcher-only flags (likely mismatch) =="
comm -23 \
  <(rg -o --no-line-number --pcre2 '(?<=\s)--[a-z0-9][a-z0-9-]*' "$LAUNCHER" | sort -u) \
  <(rg -o --no-line-number --pcre2 '--[A-Za-z0-9_]+' "$PARSER" | sort -u)

Repository: NVIDIA/Model-Optimizer

Length of output: 898


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the actual parser file for all add_argument calls
echo "=== Full parser argument definitions ==="
rg -A 2 'add_argument' examples/megatron_bridge/quantize.py | head -100

echo ""
echo "=== Check for any conditional or dynamic argument handling ==="
rg -n 'EXPORT_QUANT_CFG|megatron_save_path|etp|NPROC' examples/megatron_bridge/quantize.py

echo ""
echo "=== Launcher script context around args ==="
sed -n '100,115p' tools/launcher/common/megatron_bridge/quantize/quantize.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 1818


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify the launcher is meant to call quantize.py
echo "=== Check launcher command line ==="
rg -A 5 'exec python' tools/launcher/common/megatron_bridge/quantize/quantize.sh

echo ""
echo "=== Check if quantize.py exists and its location ==="
find . -name "quantize.py" -type f | grep -E "(megatron|bridge)" | head -5

echo ""
echo "=== Verify argument mismatch by checking specific mappings ==="
echo "Launcher argument mapping needed:"
echo "  --hf-model-id → --hf_model_name_or_path"
echo "  --export-quant-cfg → --quant_cfg"
echo "  --megatron-save-path → --export_megatron_path"
echo "  --tp → --tp_size"
echo "  --pp → --pp_size"
echo "  --ep → --ep_size"
echo "  --etp → (NO PARSER ARGUMENT)"
echo "  --calib-size → --calib_num_samples"

Repository: NVIDIA/Model-Optimizer

Length of output: 822


Launcher CLI arguments are incompatible with the upstream quantize.py parser.

The launcher forwards kebab-case arguments (lines 105–113) that do not match the parser's snake_case argument names. For example:

  • --hf-model-id → parser expects --hf_model_name_or_path
  • --export-quant-cfg → parser expects --quant_cfg
  • --megatron-save-path → parser expects --export_megatron_path
  • --tp, --pp, --ep → parser expects --tp_size, --pp_size, --ep_size
  • --calib-size → parser expects --calib_num_samples
  • --etp has no corresponding parser argument

This will cause argparse to fail at runtime with unrecognized argument errors.

Fix: align launcher arguments to parser contract
 exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \
-    --hf-model-id "${HF_MODEL_ID}" \
-    --export-quant-cfg "${EXPORT_QUANT_CFG}" \
-    --megatron-save-path "${MEGATRON_SAVE_PATH}" \
-    --tp "${TP}" \
-    --pp "${PP}" \
-    --ep "${EP}" \
-    --etp "${ETP}" \
-    --calib-size "${CALIB_SIZE}" \
+    --hf_model_name_or_path "${HF_MODEL_ID}" \
+    --quant_cfg "${EXPORT_QUANT_CFG}" \
+    --export_megatron_path "${MEGATRON_SAVE_PATH}" \
+    --tp_size "${TP}" \
+    --pp_size "${PP}" \
+    --ep_size "${EP}" \
+    --calib_num_samples "${CALIB_SIZE}" \
     "${EXTRA_FLAGS[@]}" \

Remove --etp "${ETP}" unless it is handled elsewhere in the parser.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 104
- 113, The exec command invokes quantize.py with argument names that do not
match the upstream parser's expected snake_case names. Replace the kebab-case
argument names with their correct snake_case equivalents: change --hf-model-id
to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg,
--megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to
--pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples.
Additionally, remove the --etp argument entirely since it has no corresponding
parser argument and will cause an unrecognized argument error at runtime.

Comment on lines +28 to +36
environment:
- HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
- MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calib
- TP: "8"
- PP: "9"
- EP: "8"
- ETP: "1"
- TRUST_REMOTE_CODE: "true"
- PYTORCH_ALLOC_CONF: "expandable_segments:True"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required model-config environment variables for launcher YAMLs.

This new model config omits MLM_MODEL_CFG and QUANT_CFG, which are required by the launcher YAML guidelines for newly added model configs.

Suggested change
     environment:
+      - MLM_MODEL_CFG: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
+      - QUANT_CFG: super-nvfp4-max-calib
       - HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
       - MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calib

As per coding guidelines, “Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config” and “Set QUANT_CFG environment variable … when adding a new model config.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`
around lines 28 - 36, The environment section in the launcher YAML is missing
two required environment variables according to the guidelines. Add
MLM_MODEL_CFG set to the HuggingFace repo ID (which should match the HF_MODEL_ID
value of /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add
QUANT_CFG set to the appropriate quantization configuration value for this
model. Both variables should be added as list items in the environment section
alongside the existing variables like TP, PP, EP, and ETP.

Source: Coding guidelines

- PP: "9"
- EP: "8"
- ETP: "1"
- TRUST_REMOTE_CODE: "true"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hardcoding remote-code execution opt-in.

TRUST_REMOTE_CODE: "true" forces --trust-remote-code for every run, which weakens the default security posture and violates the repository security rule.

Suggested change
-      - TRUST_REMOTE_CODE: "true"
+      # Set only when explicitly needed for a trusted model source.
+      # - TRUST_REMOTE_CODE: "true"

As per coding guidelines, “Do not hardcode trust_remote_code=True … This flag tells Transformers to execute arbitrary Python shipped with a checkpoint.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`
at line 35, Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable
setting from the megatron_bridge_generate.yaml configuration file. This setting
forces trust-remote-code execution by default, which violates security
guidelines by automatically executing arbitrary Python code from checkpoints.
Instead, allow users to explicitly opt-in to this security-sensitive flag
through their own configuration or command-line arguments rather than having it
enforced by default in the configuration file.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant