feat(launcher): add Megatron-Bridge quantize/generate/export wrappers by yueshen2016 · Pull Request #1767 · NVIDIA/Model-Optimizer

yueshen2016 · 2026-06-17T22:46:19Z

What does this PR do?

Type of change: new example

Adds Megatron-Bridge PTQ wrappers to the launcher so users can quantize / generate / export Hugging Face models through Megatron-Bridge on Slurm or Docker, mirroring the existing common/megatron_lm/{quantize,export}/ and common/megatron_bridge/import/ conventions.

Common scripts (tools/launcher/common/megatron_bridge/):
- quantize/quantize.sh — wraps /opt/Megatron-Bridge/examples/quantization/quantize.py
- generate/generate.sh — wraps /opt/Megatron-Bridge/examples/quantization/ptq_generate.py
- export/export.sh — wraps /opt/Megatron-Bridge/examples/quantization/export.py
All three assume nvcr.io/nvidia/nemo:26.04+ (Megatron-Bridge preinstalled), read configuration from env vars (HF_MODEL_ID, MEGATRON_SAVE_PATH/MEGATRON_LOAD_PATH, EXPORT_QUANT_CFG, TP/PP/EP/ETP, DTYPE, …), and launch via python -m torch.distributed.run with multi-node rendezvous (--nnodes/--node-rank/--master-addr/--master-port) derived from SLURM_NNODES/SLURM_NODEID/SLURM_NODELIST, falling back to --standalone for single-node.
Per-model example YAMLs (tools/launcher/examples/nvidia/):
- NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 2 nodes × 4 GPUs (TP=2 PP=4 EP=2 for PTQ, TP=8 EP=8 for generate, TP=1 PP=8 for export), nvfp4 config.
- NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 4 nodes × 4 GPUs for PTQ/generate (TP=8 PP=2 EP=8), 2 nodes × 4 GPUs for export (TP=1 PP=8), mamba_moe_nvfp4_conservative per super3 docs.
- NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_{ptq,generate,export}.yaml — 18 nodes × 4 GPUs for PTQ/generate (TP=8 PP=9 EP=8, super-nvfp4-max-calib.yaml recipe, PYTORCH_ALLOC_CONF=expandable_segments:True), 3 nodes × 4 GPUs for export (TP=1 PP=12 EP=8 — PP bumped from doc's 9 so the world maps cleanly to 4-GPU nodes; 108 layers / 12 = 9 layers/stage), per ultra3 docs.

Usage

# Quantize Nemotron-3-Nano-30B-A3B to NVFP4 via Megatron-Bridge:
cd tools/launcher
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml --yes

# Smoke-test generation on the quantized MCore checkpoint:
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml --yes

# Export to deployable HF format:
uv run launch.py --yaml examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml --yes

# Same three-step flow works for Super and Ultra by swapping the model subdir.

Testing

bash -n syntax check on all three shell scripts (pass).
uv run launch.py --dryrun --yes on all 9 example YAMLs (all parse, resolve <<global_vars.X>>, and build the Fiddle SandboxPipeline cleanly).
End-to-end Slurm runs not yet exercised; need cluster access with the right node count for super (16 GPUs) and ultra (72 GPUs).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ — all files are new; no existing files modified.
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A — no copied code, no new PIP dependencies.
Did you write any new necessary tests?: ❌ — the launcher's existing --dryrun path is the validation surface for example YAMLs and was used here; no new unit tests added. Happy to add launcher YAML regression tests if requested.
Did you update Changelog?: ❌ — pure launcher-example addition, no library API change. Can add a one-liner under 0.46 "New Features" if preferred (prior precedent: the moonshotai/ launcher examples were mentioned in 0.46).
Did you get Claude approval on this PR?: ❌ — to be run via /claude review after PR creation.

Additional Information

Upstream reference scripts: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/quantization
Note on trust_remote_code=True: the Nemotron-3 family requires it, and all three Nemotron YAMLs set TRUST_REMOTE_CODE=true consistent with the upstream NVIDIA-NeMo/Nemotron docs.
The Ultra-export YAML deliberately deviates from the upstream doc's --pp 9 to PP=12 so the world (12 ranks) maps cleanly to 3 × 4-GPU nodes; this is called out in the YAML's header comment.

Summary by CodeRabbit

Release Notes

New Features
- Added Megatron-Bridge tools for model quantization, text generation, and checkpoint export workflows.
- Added pre-configured job templates for three NVIDIA Nemotron model variants (Nano-30B, Super-120B, Ultra-550B) to streamline distributed training and inference operations.

coderabbitai · 2026-06-17T22:46:33Z

📝 Walkthrough

Walkthrough

Adds three reusable Bash launcher scripts (quantize.sh, export.sh, generate.sh) under tools/launcher/common/megatron_bridge/ for distributed Megatron-Bridge PTQ, HuggingFace checkpoint export, and text generation jobs. Also adds nine Slurm YAML pipeline configurations covering PTQ, export, and generate operations for three Nemotron model variants (Nano-30B, Super-120B, Ultra-550B).

Changes

Megatron-Bridge Launcher Scripts and Model Configs

Layer / File(s)	Summary
PTQ quantization launcher script `tools/launcher/common/megatron_bridge/quantize/quantize.sh`	Validates `HF_MODEL_ID`, sets defaults for `OUTPUT_DIR`, `EXPORT_QUANT_CFG`, `MEGATRON_SAVE_PATH`, and parallelism (`TP`, `PP`, `EP`, `ETP`, `CALIB_SIZE`), auto-detects `NPROC_PER_NODE` via `nvidia-smi`, builds Slurm-derived or standalone torchrun rendezvous args, assembles optional flags (compress, weight-only, KV cache, trust remote code, HF file lock, prompts), applies runtime workaround to rewrite deprecated dataset IDs, then `exec`s torchrun invoking `quantize.py`.
HuggingFace export launcher script `tools/launcher/common/megatron_bridge/export/export.sh`	Validates `HF_MODEL_ID` and `MEGATRON_LOAD_PATH`, sets output/parallelism/dtype defaults, auto-detects `NPROC_PER_NODE`, builds torchrun rendezvous args from Slurm or standalone mode, conditionally adds `EXPORT_EXTRA_MODULES` and `TRUST_REMOTE_CODE` flags, `exec`s torchrun with `export.py`, then lists export directory and prints `hf_quant_config.json` if present.
PTQ generation launcher script `tools/launcher/common/megatron_bridge/generate/generate.sh`	Validates `HF_MODEL_ID` and `MEGATRON_LOAD_PATH`, sets `TP`/`PP`/`EP`/`ETP` and `OSL` defaults, derives Slurm-based or standalone rendezvous args (including optional `MASTER_ADDR` via `scontrol`), builds optional flags from `TRUST_REMOTE_CODE` and `PROMPTS`, logs configuration, then `exec`s torchrun invoking `ptq_generate.py`.
PTQ job configs (3 models) `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml`	Each config defines pipeline metadata and global `output_dir`, wires `task_0` to invoke `quantize.sh` with model-specific env vars (HF model path, quantization config, parallelism, calibration size, trust remote code), and provisions Slurm/container execution with resource shapes from 2 nodes/4 GPUs (Nano, 4 h) to 18 nodes/4 GPUs (Ultra, 12 h).
Export job configs (3 models) `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`	Each config invokes `export.sh` with HF model/load paths and `EXPORT_DIR`, fixes `TP=1` with model-specific `PP` (8 for Nano/Super, 12 for Ultra), uses `bfloat16` dtype and `TRUST_REMOTE_CODE=true`, and specifies Slurm shapes of 2 nodes (Nano) and 3 nodes (Ultra) with 2-hour limits.
Generate job configs (3 models) `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml`, `tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`	Each config invokes `generate.sh` with HF model id, templated `MEGATRON_LOAD_PATH`, model-specific TP/PP/EP/ETP, `TRUST_REMOTE_CODE=true`, and optional `PYTORCH_ALLOC_CONF`. Slurm shapes range from 2 nodes/4 GPUs (Nano) to 4 nodes/4 GPUs (Super) to 18 nodes/1 task per node/4 GPUs (Ultra), with 1–2 hour limits.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

cherry-pick-0.45.0

Suggested reviewers

kevalmorabia97
h-guo18
ChenhanYu

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	PR contains 9 hardcoded instances of `TRUST_REMOTE_CODE: "true"` across all example YAML files, violating SECURITY.md rule `#3`: trust_remote_code must be exposed as caller-configurable parameter, no...	Comment out TRUST_REMOTE_CODE lines in all 9 YAML files with a note stating it should be set only when explicitly needed for trusted model sources, matching the pattern already flagged in review comments.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main change: it introduces Megatron-Bridge quantize/generate/export wrapper scripts to the launcher framework, which is the primary focus of the changeset.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yueshen/launcher-megatron-bridge

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yueshen2016 · 2026-06-17T22:54:23Z

/claude review

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/megatron_bridge/export/export.sh`:
- Around line 67-74: After the MASTER_ADDR resolution attempt from
SLURM_NODELIST in the multi-node block (when NNODES > 1), add an explicit
validation check to ensure MASTER_ADDR is not empty. If MASTER_ADDR is still
unresolved at that point, exit the script immediately with a clear error message
before the RDZV_ARGS are constructed. This prevents passing an empty
--master-addr value and failing downstream.

In `@tools/launcher/common/megatron_bridge/generate/generate.sh`:
- Around line 58-65: When NNODES is greater than 1, the script attempts to
construct RDZV_ARGS with MASTER_ADDR, but if the Slurm hostname discovery using
scontrol fails to resolve, MASTER_ADDR remains empty and gets passed as an empty
value in the rendezvous arguments, causing runtime failure. Add a validation
check after the conditional block that attempts to set MASTER_ADDR that
explicitly verifies MASTER_ADDR is not empty when NNODES is greater than 1, and
fail with a clear error message if it is empty, preventing the script from
proceeding to build RDZV_ARGS with an invalid master address.

In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 68-75: When NNODES is greater than 1, the code attempts to resolve
MASTER_ADDR from SLURM_NODELIST using scontrol, but if this resolution fails the
variable remains empty and RDZV_ARGS is still assembled with the empty value,
causing failures to surface later in torchrun with unclear error messages. Add a
validation check after the conditional block that attempts to set MASTER_ADDR to
verify it is not empty when NNODES > 1, and exit immediately with a clear error
message if MASTER_ADDR is unresolved, preventing the script from proceeding to
assemble RDZV_ARGS with undefined values.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`:
- Line 34: The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE
environment variable set to "true" which removes user control over remote code
execution security settings. Remove the line containing TRUST_REMOTE_CODE:
"true" from the megatron_bridge_export.yaml file, as the underlying shell
scripts (export.sh, generate.sh, quantize.sh) already support this as an
optional flag with a safe default that lets users explicitly opt in when needed,
restoring proper security boundaries.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`:
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE environment variable
assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration
files in tools/launcher/examples/nvidia/. This security-sensitive setting should
not be hardcoded in configs; instead, allow it to default to false and require
users to explicitly enable it per-run after vetting the model source. Apply this
change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for
export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from
each environment variables section.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`:
- Line 40: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the megatron_bridge_ptq.yaml configuration file, which enables remote code
execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE:
"true" line entirely from the configuration, or if it must be present, set it to
"false" to enforce a safe default. This requires operators to explicitly opt-in
to remote code execution for trusted artifacts rather than allowing it
unconditionally.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml`:
- Line 37: The TRUST_REMOTE_CODE environment variable is hardcoded as true in
the configuration, but per security guidelines this flag should not be baked
into default configs. Either remove the TRUST_REMOTE_CODE: "true" line entirely
and require users to explicitly pass this flag when running the example if
needed, or if this is a documented exception for NVIDIA's official trusted
artifacts, keep the line but add a clear inline comment explaining why this
security-sensitive flag is being enabled for this specific case. Apply the same
change consistently across all Nemotron variant examples (Super, Nano, Ultra)
and MiniMax examples where this pattern appears.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml`:
- Line 35: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the YAML launcher config, which forces users to enable remote code execution
without explicit opt-in. Replace the hardcoded "true" value with a pattern that
allows runtime override while defaulting to false, such as using parameter
expansion syntax (e.g., `${TRUST_REMOTE_CODE:-false}`) so users can choose to
enable it explicitly by setting the environment variable before running the
config, rather than having it forced on by default. This change should be
applied to the TRUST_REMOTE_CODE assignment in the launcher YAML file and
similar hardcoded instances across other model config files.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`:
- Line 38: The megatron_bridge_export.yaml file contains a hardcoded
TRUST_REMOTE_CODE: "true" environment variable without justification. Either
remove this line entirely if it is not necessary, or if the
Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline
YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is
required for this specific model and confirming that the artifact provenance has
been vetted and is trusted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ebef21ea-c5a1-4eb2-adb2-6d9b32bbf043

📥 Commits

Reviewing files that changed from the base of the PR and between 769ea5f and fd9fd4c.

📒 Files selected for processing (12)

tools/launcher/common/megatron_bridge/export/export.sh
tools/launcher/common/megatron_bridge/generate/generate.sh
tools/launcher/common/megatron_bridge/quantize/quantize.sh
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml

coderabbitai · 2026-06-17T22:54:36Z

+if [[ "${NNODES}" -gt 1 ]]; then
+    if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
+        MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
+    fi
+    MASTER_PORT="${MASTER_PORT:-29500}"
+    RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
+               "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
+else


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when MASTER_ADDR is unresolved for multi-node runs.

For NNODES>1, Line 73 can end up passing an empty --master-addr if SLURM_NODELIST resolution fails. Add an explicit guard and exit early with a clear error.

Suggested patch

if [[ "${NNODES}" -gt 1 ]]; then if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) fi + if [[ -z "${MASTER_ADDR}" ]]; then + echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2 + exit 1 + fi MASTER_PORT="${MASTER_PORT:-29500}" RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/megatron_bridge/export/export.sh` around lines 67 - 74, After the MASTER_ADDR resolution attempt from SLURM_NODELIST in the multi-node block (when NNODES > 1), add an explicit validation check to ensure MASTER_ADDR is not empty. If MASTER_ADDR is still unresolved at that point, exit the script immediately with a clear error message before the RDZV_ARGS are constructed. This prevents passing an empty --master-addr value and failing downstream.

coderabbitai · 2026-06-17T22:54:36Z

+if [[ "${NNODES}" -gt 1 ]]; then
+    if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
+        MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
+    fi
+    MASTER_PORT="${MASTER_PORT:-29500}"
+    RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
+               "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
+else


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a hard check for MASTER_ADDR before building multi-node rendezvous args.

When NNODES>1, the script may construct --master-addr= with an empty value if Slurm hostname discovery does not resolve, causing avoidable runtime failure.

Suggested patch

if [[ "${NNODES}" -gt 1 ]]; then if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) fi + if [[ -z "${MASTER_ADDR}" ]]; then + echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2 + exit 1 + fi MASTER_PORT="${MASTER_PORT:-29500}" RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/megatron_bridge/generate/generate.sh` around lines 58 - 65, When NNODES is greater than 1, the script attempts to construct RDZV_ARGS with MASTER_ADDR, but if the Slurm hostname discovery using scontrol fails to resolve, MASTER_ADDR remains empty and gets passed as an empty value in the rendezvous arguments, causing runtime failure. Add a validation check after the conditional block that attempts to set MASTER_ADDR that explicitly verifies MASTER_ADDR is not empty when NNODES is greater than 1, and fail with a clear error message if it is empty, preventing the script from proceeding to build RDZV_ARGS with an invalid master address.

coderabbitai · 2026-06-17T22:54:36Z

+if [[ "${NNODES}" -gt 1 ]]; then
+    if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
+        MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
+    fi
+    MASTER_PORT="${MASTER_PORT:-29500}"
+    RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
+               "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
+else


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when multi-node rendezvous address is unresolved.

When NNODES > 1, RDZV_ARGS is assembled even if MASTER_ADDR stays empty (e.g., missing SLURM_NODELIST or scontrol unavailable), which pushes the failure downstream into torchrun with a less actionable error.

Suggested fix

if [[ "${NNODES}" -gt 1 ]]; then if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) fi + if [[ -z "${MASTER_ADDR}" ]]; then + echo "[ERROR] MASTER_ADDR is required for multi-node runs (set MASTER_ADDR or SLURM_NODELIST)." >&2 + exit 1 + fi MASTER_PORT="${MASTER_PORT:-29500}" RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}") else

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 68 - 75, When NNODES is greater than 1, the code attempts to resolve MASTER_ADDR from SLURM_NODELIST using scontrol, but if this resolution fails the variable remains empty and RDZV_ARGS is still assembled with the empty value, causing failures to surface later in torchrun with unclear error messages. Add a validation check after the conditional block that attempts to set MASTER_ADDR to verify it is not empty when NNODES > 1, and exit immediately with a clear error message if MASTER_ADDR is unresolved, preventing the script from proceeding to assemble RDZV_ARGS with undefined values.

coderabbitai · 2026-06-17T22:54:36Z

+      - EP: "1"
+      - ETP: "1"
+      - DTYPE: bfloat16
+      - TRUST_REMOTE_CODE: "true"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the target YAML file find . -type f -name "megatron_bridge_export.yaml" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 363

🏁 Script executed:

# Also search for similar YAML configs to understand patterns find tools/launcher/examples -type f -name "*.yaml" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 776

🏁 Script executed:

# Search for how TRUST_REMOTE_CODE is used in the codebase rg "TRUST_REMOTE_CODE" -A 2 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 14710

🏁 Script executed:

# Search for MLM_MODEL_CFG and QUANT_CFG usage patterns in YAML files rg "MLM_MODEL_CFG|QUANT_CFG" tools/launcher/examples -A 1 -B 1

Repository: NVIDIA/Model-Optimizer

Length of output: 5902

🏁 Script executed:

# View the specific file mentioned in the review cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1879

🏁 Script executed:

# Check if there's any documentation or comments about TRUST_REMOTE_CODE in launcher configs rg "TRUST_REMOTE_CODE" tools/launcher -B 5 -A 2 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 11043

🏁 Script executed:

# Look for any README or documentation in the launcher examples directory find tools/launcher/examples -type f -name "README*" -o -name "*.md"

Repository: NVIDIA/Model-Optimizer

Length of output: 48

Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable.

This setting forces remote code execution for all runs, removing user control over a critical security boundary. Per security guidelines, do not hardcode trust_remote_code=True; instead let callers explicitly opt in at submission time when they can justify trusting the model source.

The shell scripts (export.sh, generate.sh, quantize.sh) already support this as an optional flag with a safe default ([[ "${TRUST_REMOTE_CODE:-false}" == "true" ]]), so removing this line simply restores user control without breaking the pipeline.

Suggested patch

- DTYPE: bfloat16 - - TRUST_REMOTE_CODE: "true" slurm_config:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- TRUST_REMOTE_CODE: "true"

- DTYPE: bfloat16

slurm_config:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml` at line 34, The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE environment variable set to "true" which removes user control over remote code execution security settings. Remove the line containing TRUST_REMOTE_CODE: "true" from the megatron_bridge_export.yaml file, as the underlying shell scripts (export.sh, generate.sh, quantize.sh) already support this as an optional flag with a safe default that lets users explicitly opt in when needed, restoring proper security boundaries.

Source: Coding guidelines

coderabbitai · 2026-06-17T22:54:36Z

+      - PP: "1"
+      - EP: "8"
+      - ETP: "1"
+      - TRUST_REMOTE_CODE: "true"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the actual file content and context cd tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 2>/dev/null && pwd && ls -la || echo "Directory may not exist, searching for the file..."

Repository: NVIDIA/Model-Optimizer

Length of output: 672

🏁 Script executed:

# Find the file and examine its content fd -type f -name "megatron_bridge_generate.yaml" | head -5 | xargs -I {} sh -c 'echo "File: {}" && cat -n {}'

Repository: NVIDIA/Model-Optimizer

Length of output: 237

🏁 Script executed:

# Search for TRUST_REMOTE_CODE usage in Python code to understand how it's consumed rg "TRUST_REMOTE_CODE" -A 3 -B 3 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 1536

🏁 Script executed:

# Check for similar environment variable patterns in other launcher configs fd -type f -name "*.yaml" -path "*/tools/launcher/examples/*" | xargs grep -l "TRUST_REMOTE_CODE" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 239

🏁 Script executed:

# Look for how environment variables are used in launcher code fd -type f -name "*.py" -path "*/tools/launcher/*" | head -10 | xargs ls -lh

Repository: NVIDIA/Model-Optimizer

Length of output: 1442

🏁 Script executed:

# Examine the actual megatron_bridge_generate.yaml file cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1939

🏁 Script executed:

# Also check the export version to see if it has the same pattern cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 1879

🏁 Script executed:

# Search for documentation on how launcher configs use environment variables find tools/launcher -name "*.md" -o -name "README*" | xargs cat 2>/dev/null | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3272

🏁 Script executed:

# Check how fakequant_worker.py uses the TRUST_REMOTE_CODE env var cat -n examples/vllm_serve/fakequant_worker.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 5192

🏁 Script executed:

# Search for other launcher YAML configs with TRUST_REMOTE_CODE or similar patterns grep -r "TRUST_REMOTE_CODE" tools/launcher --include="*.yaml"

Repository: NVIDIA/Model-Optimizer

Length of output: 1348

🏁 Script executed:

# Check if there's any documentation about environment variable handling in launcher configs find tools/launcher -type f $ -name "*.py" -o -name "*.md" $ | xargs grep -l "environment\|TRUST_REMOTE_CODE" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 243

🏁 Script executed:

# Check the launcher scripts to understand how environment variables flow from YAML to actual execution find tools/launcher/common -name "*.sh" | head -3 | xargs head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 3937

🏁 Script executed:

# Check if there are any configuration override mechanisms in the launcher cat tools/launcher/core.py | grep -A 10 -B 5 "environment\|override" | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1613

🏁 Script executed:

# Look for comments or documentation about setting TRUST_REMOTE_CODE in launcher configs grep -r "trust.remote\|TRUST_REMOTE" tools/launcher --include="*.md" --include="*.py" -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check the launcher's query.py to see if environment variables can be overridden find tools/launcher/common -name "query.py" -exec head -50 {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 1848

🏁 Script executed:

# Check if there's any documentation or comments explaining why TRUST_REMOTE_CODE is hardcoded grep -B 10 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 419

🏁 Script executed:

# Check if launcher configs support environment variable overrides at runtime cat tools/launcher/launch.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4138

🏁 Script executed:

# Look for documentation on how to override environment variables when running launcher grep -r "override\|environment" tools/launcher/docs/ tools/launcher/*.md 2>/dev/null | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1889

🏁 Script executed:

# Check the shell script that actually runs - see if it respects or overrides TRUST_REMOTE_CODE cat tools/launcher/common/megatron_bridge/generate/generate.sh 2>/dev/null | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1911

🏁 Script executed:

# Look for any comments or inline documentation about the TRUST_REMOTE_CODE setting in similar files cat tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 1567

🏁 Script executed:

# Check if these launcher YAML files are used in actual CICD or if they're just examples grep -r "megatron_bridge_generate" . --include="*.yml" --include="*.yaml" --include="*.py" | grep -v "examples" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at how the environment variables are actually consumed in the shell script cat -n tools/launcher/common/megatron_bridge/generate/generate.sh | grep -A 30 "TRUST_REMOTE_CODE"

Repository: NVIDIA/Model-Optimizer

Length of output: 2096

🏁 Script executed:

# Check if there are any warnings or comments in the generated shell scripts about trust_remote_code find tools/launcher/common -name "*.sh" -exec grep -l "trust" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 612

🏁 Script executed:

# Look at the launcher configuration documentation to see if there's guidance on setting sensitive env vars cat tools/launcher/docs/configuration.md 2>/dev/null || echo "File not found"

Repository: NVIDIA/Model-Optimizer

Length of output: 4251

🏁 Script executed:

# Check the CONTRIBUTING.md for any guidance on launcher configs and security grep -A 10 "launcher\|environment\|TRUST" tools/launcher/docs/contributing.md 2>/dev/null || grep -A 10 "launcher\|environment" CONTRIBUTING.md 2>/dev/null | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 277

🏁 Script executed:

# Check if these are actual production configs or just examples find tools/launcher/examples -type f -name "*.yaml" | wc -l

Repository: NVIDIA/Model-Optimizer

Length of output: 69

🏁 Script executed:

# Verify the broader scope - how many YAML configs have this issue grep -r "TRUST_REMOTE_CODE.*true" tools/launcher --include="*.yaml" | wc -l

Repository: NVIDIA/Model-Optimizer

Length of output: 68

🏁 Script executed:

# Check if the shell scripts document that these env vars can be overridden cat -n tools/launcher/common/megatron_bridge/quantize/quantize.sh | grep -A 10 -B 5 "TRUST_REMOTE_CODE"

Repository: NVIDIA/Model-Optimizer

Length of output: 1944

🏁 Script executed:

# Look for any warnings or notes about security in the example configs grep -r "security\|trusted\|remote.code" tools/launcher/examples --include="*.yaml" --include="*.md"

Repository: NVIDIA/Model-Optimizer

Length of output: 6233

Remove hardcoded TRUST_REMOTE_CODE: "true" from launcher configs; require per-run override instead.

These Megatron-Bridge launcher configs in tools/launcher/examples/nvidia/ hardcode TRUST_REMOTE_CODE: "true", violating the coding guideline that states: "Do not hardcode trust_remote_code=True". The guideline explicitly requires this security-sensitive setting to default to false and be controlled per-run after model-source vetting.

This pattern appears in all 9 NVIDIA Nemotron launcher configs (Nano, Super, Ultra—across export, generate, and ptq tasks). Either remove the environment variable entirely (shell scripts default to false) or document it as an optional per-run override, allowing users to enable it explicitly when needed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml` at line 35, Remove the hardcoded TRUST_REMOTE_CODE environment variable assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration files in tools/launcher/examples/nvidia/. This security-sensitive setting should not be hardcoded in configs; instead, allow it to default to false and require users to explicitly enable it per-run after vetting the model source. Apply this change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from each environment variables section.

Source: Coding guidelines

coderabbitai · 2026-06-17T22:54:36Z

+      - PP: "4"
+      - EP: "2"
+      - ETP: "1"
+      - TRUST_REMOTE_CODE: "true"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not hardcode remote-code execution opt-in in the committed config.

Setting TRUST_REMOTE_CODE: "true" by default makes every run execute model-provided code. Prefer a safe default (false/unset) and require explicit operator opt-in for trusted artifacts.

As per coding guidelines: “Do not hardcode trust_remote_code=True.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml` at line 40, The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in the megatron_bridge_ptq.yaml configuration file, which enables remote code execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE: "true" line entirely from the configuration, or if it must be present, set it to "false" to enforce a safe default. This requires operators to explicitly opt-in to remote code execution for trusted artifacts rather than allowing it unconditionally.

Source: Coding guidelines

coderabbitai · 2026-06-17T22:54:36Z

+      - EP: "8"
+      - ETP: "1"
+      - DTYPE: bfloat16
+      - TRUST_REMOTE_CODE: "true"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and check its content find . -path "*tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"

Repository: NVIDIA/Model-Optimizer

Length of output: 166

🏁 Script executed:

# Search for the specific YAML file and read it fd -e yaml "megatron_bridge_export.yaml" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 357

🏁 Script executed:

# Check if the file exists and its content around line 38 if [ -f "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml" ]; then echo "=== File content around line 38 ===" sed -n '30,45p' "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml" echo "" echo "=== Full file content ===" cat "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml" else echo "File not found at specified path" fi

Repository: NVIDIA/Model-Optimizer

Length of output: 2662

🏁 Script executed:

# Search for TRUST_REMOTE_CODE usage in launcher configs and scripts rg "TRUST_REMOTE_CODE" --type yaml -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 8526

🏁 Script executed:

# Check for similar patterns in other YAML configs rg "trust_remote_code|TRUST_REMOTE_CODE" -i --type yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 4802

🏁 Script executed:

# Look at the launcher scripts to understand how environment variables are processed find tools/launcher -name "*.py" -o -name "*.sh" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1011

🏁 Script executed:

# Check the export.sh script to understand how TRUST_REMOTE_CODE is used if [ -f "tools/launcher/common/megatron_bridge/export/export.sh" ]; then echo "=== export.sh script ===" cat "tools/launcher/common/megatron_bridge/export/export.sh" else echo "File not found" fi

Repository: NVIDIA/Model-Optimizer

Length of output: 4140

🏁 Script executed:

# Also check if there's a README or documentation about the launcher find tools/launcher -name "README*" -o -name "*.md" | head -10

Repository: NVIDIA/Model-Optimizer

Length of output: 291

🏁 Script executed:

# Look at the launch.py to understand how env vars are used find tools/launcher -name "launch.py" -o -name "launcher.py" | xargs ls -la 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 140

🏁 Script executed:

# Check the README for launcher guidelines cat tools/launcher/README.md | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 3006

🏁 Script executed:

# Look at the launch.py to see if there's any context about how environment variables are handled head -50 tools/launcher/launch.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2207

🏁 Script executed:

# Check if there are other scripts that also use TRUST_REMOTE_CODE similarly grep -r "TRUST_REMOTE_CODE" tools/launcher/common --include="*.sh" -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3507

🏁 Script executed:

# Check the inline comments in other similar export/quantize configs to see if they document the reason grep -B 5 -A 1 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml

Repository: NVIDIA/Model-Optimizer

Length of output: 207

Remove hardcoded TRUST_REMOTE_CODE: "true" or document why it's required.

Per the coding guidelines, do not hardcode trust_remote_code=True. While the underlying shell script correctly defaults to false, this YAML example hardcodes the override without justification. If --trust-remote-code is required for this model, add an inline YAML comment explaining why it is necessary (e.g., "Nemotron models require custom modeling code") and confirming artifact provenance is vetted.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml` at line 38, The megatron_bridge_export.yaml file contains a hardcoded TRUST_REMOTE_CODE: "true" environment variable without justification. Either remove this line entirely if it is not necessary, or if the Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is required for this specific model and confirming that the artifact provenance has been vetted and is trusted.

Source: Coding guidelines

codecov · 2026-06-17T22:56:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.31%. Comparing base (769ea5f) to head (be54e63).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1767      +/-   ##
==========================================
+ Coverage   74.28%   74.31%   +0.03%     
==========================================
  Files         511      511              
  Lines       56360    56360              
==========================================
+ Hits        41867    41884      +17     
+ Misses      14493    14476      -17

Flag	Coverage Δ
regression	`14.70% <ø> (+0.06%)`	⬆️
unit	`54.33% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-06-17T22:59:31Z

+CALIB_SIZE="${CALIB_SIZE:-512}"
+
+if [[ -z "${NPROC_PER_NODE}" ]]; then
+    NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)


[SUGGESTION] (applies to all three scripts — quantize.sh, generate.sh, export.sh)

NPROC_PER_NODE auto-detects via nvidia-smi --query-gpu=count, but the 9 example YAMLs request gpus_per_node: 4 from Slurm and rely on this detection rather than setting NPROC_PER_NODE explicitly.

On most Slurm clusters cgroup isolation means nvidia-smi only sees the allocated GPUs, so this works. But on setups where nvidia-smi reports all physical GPUs on the node (e.g. an 8-GPU node with only 4 allocated), NPROC_PER_NODE becomes 8 while Slurm allocated 4, so nproc_per_node × nnodes no longer equals the intended TP×PP world size — torchrun then fails with an opaque rendezvous/world-size error.

Since the YAMLs already pin gpus_per_node, consider either passing NPROC_PER_NODE: "4" explicitly in the configs, or deriving it from SLURM_GPUS_ON_NODE (which reflects the allocation) when present:

if [[ -z "${NPROC_PER_NODE}" ]]; then NPROC_PER_NODE="${SLURM_GPUS_ON_NODE:-$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)}" fi

Non-blocking — the current path is correct on cgroup-isolated clusters, which is the common case.

claude

Claude review passed — no blocking issues found. LGTM

Summary

Scope reviewed: All 12 changed files (3 new bash launcher scripts + 9 new example YAMLs). The PR is entirely under tools/launcher/ — no modelopt/ library changes — so the ModelOpt-specific concerns (mode registration, modelopt_state schema/migration, HF/TRT-LLM export-API compatibility, plugin laziness) do not apply here.

Findings by severity: CRITICAL: 0 · IMPORTANT: 0 · SUGGESTION: 1

What I verified

World-size arithmetic across all 9 configs is consistent (TP×PP == nodes×gpus_per_node): Nano PTQ 2×4=8 / TP2·PP4; Super PTQ 8·PP2=16 = 4×4; Ultra PTQ/generate 8·9=72 = 18×4; export configs all TP=1 with PP matching node×GPU counts. The Ultra-export PP=9→12 deviation is correctly documented in the YAML header (108 layers / 12 = 9 layers/stage, maps cleanly to 3×4-GPU nodes).
Script structure mirrors the existing common/megatron_lm/{quantize,export} and common/megatron_bridge/import conventions: env-var validation, Slurm-derived rendezvous with standalone fallback, cd into the example dir for sibling-module imports, exec torchrun (export.sh intentionally omits exec so it can ls/cat hf_quant_config.json afterward).
Optional-flag assembly (COMPRESS, WEIGHT_ONLY, EXPORT_KV_CACHE_QUANT, EXPORT_EXTRA_MODULES, PROMPTS) is correct and safely defaulted.

Deliberately not duplicated (already gated/flagged)

Hardcoded TRUST_REMOTE_CODE: "true" in all 9 YAMLs — this is a CodeRabbit-gated security anti-pattern (already flagged by the pre-merge check and CodeRabbit's 9 inline comments). The author documents it as a Nemotron-3 family requirement matching upstream NVIDIA-NeMo docs. Out of scope for this review per the division of labor.
Empty MASTER_ADDR if scontrol resolution fails in the multi-node branch — already posted by CodeRabbit on all three scripts.

Sole new finding (non-blocking)

[SUGGESTION] NPROC_PER_NODE auto-detection via nvidia-smi can over-count on non-cgroup-isolated Slurm nodes where more physical GPUs exist than were allocated, breaking the world-size invariant. Suggested preferring SLURM_GPUS_ON_NODE or pinning NPROC_PER_NODE in the YAMLs. Correct as-is on cgroup-isolated clusters (the common case).

Risk level: Low. Additive-only tooling change; no library/API surface touched, backward compatible by construction.

Add three shell-script wrappers under tools/launcher/common/megatron_bridge/ that drive /opt/Megatron-Bridge/examples/quantization/{quantize,ptq_generate, export}.py via torchrun, with multi-node rendezvous derived from Slurm env. Also add per-model launcher YAMLs for Nemotron-3 Nano-30B-A3B, Super-120B-A12B, and Ultra-550B-A55B following the upstream NVIDIA-NeMo/Nemotron quantization docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 61-63: After the NPROC_PER_NODE auto-detection block that uses
nvidia-smi, add an explicit validation check to ensure NPROC_PER_NODE is not
empty and contains a numeric value. If the validation fails, the script should
exit immediately with a clear error message indicating that NPROC_PER_NODE
detection failed or produced an invalid value, rather than allowing the script
to proceed to the torchrun invocation where the error would be less actionable.
- Around line 104-113: The exec command invokes quantize.py with argument names
that do not match the upstream parser's expected snake_case names. Replace the
kebab-case argument names with their correct snake_case equivalents: change
--hf-model-id to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg,
--megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to
--pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples.
Additionally, remove the --etp argument entirely since it has no corresponding
parser argument and will cause an unrecognized argument error at runtime.

In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`:
- Around line 28-36: The environment section in the launcher YAML is missing two
required environment variables according to the guidelines. Add MLM_MODEL_CFG
set to the HuggingFace repo ID (which should match the HF_MODEL_ID value of
/hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add QUANT_CFG set
to the appropriate quantization configuration value for this model. Both
variables should be added as list items in the environment section alongside the
existing variables like TP, PP, EP, and ETP.
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable
setting from the megatron_bridge_generate.yaml configuration file. This setting
forces trust-remote-code execution by default, which violates security
guidelines by automatically executing arbitrary Python code from checkpoints.
Instead, allow users to explicitly opt-in to this security-sensitive flag
through their own configuration or command-line arguments rather than having it
enforced by default in the configuration file.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7c05886a-833a-41e4-b3b9-fe5bf0ab7e10

📥 Commits

Reviewing files that changed from the base of the PR and between fd9fd4c and be54e63.

📒 Files selected for processing (12)

tools/launcher/common/megatron_bridge/export/export.sh
tools/launcher/common/megatron_bridge/generate/generate.sh
tools/launcher/common/megatron_bridge/quantize/quantize.sh
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml

✅ Files skipped from review due to trivial changes (2)

tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml

🚧 Files skipped from review as they are similar to previous changes (8)

tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
tools/launcher/common/megatron_bridge/generate/generate.sh
tools/launcher/common/megatron_bridge/export/export.sh
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml

coderabbitai · 2026-06-18T00:21:36Z

+if [[ -z "${NPROC_PER_NODE}" ]]; then
+    NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)
+fi


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate NPROC_PER_NODE before torchrun invocation.

If auto-detection yields empty/non-numeric output, the script defers failure to torchrun with a less actionable error. Add an explicit sanity check and fail fast.

Minimal fail-fast guard

if [[ -z "${NPROC_PER_NODE}" ]]; then NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1) fi +if ! [[ "${NPROC_PER_NODE}" =~ ^[1-9][0-9]*$ ]]; then + echo "[ERROR] NPROC_PER_NODE must be a positive integer (set NPROC_PER_NODE explicitly if auto-detect fails)." >&2 + exit 1 +fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 61 - 63, After the NPROC_PER_NODE auto-detection block that uses nvidia-smi, add an explicit validation check to ensure NPROC_PER_NODE is not empty and contains a numeric value. If the validation fails, the script should exit immediately with a clear error message indicating that NPROC_PER_NODE detection failed or produced an invalid value, rather than allowing the script to proceed to the torchrun invocation where the error would be less actionable.

coderabbitai · 2026-06-18T00:21:36Z

+exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \
+    --hf-model-id "${HF_MODEL_ID}" \
+    --export-quant-cfg "${EXPORT_QUANT_CFG}" \
+    --megatron-save-path "${MEGATRON_SAVE_PATH}" \
+    --tp "${TP}" \
+    --pp "${PP}" \
+    --ep "${EP}" \
+    --etp "${ETP}" \
+    --calib-size "${CALIB_SIZE}" \
+    "${EXTRA_FLAGS[@]}" \


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify launcher flags vs parser-defined flags in repo sources. LAUNCHER="tools/launcher/common/megatron_bridge/quantize/quantize.sh" PARSER="examples/megatron_bridge/quantize.py" echo "== Launcher forwarded flags ==" rg -n --no-heading --pcre2 '\s--[a-z0-9][a-z0-9-]*' "$LAUNCHER" \ | sed -E 's/.*(--[a-z0-9][a-z0-9-]*).*/\1/' \ | sort -u echo echo "== Parser add_argument options ==" rg -n --no-heading --pcre2 'add_argument\(\s*"--[A-Za-z0-9_]+"' "$PARSER" \ | sed -E 's/.*"--([^"]+)".*/--\1/' \ | sort -u echo echo "== Potential launcher-only flags (likely mismatch) ==" comm -23 \ <(rg -o --no-line-number --pcre2 '(?<=\s)--[a-z0-9][a-z0-9-]*' "$LAUNCHER" | sort -u) \ <(rg -o --no-line-number --pcre2 '--[A-Za-z0-9_]+' "$PARSER" | sort -u)

Repository: NVIDIA/Model-Optimizer

Length of output: 898

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check the actual parser file for all add_argument calls echo "=== Full parser argument definitions ===" rg -A 2 'add_argument' examples/megatron_bridge/quantize.py | head -100 echo "" echo "=== Check for any conditional or dynamic argument handling ===" rg -n 'EXPORT_QUANT_CFG|megatron_save_path|etp|NPROC' examples/megatron_bridge/quantize.py echo "" echo "=== Launcher script context around args ===" sed -n '100,115p' tools/launcher/common/megatron_bridge/quantize/quantize.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 1818

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify the launcher is meant to call quantize.py echo "=== Check launcher command line ===" rg -A 5 'exec python' tools/launcher/common/megatron_bridge/quantize/quantize.sh echo "" echo "=== Check if quantize.py exists and its location ===" find . -name "quantize.py" -type f | grep -E "(megatron|bridge)" | head -5 echo "" echo "=== Verify argument mismatch by checking specific mappings ===" echo "Launcher argument mapping needed:" echo " --hf-model-id → --hf_model_name_or_path" echo " --export-quant-cfg → --quant_cfg" echo " --megatron-save-path → --export_megatron_path" echo " --tp → --tp_size" echo " --pp → --pp_size" echo " --ep → --ep_size" echo " --etp → (NO PARSER ARGUMENT)" echo " --calib-size → --calib_num_samples"

Repository: NVIDIA/Model-Optimizer

Length of output: 822

Launcher CLI arguments are incompatible with the upstream quantize.py parser.

The launcher forwards kebab-case arguments (lines 105–113) that do not match the parser's snake_case argument names. For example:

--hf-model-id → parser expects --hf_model_name_or_path

--export-quant-cfg → parser expects --quant_cfg

--megatron-save-path → parser expects --export_megatron_path

--tp, --pp, --ep → parser expects --tp_size, --pp_size, --ep_size

--calib-size → parser expects --calib_num_samples

--etp has no corresponding parser argument

This will cause argparse to fail at runtime with unrecognized argument errors.

Fix: align launcher arguments to parser contract

exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \ - --hf-model-id "${HF_MODEL_ID}" \ - --export-quant-cfg "${EXPORT_QUANT_CFG}" \ - --megatron-save-path "${MEGATRON_SAVE_PATH}" \ - --tp "${TP}" \ - --pp "${PP}" \ - --ep "${EP}" \ - --etp "${ETP}" \ - --calib-size "${CALIB_SIZE}" \ + --hf_model_name_or_path "${HF_MODEL_ID}" \ + --quant_cfg "${EXPORT_QUANT_CFG}" \ + --export_megatron_path "${MEGATRON_SAVE_PATH}" \ + --tp_size "${TP}" \ + --pp_size "${PP}" \ + --ep_size "${EP}" \ + --calib_num_samples "${CALIB_SIZE}" \ "${EXTRA_FLAGS[@]}" \

Remove --etp "${ETP}" unless it is handled elsewhere in the parser.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 104 - 113, The exec command invokes quantize.py with argument names that do not match the upstream parser's expected snake_case names. Replace the kebab-case argument names with their correct snake_case equivalents: change --hf-model-id to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg, --megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to --pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples. Additionally, remove the --etp argument entirely since it has no corresponding parser argument and will cause an unrecognized argument error at runtime.

coderabbitai · 2026-06-18T00:21:36Z

+    environment:
+      - HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
+      - MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calib
+      - TP: "8"
+      - PP: "9"
+      - EP: "8"
+      - ETP: "1"
+      - TRUST_REMOTE_CODE: "true"
+      - PYTORCH_ALLOC_CONF: "expandable_segments:True"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required model-config environment variables for launcher YAMLs.

This new model config omits MLM_MODEL_CFG and QUANT_CFG, which are required by the launcher YAML guidelines for newly added model configs.

Suggested change

environment: + - MLM_MODEL_CFG: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 + - QUANT_CFG: super-nvfp4-max-calib - HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 - MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calib

As per coding guidelines, “Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config” and “Set QUANT_CFG environment variable … when adding a new model config.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml` around lines 28 - 36, The environment section in the launcher YAML is missing two required environment variables according to the guidelines. Add MLM_MODEL_CFG set to the HuggingFace repo ID (which should match the HF_MODEL_ID value of /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add QUANT_CFG set to the appropriate quantization configuration value for this model. Both variables should be added as list items in the environment section alongside the existing variables like TP, PP, EP, and ETP.

Source: Coding guidelines

coderabbitai · 2026-06-18T00:21:36Z

+      - PP: "9"
+      - EP: "8"
+      - ETP: "1"
+      - TRUST_REMOTE_CODE: "true"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hardcoding remote-code execution opt-in.

TRUST_REMOTE_CODE: "true" forces --trust-remote-code for every run, which weakens the default security posture and violates the repository security rule.

Suggested change

- - TRUST_REMOTE_CODE: "true" + # Set only when explicitly needed for a trusted model source. + # - TRUST_REMOTE_CODE: "true"

As per coding guidelines, “Do not hardcode trust_remote_code=True … This flag tells Transformers to execute arbitrary Python shipped with a checkpoint.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml` at line 35, Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable setting from the megatron_bridge_generate.yaml configuration file. This setting forces trust-remote-code execution by default, which violates security guidelines by automatically executing arbitrary Python code from checkpoints. Instead, allow users to explicitly opt-in to this security-sensitive flag through their own configuration or command-line arguments rather than having it enforced by default in the configuration file.

Source: Coding guidelines

yueshen2016 requested a review from a team as a code owner June 17, 2026 22:46

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

claude Bot reviewed Jun 17, 2026

View reviewed changes

claude Bot approved these changes Jun 17, 2026

View reviewed changes

yueshen2016 force-pushed the yueshen/launcher-megatron-bridge branch from fd9fd4c to be54e63 Compare June 18, 2026 00:15

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Conversation

yueshen2016 commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

yueshen2016 commented Jun 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Summary

What I verified

Deliberately not duplicated (already gated/flagged)

Sole new finding (non-blocking)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

yueshen2016 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading