feat(launcher): add Megatron-Bridge quantize/generate/export wrappers#1767
feat(launcher): add Megatron-Bridge quantize/generate/export wrappers#1767yueshen2016 wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughAdds three reusable Bash launcher scripts ( ChangesMegatron-Bridge Launcher Scripts and Model Configs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
/claude review |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/megatron_bridge/export/export.sh`:
- Around line 67-74: After the MASTER_ADDR resolution attempt from
SLURM_NODELIST in the multi-node block (when NNODES > 1), add an explicit
validation check to ensure MASTER_ADDR is not empty. If MASTER_ADDR is still
unresolved at that point, exit the script immediately with a clear error message
before the RDZV_ARGS are constructed. This prevents passing an empty
--master-addr value and failing downstream.
In `@tools/launcher/common/megatron_bridge/generate/generate.sh`:
- Around line 58-65: When NNODES is greater than 1, the script attempts to
construct RDZV_ARGS with MASTER_ADDR, but if the Slurm hostname discovery using
scontrol fails to resolve, MASTER_ADDR remains empty and gets passed as an empty
value in the rendezvous arguments, causing runtime failure. Add a validation
check after the conditional block that attempts to set MASTER_ADDR that
explicitly verifies MASTER_ADDR is not empty when NNODES is greater than 1, and
fail with a clear error message if it is empty, preventing the script from
proceeding to build RDZV_ARGS with an invalid master address.
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 68-75: When NNODES is greater than 1, the code attempts to resolve
MASTER_ADDR from SLURM_NODELIST using scontrol, but if this resolution fails the
variable remains empty and RDZV_ARGS is still assembled with the empty value,
causing failures to surface later in torchrun with unclear error messages. Add a
validation check after the conditional block that attempts to set MASTER_ADDR to
verify it is not empty when NNODES > 1, and exit immediately with a clear error
message if MASTER_ADDR is unresolved, preventing the script from proceeding to
assemble RDZV_ARGS with undefined values.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`:
- Line 34: The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE
environment variable set to "true" which removes user control over remote code
execution security settings. Remove the line containing TRUST_REMOTE_CODE:
"true" from the megatron_bridge_export.yaml file, as the underlying shell
scripts (export.sh, generate.sh, quantize.sh) already support this as an
optional flag with a safe default that lets users explicitly opt in when needed,
restoring proper security boundaries.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`:
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE environment variable
assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration
files in tools/launcher/examples/nvidia/. This security-sensitive setting should
not be hardcoded in configs; instead, allow it to default to false and require
users to explicitly enable it per-run after vetting the model source. Apply this
change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for
export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from
each environment variables section.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`:
- Line 40: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the megatron_bridge_ptq.yaml configuration file, which enables remote code
execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE:
"true" line entirely from the configuration, or if it must be present, set it to
"false" to enforce a safe default. This requires operators to explicitly opt-in
to remote code execution for trusted artifacts rather than allowing it
unconditionally.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml`:
- Line 37: The TRUST_REMOTE_CODE environment variable is hardcoded as true in
the configuration, but per security guidelines this flag should not be baked
into default configs. Either remove the TRUST_REMOTE_CODE: "true" line entirely
and require users to explicitly pass this flag when running the example if
needed, or if this is a documented exception for NVIDIA's official trusted
artifacts, keep the line but add a clear inline comment explaining why this
security-sensitive flag is being enabled for this specific case. Apply the same
change consistently across all Nemotron variant examples (Super, Nano, Ultra)
and MiniMax examples where this pattern appears.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml`:
- Line 35: The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the YAML launcher config, which forces users to enable remote code execution
without explicit opt-in. Replace the hardcoded "true" value with a pattern that
allows runtime override while defaulting to false, such as using parameter
expansion syntax (e.g., `${TRUST_REMOTE_CODE:-false}`) so users can choose to
enable it explicitly by setting the environment variable before running the
config, rather than having it forced on by default. This change should be
applied to the TRUST_REMOTE_CODE assignment in the launcher YAML file and
similar hardcoded instances across other model config files.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`:
- Line 38: The megatron_bridge_export.yaml file contains a hardcoded
TRUST_REMOTE_CODE: "true" environment variable without justification. Either
remove this line entirely if it is not necessary, or if the
Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline
YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is
required for this specific model and confirming that the artifact provenance has
been vetted and is trusted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ebef21ea-c5a1-4eb2-adb2-6d9b32bbf043
📒 Files selected for processing (12)
tools/launcher/common/megatron_bridge/export/export.shtools/launcher/common/megatron_bridge/generate/generate.shtools/launcher/common/megatron_bridge/quantize/quantize.shtools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
| if [[ "${NNODES}" -gt 1 ]]; then | ||
| if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then | ||
| MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) | ||
| fi | ||
| MASTER_PORT="${MASTER_PORT:-29500}" | ||
| RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ | ||
| "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}") | ||
| else |
There was a problem hiding this comment.
Fail fast when MASTER_ADDR is unresolved for multi-node runs.
For NNODES>1, Line 73 can end up passing an empty --master-addr if SLURM_NODELIST resolution fails. Add an explicit guard and exit early with a clear error.
Suggested patch
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
+ if [[ -z "${MASTER_ADDR}" ]]; then
+ echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2
+ exit 1
+ fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/megatron_bridge/export/export.sh` around lines 67 - 74,
After the MASTER_ADDR resolution attempt from SLURM_NODELIST in the multi-node
block (when NNODES > 1), add an explicit validation check to ensure MASTER_ADDR
is not empty. If MASTER_ADDR is still unresolved at that point, exit the script
immediately with a clear error message before the RDZV_ARGS are constructed.
This prevents passing an empty --master-addr value and failing downstream.
| if [[ "${NNODES}" -gt 1 ]]; then | ||
| if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then | ||
| MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) | ||
| fi | ||
| MASTER_PORT="${MASTER_PORT:-29500}" | ||
| RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ | ||
| "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}") | ||
| else |
There was a problem hiding this comment.
Add a hard check for MASTER_ADDR before building multi-node rendezvous args.
When NNODES>1, the script may construct --master-addr= with an empty value if Slurm hostname discovery does not resolve, causing avoidable runtime failure.
Suggested patch
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
+ if [[ -z "${MASTER_ADDR}" ]]; then
+ echo "[ERROR] MASTER_ADDR is required when NNODES>1 (or provide SLURM_NODELIST)" >&2
+ exit 1
+ fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/megatron_bridge/generate/generate.sh` around lines 58 -
65, When NNODES is greater than 1, the script attempts to construct RDZV_ARGS
with MASTER_ADDR, but if the Slurm hostname discovery using scontrol fails to
resolve, MASTER_ADDR remains empty and gets passed as an empty value in the
rendezvous arguments, causing runtime failure. Add a validation check after the
conditional block that attempts to set MASTER_ADDR that explicitly verifies
MASTER_ADDR is not empty when NNODES is greater than 1, and fail with a clear
error message if it is empty, preventing the script from proceeding to build
RDZV_ARGS with an invalid master address.
| if [[ "${NNODES}" -gt 1 ]]; then | ||
| if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then | ||
| MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1) | ||
| fi | ||
| MASTER_PORT="${MASTER_PORT:-29500}" | ||
| RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \ | ||
| "--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}") | ||
| else |
There was a problem hiding this comment.
Fail fast when multi-node rendezvous address is unresolved.
When NNODES > 1, RDZV_ARGS is assembled even if MASTER_ADDR stays empty (e.g., missing SLURM_NODELIST or scontrol unavailable), which pushes the failure downstream into torchrun with a less actionable error.
Suggested fix
if [[ "${NNODES}" -gt 1 ]]; then
if [[ -z "${MASTER_ADDR}" && -n "${SLURM_NODELIST}" ]]; then
MASTER_ADDR=$(scontrol show hostname "${SLURM_NODELIST}" 2>/dev/null | head -n 1)
fi
+ if [[ -z "${MASTER_ADDR}" ]]; then
+ echo "[ERROR] MASTER_ADDR is required for multi-node runs (set MASTER_ADDR or SLURM_NODELIST)." >&2
+ exit 1
+ fi
MASTER_PORT="${MASTER_PORT:-29500}"
RDZV_ARGS=("--nnodes=${NNODES}" "--node-rank=${NODE_RANK}" \
"--master-addr=${MASTER_ADDR}" "--master-port=${MASTER_PORT}")
else🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 68 -
75, When NNODES is greater than 1, the code attempts to resolve MASTER_ADDR from
SLURM_NODELIST using scontrol, but if this resolution fails the variable remains
empty and RDZV_ARGS is still assembled with the empty value, causing failures to
surface later in torchrun with unclear error messages. Add a validation check
after the conditional block that attempts to set MASTER_ADDR to verify it is not
empty when NNODES > 1, and exit immediately with a clear error message if
MASTER_ADDR is unresolved, preventing the script from proceeding to assemble
RDZV_ARGS with undefined values.
| - EP: "1" | ||
| - ETP: "1" | ||
| - DTYPE: bfloat16 | ||
| - TRUST_REMOTE_CODE: "true" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the target YAML file
find . -type f -name "megatron_bridge_export.yaml" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 363
🏁 Script executed:
# Also search for similar YAML configs to understand patterns
find tools/launcher/examples -type f -name "*.yaml" | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 776
🏁 Script executed:
# Search for how TRUST_REMOTE_CODE is used in the codebase
rg "TRUST_REMOTE_CODE" -A 2 -B 2Repository: NVIDIA/Model-Optimizer
Length of output: 14710
🏁 Script executed:
# Search for MLM_MODEL_CFG and QUANT_CFG usage patterns in YAML files
rg "MLM_MODEL_CFG|QUANT_CFG" tools/launcher/examples -A 1 -B 1Repository: NVIDIA/Model-Optimizer
Length of output: 5902
🏁 Script executed:
# View the specific file mentioned in the review
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 1879
🏁 Script executed:
# Check if there's any documentation or comments about TRUST_REMOTE_CODE in launcher configs
rg "TRUST_REMOTE_CODE" tools/launcher -B 5 -A 2 | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 11043
🏁 Script executed:
# Look for any README or documentation in the launcher examples directory
find tools/launcher/examples -type f -name "README*" -o -name "*.md"Repository: NVIDIA/Model-Optimizer
Length of output: 48
Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable.
This setting forces remote code execution for all runs, removing user control over a critical security boundary. Per security guidelines, do not hardcode trust_remote_code=True; instead let callers explicitly opt in at submission time when they can justify trusting the model source.
The shell scripts (export.sh, generate.sh, quantize.sh) already support this as an optional flag with a safe default ([[ "${TRUST_REMOTE_CODE:-false}" == "true" ]]), so removing this line simply restores user control without breaking the pipeline.
Suggested patch
- DTYPE: bfloat16
- - TRUST_REMOTE_CODE: "true"
slurm_config:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - TRUST_REMOTE_CODE: "true" | |
| - DTYPE: bfloat16 | |
| slurm_config: |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml`
at line 34, The YAML configuration file contains a hardcoded TRUST_REMOTE_CODE
environment variable set to "true" which removes user control over remote code
execution security settings. Remove the line containing TRUST_REMOTE_CODE:
"true" from the megatron_bridge_export.yaml file, as the underlying shell
scripts (export.sh, generate.sh, quantize.sh) already support this as an
optional flag with a safe default that lets users explicitly opt in when needed,
restoring proper security boundaries.
Source: Coding guidelines
| - PP: "1" | ||
| - EP: "8" | ||
| - ETP: "1" | ||
| - TRUST_REMOTE_CODE: "true" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check the actual file content and context
cd tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 2>/dev/null && pwd && ls -la || echo "Directory may not exist, searching for the file..."Repository: NVIDIA/Model-Optimizer
Length of output: 672
🏁 Script executed:
# Find the file and examine its content
fd -type f -name "megatron_bridge_generate.yaml" | head -5 | xargs -I {} sh -c 'echo "File: {}" && cat -n {}'Repository: NVIDIA/Model-Optimizer
Length of output: 237
🏁 Script executed:
# Search for TRUST_REMOTE_CODE usage in Python code to understand how it's consumed
rg "TRUST_REMOTE_CODE" -A 3 -B 3 --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 1536
🏁 Script executed:
# Check for similar environment variable patterns in other launcher configs
fd -type f -name "*.yaml" -path "*/tools/launcher/examples/*" | xargs grep -l "TRUST_REMOTE_CODE" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 239
🏁 Script executed:
# Look for how environment variables are used in launcher code
fd -type f -name "*.py" -path "*/tools/launcher/*" | head -10 | xargs ls -lhRepository: NVIDIA/Model-Optimizer
Length of output: 1442
🏁 Script executed:
# Examine the actual megatron_bridge_generate.yaml file
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 1939
🏁 Script executed:
# Also check the export version to see if it has the same pattern
cat -n tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 1879
🏁 Script executed:
# Search for documentation on how launcher configs use environment variables
find tools/launcher -name "*.md" -o -name "README*" | xargs cat 2>/dev/null | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 3272
🏁 Script executed:
# Check how fakequant_worker.py uses the TRUST_REMOTE_CODE env var
cat -n examples/vllm_serve/fakequant_worker.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 5192
🏁 Script executed:
# Search for other launcher YAML configs with TRUST_REMOTE_CODE or similar patterns
grep -r "TRUST_REMOTE_CODE" tools/launcher --include="*.yaml"Repository: NVIDIA/Model-Optimizer
Length of output: 1348
🏁 Script executed:
# Check if there's any documentation about environment variable handling in launcher configs
find tools/launcher -type f \( -name "*.py" -o -name "*.md" \) | xargs grep -l "environment\|TRUST_REMOTE_CODE" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 243
🏁 Script executed:
# Check the launcher scripts to understand how environment variables flow from YAML to actual execution
find tools/launcher/common -name "*.sh" | head -3 | xargs head -30Repository: NVIDIA/Model-Optimizer
Length of output: 3937
🏁 Script executed:
# Check if there are any configuration override mechanisms in the launcher
cat tools/launcher/core.py | grep -A 10 -B 5 "environment\|override" | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1613
🏁 Script executed:
# Look for comments or documentation about setting TRUST_REMOTE_CODE in launcher configs
grep -r "trust.remote\|TRUST_REMOTE" tools/launcher --include="*.md" --include="*.py" -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check the launcher's query.py to see if environment variables can be overridden
find tools/launcher/common -name "query.py" -exec head -50 {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 1848
🏁 Script executed:
# Check if there's any documentation or comments explaining why TRUST_REMOTE_CODE is hardcoded
grep -B 10 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 419
🏁 Script executed:
# Check if launcher configs support environment variable overrides at runtime
cat tools/launcher/launch.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 4138
🏁 Script executed:
# Look for documentation on how to override environment variables when running launcher
grep -r "override\|environment" tools/launcher/docs/ tools/launcher/*.md 2>/dev/null | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1889
🏁 Script executed:
# Check the shell script that actually runs - see if it respects or overrides TRUST_REMOTE_CODE
cat tools/launcher/common/megatron_bridge/generate/generate.sh 2>/dev/null | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1911
🏁 Script executed:
# Look for any comments or inline documentation about the TRUST_REMOTE_CODE setting in similar files
cat tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 1567
🏁 Script executed:
# Check if these launcher YAML files are used in actual CICD or if they're just examples
grep -r "megatron_bridge_generate" . --include="*.yml" --include="*.yaml" --include="*.py" | grep -v "examples" | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at how the environment variables are actually consumed in the shell script
cat -n tools/launcher/common/megatron_bridge/generate/generate.sh | grep -A 30 "TRUST_REMOTE_CODE"Repository: NVIDIA/Model-Optimizer
Length of output: 2096
🏁 Script executed:
# Check if there are any warnings or comments in the generated shell scripts about trust_remote_code
find tools/launcher/common -name "*.sh" -exec grep -l "trust" {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 612
🏁 Script executed:
# Look at the launcher configuration documentation to see if there's guidance on setting sensitive env vars
cat tools/launcher/docs/configuration.md 2>/dev/null || echo "File not found"Repository: NVIDIA/Model-Optimizer
Length of output: 4251
🏁 Script executed:
# Check the CONTRIBUTING.md for any guidance on launcher configs and security
grep -A 10 "launcher\|environment\|TRUST" tools/launcher/docs/contributing.md 2>/dev/null || grep -A 10 "launcher\|environment" CONTRIBUTING.md 2>/dev/null | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 277
🏁 Script executed:
# Check if these are actual production configs or just examples
find tools/launcher/examples -type f -name "*.yaml" | wc -lRepository: NVIDIA/Model-Optimizer
Length of output: 69
🏁 Script executed:
# Verify the broader scope - how many YAML configs have this issue
grep -r "TRUST_REMOTE_CODE.*true" tools/launcher --include="*.yaml" | wc -lRepository: NVIDIA/Model-Optimizer
Length of output: 68
🏁 Script executed:
# Check if the shell scripts document that these env vars can be overridden
cat -n tools/launcher/common/megatron_bridge/quantize/quantize.sh | grep -A 10 -B 5 "TRUST_REMOTE_CODE"Repository: NVIDIA/Model-Optimizer
Length of output: 1944
🏁 Script executed:
# Look for any warnings or notes about security in the example configs
grep -r "security\|trusted\|remote.code" tools/launcher/examples --include="*.yaml" --include="*.md"Repository: NVIDIA/Model-Optimizer
Length of output: 6233
Remove hardcoded TRUST_REMOTE_CODE: "true" from launcher configs; require per-run override instead.
These Megatron-Bridge launcher configs in tools/launcher/examples/nvidia/ hardcode TRUST_REMOTE_CODE: "true", violating the coding guideline that states: "Do not hardcode trust_remote_code=True". The guideline explicitly requires this security-sensitive setting to default to false and be controlled per-run after model-source vetting.
This pattern appears in all 9 NVIDIA Nemotron launcher configs (Nano, Super, Ultra—across export, generate, and ptq tasks). Either remove the environment variable entirely (shell scripts default to false) or document it as an optional per-run override, allowing users to enable it explicitly when needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml`
at line 35, Remove the hardcoded TRUST_REMOTE_CODE environment variable
assignment (set to "true") from all NVIDIA Nemotron launcher YAML configuration
files in tools/launcher/examples/nvidia/. This security-sensitive setting should
not be hardcoded in configs; instead, allow it to default to false and require
users to explicitly enable it per-run after vetting the model source. Apply this
change across all 9 Nemotron launcher configs (Nano, Super, Ultra variants for
export, generate, and ptq tasks) by deleting the TRUST_REMOTE_CODE line from
each environment variables section.
Source: Coding guidelines
| - PP: "4" | ||
| - EP: "2" | ||
| - ETP: "1" | ||
| - TRUST_REMOTE_CODE: "true" |
There was a problem hiding this comment.
Do not hardcode remote-code execution opt-in in the committed config.
Setting TRUST_REMOTE_CODE: "true" by default makes every run execute model-provided code. Prefer a safe default (false/unset) and require explicit operator opt-in for trusted artifacts.
As per coding guidelines: “Do not hardcode trust_remote_code=True.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml`
at line 40, The TRUST_REMOTE_CODE environment variable is hardcoded to "true" in
the megatron_bridge_ptq.yaml configuration file, which enables remote code
execution by default and poses a security risk. Remove the TRUST_REMOTE_CODE:
"true" line entirely from the configuration, or if it must be present, set it to
"false" to enforce a safe default. This requires operators to explicitly opt-in
to remote code execution for trusted artifacts rather than allowing it
unconditionally.
Source: Coding guidelines
| - EP: "8" | ||
| - ETP: "1" | ||
| - DTYPE: bfloat16 | ||
| - TRUST_REMOTE_CODE: "true" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, verify the file exists and check its content
find . -path "*tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"Repository: NVIDIA/Model-Optimizer
Length of output: 166
🏁 Script executed:
# Search for the specific YAML file and read it
fd -e yaml "megatron_bridge_export.yaml" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 357
🏁 Script executed:
# Check if the file exists and its content around line 38
if [ -f "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml" ]; then
echo "=== File content around line 38 ==="
sed -n '30,45p' "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"
echo ""
echo "=== Full file content ==="
cat "tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml"
else
echo "File not found at specified path"
fiRepository: NVIDIA/Model-Optimizer
Length of output: 2662
🏁 Script executed:
# Search for TRUST_REMOTE_CODE usage in launcher configs and scripts
rg "TRUST_REMOTE_CODE" --type yaml -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 8526
🏁 Script executed:
# Check for similar patterns in other YAML configs
rg "trust_remote_code|TRUST_REMOTE_CODE" -i --type yamlRepository: NVIDIA/Model-Optimizer
Length of output: 4802
🏁 Script executed:
# Look at the launcher scripts to understand how environment variables are processed
find tools/launcher -name "*.py" -o -name "*.sh" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1011
🏁 Script executed:
# Check the export.sh script to understand how TRUST_REMOTE_CODE is used
if [ -f "tools/launcher/common/megatron_bridge/export/export.sh" ]; then
echo "=== export.sh script ==="
cat "tools/launcher/common/megatron_bridge/export/export.sh"
else
echo "File not found"
fiRepository: NVIDIA/Model-Optimizer
Length of output: 4140
🏁 Script executed:
# Also check if there's a README or documentation about the launcher
find tools/launcher -name "README*" -o -name "*.md" | head -10Repository: NVIDIA/Model-Optimizer
Length of output: 291
🏁 Script executed:
# Look at the launch.py to understand how env vars are used
find tools/launcher -name "launch.py" -o -name "launcher.py" | xargs ls -la 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 140
🏁 Script executed:
# Check the README for launcher guidelines
cat tools/launcher/README.md | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 3006
🏁 Script executed:
# Look at the launch.py to see if there's any context about how environment variables are handled
head -50 tools/launcher/launch.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2207
🏁 Script executed:
# Check if there are other scripts that also use TRUST_REMOTE_CODE similarly
grep -r "TRUST_REMOTE_CODE" tools/launcher/common --include="*.sh" -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 3507
🏁 Script executed:
# Check the inline comments in other similar export/quantize configs to see if they document the reason
grep -B 5 -A 1 "TRUST_REMOTE_CODE" tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yamlRepository: NVIDIA/Model-Optimizer
Length of output: 207
Remove hardcoded TRUST_REMOTE_CODE: "true" or document why it's required.
Per the coding guidelines, do not hardcode trust_remote_code=True. While the underlying shell script correctly defaults to false, this YAML example hardcodes the override without justification. If --trust-remote-code is required for this model, add an inline YAML comment explaining why it is necessary (e.g., "Nemotron models require custom modeling code") and confirming artifact provenance is vetted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml`
at line 38, The megatron_bridge_export.yaml file contains a hardcoded
TRUST_REMOTE_CODE: "true" environment variable without justification. Either
remove this line entirely if it is not necessary, or if the
Nemotron-3-Ultra-550B model requires custom remote code execution, add an inline
YAML comment above the TRUST_REMOTE_CODE: "true" line explaining why it is
required for this specific model and confirming that the artifact provenance has
been vetted and is trusted.
Source: Coding guidelines
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1767 +/- ##
==========================================
+ Coverage 74.28% 74.31% +0.03%
==========================================
Files 511 511
Lines 56360 56360
==========================================
+ Hits 41867 41884 +17
+ Misses 14493 14476 -17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| CALIB_SIZE="${CALIB_SIZE:-512}" | ||
|
|
||
| if [[ -z "${NPROC_PER_NODE}" ]]; then | ||
| NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1) |
There was a problem hiding this comment.
[SUGGESTION] (applies to all three scripts — quantize.sh, generate.sh, export.sh)
NPROC_PER_NODE auto-detects via nvidia-smi --query-gpu=count, but the 9 example YAMLs request gpus_per_node: 4 from Slurm and rely on this detection rather than setting NPROC_PER_NODE explicitly.
On most Slurm clusters cgroup isolation means nvidia-smi only sees the allocated GPUs, so this works. But on setups where nvidia-smi reports all physical GPUs on the node (e.g. an 8-GPU node with only 4 allocated), NPROC_PER_NODE becomes 8 while Slurm allocated 4, so nproc_per_node × nnodes no longer equals the intended TP×PP world size — torchrun then fails with an opaque rendezvous/world-size error.
Since the YAMLs already pin gpus_per_node, consider either passing NPROC_PER_NODE: "4" explicitly in the configs, or deriving it from SLURM_GPUS_ON_NODE (which reflects the allocation) when present:
if [[ -z "${NPROC_PER_NODE}" ]]; then
NPROC_PER_NODE="${SLURM_GPUS_ON_NODE:-$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)}"
fiNon-blocking — the current path is correct on cgroup-isolated clusters, which is the common case.
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Summary
Scope reviewed: All 12 changed files (3 new bash launcher scripts + 9 new example YAMLs). The PR is entirely under tools/launcher/ — no modelopt/ library changes — so the ModelOpt-specific concerns (mode registration, modelopt_state schema/migration, HF/TRT-LLM export-API compatibility, plugin laziness) do not apply here.
Findings by severity: CRITICAL: 0 · IMPORTANT: 0 · SUGGESTION: 1
What I verified
- World-size arithmetic across all 9 configs is consistent (
TP×PP == nodes×gpus_per_node): Nano PTQ 2×4=8 / TP2·PP4; Super PTQ 8·PP2=16 = 4×4; Ultra PTQ/generate 8·9=72 = 18×4; export configs all TP=1 with PP matching node×GPU counts. The Ultra-export PP=9→12 deviation is correctly documented in the YAML header (108 layers / 12 = 9 layers/stage, maps cleanly to 3×4-GPU nodes). - Script structure mirrors the existing
common/megatron_lm/{quantize,export}andcommon/megatron_bridge/importconventions: env-var validation, Slurm-derived rendezvous with standalone fallback,cdinto the example dir for sibling-module imports,exec torchrun(export.sh intentionally omitsexecso it canls/cat hf_quant_config.jsonafterward). - Optional-flag assembly (
COMPRESS,WEIGHT_ONLY,EXPORT_KV_CACHE_QUANT,EXPORT_EXTRA_MODULES,PROMPTS) is correct and safely defaulted.
Deliberately not duplicated (already gated/flagged)
- Hardcoded
TRUST_REMOTE_CODE: "true"in all 9 YAMLs — this is a CodeRabbit-gated security anti-pattern (already flagged by the pre-merge check and CodeRabbit's 9 inline comments). The author documents it as a Nemotron-3 family requirement matching upstream NVIDIA-NeMo docs. Out of scope for this review per the division of labor. - Empty
MASTER_ADDRifscontrolresolution fails in the multi-node branch — already posted by CodeRabbit on all three scripts.
Sole new finding (non-blocking)
- [SUGGESTION]
NPROC_PER_NODEauto-detection vianvidia-smican over-count on non-cgroup-isolated Slurm nodes where more physical GPUs exist than were allocated, breaking the world-size invariant. Suggested preferringSLURM_GPUS_ON_NODEor pinningNPROC_PER_NODEin the YAMLs. Correct as-is on cgroup-isolated clusters (the common case).
Risk level: Low. Additive-only tooling change; no library/API surface touched, backward compatible by construction.
Add three shell-script wrappers under tools/launcher/common/megatron_bridge/
that drive /opt/Megatron-Bridge/examples/quantization/{quantize,ptq_generate,
export}.py via torchrun, with multi-node rendezvous derived from Slurm env.
Also add per-model launcher YAMLs for Nemotron-3 Nano-30B-A3B, Super-120B-A12B,
and Ultra-550B-A55B following the upstream NVIDIA-NeMo/Nemotron quantization
docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
fd9fd4c to
be54e63
Compare
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh`:
- Around line 61-63: After the NPROC_PER_NODE auto-detection block that uses
nvidia-smi, add an explicit validation check to ensure NPROC_PER_NODE is not
empty and contains a numeric value. If the validation fails, the script should
exit immediately with a clear error message indicating that NPROC_PER_NODE
detection failed or produced an invalid value, rather than allowing the script
to proceed to the torchrun invocation where the error would be less actionable.
- Around line 104-113: The exec command invokes quantize.py with argument names
that do not match the upstream parser's expected snake_case names. Replace the
kebab-case argument names with their correct snake_case equivalents: change
--hf-model-id to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg,
--megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to
--pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples.
Additionally, remove the --etp argument entirely since it has no corresponding
parser argument and will cause an unrecognized argument error at runtime.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`:
- Around line 28-36: The environment section in the launcher YAML is missing two
required environment variables according to the guidelines. Add MLM_MODEL_CFG
set to the HuggingFace repo ID (which should match the HF_MODEL_ID value of
/hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add QUANT_CFG set
to the appropriate quantization configuration value for this model. Both
variables should be added as list items in the environment section alongside the
existing variables like TP, PP, EP, and ETP.
- Line 35: Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable
setting from the megatron_bridge_generate.yaml configuration file. This setting
forces trust-remote-code execution by default, which violates security
guidelines by automatically executing arbitrary Python code from checkpoints.
Instead, allow users to explicitly opt-in to this security-sensitive flag
through their own configuration or command-line arguments rather than having it
enforced by default in the configuration file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7c05886a-833a-41e4-b3b9-fe5bf0ab7e10
📒 Files selected for processing (12)
tools/launcher/common/megatron_bridge/export/export.shtools/launcher/common/megatron_bridge/generate/generate.shtools/launcher/common/megatron_bridge/quantize/quantize.shtools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yamltools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
✅ Files skipped from review due to trivial changes (2)
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_ptq.yaml
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_ptq.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_generate.yaml
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_export.yaml
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_ptq.yaml
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_export.yaml
- tools/launcher/common/megatron_bridge/generate/generate.sh
- tools/launcher/common/megatron_bridge/export/export.sh
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_generate.yaml
- tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_export.yaml
| if [[ -z "${NPROC_PER_NODE}" ]]; then | ||
| NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1) | ||
| fi |
There was a problem hiding this comment.
Validate NPROC_PER_NODE before torchrun invocation.
If auto-detection yields empty/non-numeric output, the script defers failure to torchrun with a less actionable error. Add an explicit sanity check and fail fast.
Minimal fail-fast guard
if [[ -z "${NPROC_PER_NODE}" ]]; then
NPROC_PER_NODE=$(nvidia-smi --query-gpu=count --format=csv,noheader,nounits | head -n 1)
fi
+if ! [[ "${NPROC_PER_NODE}" =~ ^[1-9][0-9]*$ ]]; then
+ echo "[ERROR] NPROC_PER_NODE must be a positive integer (set NPROC_PER_NODE explicitly if auto-detect fails)." >&2
+ exit 1
+fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 61 -
63, After the NPROC_PER_NODE auto-detection block that uses nvidia-smi, add an
explicit validation check to ensure NPROC_PER_NODE is not empty and contains a
numeric value. If the validation fails, the script should exit immediately with
a clear error message indicating that NPROC_PER_NODE detection failed or
produced an invalid value, rather than allowing the script to proceed to the
torchrun invocation where the error would be less actionable.
| exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \ | ||
| --hf-model-id "${HF_MODEL_ID}" \ | ||
| --export-quant-cfg "${EXPORT_QUANT_CFG}" \ | ||
| --megatron-save-path "${MEGATRON_SAVE_PATH}" \ | ||
| --tp "${TP}" \ | ||
| --pp "${PP}" \ | ||
| --ep "${EP}" \ | ||
| --etp "${ETP}" \ | ||
| --calib-size "${CALIB_SIZE}" \ | ||
| "${EXTRA_FLAGS[@]}" \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify launcher flags vs parser-defined flags in repo sources.
LAUNCHER="tools/launcher/common/megatron_bridge/quantize/quantize.sh"
PARSER="examples/megatron_bridge/quantize.py"
echo "== Launcher forwarded flags =="
rg -n --no-heading --pcre2 '\s--[a-z0-9][a-z0-9-]*' "$LAUNCHER" \
| sed -E 's/.*(--[a-z0-9][a-z0-9-]*).*/\1/' \
| sort -u
echo
echo "== Parser add_argument options =="
rg -n --no-heading --pcre2 'add_argument\(\s*"--[A-Za-z0-9_]+"' "$PARSER" \
| sed -E 's/.*"--([^"]+)".*/--\1/' \
| sort -u
echo
echo "== Potential launcher-only flags (likely mismatch) =="
comm -23 \
<(rg -o --no-line-number --pcre2 '(?<=\s)--[a-z0-9][a-z0-9-]*' "$LAUNCHER" | sort -u) \
<(rg -o --no-line-number --pcre2 '--[A-Za-z0-9_]+' "$PARSER" | sort -u)Repository: NVIDIA/Model-Optimizer
Length of output: 898
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the actual parser file for all add_argument calls
echo "=== Full parser argument definitions ==="
rg -A 2 'add_argument' examples/megatron_bridge/quantize.py | head -100
echo ""
echo "=== Check for any conditional or dynamic argument handling ==="
rg -n 'EXPORT_QUANT_CFG|megatron_save_path|etp|NPROC' examples/megatron_bridge/quantize.py
echo ""
echo "=== Launcher script context around args ==="
sed -n '100,115p' tools/launcher/common/megatron_bridge/quantize/quantize.shRepository: NVIDIA/Model-Optimizer
Length of output: 1818
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the launcher is meant to call quantize.py
echo "=== Check launcher command line ==="
rg -A 5 'exec python' tools/launcher/common/megatron_bridge/quantize/quantize.sh
echo ""
echo "=== Check if quantize.py exists and its location ==="
find . -name "quantize.py" -type f | grep -E "(megatron|bridge)" | head -5
echo ""
echo "=== Verify argument mismatch by checking specific mappings ==="
echo "Launcher argument mapping needed:"
echo " --hf-model-id → --hf_model_name_or_path"
echo " --export-quant-cfg → --quant_cfg"
echo " --megatron-save-path → --export_megatron_path"
echo " --tp → --tp_size"
echo " --pp → --pp_size"
echo " --ep → --ep_size"
echo " --etp → (NO PARSER ARGUMENT)"
echo " --calib-size → --calib_num_samples"Repository: NVIDIA/Model-Optimizer
Length of output: 822
Launcher CLI arguments are incompatible with the upstream quantize.py parser.
The launcher forwards kebab-case arguments (lines 105–113) that do not match the parser's snake_case argument names. For example:
--hf-model-id→ parser expects--hf_model_name_or_path--export-quant-cfg→ parser expects--quant_cfg--megatron-save-path→ parser expects--export_megatron_path--tp,--pp,--ep→ parser expects--tp_size,--pp_size,--ep_size--calib-size→ parser expects--calib_num_samples--etphas no corresponding parser argument
This will cause argparse to fail at runtime with unrecognized argument errors.
Fix: align launcher arguments to parser contract
exec python -m torch.distributed.run --nproc_per_node "${NPROC_PER_NODE}" "${RDZV_ARGS[@]}" quantize.py \
- --hf-model-id "${HF_MODEL_ID}" \
- --export-quant-cfg "${EXPORT_QUANT_CFG}" \
- --megatron-save-path "${MEGATRON_SAVE_PATH}" \
- --tp "${TP}" \
- --pp "${PP}" \
- --ep "${EP}" \
- --etp "${ETP}" \
- --calib-size "${CALIB_SIZE}" \
+ --hf_model_name_or_path "${HF_MODEL_ID}" \
+ --quant_cfg "${EXPORT_QUANT_CFG}" \
+ --export_megatron_path "${MEGATRON_SAVE_PATH}" \
+ --tp_size "${TP}" \
+ --pp_size "${PP}" \
+ --ep_size "${EP}" \
+ --calib_num_samples "${CALIB_SIZE}" \
"${EXTRA_FLAGS[@]}" \Remove --etp "${ETP}" unless it is handled elsewhere in the parser.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/launcher/common/megatron_bridge/quantize/quantize.sh` around lines 104
- 113, The exec command invokes quantize.py with argument names that do not
match the upstream parser's expected snake_case names. Replace the kebab-case
argument names with their correct snake_case equivalents: change --hf-model-id
to --hf_model_name_or_path, --export-quant-cfg to --quant_cfg,
--megatron-save-path to --export_megatron_path, --tp to --tp_size, --pp to
--pp_size, --ep to --ep_size, and --calib-size to --calib_num_samples.
Additionally, remove the --etp argument entirely since it has no corresponding
parser argument and will cause an unrecognized argument error at runtime.
| environment: | ||
| - HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 | ||
| - MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calib | ||
| - TP: "8" | ||
| - PP: "9" | ||
| - EP: "8" | ||
| - ETP: "1" | ||
| - TRUST_REMOTE_CODE: "true" | ||
| - PYTORCH_ALLOC_CONF: "expandable_segments:True" |
There was a problem hiding this comment.
Add required model-config environment variables for launcher YAMLs.
This new model config omits MLM_MODEL_CFG and QUANT_CFG, which are required by the launcher YAML guidelines for newly added model configs.
Suggested change
environment:
+ - MLM_MODEL_CFG: nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
+ - QUANT_CFG: super-nvfp4-max-calib
- HF_MODEL_ID: /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
- MEGATRON_LOAD_PATH: <<global_vars.output_dir>>/NVIDIA-Nemotron-3-Ultra-550B-A55B-super-nvfp4-max-calibAs per coding guidelines, “Set MLM_MODEL_CFG environment variable to the HuggingFace repo ID when adding a new model config” and “Set QUANT_CFG environment variable … when adding a new model config.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`
around lines 28 - 36, The environment section in the launcher YAML is missing
two required environment variables according to the guidelines. Add
MLM_MODEL_CFG set to the HuggingFace repo ID (which should match the HF_MODEL_ID
value of /hf-local/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16) and add
QUANT_CFG set to the appropriate quantization configuration value for this
model. Both variables should be added as list items in the environment section
alongside the existing variables like TP, PP, EP, and ETP.
Source: Coding guidelines
| - PP: "9" | ||
| - EP: "8" | ||
| - ETP: "1" | ||
| - TRUST_REMOTE_CODE: "true" |
There was a problem hiding this comment.
Avoid hardcoding remote-code execution opt-in.
TRUST_REMOTE_CODE: "true" forces --trust-remote-code for every run, which weakens the default security posture and violates the repository security rule.
Suggested change
- - TRUST_REMOTE_CODE: "true"
+ # Set only when explicitly needed for a trusted model source.
+ # - TRUST_REMOTE_CODE: "true"As per coding guidelines, “Do not hardcode trust_remote_code=True … This flag tells Transformers to execute arbitrary Python shipped with a checkpoint.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@tools/launcher/examples/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_generate.yaml`
at line 35, Remove the hardcoded TRUST_REMOTE_CODE: "true" environment variable
setting from the megatron_bridge_generate.yaml configuration file. This setting
forces trust-remote-code execution by default, which violates security
guidelines by automatically executing arbitrary Python code from checkpoints.
Instead, allow users to explicitly opt-in to this security-sensitive flag
through their own configuration or command-line arguments rather than having it
enforced by default in the configuration file.
Source: Coding guidelines
What does this PR do?
Type of change: new example
Adds Megatron-Bridge PTQ wrappers to the launcher so users can quantize / generate / export Hugging Face models through Megatron-Bridge on Slurm or Docker, mirroring the existing
common/megatron_lm/{quantize,export}/andcommon/megatron_bridge/import/conventions.Common scripts (
tools/launcher/common/megatron_bridge/):quantize/quantize.sh— wraps/opt/Megatron-Bridge/examples/quantization/quantize.pygenerate/generate.sh— wraps/opt/Megatron-Bridge/examples/quantization/ptq_generate.pyexport/export.sh— wraps/opt/Megatron-Bridge/examples/quantization/export.pyAll three assume
nvcr.io/nvidia/nemo:26.04+(Megatron-Bridge preinstalled), read configuration from env vars (HF_MODEL_ID,MEGATRON_SAVE_PATH/MEGATRON_LOAD_PATH,EXPORT_QUANT_CFG,TP/PP/EP/ETP,DTYPE, …), and launch viapython -m torch.distributed.runwith multi-node rendezvous (--nnodes/--node-rank/--master-addr/--master-port) derived fromSLURM_NNODES/SLURM_NODEID/SLURM_NODELIST, falling back to--standalonefor single-node.Per-model example YAMLs (
tools/launcher/examples/nvidia/):NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/megatron_bridge_{ptq,generate,export}.yaml— 2 nodes × 4 GPUs (TP=2 PP=4 EP=2 for PTQ, TP=8 EP=8 for generate, TP=1 PP=8 for export),nvfp4config.NVIDIA-Nemotron-3-Super-120B-A12B-BF16/megatron_bridge_{ptq,generate,export}.yaml— 4 nodes × 4 GPUs for PTQ/generate (TP=8 PP=2 EP=8), 2 nodes × 4 GPUs for export (TP=1 PP=8),mamba_moe_nvfp4_conservativeper super3 docs.NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16/megatron_bridge_{ptq,generate,export}.yaml— 18 nodes × 4 GPUs for PTQ/generate (TP=8 PP=9 EP=8,super-nvfp4-max-calib.yamlrecipe,PYTORCH_ALLOC_CONF=expandable_segments:True), 3 nodes × 4 GPUs for export (TP=1 PP=12 EP=8 — PP bumped from doc's 9 so the world maps cleanly to 4-GPU nodes; 108 layers / 12 = 9 layers/stage), per ultra3 docs.Usage
Testing
bash -nsyntax check on all three shell scripts (pass).uv run launch.py --dryrun --yeson all 9 example YAMLs (all parse, resolve<<global_vars.X>>, and build the FiddleSandboxPipelinecleanly).Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/A — no copied code, no new PIP dependencies.--dryrunpath is the validation surface for example YAMLs and was used here; no new unit tests added. Happy to add launcher YAML regression tests if requested.moonshotai/launcher examples were mentioned in 0.46)./claude reviewafter PR creation.Additional Information
trust_remote_code=True: the Nemotron-3 family requires it, and all three Nemotron YAMLs setTRUST_REMOTE_CODE=trueconsistent with the upstream NVIDIA-NeMo/Nemotron docs.--pp 9toPP=12so the world (12 ranks) maps cleanly to 3 × 4-GPU nodes; this is called out in the YAML's header comment.Summary by CodeRabbit
Release Notes