Skip to content

refine submit script#1798

Open
hhaAndroid wants to merge 2 commits into
InternLM:mainfrom
hhaAndroid:refine_submit
Open

refine submit script#1798
hhaAndroid wants to merge 2 commits into
InternLM:mainfrom
hhaAndroid:refine_submit

Conversation

@hhaAndroid
Copy link
Copy Markdown
Collaborator

@hhaAndroid hhaAndroid commented May 15, 2026

背景

run_rl_submit.sh 同时承担 Ray 集群启动和 Ray Jobs 提交逻辑。原脚本里部分业务环境变量通过 shell export 设置,部分变量通过 --runtime-env-json 注入,存在来源不一致的问题。

在 Ray Jobs 模式下,更稳妥的做法是:Ray 系统启动变量只用于 ray start,训练入口和 Ray task/actor 需要的业务变量统一通过 job-level runtime_env 注入。

修改内容

  • 将业务运行环境统一整理到 RUNTIME_ENV_JSON 中,避免依赖 raylet 启动时继承 shell 环境。
  • 补充 job runtime env 中缺失的关键变量:
    • WORLD_SIZE
    • MASTER_PORT
    • RAY_MASTER_ADDR
    • ACCELERATOR
    • RAY_MAX_CONCURRENCY
    • LMDEPLOY_LOG_FILE
    • XTUNER_RL_MEM_DIR
    • SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION
  • 修正原脚本中 RAY_MAX_CONCURRENCY 没有正确注入的问题。
  • RUNTIME_ENV_JSON 改为 heredoc 形式,保持变量注入逻辑直观、易维护。
  • 保留外部环境变量覆盖能力,例如用户仍可通过 WORK_DIRXTUNER_LOG_LEVELRAY_MAX_CONCURRENCY 等变量覆盖默认值。

验证

  • 通过 bash 语法检查:
bash -n examples/v1/scripts/run_rl_submit.sh

注意事项

如果用户在目前写法下启动 ·examples/v1/scripts/run_rl_submit.sh· 前通过 export xxxx=yyy 来设置环境变量,而不是注入到 --runtime-env-json 中,那么依然是生效的,但是不推荐。因为目前正好 ray submit job 进程在 head 节点启动,rl.py 也在 head 节点,所以是可以的。

@hhaAndroid hhaAndroid requested review from CyCle1024 and YanhuiDua May 15, 2026 10:39
@hhaAndroid
Copy link
Copy Markdown
Collaborator Author

@claude review

@@ -116,7 +115,7 @@ echo OUPUT_DIR is ${WORK_DIR}
if [ "$RAY_RANK" -eq 0 ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Warning — Behavior change: The old code injected XTUNER_MAX_CONCURRENCY=8192 into the runtime env, which was consumed by agent_env.py and composed_env.py for @ray.remote(max_concurrency=...). After this PR, those files read RAY_MAX_CONCURRENCY instead, and this script's runtime env will inject RAY_MAX_CONCURRENCY=${RAY_MAX_CONCURRENCY} which resolves to 2048 (set at line 64).

This is a 4x reduction in actor max_concurrency for lagent environments (8192 → 2048). Is this intentional? If the higher concurrency was needed for NPU workloads, consider updating line 64 to export RAY_MAX_CONCURRENCY=8192.

Comment on lines +148 to +155
"XTUNER_USE_LMDEPLOY": "${XTUNER_USE_LMDEPLOY}",
"XTUNER_USE_VLLM": "${XTUNER_USE_VLLM}",
"PYTORCH_CUDA_ALLOC_CONF": "${PYTORCH_CUDA_ALLOC_CONF_VALUE}",
"LMDEPLOY_LOG_FILE": "${LMDEPLOY_LOG_FILE}",
"XTUNER_RL_MEM_DIR": "${XTUNER_RL_MEM_DIR}",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN": "1",
"SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION": "False"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: Nit: SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN and SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION are now injected unconditionally for all backends (lmdeploy, vllm included). The old code only set these for the sglang case. While this is likely harmless (other backends won't read sglang-specific vars), it's a slight change in semantics. Similarly, LMDEPLOY_LOG_FILE and XTUNER_RL_MEM_DIR will be set to (potentially empty) values for all backends/accelerators.

Consider whether this is desirable for clarity, or whether conditional logic would be more explicit.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 15, 2026

Claude: ## Summary

Good cleanup. Unifying business env vars through Ray job runtime_env instead of shell inheritance is the correct approach for multi-node reliability. The heredoc format is cleaner and the bug fix for the undefined XTUNER_MAX_CONCURRENCY in the GPU script is a solid catch.

Issues

Warning

  • [run_rl_submit_npu.sh:115] The rename from XTUNER_MAX_CONCURRENCY to RAY_MAX_CONCURRENCY changes the effective actor max_concurrency for lagent environments from 8192 → 2048 on NPU. Please confirm this is intentional.

Nit

  • [run_rl_submit.sh:148-155] sglang-specific env vars (SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN, SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION) are injected unconditionally for all backends. Harmless but slightly imprecise.

Verdict

COMMENT — looks good overall, just want confirmation on the NPU concurrency behavior change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant