refine submit script by hhaAndroid · Pull Request #1798 · InternLM/xtuner

hhaAndroid · 2026-05-15T10:26:05Z

背景

run_rl_submit.sh 同时承担 Ray 集群启动和 Ray Jobs 提交逻辑。原脚本里部分业务环境变量通过 shell export 设置，部分变量通过 --runtime-env-json 注入，存在来源不一致的问题。

在 Ray Jobs 模式下，更稳妥的做法是：Ray 系统启动变量只用于 ray start，训练入口和 Ray task/actor 需要的业务变量统一通过 job-level runtime_env 注入。

修改内容

将业务运行环境统一整理到 RUNTIME_ENV_JSON 中，避免依赖 raylet 启动时继承 shell 环境。
补充 job runtime env 中缺失的关键变量：
- WORLD_SIZE
- MASTER_PORT
- RAY_MASTER_ADDR
- ACCELERATOR
- RAY_MAX_CONCURRENCY
- LMDEPLOY_LOG_FILE
- XTUNER_RL_MEM_DIR
- SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION
修正原脚本中 RAY_MAX_CONCURRENCY 没有正确注入的问题。
将 RUNTIME_ENV_JSON 改为 heredoc 形式，保持变量注入逻辑直观、易维护。
保留外部环境变量覆盖能力，例如用户仍可通过 WORK_DIR、XTUNER_LOG_LEVEL、RAY_MAX_CONCURRENCY 等变量覆盖默认值。

验证

通过 bash 语法检查：

bash -n examples/v1/scripts/run_rl_submit.sh

注意事项

如果用户在目前写法下启动 ·examples/v1/scripts/run_rl_submit.sh· 前通过 export xxxx=yyy 来设置环境变量，而不是注入到 --runtime-env-json 中，那么依然是生效的，但是不推荐。因为目前正好 ray submit job 进程在 head 节点启动，rl.py 也在 head 节点，所以是可以的。

hhaAndroid · 2026-05-15T10:39:40Z

@claude review

claude · 2026-05-15T10:42:16Z

@@ -116,7 +115,7 @@ echo OUPUT_DIR is ${WORK_DIR}
 if [ "$RAY_RANK" -eq 0 ]; then


Claude: Warning — Behavior change: The old code injected XTUNER_MAX_CONCURRENCY=8192 into the runtime env, which was consumed by agent_env.py and composed_env.py for @ray.remote(max_concurrency=...). After this PR, those files read RAY_MAX_CONCURRENCY instead, and this script's runtime env will inject RAY_MAX_CONCURRENCY=${RAY_MAX_CONCURRENCY} which resolves to 2048 (set at line 64).

This is a 4x reduction in actor max_concurrency for lagent environments (8192 → 2048). Is this intentional? If the higher concurrency was needed for NPU workloads, consider updating line 64 to export RAY_MAX_CONCURRENCY=8192.

claude · 2026-05-15T10:42:20Z

+    "XTUNER_USE_LMDEPLOY": "${XTUNER_USE_LMDEPLOY}",
+    "XTUNER_USE_VLLM": "${XTUNER_USE_VLLM}",
+    "PYTORCH_CUDA_ALLOC_CONF": "${PYTORCH_CUDA_ALLOC_CONF_VALUE}",
+    "LMDEPLOY_LOG_FILE": "${LMDEPLOY_LOG_FILE}",
+    "XTUNER_RL_MEM_DIR": "${XTUNER_RL_MEM_DIR}",
+    "CUDA_DEVICE_MAX_CONNECTIONS": "1",
+    "SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN": "1",
+    "SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION": "False"


Claude: Nit: SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN and SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION are now injected unconditionally for all backends (lmdeploy, vllm included). The old code only set these for the sglang case. While this is likely harmless (other backends won't read sglang-specific vars), it's a slight change in semantics. Similarly, LMDEPLOY_LOG_FILE and XTUNER_RL_MEM_DIR will be set to (potentially empty) values for all backends/accelerators.

Consider whether this is desirable for clarity, or whether conditional logic would be more explicit.

claude · 2026-05-15T10:42:38Z

Claude: ## Summary

Good cleanup. Unifying business env vars through Ray job runtime_env instead of shell inheritance is the correct approach for multi-node reliability. The heredoc format is cleaner and the bug fix for the undefined XTUNER_MAX_CONCURRENCY in the GPU script is a solid catch.

Issues

Warning

[run_rl_submit_npu.sh:115] The rename from XTUNER_MAX_CONCURRENCY to RAY_MAX_CONCURRENCY changes the effective actor max_concurrency for lagent environments from 8192 → 2048 on NPU. Please confirm this is intentional.

Nit

[run_rl_submit.sh:148-155] sglang-specific env vars (SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN, SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION) are injected unconditionally for all backends. Harmless but slightly imprecise.

Verdict

COMMENT — looks good overall, just want confirmation on the NPU concurrency behavior change.

hhaAndroid added 2 commits May 15, 2026 10:25

refine submit script

b6973c7

remove XTUNER_MAX_CONCURRENCY

880d9c3

hhaAndroid requested review from CyCle1024 and YanhuiDua May 15, 2026 10:39

claude Bot reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine submit script#1798

refine submit script#1798
hhaAndroid wants to merge 2 commits into
InternLM:mainfrom
hhaAndroid:refine_submit

hhaAndroid commented May 15, 2026 •

edited

Loading

Uh oh!

hhaAndroid commented May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot May 15, 2026

Uh oh!

claude Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -116,7 +115,7 @@ echo OUPUT_DIR is ${WORK_DIR}
		if [ "$RAY_RANK" -eq 0 ]; then

Conversation

hhaAndroid commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

修改内容

验证

注意事项

Uh oh!

hhaAndroid commented May 15, 2026

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 15, 2026

Issues

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hhaAndroid commented May 15, 2026 •

edited

Loading