refine submit script#1798
Conversation
|
@claude review |
| @@ -116,7 +115,7 @@ echo OUPUT_DIR is ${WORK_DIR} | |||
| if [ "$RAY_RANK" -eq 0 ]; then | |||
There was a problem hiding this comment.
Claude: Warning — Behavior change: The old code injected XTUNER_MAX_CONCURRENCY=8192 into the runtime env, which was consumed by agent_env.py and composed_env.py for @ray.remote(max_concurrency=...). After this PR, those files read RAY_MAX_CONCURRENCY instead, and this script's runtime env will inject RAY_MAX_CONCURRENCY=${RAY_MAX_CONCURRENCY} which resolves to 2048 (set at line 64).
This is a 4x reduction in actor max_concurrency for lagent environments (8192 → 2048). Is this intentional? If the higher concurrency was needed for NPU workloads, consider updating line 64 to export RAY_MAX_CONCURRENCY=8192.
| "XTUNER_USE_LMDEPLOY": "${XTUNER_USE_LMDEPLOY}", | ||
| "XTUNER_USE_VLLM": "${XTUNER_USE_VLLM}", | ||
| "PYTORCH_CUDA_ALLOC_CONF": "${PYTORCH_CUDA_ALLOC_CONF_VALUE}", | ||
| "LMDEPLOY_LOG_FILE": "${LMDEPLOY_LOG_FILE}", | ||
| "XTUNER_RL_MEM_DIR": "${XTUNER_RL_MEM_DIR}", | ||
| "CUDA_DEVICE_MAX_CONNECTIONS": "1", | ||
| "SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN": "1", | ||
| "SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION": "False" |
There was a problem hiding this comment.
Claude: Nit: SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN and SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION are now injected unconditionally for all backends (lmdeploy, vllm included). The old code only set these for the sglang case. While this is likely harmless (other backends won't read sglang-specific vars), it's a slight change in semantics. Similarly, LMDEPLOY_LOG_FILE and XTUNER_RL_MEM_DIR will be set to (potentially empty) values for all backends/accelerators.
Consider whether this is desirable for clarity, or whether conditional logic would be more explicit.
|
Claude: ## Summary Good cleanup. Unifying business env vars through Ray job IssuesWarning
Nit
VerdictCOMMENT — looks good overall, just want confirmation on the NPU concurrency behavior change. |
背景
run_rl_submit.sh同时承担 Ray 集群启动和 Ray Jobs 提交逻辑。原脚本里部分业务环境变量通过 shellexport设置,部分变量通过--runtime-env-json注入,存在来源不一致的问题。在 Ray Jobs 模式下,更稳妥的做法是:Ray 系统启动变量只用于
ray start,训练入口和 Ray task/actor 需要的业务变量统一通过 job-levelruntime_env注入。修改内容
RUNTIME_ENV_JSON中,避免依赖 raylet 启动时继承 shell 环境。WORLD_SIZEMASTER_PORTRAY_MASTER_ADDRACCELERATORRAY_MAX_CONCURRENCYLMDEPLOY_LOG_FILEXTUNER_RL_MEM_DIRSGLANG_ENABLE_HEALTH_ENDPOINT_GENERATIONRAY_MAX_CONCURRENCY没有正确注入的问题。RUNTIME_ENV_JSON改为 heredoc 形式,保持变量注入逻辑直观、易维护。WORK_DIR、XTUNER_LOG_LEVEL、RAY_MAX_CONCURRENCY等变量覆盖默认值。验证
注意事项
如果用户在目前写法下启动 ·examples/v1/scripts/run_rl_submit.sh· 前通过
export xxxx=yyy来设置环境变量,而不是注入到--runtime-env-json中,那么依然是生效的,但是不推荐。因为目前正好 ray submit job 进程在 head 节点启动,rl.py 也在 head 节点,所以是可以的。