[Feature] Support weight update from disk#1939
Open
PengchengShi00 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
从 Disk 更新 Rollout 权重方案
一、目标
在 RL 训练过程中,当 rollout server 异常退出并重启后,支持从磁盘上的 HuggingFace 格式 checkpoint 恢复模型参数,避免只能依赖训练侧重新通过完整的 IPC/NCCL tensor 流同步权重。
该能力作为现有 IPC、NCCL 权重更新之外的第三种更新方式:
disk。它主要服务于 server recovery,也可以作为后续权重同步链路的备选方案。二、前置条件
rollout server 重启时可以先不加载真实权重。
empty_init=True启动。load_format="dummy"启动。上游需要传递待加载的 HF checkpoint 路径。
LMDeploy 使用
disk模式时,还需要知道底层复用 IPC 还是 NCCL。update_weight当前由 train controller 触发。SGLang 使用
disk模式时不需要区分 IPC/NCCL。update_weights_from_disk。recovery 场景需要明确本次哪些 rollout server 需要重新加载权重,避免影响仍在正常工作的 server。
update_rollout_info()时,让server_url_dict只包含重启后的 server URL(待讨论)三、设计原则
disk是顶层权重更新模式,不等同于 IPC 或 NCCL。disk模式下 LMDeploy 可能复用的底层传输路径。四、方案
1. 配置与元信息
在
RolloutWeightUpdateInfo中增加 disk 相关字段:transport_type="disk":表示使用从磁盘更新权重的顶层模式。hf_weight_path:待加载的 HF checkpoint 路径,从rollout_config或其extra_rollout_config获取。disk_update_upstream_transport:仅用于 LMDeploy,表示 disk 更新最终复用ipc还是nccl路径。2. UpdateWeighter
update_rollout_info不再接收train_rollout_mode,而是直接接收weight_update_mode:ipc:现有共卡 IPC 更新。nccl:现有训推分离 NCCL 更新。disk:新增从 HF checkpoint 路径恢复权重。weight_update_mode决定构造哪个WeightTransport。3. DiskWeightTransport
DiskWeightTransport按 backend 委托给不同的 disk adapter:WeightIterator产生的 tensor batch,而是直接使用hf_weight_path。WeightIterator.iter_disk_hf_batches()产生的 tensor batch,并复用已有 IPC/NCCL transport。SGLang adapter:
update_weights_from_diskendpoint。model_path=hf_weight_path。rollout_server_url_dict中的目标 rollout server 发送请求。LMDeploy adapter:
disk_update_upstream_transport判断复用 IPC 还是 NCCL。WeightIterator.iter_batch_groups()在transport_type="disk"且 backend 不是 SGLang 时,切换到iter_disk_hf_batches()。iter_disk_hf_batches()支持读取 HFmodel.safetensors.index.json或单文件model.safetensors,按update_weight_bucket_size_in_gb产出WeightUpdateBatch。IPCWeightTransport或NCCLWeightTransport,继续使用 LMDeploy 现有update_weights/update_weights_from_distributed接口。finished=Truebatch,用于触发 rollout 侧 finalize。五、待讨论事项
update_weights_from_disk请求,由 SGLang server 自行从 HF checkpoint 路径加载权重。该路径不会占用无关 train rank 的显存和传输带宽。WeightIterator的权重来源:从训练侧模型参数导出变为从 HF checkpoint 读取;后续 IPC/NCCL 发送、server 接收和 finalize 流程与原有权重更新路径保持一致。disk + NCCL会复用现有 NCCL 权重更新链路,需要关注它与常规 NCCL 权重更新通信组之间的冲突风险,例如 group name 复用、rollout 侧通信组残留、恢复流程与常规更新并发触发等问题。lmdeploy 可做优化:
disk + NCCL使用独立的通信组命名或显式 teardown 策略,避免与常规 NCCL 更新链路共享未清理的通信状态。