Skip to content

【Do Not Merge】 测试dynamic loading用,存储模型pdparams#7789

Open
DDDivano wants to merge 1 commit into
developfrom
save_model_for_rl
Open

【Do Not Merge】 测试dynamic loading用,存储模型pdparams#7789
DDDivano wants to merge 1 commit into
developfrom
save_model_for_rl

Conversation

@DDDivano
Copy link
Copy Markdown
Collaborator

@DDDivano DDDivano commented May 12, 2026

Motivation

为 RL 动态加载(dynamic loading)功能开发测试,需要提前保存各 TP rank 的 pdparams 权重文件,以验证 DynamicWeightManager 的加载流程正确性。

Modifications

  • fastdeploy/worker/gpu_model_runner.py:在 load_model() 方法末尾新增保存逻辑
    • 当环境变量 FD_SAVE_PDPARAMS=1 时,模型加载完成后自动保存权重
    • 支持 FD_SAVE_DIR 指定输出目录(默认为模型目录)
    • rank 0 复制 config / tokenizer 等配置文件
    • 所有 rank 各自保存对应 TP 分片权重,文件命名为 model_state.tp{rank}.{gpu_id}.pdparams
    • 若存在 proposer(投机解码),同时合并保存 proposer 模型权重

Usage or Command

FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy ...

Accuracy Tests

N/A(本 PR 为测试辅助工具,不影响模型推理精度)

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Added functionality to save model parameters and configuration files when FD_SAVE_PDPARAMS is set. Includes handling for saving directory and rank-based file copying.
Copilot AI review requested due to automatic review settings May 12, 2026 07:12
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 在 GPUModelRunner.load_model() 中新增一个通过环境变量触发的“导出权重/配置”能力:当设置 FD_SAVE_PDPARAMS=1 时,将模型参数(以及部分 config/tokenizer 文件)保存到指定目录,用于 dynamic loading 相关测试/排查。

Changes:

  • 新增 FD_SAVE_PDPARAMS + FD_SAVE_DIR 控制的权重/配置导出逻辑(按 rank 写出分片文件)
  • rank0 额外复制模型目录下的部分配置与 tokenizer 文件到导出目录
  • 尝试将保存的权重包含 proposer/MTP 模型参数

补充说明(需人工处理):

  • PR 标题未按仓库要求包含标签(例如 [Feature]...),且标题包含“【Do Not Merge】”不符合合入规范。
  • PR 描述模板的 Motivation/Modifications/Usage 等未补全;并且新增环境变量建议同步更新环境变量文档(如 docs/usage/environment_variables.md)以避免“隐藏开关”。

Comment on lines +1434 to +1436
visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])
rank = paddle.distributed.get_rank()
# Save model weights (main model + proposer/MTP model if exists)
model_state_dict = self.model.state_dict()
if hasattr(self, 'proposer') and self.proposer is not None and hasattr(self.proposer, 'model'):
proposer_state_dict = self.proposer.model.state_dict()
logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")

clean_state_dict = {
k: paddle.to_tensor(v.contiguous().numpy())
Comment on lines +1429 to +1473
# Usage: FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy...
if os.getenv("FD_SAVE_PDPARAMS", "0") == "1":
import shutil
import glob as glob_mod

visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])
rank = paddle.distributed.get_rank()

# Determine save directory: FD_SAVE_DIR or default to model directory
save_dir = os.getenv("FD_SAVE_DIR", self.fd_config.model_config.model)
os.makedirs(save_dir, exist_ok=True)

# Copy config and tokenizer files (only rank 0 to avoid race)
if rank == 0:
src_dir = self.fd_config.model_config.model
copy_patterns = [
"config.json", "generation_config.json",
"tokenizer*", "added_tokens.json",
"special_tokens_map.json", "chat_template*",
]
for pattern in copy_patterns:
for f in glob_mod.glob(os.path.join(src_dir, pattern)):
dst = os.path.join(save_dir, os.path.basename(f))
if not os.path.exists(dst):
shutil.copy2(f, dst)

# Save model weights (main model + proposer/MTP model if exists)
model_state_dict = self.model.state_dict()
if hasattr(self, 'proposer') and self.proposer is not None and hasattr(self.proposer, 'model'):
proposer_state_dict = self.proposer.model.state_dict()
model_state_dict.update(proposer_state_dict)
logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")

clean_state_dict = {
k: paddle.to_tensor(v.contiguous().numpy())
for k, v in model_state_dict.items()
}
model_path = os.path.join(
save_dir,
f"model_state.tp{rank}.{meta_src_id}.pdparams",
)
paddle.save(clean_state_dict, model_path, safetensors=True)
del clean_state_dict
logger.info(f"Saved model state dict to {model_path}")
save_dir,
f"model_state.tp{rank}.{meta_src_id}.pdparams",
)
paddle.save(clean_state_dict, model_path, safetensors=True)
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 15:35:34

📋 Review 摘要

PR 概述:新增 FD_SAVE_PDPARAMS 环境变量支持,在 load_model() 完成后将模型权重保存为 pdparams 文件,用于 RL 动态加载功能测试。
变更范围fastdeploy/worker/gpu_model_runner.py
影响面 Tag[RL] [Engine]

📝 PR 规范检查

标题使用中文全角括号且无官方 Tag;描述中所有 section 内容为空/仅占位符,需补全。

标题建议(可直接复制):

  • [RL] 新增 FD_SAVE_PDPARAMS 环境变量保存模型 pdparams 用于 RL 动态加载测试(Do Not Merge)

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation
为 RL 动态加载(dynamic loading)功能开发测试,需要提前保存各 TP rank 的 pdparams 权重文件,以验证 `DynamicWeightManager` 的加载流程正确性。

## Modifications
- `fastdeploy/worker/gpu_model_runner.py`:在 `load_model()` 方法末尾新增保存逻辑
  - 当环境变量 `FD_SAVE_PDPARAMS=1` 时,模型加载完成后自动保存权重
  - 支持 `FD_SAVE_DIR` 指定输出目录(默认为模型目录)
  - rank 0 复制 config / tokenizer 等配置文件
  - 所有 rank 各自保存对应 TP 分片权重,文件命名为 `model_state.tp{rank}.{gpu_id}.pdparams`
  - 若存在 proposer(投机解码),同时合并保存 proposer 模型权重

## Usage or Command
```bash
FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy ...
```

## Accuracy Tests
N/A(本 PR 为测试辅助工具,不影响模型推理精度)

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/worker/gpu_model_runner.py:1435 visible_devices 数组越界,当 FLAGS_selected_gpus 值 ≥ visible_devices 长度时抛 IndexError
🟡 建议 fastdeploy/worker/gpu_model_runner.py:1464 .contiguous().numpy() + paddle.to_tensor() 双拷贝,大模型易 OOM

总体评价

这是一个临时调试用 PR(Do Not Merge),整体逻辑清晰。但存在 visible_devices 数组越界的 P0 Bug,需在合入前修复;同时建议优化 tensor 转换方式以节省内存。

import glob as glob_mod

visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug visible_devices 数组可能越界,当 FLAGS_selected_gpus 值 ≥ visible_devices 长度时将抛出 IndexError。

例如 CUDA_VISIBLE_DEVICES=0FLAGS_selected_gpus=1 时触发;或 CUDA_VISIBLE_DEVICES 为空字符串时 split(',') 产生 ['']int('') 同样报错。

建议修复:

visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
local_rank = int(os.getenv("FLAGS_selected_gpus", "0"))
if local_rank < len(visible_devices) and visible_devices[local_rank].strip():
    meta_src_id = int(visible_devices[local_rank].strip())
else:
    meta_src_id = local_rank  # fallback

logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")

clean_state_dict = {
k: paddle.to_tensor(v.contiguous().numpy())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 .contiguous().numpy() + paddle.to_tensor() 构成两次完整的 GPU→CPU→GPU 内存拷贝,对大模型(数十 GB 参数)极易触发 OOM,且耗时倍增。

paddle.save 可直接保存 GPU tensor,无需中转 numpy。建议简化为:

clean_state_dict = {k: v.contiguous() for k, v in model_state_dict.items()}

或直接传入 model_state_dict(若已是 contiguous 则更佳)。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:42:03

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

存在 2 个 required 失败3 个 required 运行中1 个 required 等待中,需处理失败后等待其余任务完成。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 31 3 5 2 0

2 任务状态汇总

2.1 Required任务 : 4/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:新增 logger.info 需指定 RD 审批 请求 xyxinyang 或 zyyzghb 审批 Job -
Pre Commit 46s PR问题:gpu_model_runner.py 代码风格不符合规范 运行 pre-commit 修复代码格式 Job -
run_ce_cases - 运行中 - Job -
run_tests_with_coverage - 运行中 - Job -
run_xpu_4cards_cases - 运行中 - Job -
⏸️ run_4_cards_tests - 等待中 - - -
其余 4 个必选任务通过 - - - - -

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
run_iluvatar_cases - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
其余 27 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码审批检查(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: PR新增 logger.info 调用,需 FastDeploy RD 审批
  • 分析器: 通用分析(fallback)

根因详情:
PR 在 gpu_model_runner.py 中新增了 2 处 logger.info(...) 日志调用,触发了 FastDeploy 的审批检查机制。脚本 scripts/check_approval.sh 检测到日志行为变更,要求指定 RD 进行审批(exit code 6 表示"存在未解决的审批错误")。

关键日志:

Detected log modification in diff:
+                logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")
+            logger.info(f"Saved model state dict to {model_path}")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior.
There are 1 approved errors.

修复建议:

  1. 请联系 @xyxinyang(zhouchong) 或 @zyyzghb(zhangyongyue) 对此 PR 进行 Review 并审批

修复建议摘要: 请求 xyxinyang 或 zyyzghb 审批此 PR

关联变更: fastdeploy/worker/gpu_model_runner.py — 新增 logger.info 日志输出
链接: 查看日志

Pre Commit — 代码格式检查(置信度: 高)

Pre Commit

  • 状态: ❌ 失败
  • 错误类型: 代码规范
  • 置信度: 高
  • 根因摘要: gpu_model_runner.py 代码格式不规范,black/isort 未通过
  • 分析器: 通用分析(fallback)

根因详情:
fastdeploy/worker/gpu_model_runner.py 提交时代码未经 pre-commit 格式化,black(代码格式化)和 isort(import 排序)两个 hook 均检测到格式问题并修改了文件。需要在本地运行 pre-commit 后重新提交。

关键日志:

black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted fastdeploy/worker/gpu_model_runner.py

isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing .../fastdeploy/worker/gpu_model_runner.py

修复建议:

  1. 本地安装并运行 pre-commit:pip install pre-commit==4.2.0 clang-format==13.0.0 && pre-commit install
  2. 检查文件格式:pre-commit run --files fastdeploy/worker/gpu_model_runner.py
  3. 将 pre-commit 修复后的文件重新 git add 并提交

修复建议摘要: 本地运行 pre-commit 修复代码格式后重新提交

关联变更: fastdeploy/worker/gpu_model_runner.py — import 排序及代码格式问题
链接: 查看日志

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants