【Do Not Merge】测试dynamic loading用，存储模型pdparams by DDDivano · Pull Request #7789 · PaddlePaddle/FastDeploy

DDDivano · 2026-05-12T07:12:40Z

Motivation

为 RL 动态加载（dynamic loading）功能开发测试，需要提前保存各 TP rank 的 pdparams 权重文件，以验证 DynamicWeightManager 的加载流程正确性。

Modifications

fastdeploy/worker/gpu_model_runner.py：在 load_model() 方法末尾新增保存逻辑
- 当环境变量 FD_SAVE_PDPARAMS=1 时，模型加载完成后自动保存权重
- 支持 FD_SAVE_DIR 指定输出目录（默认为模型目录）
- rank 0 复制 config / tokenizer 等配置文件
- 所有 rank 各自保存对应 TP 分片权重，文件命名为 model_state.tp{rank}.{gpu_id}.pdparams
- 若存在 proposer（投机解码），同时合并保存 proposer 模型权重

Usage or Command

FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy ...

Accuracy Tests

N/A（本 PR 为测试辅助工具，不影响模型推理精度）

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Added functionality to save model parameters and configuration files when FD_SAVE_PDPARAMS is set. Includes handling for saving directory and rank-based file copying.

paddle-bot · 2026-05-12T07:12:46Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 在 GPUModelRunner.load_model() 中新增一个通过环境变量触发的“导出权重/配置”能力：当设置 FD_SAVE_PDPARAMS=1 时，将模型参数（以及部分 config/tokenizer 文件）保存到指定目录，用于 dynamic loading 相关测试/排查。

Changes:

新增 FD_SAVE_PDPARAMS + FD_SAVE_DIR 控制的权重/配置导出逻辑（按 rank 写出分片文件）
rank0 额外复制模型目录下的部分配置与 tokenizer 文件到导出目录
尝试将保存的权重包含 proposer/MTP 模型参数

补充说明（需人工处理）：

PR 标题未按仓库要求包含标签（例如 [Feature]...），且标题包含“【Do Not Merge】”不符合合入规范。
PR 描述模板的 Motivation/Modifications/Usage 等未补全；并且新增环境变量建议同步更新环境变量文档（如 docs/usage/environment_variables.md）以避免“隐藏开关”。

+            visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
+            meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])
+            rank = paddle.distributed.get_rank()


+            # Save model weights (main model + proposer/MTP model if exists)
+            model_state_dict = self.model.state_dict()
+            if hasattr(self, 'proposer') and self.proposer is not None and hasattr(self.proposer, 'model'):
+                proposer_state_dict = self.proposer.model.state_dict()


+                logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")
+
+            clean_state_dict = {
+                k: paddle.to_tensor(v.contiguous().numpy())


+        # Usage: FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy...
+        if os.getenv("FD_SAVE_PDPARAMS", "0") == "1":
+            import shutil
+            import glob as glob_mod
+
+            visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
+            meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])
+            rank = paddle.distributed.get_rank()
+
+            # Determine save directory: FD_SAVE_DIR or default to model directory
+            save_dir = os.getenv("FD_SAVE_DIR", self.fd_config.model_config.model)
+            os.makedirs(save_dir, exist_ok=True)
+
+            # Copy config and tokenizer files (only rank 0 to avoid race)
+            if rank == 0:
+                src_dir = self.fd_config.model_config.model
+                copy_patterns = [
+                    "config.json", "generation_config.json",
+                    "tokenizer*", "added_tokens.json",
+                    "special_tokens_map.json", "chat_template*",
+                ]
+                for pattern in copy_patterns:
+                    for f in glob_mod.glob(os.path.join(src_dir, pattern)):
+                        dst = os.path.join(save_dir, os.path.basename(f))
+                        if not os.path.exists(dst):
+                            shutil.copy2(f, dst)
+
+            # Save model weights (main model + proposer/MTP model if exists)
+            model_state_dict = self.model.state_dict()
+            if hasattr(self, 'proposer') and self.proposer is not None and hasattr(self.proposer, 'model'):
+                proposer_state_dict = self.proposer.model.state_dict()
+                model_state_dict.update(proposer_state_dict)
+                logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")
+
+            clean_state_dict = {
+                k: paddle.to_tensor(v.contiguous().numpy())
+                for k, v in model_state_dict.items()
+            }
+            model_path = os.path.join(
+                save_dir,
+                f"model_state.tp{rank}.{meta_src_id}.pdparams",
+            )
+            paddle.save(clean_state_dict, model_path, safetensors=True)
+            del clean_state_dict
+            logger.info(f"Saved model state dict to {model_path}")


+                save_dir,
+                f"model_state.tp{rank}.{meta_src_id}.pdparams",
+            )
+            paddle.save(clean_state_dict, model_path, safetensors=True)


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 15:35:34

📋 Review 摘要

PR 概述：新增 FD_SAVE_PDPARAMS 环境变量支持，在 load_model() 完成后将模型权重保存为 pdparams 文件，用于 RL 动态加载功能测试。
变更范围：fastdeploy/worker/gpu_model_runner.py
影响面 Tag：[RL] [Engine]

📝 PR 规范检查

标题使用中文全角括号且无官方 Tag；描述中所有 section 内容为空/仅占位符，需补全。

标题建议（可直接复制）：

[RL] 新增 FD_SAVE_PDPARAMS 环境变量保存模型 pdparams 用于 RL 动态加载测试（Do Not Merge）

PR 描述建议（可直接复制，必须复刻 checklist §D2 模板的完整结构）：

## Motivation
为 RL 动态加载（dynamic loading）功能开发测试，需要提前保存各 TP rank 的 pdparams 权重文件，以验证 `DynamicWeightManager` 的加载流程正确性。

## Modifications
- `fastdeploy/worker/gpu_model_runner.py`：在 `load_model()` 方法末尾新增保存逻辑
  - 当环境变量 `FD_SAVE_PDPARAMS=1` 时，模型加载完成后自动保存权重
  - 支持 `FD_SAVE_DIR` 指定输出目录（默认为模型目录）
  - rank 0 复制 config / tokenizer 等配置文件
  - 所有 rank 各自保存对应 TP 分片权重，文件命名为 `model_state.tp{rank}.{gpu_id}.pdparams`
  - 若存在 proposer（投机解码），同时合并保存 proposer 模型权重

## Usage or Command
```bash
FD_SAVE_PDPARAMS=1 FD_SAVE_DIR=/path/to/output python -m fastdeploy ...
```

## Accuracy Tests
N/A（本 PR 为测试辅助工具，不影响模型推理精度）

## Checklist

- [ ] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/worker/gpu_model_runner.py:1435`	`visible_devices` 数组越界，当 `FLAGS_selected_gpus` 值 ≥ `visible_devices` 长度时抛 IndexError
🟡 建议	`fastdeploy/worker/gpu_model_runner.py:1464`	`.contiguous().numpy()` + `paddle.to_tensor()` 双拷贝，大模型易 OOM

总体评价

这是一个临时调试用 PR（Do Not Merge），整体逻辑清晰。但存在 visible_devices 数组越界的 P0 Bug，需在合入前修复；同时建议优化 tensor 转换方式以节省内存。

PaddlePaddle-bot · 2026-05-12T07:37:56Z

+            import glob as glob_mod
+
+            visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",")
+            meta_src_id = int(visible_devices[int(os.getenv("FLAGS_selected_gpus", "0"))])


🔴 Bug visible_devices 数组可能越界，当 FLAGS_selected_gpus 值 ≥ visible_devices 长度时将抛出 IndexError。

例如 CUDA_VISIBLE_DEVICES=0 但 FLAGS_selected_gpus=1 时触发；或 CUDA_VISIBLE_DEVICES 为空字符串时 split(',') 产生 ['']，int('') 同样报错。

建议修复：

visible_devices = os.getenv("CUDA_VISIBLE_DEVICES", "0").split(",") local_rank = int(os.getenv("FLAGS_selected_gpus", "0")) if local_rank < len(visible_devices) and visible_devices[local_rank].strip(): meta_src_id = int(visible_devices[local_rank].strip()) else: meta_src_id = local_rank # fallback

PaddlePaddle-bot · 2026-05-12T07:37:56Z

+                logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")
+
+            clean_state_dict = {
+                k: paddle.to_tensor(v.contiguous().numpy())


🟡 建议 .contiguous().numpy() + paddle.to_tensor() 构成两次完整的 GPU→CPU→GPU 内存拷贝，对大模型（数十 GB 参数）极易触发 OOM，且耗时倍增。

paddle.save 可直接保存 GPU tensor，无需中转 numpy。建议简化为：

clean_state_dict = {k: v.contiguous() for k, v in model_state_dict.items()}

或直接传入 model_state_dict（若已是 contiguous 则更佳）。

PaddlePaddle-bot · 2026-05-12T07:44:35Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:42:03

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 7f04e4c
Merge base: 203c7da (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

存在 2 个 required 失败，3 个 required 运行中，1 个 required 等待中，需处理失败后等待其余任务完成。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	31	3	5	2	0

2 任务状态汇总

2.1 Required任务 : 4/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：新增 logger.info 需指定 RD 审批	请求 xyxinyang 或 zyyzghb 审批	Job	-
❌	`Pre Commit`	46s	PR问题：gpu_model_runner.py 代码风格不符合规范	运行 pre-commit 修复代码格式	Job	-
⏳	`run_ce_cases`	-	运行中	-	Job	-
⏳	`run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`run_xpu_4cards_cases`	-	运行中	-	Job	-
⏸️	`run_4_cards_tests`	-	等待中	-	-	-
✅	其余 4 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 27/31 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	15s	Job	-
⏳	`run_iluvatar_cases`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 27 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码审批检查（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: PR新增 logger.info 调用，需 FastDeploy RD 审批
分析器: 通用分析(fallback)

根因详情:
PR 在 gpu_model_runner.py 中新增了 2 处 logger.info(...) 日志调用，触发了 FastDeploy 的审批检查机制。脚本 scripts/check_approval.sh 检测到日志行为变更，要求指定 RD 进行审批（exit code 6 表示"存在未解决的审批错误"）。

关键日志:

Detected log modification in diff:
+                logger.info(f"Including proposer model weights ({len(proposer_state_dict)} params)")
+            logger.info(f"Saved model state dict to {model_path}")
0. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior.
There are 1 approved errors.

修复建议:

请联系 @xyxinyang(zhouchong) 或 @zyyzghb(zhangyongyue) 对此 PR 进行 Review 并审批

修复建议摘要: 请求 xyxinyang 或 zyyzghb 审批此 PR

关联变更: fastdeploy/worker/gpu_model_runner.py — 新增 logger.info 日志输出
链接: 查看日志

Pre Commit — 代码格式检查（置信度: 高）

Pre Commit

状态: ❌ 失败
错误类型: 代码规范
置信度: 高
根因摘要: gpu_model_runner.py 代码格式不规范，black/isort 未通过
分析器: 通用分析(fallback)

根因详情:
fastdeploy/worker/gpu_model_runner.py 提交时代码未经 pre-commit 格式化，black（代码格式化）和 isort（import 排序）两个 hook 均检测到格式问题并修改了文件。需要在本地运行 pre-commit 后重新提交。

关键日志:

black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted fastdeploy/worker/gpu_model_runner.py

isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing .../fastdeploy/worker/gpu_model_runner.py

修复建议:

本地安装并运行 pre-commit：pip install pre-commit==4.2.0 clang-format==13.0.0 && pre-commit install
检查文件格式：pre-commit run --files fastdeploy/worker/gpu_model_runner.py
将 pre-commit 修复后的文件重新 git add 并提交

修复建议摘要: 本地运行 pre-commit 修复代码格式后重新提交

关联变更: fastdeploy/worker/gpu_model_runner.py — import 排序及代码格式问题
链接: 查看日志

save model for rl

7f04e4c

Added functionality to save model parameters and configuration files when FD_SAVE_PDPARAMS is set. Includes handling for saving directory and rank-based file copying.

Copilot AI review requested due to automatic review settings May 12, 2026 07:12

DDDivano had a problem deploying to Metax_ci May 12, 2026 07:12 — with GitHub Actions Failure

Copilot started reviewing on behalf of DDDivano May 12, 2026 07:13 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

PaddlePaddle-bot suggested changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Do Not Merge】测试dynamic loading用，存储模型pdparams#7789

【Do Not Merge】测试dynamic loading用，存储模型pdparams#7789
DDDivano wants to merge 1 commit into
developfrom
save_model_for_rl

DDDivano commented May 12, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot commented May 12, 2026

Approval

Pre Commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DDDivano commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 12, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 4/10 通过

2.2 可选任务 — 27/31 通过

3 失败详情（仅 required）

Approval

Pre Commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DDDivano commented May 12, 2026 •

edited

Loading