Skip to content

[RL] Reuse GDR checkpoint transfer handle#8078

Open
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop
Open

[RL] Reuse GDR checkpoint transfer handle#8078
jackyYang6 wants to merge 2 commits into
PaddlePaddle:developfrom
jackyYang6:jacky/optimize-checkpoint-transfer-handle-init-develop

Conversation

@jackyYang6

@jackyYang6 jackyYang6 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

Avoid repeated CheckpointTransfer initialization during GDR dynamic weight updates. Reusing the initialized handle reduces repeated setup overhead across multiple update steps.

Modifications

  • Cache the GDR CheckpointTransfer handle in DynamicWeightManager.
  • Lazily initialize the handle on the first GDR weight update.
  • Reuse the cached handle for later update_weights_by_gdr calls.
  • Destroy and reset the cached handle when an update fails.

Usage or Command

No new user-facing command. Existing GDR weight update flow is unchanged.

Accuracy Tests

Not applicable. This PR only changes checkpoint-transfer handle initialization behavior and does not affect model outputs.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No unit tests added because this is a handle lifecycle optimization for GDR runtime behavior.
  • Provide accuracy results. Not applicable; no model output changes.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@jackyYang6 jackyYang6 force-pushed the jacky/optimize-checkpoint-transfer-handle-init-develop branch from ee3f166 to b69ad2a Compare June 25, 2026 11:47
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6d9a8f4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/rl/dynamic_weight_manager.py 86.95% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8078   +/-   ##
==========================================
  Coverage           ?   67.52%           
==========================================
  Files              ?      475           
  Lines              ?    66907           
  Branches           ?    10317           
==========================================
  Hits               ?    45182           
  Misses             ?    18857           
  Partials           ?     2868           
Flag Coverage Δ
GPU 77.55% <86.95%> (?)
XPU 6.95% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 11:04:01

📋 Review 摘要

PR 概述:缓存 GDR CheckpointTransfer handle,避免动态权重更新时重复初始化 transfer 句柄。
变更范围fastdeploy/rl/dynamic_weight_manager.pytests/rl/test_dynamic_weight_gdr.py
影响面 Tag[RL]

问题

未发现新的阻塞性问题。PR 规范问题在下面章节报,不要在这里重复

历史 Findings 修复情况

Finding 问题 状态
F1 _destroy_gdr_handle() 吞掉 cleanup() 异常且没有任何日志。 ⚠️ 仍存在
F2 缓存的 GDR CheckpointTransfer 没有在 sleep/clear 权重路径释放。 ⚠️ 仍存在

📝 PR 规范检查

符合规范。标题使用官方 [RL] Tag,PR 描述包含 checklist §D2 要求的 MotivationModificationsUsage or CommandAccuracy TestsChecklist 章节。

总体评价

本轮按风险优先追溯了 GDR handle 创建、复用、异常清理、runner update/clear/sleep 调用链和新增单测。除历史未解决项外,暂未发现新的需要行间评论的问题。

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 26, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-26 13:14:59 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 9238c11 | Merge base: 6d9a8f4 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 39 3 0 0 0
任务 错误类型 置信度 日志
Approval 需要 Approval Job
xpu_8cards_case_test / run_xpu_8cards_cases 不稳定问题 Job

2 失败详情

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

  • 根因摘要: 需要人工审批
  • 修复建议: 请通过人工审批
  • 关联变更: 无需代码分析
🔴 xpu_8cards_case_test / run_xpu_8cards_cases — 不稳定问题(置信度: 中)

分析器: 通用分析(fallback)
失败用例:

用例 错误摘要
tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation PD 分离服务健康检查通过后,OpenAI 接口返回空文本,关键词断言失败

关键日志:

PD分离服务启动成功!耗时 10 秒
模型回复:
PD分离测试失败: 响应内容不符合预期:
assert False
FAILED tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py::test_pd_separation
=================== 1 failed, 3 passed in 404.94s (0:06:44) ====================
  • 根因摘要: XPU PD分离首个请求空回复
    失败发生在 tests/xpu_ci/8cards_cases/test_pd_21b_ep4tp1.py:306 的响应关键词断言。日志显示 P/D 节点健康检查已通过,但该用例打印的 模型回复: 后内容为空;同一 job 后续 3 个 PD 分离用例均返回正常中文回复并通过,因此更像 XPU PD/RDMA 场景的偶发空回复或服务就绪抖动。

修复建议:

  1. 建议先 rerun 该 job;如仍复现,再排查 XPU PD 分离首个请求在健康检查刚通过后的空 completion,或在测试侧增加更严格的首请求就绪校验/重试。

关联变更: 本 PR 只修改 fastdeploy/rl/dynamic_weight_manager.py 的 GDR CheckpointTransfer handle 缓存逻辑(_gdr_ct_handle_ensure_gdr_handle),失败 job 启动参数是 XPU PD 分离 --cache-transfer-protocol rdma,日志中未出现 FD_USE_GDR_CHECKPOINT_TRANSFERCheckpointTransfer config,未发现该 PR 变更路径被触发。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants