Optimize rollout-related ut execution time #1463

YanhuiDua · 2026-01-29T12:23:00Z

优化前 tests/ray运行时间：~25min

优化后 tests/ray运行时间：12min56s

CyCle1024 · 2026-01-30T07:18:28Z

tests/ray/test_rollout.py

-        responses = ray.get(self.test_flow.run.remote(), timeout=300)["data_groups"]
-        finished_samples_count = sum(1 for data in responses for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length")
+        responses = ray.get(test_flow.run.remote(), timeout=300)["data_groups"]
+        finished_samples_count = sum(1 for data in responses[0] for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length")


这里为什么在responses需要加[0]，是原来有bug还是test_flow.run接口改了？

CyCle1024 · 2026-01-30T07:38:00Z

tests/ray/test_rollout.py

+
+        async def run_both():
+            return await asyncio.gather(
+                self._run_rollout(dense_model_path, 4, 1, pg1), # tp


建议并行配置相关的调用关键处，写清楚tp,ep的kwargs格式，可读性更好，否则不好理解4,1 1,4

CyCle1024 · 2026-01-30T07:43:01Z

tests/ray/test_rollout.py

+            expert_parallel_size=ep_size,
+            context_length=self.context_length,
+            worker_log_dir=self.worker_log_dir,
+            dist_port_base=38000,


请注意所有可以并行的测试中RolloutConfig的dist_port_base的设置，默认占用的port段是[dist_port_base, dist_port_base+1024*pg_num_gpu], pg_num_gpu其实就是这个pg中多少个gpu。不同的并行任务可能占用相同的ip段会导致自动获取到相同的ip端口。

1024这个参数目前还没有整合到RolloutConfig中，如果有需要可以整合。

CyCle1024 · 2026-01-30T07:54:56Z

@YanhuiDua test_update_weight.py和test_rl_update_weight.py有更加优化的流程写法。
目前：

Update weight rollout: init train -> init rollout with empty init -> update_weight -> rollout
Ref rollout: new init rollout -> rollout

建议：

Ref rollout: init rollout -> rollout
Update weight rollout: init train -> rollout sleep (drop weight and kvcache) -> rollout wakeup (empty weight) -> update_weight -> rollout

建议的方案优势在于少执行一次推理引擎的初始化，缺点在于需要推理引擎在sleep的一定能正确drop weight，依赖sleep实现（对于目前lmdeploy实现这个条件是符合的）

YanhuiDua requested a review from CyCle1024 January 29, 2026 12:23

optimize rollout-related ut execution time

772de6a

YanhuiDua force-pushed the optim_rollout_ut branch from 2690df1 to 772de6a Compare January 30, 2026 04:47

CyCle1024 reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize rollout-related ut execution time #1463

Optimize rollout-related ut execution time #1463

Uh oh!

YanhuiDua commented Jan 29, 2026 •

edited

Loading

Uh oh!

CyCle1024 Jan 30, 2026

Uh oh!

CyCle1024 Jan 30, 2026

Uh oh!

CyCle1024 Jan 30, 2026

Uh oh!

CyCle1024 commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize rollout-related ut execution time #1463

Are you sure you want to change the base?

Optimize rollout-related ut execution time #1463

Uh oh!

Conversation

YanhuiDua commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyCle1024 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

CyCle1024 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

CyCle1024 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

CyCle1024 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YanhuiDua commented Jan 29, 2026 •

edited

Loading

CyCle1024 commented Jan 30, 2026 •

edited

Loading