Skip to content

Conversation

@YanhuiDua
Copy link
Collaborator

@YanhuiDua YanhuiDua commented Jan 29, 2026

优化前 tests/ray运行时间:~25min
image

优化后 tests/ray运行时间:12min56s

@YanhuiDua YanhuiDua requested a review from CyCle1024 January 29, 2026 12:23
responses = ray.get(self.test_flow.run.remote(), timeout=300)["data_groups"]
finished_samples_count = sum(1 for data in responses for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length")
responses = ray.get(test_flow.run.remote(), timeout=300)["data_groups"]
finished_samples_count = sum(1 for data in responses[0] for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么在responses需要加[0],是原来有bug还是test_flow.run接口改了?


async def run_both():
return await asyncio.gather(
self._run_rollout(dense_model_path, 4, 1, pg1), # tp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议并行配置相关的调用关键处,写清楚tp,ep的kwargs格式,可读性更好,否则不好理解4,1 1,4

expert_parallel_size=ep_size,
context_length=self.context_length,
worker_log_dir=self.worker_log_dir,
dist_port_base=38000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请注意所有可以并行的测试中RolloutConfig的dist_port_base的设置,默认占用的port段是[dist_port_base, dist_port_base+1024*pg_num_gpu], pg_num_gpu其实就是这个pg中多少个gpu。不同的并行任务可能占用相同的ip段会导致自动获取到相同的ip端口。

1024这个参数目前还没有整合到RolloutConfig中,如果有需要可以整合。

@CyCle1024
Copy link
Collaborator

CyCle1024 commented Jan 30, 2026

@YanhuiDua test_update_weight.py和test_rl_update_weight.py有更加优化的流程写法。
目前:

Update weight rollout: init train -> init rollout with empty init -> update_weight -> rollout
Ref rollout: new init rollout -> rollout

建议:

Ref rollout: init rollout -> rollout
Update weight rollout: init train -> rollout sleep (drop weight and kvcache) -> rollout wakeup (empty weight) -> update_weight -> rollout

建议的方案优势在于少执行一次推理引擎的初始化,缺点在于需要推理引擎在sleep的一定能正确drop weight,依赖sleep实现(对于目前lmdeploy实现这个条件是符合的)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants