-
Notifications
You must be signed in to change notification settings - Fork 403
Optimize rollout-related ut execution time #1463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2690df1 to
772de6a
Compare
| responses = ray.get(self.test_flow.run.remote(), timeout=300)["data_groups"] | ||
| finished_samples_count = sum(1 for data in responses for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length") | ||
| responses = ray.get(test_flow.run.remote(), timeout=300)["data_groups"] | ||
| finished_samples_count = sum(1 for data in responses[0] for item in data if item.env.rollout.finish_reason == "stop" or item.env.rollout.finish_reason == "length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为什么在responses需要加[0],是原来有bug还是test_flow.run接口改了?
|
|
||
| async def run_both(): | ||
| return await asyncio.gather( | ||
| self._run_rollout(dense_model_path, 4, 1, pg1), # tp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议并行配置相关的调用关键处,写清楚tp,ep的kwargs格式,可读性更好,否则不好理解4,1 1,4
| expert_parallel_size=ep_size, | ||
| context_length=self.context_length, | ||
| worker_log_dir=self.worker_log_dir, | ||
| dist_port_base=38000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请注意所有可以并行的测试中RolloutConfig的dist_port_base的设置,默认占用的port段是[dist_port_base, dist_port_base+1024*pg_num_gpu], pg_num_gpu其实就是这个pg中多少个gpu。不同的并行任务可能占用相同的ip段会导致自动获取到相同的ip端口。
1024这个参数目前还没有整合到RolloutConfig中,如果有需要可以整合。
|
@YanhuiDua test_update_weight.py和test_rl_update_weight.py有更加优化的流程写法。 建议: 建议的方案优势在于少执行一次推理引擎的初始化,缺点在于需要推理引擎在sleep的一定能正确drop weight,依赖sleep实现(对于目前lmdeploy实现这个条件是符合的) |
优化前 tests/ray运行时间:~25min

优化后 tests/ray运行时间:12min56s