feat: Add native support for PyTorch Profiler in ms-swift#9449
feat: Add native support for PyTorch Profiler in ms-swift#9449qq1243196045 wants to merge 18 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a performance profiling framework (DistProfiler) integrated with PyTorch's profiler, adding configuration arguments, trainer mixins, and callback hooks for standard and Megatron trainers. The feedback identifies several critical issues, including potential AttributeErrors in BaseArguments and Profiler, inconsistent default values for profiler_tool, and Python compatibility issues with the | operator in isinstance. Additionally, improvements are suggested for handling None values in config union/intersection operations, replacing print statements with standard logging, and using existing utility functions like get_dist_setting to retrieve distributed ranks.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a distributed performance profiling feature integrated with PyTorch's profiler, adding configuration arguments, callbacks for standard and Megatron trainers, and utility classes to manage profiling sessions. Key feedback highlights several critical issues: a missing logger import in profile.py that will cause a NameError, an invalid attribute check on BaseArguments that blocks initialization, a potential resource leak in discrete profiling mode if exceptions are raised, potential AttributeError issues due to class-level state tracking in Profiler, and an incorrect ordering of type assertions in TorchProfilerToolConfig.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a distributed performance profiling framework integrated with PyTorch's profiler, adding profiler arguments, trainer callbacks, and a DistProfiler utility with an annotate decorator. The code review identified several critical issues: an over-restrictive assertion in profile_args.py that crashes the application when standard callbacks are absent; signature and fallback limitations in the annotate decorator that prevent profiling standalone functions; a potential AttributeError when stopping an uninitialized profiler; and a missing try...finally block in discrete profiling mode that could leak resources if an exception occurs.
| if hasattr(self, 'callbacks'): | ||
| if self.enable_profiler and 'profiler' not in self.callbacks: | ||
| self.callbacks.append('profiler') | ||
| if 'profiler' in self.callbacks and not self.enable_profiler: | ||
| self.enable_profiler = True | ||
| if self.enable_profiler: | ||
| assert 'profiler' in self.callbacks, \ | ||
| 'Profiler callback must be included in callbacks when profiler is enabled.' | ||
| if 'profiler' in self.callbacks: | ||
| assert self.enable_profiler, \ | ||
| 'Profiler callback is included in callbacks but profiler is not enabled.' | ||
| else: | ||
| assert not self.enable_profiler, \ | ||
| 'Profiler cannot be enabled without callbacks attribute or with profiler callback missing in callbacks.' |
There was a problem hiding this comment.
The assertion assert not self.enable_profiler in the else block will raise an AssertionError and crash the application whenever enable_profiler is set to True on any arguments class that does not have a callbacks attribute (such as BaseArguments, DeployArguments, EvalArguments, or ExportArguments). This prevents using the profiler for inference, evaluation, or custom training loops (like RLHF rollout/actor phases) where standard trainer callbacks are not used.
We should remove this assertion to allow enabling the profiler without requiring a callbacks attribute.
| if hasattr(self, 'callbacks'): | |
| if self.enable_profiler and 'profiler' not in self.callbacks: | |
| self.callbacks.append('profiler') | |
| if 'profiler' in self.callbacks and not self.enable_profiler: | |
| self.enable_profiler = True | |
| if self.enable_profiler: | |
| assert 'profiler' in self.callbacks, \ | |
| 'Profiler callback must be included in callbacks when profiler is enabled.' | |
| if 'profiler' in self.callbacks: | |
| assert self.enable_profiler, \ | |
| 'Profiler callback is included in callbacks but profiler is not enabled.' | |
| else: | |
| assert not self.enable_profiler, \ | |
| 'Profiler cannot be enabled without callbacks attribute or with profiler callback missing in callbacks.' | |
| if hasattr(self, 'callbacks'): | |
| if self.enable_profiler and 'profiler' not in self.callbacks: | |
| self.callbacks.append('profiler') | |
| if 'profiler' in self.callbacks and not self.enable_profiler: | |
| self.enable_profiler = True | |
| if self.enable_profiler: | |
| assert 'profiler' in self.callbacks, \ | |
| 'Profiler callback must be included in callbacks when profiler is enabled.' | |
| if 'profiler' in self.callbacks: | |
| assert self.enable_profiler, \ | |
| 'Profiler callback is included in callbacks but profiler is not enabled.' |
There was a problem hiding this comment.
BaseArguments 不应该直接使用,一般都是被其他Arguments继承,例如SftArguments,是有callback 属性的,并且profiler功能依赖profiler callback ,因此我必须确保profiler callback和enable_profiler开启。而如果没有 profiler callback, 那也确实说明不该开启profiler
| def decorator(func): | ||
|
|
||
| @functools.wraps(func) | ||
| def wrapper(self_instance, *args, **kwargs_inner): | ||
| profiler = getattr(self_instance, 'profiler', None) | ||
|
|
||
| if (not profiler or not profiler.check_enable() or not profiler.check_this_step() | ||
| or not profiler.check_this_rank()): | ||
| return func(self_instance, *args, **kwargs_inner) | ||
|
|
||
| impl = profiler._impl | ||
| if hasattr(impl, 'annotate'): | ||
| try: | ||
| actual_decorator = impl.annotate( | ||
| message=message, color=color, domain=domain, category=category, **kwargs_outer) | ||
|
|
||
| return actual_decorator(func)(self_instance, *args, **kwargs_inner) | ||
| except Exception: | ||
| return func(self_instance, *args, **kwargs_inner) | ||
| return func(self_instance, *args, **kwargs_inner) | ||
|
|
||
| return wrapper | ||
|
|
||
| return decorator |
There was a problem hiding this comment.
The current implementation of the annotate decorator has two major issues:
- It forces
self_instanceas a required first positional argument in the wrapper, which will raise aTypeErrorwhen decorating standalone functions or methods with no arguments. - It silently skips profiling if the class instance does not have a
profilerattribute, even if a global profiler is active and running.
We can fix both issues by:
- Using
*args, **kwargs_innerto support any function signature (standalone or method). - Falling back to the active global profiler (
_current_profiler) if the first argument does not have aprofilerattribute.
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs_inner):
profiler = None
if args:
profiler = getattr(args[0], 'profiler', None)
if profiler is None:
profiler = getattr(cls, '_current_profiler', None)
if (not profiler or not profiler.check_enable() or not profiler.check_this_step()
or not profiler.check_this_rank()):
return func(*args, **kwargs_inner)
impl = profiler._impl
if hasattr(impl, 'annotate'):
try:
actual_decorator = impl.annotate(
message=message, color=color, domain=domain, category=category, **kwargs_outer)
return actual_decorator(func)(*args, **kwargs_inner)
except Exception:
return func(*args, **kwargs_inner)
return func(*args, **kwargs_inner)
return wrapperThere was a problem hiding this comment.
该annotate必须要求传入self_instance,standalone functions 是不支持profiler的。并且需要self_instance 拥有profiler 属性,这一点在profiler callback的时候可以保证
| class DistProfiler: | ||
|
|
There was a problem hiding this comment.
Add a class-level variable _current_profiler to keep track of the active global profiler instance. This will be used as a fallback in the annotate decorator when the decorated function is a standalone function or a method of a class that does not have a profiler attribute.
class DistProfiler:
_current_profiler = None| def start(self, **kwargs): | ||
| if self.check_enable() and self.check_this_rank(): | ||
| self._this_step = True | ||
| return getattr(self._impl, 'start', lambda **_: None)(**kwargs) | ||
|
|
||
| def stop(self): | ||
| if self.check_enable() and self.check_this_rank(): | ||
| self._this_step = False | ||
| return getattr(self._impl, 'stop', lambda: None)() |
There was a problem hiding this comment.
Update the start and stop methods to register and unregister the active profiler instance in the class-level _current_profiler variable.
def start(self, **kwargs):
if self.check_enable() and self.check_this_rank():
self._this_step = True
DistProfiler._current_profiler = self
return getattr(self._impl, 'start', lambda **_: None)(**kwargs)
def stop(self):
if self.check_enable() and self.check_this_rank():
self._this_step = False
if DistProfiler._current_profiler is self:
DistProfiler._current_profiler = None
return getattr(self._impl, 'stop', lambda: None)()Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
please run: |

PR type
PR information
This PR introduces built-in support for PyTorch Profiler,users can now generate detailed execution traces (compatible with Chrome Tracing) to visualize CPU/GPU activities, memory usage, and kernel-level performance.
The implementation is highly extensible, making it straightforward to add support for other tools like nsys down the road.
To support fine-grained performance analysis in Reinforcement Learning (e.g., RLHF), this PR introduces a flexible annotation mechanism. It allows developers to independently profile specific computational roles by simply wrapping their functions with @DistProfiler.annotate.
For example, you can distinguish between the Rollout and Actor phases as follows:
This design ensures that as the framework evolves to support complex multi-role training, performance bottlenecks can be easily isolated and visualized for each specific stage.
Experiment results
Key Arguments