Skip to content

feat: Add native support for PyTorch Profiler in ms-swift#9449

Open
qq1243196045 wants to merge 18 commits into
modelscope:mainfrom
qq1243196045:Profile
Open

feat: Add native support for PyTorch Profiler in ms-swift#9449
qq1243196045 wants to merge 18 commits into
modelscope:mainfrom
qq1243196045:Profile

Conversation

@qq1243196045
Copy link
Copy Markdown
Contributor

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

This PR introduces built-in support for PyTorch Profiler,users can now generate detailed execution traces (compatible with Chrome Tracing) to visualize CPU/GPU activities, memory usage, and kernel-level performance.

The implementation is highly extensible, making it straightforward to add support for other tools like nsys down the road.

To support fine-grained performance analysis in Reinforcement Learning (e.g., RLHF), this PR introduces a flexible annotation mechanism. It allows developers to independently profile specific computational roles by simply wrapping their functions with @DistProfiler.annotate.
For example, you can distinguish between the Rollout and Actor phases as follows:

class RolloutTrainer:
    @DistProfiler.annotate(color="red", role="rollout")
    def train(self):
        # rollout logic
        pass

class ActorTrainer:
    @DistProfiler.annotate(color="red", role="actor")
    def train(self):
        # actor logic
        pass

This design ensures that as the framework evolves to support complex multi-role training, performance bottlenecks can be easily isolated and visualized for each specific stage.

Experiment results

image

Key Arguments

--enable_profiler true 
--profiler_save_path ./profiler_output 
--profiler_ranks 0 1 
--profiler_contents "cpu" "cuda" "stack" 
--profiler_steps 1  

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance profiling framework (DistProfiler) integrated with PyTorch's profiler, adding configuration arguments, trainer mixins, and callback hooks for standard and Megatron trainers. The feedback identifies several critical issues, including potential AttributeErrors in BaseArguments and Profiler, inconsistent default values for profiler_tool, and Python compatibility issues with the | operator in isinstance. Additionally, improvements are suggested for handling None values in config union/intersection operations, replacing print statements with standard logging, and using existing utility functions like get_dist_setting to retrieve distributed ranks.

Comment thread swift/trainers/arguments.py Outdated
Comment thread swift/utils/profiler/torch_profile.py Outdated
Comment thread swift/utils/profiler/config.py Outdated
Comment thread swift/arguments/base_args/profile_args.py Outdated
Comment thread swift/utils/profiler/config.py
Comment thread swift/utils/profiler/torch_profile.py Outdated
Comment thread swift/utils/profiler/torch_profile.py Outdated
Comment thread swift/utils/profiler/torch_profile.py Outdated
Comment thread swift/utils/profiler/profile.py
Comment thread swift/utils/profiler/profile.py Outdated
qq1243196045 and others added 11 commits May 29, 2026 21:08
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@qq1243196045
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a distributed performance profiling feature integrated with PyTorch's profiler, adding configuration arguments, callbacks for standard and Megatron trainers, and utility classes to manage profiling sessions. Key feedback highlights several critical issues: a missing logger import in profile.py that will cause a NameError, an invalid attribute check on BaseArguments that blocks initialization, a potential resource leak in discrete profiling mode if exceptions are raised, potential AttributeError issues due to class-level state tracking in Profiler, and an incorrect ordering of type assertions in TorchProfilerToolConfig.

Comment thread swift/utils/profiler/profile.py
Comment thread swift/arguments/base_args/profile_args.py
Comment thread swift/utils/profiler/torch_profile.py
Comment thread swift/utils/profiler/torch_profile.py
Comment thread swift/utils/profiler/config.py Outdated
qq1243196045 and others added 4 commits May 29, 2026 21:59
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@qq1243196045
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a distributed performance profiling framework integrated with PyTorch's profiler, adding profiler arguments, trainer callbacks, and a DistProfiler utility with an annotate decorator. The code review identified several critical issues: an over-restrictive assertion in profile_args.py that crashes the application when standard callbacks are absent; signature and fallback limitations in the annotate decorator that prevent profiling standalone functions; a potential AttributeError when stopping an uninitialized profiler; and a missing try...finally block in discrete profiling mode that could leak resources if an exception occurs.

Comment on lines +26 to +39
if hasattr(self, 'callbacks'):
if self.enable_profiler and 'profiler' not in self.callbacks:
self.callbacks.append('profiler')
if 'profiler' in self.callbacks and not self.enable_profiler:
self.enable_profiler = True
if self.enable_profiler:
assert 'profiler' in self.callbacks, \
'Profiler callback must be included in callbacks when profiler is enabled.'
if 'profiler' in self.callbacks:
assert self.enable_profiler, \
'Profiler callback is included in callbacks but profiler is not enabled.'
else:
assert not self.enable_profiler, \
'Profiler cannot be enabled without callbacks attribute or with profiler callback missing in callbacks.'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion assert not self.enable_profiler in the else block will raise an AssertionError and crash the application whenever enable_profiler is set to True on any arguments class that does not have a callbacks attribute (such as BaseArguments, DeployArguments, EvalArguments, or ExportArguments). This prevents using the profiler for inference, evaluation, or custom training loops (like RLHF rollout/actor phases) where standard trainer callbacks are not used.

We should remove this assertion to allow enabling the profiler without requiring a callbacks attribute.

Suggested change
if hasattr(self, 'callbacks'):
if self.enable_profiler and 'profiler' not in self.callbacks:
self.callbacks.append('profiler')
if 'profiler' in self.callbacks and not self.enable_profiler:
self.enable_profiler = True
if self.enable_profiler:
assert 'profiler' in self.callbacks, \
'Profiler callback must be included in callbacks when profiler is enabled.'
if 'profiler' in self.callbacks:
assert self.enable_profiler, \
'Profiler callback is included in callbacks but profiler is not enabled.'
else:
assert not self.enable_profiler, \
'Profiler cannot be enabled without callbacks attribute or with profiler callback missing in callbacks.'
if hasattr(self, 'callbacks'):
if self.enable_profiler and 'profiler' not in self.callbacks:
self.callbacks.append('profiler')
if 'profiler' in self.callbacks and not self.enable_profiler:
self.enable_profiler = True
if self.enable_profiler:
assert 'profiler' in self.callbacks, \
'Profiler callback must be included in callbacks when profiler is enabled.'
if 'profiler' in self.callbacks:
assert self.enable_profiler, \
'Profiler callback is included in callbacks but profiler is not enabled.'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseArguments 不应该直接使用,一般都是被其他Arguments继承,例如SftArguments,是有callback 属性的,并且profiler功能依赖profiler callback ,因此我必须确保profiler callback和enable_profiler开启。而如果没有 profiler callback, 那也确实说明不该开启profiler

Comment on lines +103 to +126
def decorator(func):

@functools.wraps(func)
def wrapper(self_instance, *args, **kwargs_inner):
profiler = getattr(self_instance, 'profiler', None)

if (not profiler or not profiler.check_enable() or not profiler.check_this_step()
or not profiler.check_this_rank()):
return func(self_instance, *args, **kwargs_inner)

impl = profiler._impl
if hasattr(impl, 'annotate'):
try:
actual_decorator = impl.annotate(
message=message, color=color, domain=domain, category=category, **kwargs_outer)

return actual_decorator(func)(self_instance, *args, **kwargs_inner)
except Exception:
return func(self_instance, *args, **kwargs_inner)
return func(self_instance, *args, **kwargs_inner)

return wrapper

return decorator
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of the annotate decorator has two major issues:

  1. It forces self_instance as a required first positional argument in the wrapper, which will raise a TypeError when decorating standalone functions or methods with no arguments.
  2. It silently skips profiling if the class instance does not have a profiler attribute, even if a global profiler is active and running.

We can fix both issues by:

  • Using *args, **kwargs_inner to support any function signature (standalone or method).
  • Falling back to the active global profiler (_current_profiler) if the first argument does not have a profiler attribute.
        def decorator(func):

            @functools.wraps(func)
            def wrapper(*args, **kwargs_inner):
                profiler = None
                if args:
                    profiler = getattr(args[0], 'profiler', None)
                if profiler is None:
                    profiler = getattr(cls, '_current_profiler', None)

                if (not profiler or not profiler.check_enable() or not profiler.check_this_step()
                        or not profiler.check_this_rank()):
                    return func(*args, **kwargs_inner)

                impl = profiler._impl
                if hasattr(impl, 'annotate'):
                    try:
                        actual_decorator = impl.annotate(
                            message=message, color=color, domain=domain, category=category, **kwargs_outer)

                        return actual_decorator(func)(*args, **kwargs_inner)
                    except Exception:
                        return func(*args, **kwargs_inner)
                return func(*args, **kwargs_inner)

            return wrapper

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该annotate必须要求传入self_instance,standalone functions 是不支持profiler的。并且需要self_instance 拥有profiler 属性,这一点在profiler callback的时候可以保证

Comment on lines +12 to +13
class DistProfiler:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a class-level variable _current_profiler to keep track of the active global profiler instance. This will be used as a fallback in the annotate decorator when the decorated function is a standalone function or a method of a class that does not have a profiler attribute.

class DistProfiler:

    _current_profiler = None

Comment on lines +83 to +91
def start(self, **kwargs):
if self.check_enable() and self.check_this_rank():
self._this_step = True
return getattr(self._impl, 'start', lambda **_: None)(**kwargs)

def stop(self):
if self.check_enable() and self.check_this_rank():
self._this_step = False
return getattr(self._impl, 'stop', lambda: None)()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the start and stop methods to register and unregister the active profiler instance in the class-level _current_profiler variable.

    def start(self, **kwargs):
        if self.check_enable() and self.check_this_rank():
            self._this_step = True
            DistProfiler._current_profiler = self
            return getattr(self._impl, 'start', lambda **_: None)(**kwargs)

    def stop(self):
        if self.check_enable() and self.check_this_rank():
            self._this_step = False
            if DistProfiler._current_profiler is self:
                DistProfiler._current_profiler = None
            return getattr(self._impl, 'stop', lambda: None)()

Comment thread swift/utils/profiler/torch_profile.py
Comment thread swift/utils/profiler/torch_profile.py
qq1243196045 and others added 2 commits June 1, 2026 11:05
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@Jintao-Huang
Copy link
Copy Markdown
Collaborator

please run:

pip install pre-commit
pre-commit run --all-files

@qq1243196045
Copy link
Copy Markdown
Contributor Author

please run:

pip install pre-commit
pre-commit run --all-files
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants