Skip to content

Conversation

@moonli
Copy link
Contributor

@moonli moonli commented Nov 18, 2025

Summary:
The current error message logs the actor name and stacktrace when an exception happens (ActorError/SupervisionError). The endpoint name is included in the stacktrace.

But for cases like proc crash, the supervision error doesn't have a stacktrace, so it couldn't show the endpoint name.

This diff adds a error log message that proceeds the supervision error, it will print both the actor name and endpoint name of the call. It is very useful for users to figure out where the failure happened in all error cases.

Differential Revision: D87353113

@meta-codesync
Copy link

meta-codesync bot commented Nov 18, 2025

@moonli has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87353113.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025
@moonli moonli force-pushed the export-D87353113 branch 2 times, most recently from bceb874 to 9073454 Compare November 21, 2025 23:22
…ytorch#1917)

Summary:

The current error message logs the actor name and stacktrace when an exception happens (ActorError/SupervisionError). The endpoint name is included in the stacktrace.

But for cases like proc crash, the supervision error doesn't have a stacktrace, so it couldn't show the endpoint name.

This diff includes the user actor name and endpoint name to the exceptions ActorError and SupervisionError.


github issue: meta-pytorch#1899

Differential Revision: D87353113
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant