Skip to content

Commit bceb874

Browse files
moonlifacebook-github-bot
authored andcommitted
log endpoint name when supervision error happens
Summary: The current error message logs the actor name and stacktrace when an exception happens (ActorError/SupervisionError). The endpoint name is included in the stacktrace. But for cases like proc crash, the supervision error doesn't have a stacktrace, so it couldn't show the endpoint name. This diff adds a error log message that proceeds the supervision error, it will print both the actor name and endpoint name of the call. It is very useful for users to figure out where the failure happened in all error cases. github issue: #1899 Differential Revision: D87353113
1 parent 36c0579 commit bceb874

File tree

1 file changed

+28
-3
lines changed

1 file changed

+28
-3
lines changed

python/monarch/_src/actor/endpoint.py

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
# pyre-strict
88

99
import functools
10+
import logging
1011
import time
1112
from abc import ABC, abstractmethod
1213
from typing import (
@@ -233,7 +234,15 @@ def choose(self, *args: P.args, **kwargs: P.kwargs) -> Future[R]:
233234
1,
234235
)
235236
async def process() -> R:
236-
result = await r.recv()
237+
try:
238+
result = await r.recv()
239+
except Exception as e:
240+
from monarch._src.actor.actor_mesh import current_actor_name
241+
242+
logging.error(
243+
f'Endpoint call "{method_name}" failed on actor mesh "{current_actor_name()}"'
244+
)
245+
raise e
237246
return result
238247

239248
return Future(coro=process())
@@ -260,7 +269,15 @@ def call_one(self, *args: P.args, **kwargs: P.kwargs) -> Future[R]:
260269
1,
261270
)
262271
async def process() -> R:
263-
result = await r.recv()
272+
try:
273+
result = await r.recv()
274+
except Exception as e:
275+
from monarch._src.actor.actor_mesh import current_actor_name
276+
277+
logging.error(
278+
f'Endpoint call "{method_name}" failed on actor mesh "{current_actor_name()}"'
279+
)
280+
raise e
264281
return result
265282

266283
return Future(coro=process())
@@ -289,7 +306,15 @@ async def process() -> "ValueMesh[R]":
289306

290307
results: List[R] = [None] * extent.nelements # pyre-fixme[9]
291308
for _ in range(extent.nelements):
292-
rank, value = await r._recv()
309+
try:
310+
rank, value = await r._recv()
311+
except Exception as e:
312+
from monarch._src.actor.actor_mesh import current_actor_name
313+
314+
logging.error(
315+
f'Endpoint call "{method_name}" failed on actor mesh "{current_actor_name()}"'
316+
)
317+
raise e
293318
results[rank] = value
294319
call_shape = Shape(
295320
extent.labels,

0 commit comments

Comments
 (0)