-
Notifications
You must be signed in to change notification settings - Fork 473
swarm - switch to handoff node only after current node stops #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| logger.debug("reason=<%s> | stopping execution", reason) | ||
| break | ||
|
|
||
| # Get current node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed a few inline comments because I felt the code was already self explanatory.
src/strands/multiagent/swarm.py
Outdated
| self.state.node_history.append(current_node) | ||
|
|
||
| # After self.state add current node, swarm state finish updating, we persist here | ||
| self.hooks.invoke_callbacks(AfterNodeCallEvent(self, current_node.node_id, invocation_state)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, setting self.state.current_node = handoff_node in the handoff tool means that AfterNodeCallEvent is emitted with a current node_id that does not match the self.state.current_node.node_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for supporting interrupts, we can't have self.state.current_node update to the handoff node if the current node is interrupted.
|
|
||
| registry.add_callback(MultiAgentInitializedEvent, lambda event: self.initialize_multi_agent(event.source)) | ||
| registry.add_callback(AfterNodeCallEvent, lambda event: self.sync_multi_agent(event.source)) | ||
| registry.add_callback(BeforeNodeCallEvent, lambda event: self.sync_multi_agent(event.source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we have successfully executed one node and are now executing the handoff node. If we crash on the handoff node, we would be left in different states depending on which event we persist on:
AfterNodeCallEvent: Current node is not set to the handoff node in session because the handoff node hasn't yet emitted itsAfterNodeCallEvent. This means if we resume after crashing on the handoff node, we will be starting again from the first node.BeforeNodeCallEvent: Current node is set to the handoff node in session because the handoff node already emitted itsBeforeNodeCallEvent. This means if we resume after crashing on the handoff node, we will be starting again from the handoff node.
In short, persisting on AfterNodeCallEvent only made sense when setting the current node to the handoff in the handoff tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes also apply to Graph?
Initilized -> Before->1st node->After->Before->2nd node->After, so for 2nd node(handoff node) before/after is kinda same?
…nto swarm-handoff-delayed
| # After self.state add current node, swarm state finish updating, we persist here | ||
| await self.hooks.invoke_callbacks_async( | ||
| AfterNodeCallEvent(self, current_node.node_id, invocation_state) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, setting self.state.current_node = handoff_node in the handoff tool means that AfterNodeCallEvent is emitted with a current node_id that does not match the self.state.current_node.node_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for supporting interrupts, we can't have self.state.current_node update to the handoff node if the current node is interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now current_node means next nodes:
next_nodes = ([self.state.current_node.node_id] if self.state.completion_status == Status.EXECUTING and self.state.current_node else [] )
| # Total metrics across all agents | ||
| accumulated_metrics: Metrics = field(default_factory=lambda: Metrics(latencyMs=0)) | ||
| execution_time: int = 0 # Total execution time in milliseconds | ||
| handoff_node: SwarmNode | None = None # The agent to execute next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to introduce this handoff_node as next node, then we also need to update persist logic for mapping next_nodes_to_execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? I tested that the persisting logic continues to work as expected. What makes this work the same is syncing the swarm on BeforeNodeCallEvent as opposed to AfterNodeCallEvent. I make a comment about this down below.
| current_node = self.state.handoff_node | ||
|
|
||
| self.state.handoff_node = None | ||
| self.state.current_node = current_node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here self.state.current_node is previous next_node so this is actually the same node, only naming change right.
Description
Set the handoff node to current only after the current node finishes. Currently, we make the switch in the middle of the current node execution. It is important to fix this for a few reasons:
AfterNodeCallEventwith the current node id andstate.current_nodeset to the handoff node. This is going to cause customer confusion.Related Issues
#204
Documentation PR
Implementation detail
Type of Change
Bug fix
Testing
How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli
hatch run prepare: Relying on existing unit testshatch test tests_integ/test_multiagent_swarm.pyChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.