Problem Statement
In multi-agent graph workflows, when a sub-agent (custom node) needs to collect user input mid-execution via event.interrupt(), the interrupt does not cleanly propagate across agent boundaries. The parent orchestrator must re-invoke the sub-agent from scratch after receiving the user's response. This means the sub-agent loses all local execution state, and the full orchestration chain replays from the beginning - adding latency and requiring external state management to reconstruct context.
This is a fundamental limitation for deterministic multi-step workflows where sub-agents conditionally require user input at various points in the execution flow.
Proposed Solution
ntroduce a first-class interrupt/resume primitive for sub-agents within a multi-agent graph that:
1/ Allows a sub-agent to call interrupt() and have the graph executor preserve the sub-agent's full local state (variables, tool state, conversation history, position in workflow)
2/ Upon receiving user input, the graph executor resumes the sub-agent in-place from the exact point of interruption - without re-dispatching from the parent orchestrator
3/ The orchestrator is notified of the interrupt/resume event but does not need to replay the workflow
This could be implemented as:
Serializable sub-agent checkpoints at the sub-agent level
A yield-style coroutine pattern where sub-agents suspend and resume natively
Graph-level session state that tracks each node's execution position
Use Case
A 13-step diagnostic flow where each step may require technician input (clarification, confirmation, selection from options). Built on AgentCore Runtime + Python Flask + React. Currently text-only, expanding to 15-20 additional deterministic workflows with similar interrupt/resume patterns. Each workflow has conditional branches where different sub-agents handle different diagnostic paths and may need user input at any point.
Alternatives Solutions
No response
Additional Context
No response
Problem Statement
In multi-agent graph workflows, when a sub-agent (custom node) needs to collect user input mid-execution via event.interrupt(), the interrupt does not cleanly propagate across agent boundaries. The parent orchestrator must re-invoke the sub-agent from scratch after receiving the user's response. This means the sub-agent loses all local execution state, and the full orchestration chain replays from the beginning - adding latency and requiring external state management to reconstruct context.
This is a fundamental limitation for deterministic multi-step workflows where sub-agents conditionally require user input at various points in the execution flow.
Proposed Solution
ntroduce a first-class interrupt/resume primitive for sub-agents within a multi-agent graph that:
1/ Allows a sub-agent to call interrupt() and have the graph executor preserve the sub-agent's full local state (variables, tool state, conversation history, position in workflow)
2/ Upon receiving user input, the graph executor resumes the sub-agent in-place from the exact point of interruption - without re-dispatching from the parent orchestrator
3/ The orchestrator is notified of the interrupt/resume event but does not need to replay the workflow
This could be implemented as:
Serializable sub-agent checkpoints at the sub-agent level
A yield-style coroutine pattern where sub-agents suspend and resume natively
Graph-level session state that tracks each node's execution position
Use Case
A 13-step diagnostic flow where each step may require technician input (clarification, confirmation, selection from options). Built on AgentCore Runtime + Python Flask + React. Currently text-only, expanding to 15-20 additional deterministic workflows with similar interrupt/resume patterns. Each workflow has conditional branches where different sub-agents handle different diagnostic paths and may need user input at any point.
Alternatives Solutions
No response
Additional Context
No response