Skip to content

Comments

feat: Add CheckpointService for agent state persistence#4586

Open
drahnreb wants to merge 5 commits intogoogle:mainfrom
drahnreb:feat/graph-agent-pr5
Open

feat: Add CheckpointService for agent state persistence#4586
drahnreb wants to merge 5 commits intogoogle:mainfrom
drahnreb:feat/graph-agent-pr5

Conversation

@drahnreb
Copy link

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

Problem:
Long-running agent workflows need state snapshots for recovery, debugging, and audit trails. There is no built-in checkpoint mechanism that composes existing ADK session/artifact services.

Solution:
Add stateless CheckpointService using SessionService and ArtifactService primitives with delta compression, concurrent session locking, and configurable retention. Includes CheckpointableMixin for any agent type, GraphCheckpointCallback for node-level checkpointing, and comprehensive error model (CheckpointNotFoundError, CheckpointCorruptedError, DeltaChainBrokenError) with telemetry.

What's included:

  • src/google/adk/checkpoints/checkpoint_service.py, models.py, mixins.py, callback.py, utils.py, __init__.py
  • src/google/adk/agents/graph/checkpoint_callback.py — GraphCheckpointCallback
  • Updated graph/__init__.py with checkpoint exports
  • Updated test_graph_agent.py with final test additions
  • Updated test_interrupt_integration.py with checkpoint+interrupt integration tests
  • 7 test files (~105 tests): test_checkpoint_service.py, test_checkpoint_coverage.py, test_checkpoint_delta_chain.py, test_checkpoint_locks.py, test_checkpoint_mixin.py, test_checkpoint_utils.py, test_callback.py
  • 7 samples (graph_agent_advanced, graph_agent_agent_driven_checkpoint, graph_agent_agent_driven_topology, graph_agent_dynamic_topology, graph_agent_hitl, graph_agent_parallel_features, graph_agent_todo_queue, examples 04/12)

Part 5 of 5 — see tracking issue #4581. Stacked on #4585.

Testing Plan

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
pytest tests/unittests/checkpoints/ -v — ~105 tests ✅
pytest tests/unittests/agents/test_interrupt_integration.py -v — checkpoint+interrupt integration ✅
All prior tests still pass (regression) ✅
Final state matches original branch: git diff feat/graph-agent-v4 feat/graph-agent-v3 -- src/ tests/ contributing/ docs/ → zero diff on our files ✅

Manual End-to-End (E2E) Tests:

7 checkpoint sample agents import and instantiate successfully.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Part 5 of 5 (final). Depends on all prior PRs: #4582, #4583, #4584, #4585. Core CheckpointService is agent-agnostic; only GraphCheckpointCallback depends on GraphAgent.

Total across all 5 PRs: ~727 tests, 26 samples, 6 design docs.

@google-cla
Copy link

google-cla bot commented Feb 22, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @drahnreb, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and manageability of long-running agent workflows by introducing a dedicated CheckpointService. This service provides a standardized mechanism for persisting agent state, particularly for GraphAgent, ensuring that complex, multi-step processes can be reliably resumed, debugged, and audited. It builds upon existing ADK infrastructure to offer advanced features like delta compression and concurrent locking, making agent workflows more resilient to interruptions and failures.

Highlights

  • CheckpointService for Agent State Persistence: Introduced a stateless CheckpointService to enable state snapshots for long-running agent workflows, supporting recovery, debugging, and audit trails. This service leverages existing SessionService and ArtifactService primitives.
  • GraphAgent Integration: Integrated checkpointing capabilities directly into GraphAgent through a new GraphCheckpointCallback, allowing for node-level checkpointing and agent-driven checkpoint requests.
  • Advanced Checkpointing Features: The CheckpointService supports delta compression for efficient storage, concurrent session locking to prevent data corruption, and configurable retention policies for managing checkpoint history.
  • Comprehensive Error Model and Telemetry: Implemented a robust error model (CheckpointNotFoundError, CheckpointCorruptedError, DeltaChainBrokenError) and integrated OpenTelemetry for enhanced observability and diagnostics of checkpoint operations.
  • Extensive Examples and Documentation: Added numerous new documentation files and 7 sample agents demonstrating various GraphAgent features, including advanced patterns, interrupt handling, parallel execution, and the newly added checkpointing functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • contributing/docs/advanced_graph_patterns.md
    • Added documentation for advanced GraphAgent patterns, including dynamic task queues, nested invocations, and conditional parallel groups.
  • contributing/docs/graph_agent_design.md
    • Added a design document outlining GraphAgent's motivation, key capabilities, architecture, and comparison to other workflow agents.
  • contributing/docs/graph_node_types.md
    • Added documentation detailing various GraphAgent node types and their use cases.
  • contributing/docs/interrupt_service_architecture.md
    • Added a design document for the InterruptService, covering its architecture, features, and security aspects.
  • contributing/docs/pattern_apis.md
    • Added documentation for GraphAgent pattern APIs, including DynamicNode, NestedGraphNode, and DynamicParallelGroup.
  • contributing/samples/graph_agent_advanced/README.md
    • Added a README for an advanced GraphAgent example demonstrating checkpointing, LLM-based interrupt reasoning, and custom observability.
  • contributing/samples/graph_agent_advanced/init.py
    • Added an initialization file for the advanced GraphAgent example.
  • contributing/samples/graph_agent_advanced/agent.py
    • Added an advanced GraphAgent example showcasing checkpointing, LLM-based interrupt reasoning, and flexible interrupt timings.
  • contributing/samples/graph_agent_advanced/root_agent.yaml
    • Added a YAML configuration file for the advanced GraphAgent example.
  • contributing/samples/graph_agent_agent_driven_checkpoint/README.md
    • Added a README for an agent-driven checkpoint example.
  • contributing/samples/graph_agent_agent_driven_checkpoint/agent.py
    • Added an example demonstrating agent-driven checkpointing where an LLM decides when to create checkpoints.
  • contributing/samples/graph_agent_agent_driven_topology/README.md
    • Added a README for an agent-driven topology example.
  • contributing/samples/graph_agent_agent_driven_topology/agent.py
    • Added an example demonstrating agent-driven dynamic topology modification within a function node.
  • contributing/samples/graph_agent_basic/README.md
    • Added a README for a basic GraphAgent example.
  • contributing/samples/graph_agent_basic/agent.py
    • Added a basic GraphAgent example demonstrating conditional routing.
  • contributing/samples/graph_agent_basic/root_agent.yaml
    • Added a YAML configuration file for the basic GraphAgent example.
  • contributing/samples/graph_agent_dynamic_queue/README.md
    • Added a README for a dynamic task queue example.
  • contributing/samples/graph_agent_dynamic_queue/agent.py
    • Added an example demonstrating a dynamic task queue pattern with runtime agent dispatch.
  • contributing/samples/graph_agent_dynamic_topology/README.md
    • Added a README for a dynamic topology example.
  • contributing/samples/graph_agent_dynamic_topology/init.py
    • Added an initialization file for the dynamic topology example.
  • contributing/samples/graph_agent_dynamic_topology/agent.py
    • Added an example demonstrating dynamic topology modification at runtime based on LLM planner decisions.
  • contributing/samples/graph_agent_hitl/README.md
    • Added a README for a Human-In-The-Loop (HITL) example.
  • contributing/samples/graph_agent_hitl/init.py
    • Added an initialization file for the HITL example.
  • contributing/samples/graph_agent_hitl/agent.py
    • Added an example demonstrating HITL with risk-gated approval and selective checkpointing.
  • contributing/samples/graph_agent_hitl_orchestrated/README.md
    • Added a README for an orchestrated HITL pipeline example.
  • contributing/samples/graph_agent_hitl_orchestrated/agent.py
    • Added an example demonstrating a composable HITL orchestrated pipeline using NestedGraphNode.
  • contributing/samples/graph_agent_hitl_review/README.md
    • Added a README for a HITL content review workflow example.
  • contributing/samples/graph_agent_hitl_review/agent.py
    • Added an example demonstrating a HITL content review workflow with a human approval loop.
  • contributing/samples/graph_agent_multi_agent/README.md
    • Added a README for a multi-agent research workflow example.
  • contributing/samples/graph_agent_multi_agent/agent.py
    • Added an example demonstrating a multi-agent research workflow with parallel execution and a quality review loop.
  • contributing/samples/graph_agent_parallel_features/README.md
    • Added a README for a parallel execution and rewind features example.
  • contributing/samples/graph_agent_parallel_features/init.py
    • Added an initialization file for the parallel features example.
  • contributing/samples/graph_agent_parallel_features/agent.py
    • Added an example demonstrating comprehensive parallel execution and rewind features.
  • contributing/samples/graph_agent_pattern_dynamic_node/README.md
    • Added a README for a DynamicNode pattern example.
  • contributing/samples/graph_agent_pattern_dynamic_node/agent.py
    • Added an example demonstrating the DynamicNode pattern for runtime agent selection.
  • contributing/samples/graph_agent_pattern_nested_graph/README.md
    • Added a README for a NestedGraphNode pattern example.
  • contributing/samples/graph_agent_pattern_nested_graph/agent.py
    • Added an example demonstrating the NestedGraphNode pattern for hierarchical workflow composition.
  • contributing/samples/graph_agent_pattern_parallel_group/README.md
    • Added a README for a DynamicParallelGroup pattern example.
  • contributing/samples/graph_agent_pattern_parallel_group/agent.py
    • Added an example demonstrating the DynamicParallelGroup pattern for Tree of Thoughts.
  • contributing/samples/graph_agent_react_pattern/README.md
    • Added a README for a ReAct pattern example.
  • contributing/samples/graph_agent_react_pattern/agent.py
    • Added an example demonstrating the ReAct pattern for reasoning and acting loops.
  • contributing/samples/graph_agent_todo_queue/README.md
    • Added a README for a TODO queue example.
  • contributing/samples/graph_agent_todo_queue/init.py
    • Added an initialization file for the TODO queue example.
  • contributing/samples/graph_agent_todo_queue/agent.py
    • Added an example demonstrating a TODO queue orchestrator with resume-safe checkpointing.
  • contributing/samples/graph_examples/01_basic/init.py
    • Added an initialization file for the basic graph example.
  • contributing/samples/graph_examples/01_basic/agent.py
    • Added a basic GraphAgent example demonstrating a simple directed workflow.
  • contributing/samples/graph_examples/02_conditional_routing/init.py
    • Added an initialization file for the conditional routing example.
  • contributing/samples/graph_examples/02_conditional_routing/agent.py
    • Added a conditional routing GraphAgent example demonstrating state-based decision making.
  • contributing/samples/graph_examples/03_cyclic_execution/init.py
    • Added an initialization file for the cyclic execution example.
  • contributing/samples/graph_examples/03_cyclic_execution/agent.py
    • Added a cyclic execution GraphAgent example demonstrating loops and iteration control.
  • contributing/samples/graph_examples/04_checkpointing/init.py
    • Added an initialization file for the checkpointing example.
  • contributing/samples/graph_examples/04_checkpointing/agent.py
    • Added a checkpointing GraphAgent example demonstrating automatic state persistence.
  • contributing/samples/graph_examples/05_interrupts_basic/init.py
    • Added an initialization file for the basic interrupts example.
  • contributing/samples/graph_examples/05_interrupts_basic/agent.py
    • Added a basic interrupts GraphAgent example demonstrating all interrupt actions.
  • contributing/samples/graph_examples/06_interrupts_reasoning/init.py
    • Added an initialization file for the interrupts reasoning example.
  • contributing/samples/graph_examples/06_interrupts_reasoning/agent.py
    • Added an interrupts with reasoning GraphAgent example demonstrating condition-based action selection.
  • contributing/samples/graph_examples/07_callbacks/init.py
    • Added an initialization file for the callbacks example.
  • contributing/samples/graph_examples/07_callbacks/agent.py
    • Added a callbacks GraphAgent example demonstrating node lifecycle hooks.
  • contributing/samples/graph_examples/08_rewind/init.py
    • Added an initialization file for the rewind example.
  • contributing/samples/graph_examples/08_rewind/agent.py
    • Added a rewind GraphAgent example demonstrating time-travel debugging.
  • contributing/samples/graph_examples/09_parallel_wait_all/init.py
    • Added an initialization file for the parallel wait all example.
  • contributing/samples/graph_examples/09_parallel_wait_all/agent.py
    • Added a parallel wait all GraphAgent example demonstrating concurrent node execution.
  • contributing/samples/graph_examples/10_parallel_wait_any/init.py
    • Added an initialization file for the parallel wait any example.
  • contributing/samples/graph_examples/10_parallel_wait_any/agent.py
    • Added a parallel wait any GraphAgent example demonstrating racing multiple data sources.
  • contributing/samples/graph_examples/11_parallel_wait_n/init.py
    • Added an initialization file for the parallel wait N example.
  • contributing/samples/graph_examples/11_parallel_wait_n/agent.py
    • Added a parallel wait N GraphAgent example demonstrating proceeding after N of M branches complete.
  • contributing/samples/graph_examples/12_parallel_checkpointing/init.py
    • Added an initialization file for the parallel checkpointing example.
  • contributing/samples/graph_examples/12_parallel_checkpointing/agent.py
    • Added a parallel checkpointing GraphAgent example demonstrating state persistence across parallel execution.
  • contributing/samples/graph_examples/13_parallel_interrupts/init.py
    • Added an initialization file for the parallel interrupts example.
  • contributing/samples/graph_examples/13_parallel_interrupts/agent.py
    • Added a parallel interrupts GraphAgent example demonstrating interrupt handling inside parallel branches.
  • contributing/samples/graph_examples/14_parallel_rewind/init.py
    • Added an initialization file for the parallel rewind example.
  • contributing/samples/graph_examples/14_parallel_rewind/agent.py
    • Added a parallel rewind GraphAgent example demonstrating rewind with parallel workflows.
  • contributing/samples/graph_examples/15_enhanced_routing/init.py
    • Added an initialization file for the enhanced routing example.
  • contributing/samples/graph_examples/15_enhanced_routing/agent.py
    • Added an enhanced routing GraphAgent example demonstrating priority, weighted, and fallback routing.
  • contributing/samples/graph_examples/README.md
    • Added a comprehensive README for all GraphAgent examples, detailing features and usage.
  • contributing/samples/graph_examples/init.py
    • Added an initialization file for the graph examples directory.
  • contributing/samples/graph_examples/example_utils.py
    • Added utility functions for GraphAgent examples, including LLM mode toggling.
  • contributing/samples/graph_examples/run_all_examples.sh
    • Added a shell script to run all GraphAgent examples.
  • contributing/samples/graph_examples/run_example.py
    • Added a utility script to run individual GraphAgent examples with optional tracing and LLM mode.
  • docs/future-work/dynamic-topology-modification.md
    • Added a detailed design document outlining the implementation plan for dynamic topology modification in GraphAgent.
  • src/google/adk/init.py
    • Updated __all__ to remove Context and added GraphAgent related imports.
  • src/google/adk/agents/init.py
    • Updated __all__ to remove Context and added GraphAgent, GraphNode, GraphState, START, END imports.
  • src/google/adk/agents/graph/init.py
    • Added an initialization file for the new graph module, exporting all its components.
  • src/google/adk/agents/graph/callbacks.py
    • Added callbacks infrastructure for GraphAgent, including NodeCallbackContext and EdgeCallbackContext.
  • src/google/adk/agents/graph/checkpoint_callback.py
    • Added GraphCheckpointCallback for node-level checkpointing within GraphAgent workflows.
  • src/google/adk/agents/graph/evaluation_metrics.py
    • Added custom evaluation metrics (graph_path_match, state_contains_keys, node_execution_count) for GraphAgent workflows.
  • src/google/adk/agents/graph/graph_agent.py
    • Added the core GraphAgent class, implementing graph execution logic, conditional routing, and integration with interrupts and telemetry.
  • src/google/adk/agents/graph/graph_agent_config.py
    • Added Pydantic models for configuring GraphAgent via YAML, including nodes, edges, interrupts, and parallel groups.
  • src/google/adk/agents/graph/graph_agent_state.py
    • Added GraphAgentState for tracking GraphAgent's execution state, such as current node, iteration, and path.
  • src/google/adk/agents/graph/graph_edge.py
    • Added EdgeCondition for defining conditional edges with priority and weight for advanced routing.
  • src/google/adk/agents/graph/graph_events.py
    • Added typed events (GraphEvent, GraphEventType, GraphStreamMode) for graph execution streaming and monitoring.
  • src/google/adk/agents/graph/graph_export.py
    • Added functions to export graph structure and execution data in D3-compatible JSON format for visualization.
  • src/google/adk/agents/graph/graph_interrupt_handler.py
    • Added a mixin (GraphInterruptMixin) to encapsulate interrupt handling logic for graph-based agents.
  • src/google/adk/agents/graph/graph_node.py
    • Added GraphNode to wrap any ADK agent or custom function as a node in the graph.
  • src/google/adk/agents/graph/graph_rewind.py
    • Added rewind_to_node function for time-travel debugging and re-execution of graph workflows.
  • src/google/adk/agents/graph/graph_state.py
    • Added GraphState as the primary container for managing domain data within graph execution.
  • src/google/adk/agents/graph/graph_telemetry.py
    • Added telemetry mixins (AgentTelemetryMixin, GraphTelemetryMixin) for graph observability and tracing.
  • src/google/adk/agents/graph/interrupt.py
    • Added interrupt modes (InterruptMode) and configuration (InterruptConfig, InterruptAction) for human-in-the-loop interactions.
  • src/google/adk/agents/graph/interrupt_reasoner.py
    • Added InterruptReasoner for LLM-based decision making on interrupt messages.
  • src/google/adk/agents/graph/interrupt_service.py
    • Added InterruptService for dynamic runtime interrupts, including message queuing, pause/resume, and cancellation.
  • src/google/adk/agents/graph/parallel.py
    • Added parallel execution components (ParallelNodeGroup, JoinStrategy, ErrorPolicy) for concurrent node execution.
  • src/google/adk/agents/graph/patterns.py
    • Added pattern APIs for DynamicNode, NestedGraphNode, and DynamicParallelGroup to support advanced graph architectures.
  • src/google/adk/agents/graph/state_utils.py
    • Added utility functions for state parsing and Pydantic JSON encoding.
  • src/google/adk/checkpoints/init.py
    • Updated __all__ to include CheckpointService and CheckpointCallback.
  • src/google/adk/checkpoints/callback.py
    • Added CheckpointCallback for agent-agnostic checkpointing lifecycle events.
  • src/google/adk/checkpoints/checkpoint_service.py
    • Added CheckpointService for managing agent state persistence, including creation, listing, and restoration of checkpoints.
  • src/google/adk/checkpoints/errors.py
    • Added custom error classes specific to checkpointing operations.
  • src/google/adk/checkpoints/models.py
    • Added Pydantic models for checkpoint metadata and state snapshots.
  • src/google/adk/checkpoints/tracing.py
    • Added OpenTelemetry tracing for checkpoint operations.
  • src/google/adk/checkpoints/utils.py
    • Added utility functions for checkpointing operations.
  • src/google/adk/sessions/init.py
    • Updated __all__ to include InMemorySessionService.
  • src/google/adk/sessions/in_memory_session_service.py
    • Added InMemorySessionService for in-memory session management.
  • src/google/adk/telemetry/init.py
    • Updated __all__ to include graph_tracing.
  • src/google/adk/telemetry/graph_tracing.py
    • Added OpenTelemetry tracing for GraphAgent components.
  • src/google/adk/telemetry/tracing.py
    • Updated __all__ to include tracer.
  • tests/unittests/agents/test_graph_agent.py
    • Added unit tests for the core GraphAgent functionality.
  • tests/unittests/agents/test_graph_callbacks.py
    • Added unit tests for GraphAgent callbacks.
  • tests/unittests/agents/test_graph_checkpoint_callback.py
    • Added unit tests for GraphCheckpointCallback.
  • tests/unittests/agents/test_graph_config.py
    • Added unit tests for GraphAgent configuration parsing.
  • tests/unittests/agents/test_graph_edge.py
    • Added unit tests for EdgeCondition and routing logic.
  • tests/unittests/agents/test_graph_interrupt_handler.py
    • Added unit tests for GraphAgent's interrupt handling mixin.
  • tests/unittests/agents/test_graph_node.py
    • Added unit tests for GraphNode behavior.
  • tests/unittests/agents/test_graph_parallel.py
    • Added unit tests for parallel execution features.
  • tests/unittests/agents/test_graph_patterns.py
    • Added unit tests for GraphAgent pattern APIs.
  • tests/unittests/agents/test_graph_rewind.py
    • Added unit tests for graph rewind functionality.
  • tests/unittests/agents/test_graph_state.py
    • Added unit tests for GraphState and its utility methods.
  • tests/unittests/agents/test_graph_telemetry.py
    • Added unit tests for graph telemetry and tracing.
  • tests/unittests/agents/test_interrupt_integration.py
    • Added integration tests for interrupt service functionality.
  • tests/unittests/agents/test_interrupt_reasoner.py
    • Added unit tests for InterruptReasoner.
  • tests/unittests/agents/test_interrupt_service.py
    • Added unit tests for InterruptService.
  • tests/unittests/checkpoints/test_checkpoint_service.py
    • Added unit tests for CheckpointService.
  • tests/unittests/checkpoints/test_checkpoint_tracing.py
    • Added unit tests for checkpoint tracing.
  • tests/unittests/checkpoints/test_checkpoint_utils.py
    • Added unit tests for checkpoint utility functions.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive and comprehensive pull request that introduces the CheckpointService and a powerful GraphAgent with extensive features like conditional routing, parallel execution, interrupts, and checkpointing. The inclusion of detailed design documents, architectural patterns, and a wide array of examples is highly commendable and will be of great value to users. The code is well-structured and shows careful consideration for edge cases, security, and performance.

My review includes a few suggestions for improving maintainability by refactoring some complex methods and correcting minor inconsistencies in documentation and scripts. One notable change is the removal of Context from the public API in google.adk.__init__, which is a breaking change that should probably be highlighted in the PR description for visibility.

"01_basic"
"02_conditional_routing"
"03_cyclic_execution"
"03_enhanced_routing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The example name 03_enhanced_routing seems incorrect as the corresponding directory is named 15_enhanced_routing. This will cause the script to fail when trying to run this example. Please update it to 15_enhanced_routing.

Suggested change
"03_enhanced_routing"
"15_enhanced_routing"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The run_all_examples.sh numbering has been corrected to match the actual example directory names (01_basic through 15_enhanced_routing).

Comment on lines 76 to 79
"user_content": types.Content(
role="user",
parts=[types.Part(text=task_data)]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the code example for creating a context for an agent, you are using types.Content and types.Part. It seems the types module from google.genai is used but not imported in this snippet. Please add from google.genai import types to make the example self-contained and runnable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged: types is imported via google.genai.types in all sample files.

# GraphAgent Design Document

**Author**: ADK Team
**Date**: 2026-01-25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 2026-01-25 is in the future. This might be a placeholder or a typo. To avoid confusion, consider replacing it with a placeholder like YYYY-MM-DD or removing the date line if it's not essential.

# InterruptService Architecture

**Author**: ADK Team
**Date**: 2026-02-04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The date 2026-02-04 is in the future. This could be confusing for readers. It might be better to use a placeholder like YYYY-MM-DD or remove the date if it's not critical for the document.

| 12_parallel_checkpointing | ✅ | - | ✅ | - | - | - | Parallel |
| 13_parallel_interrupts | ✅ | - | - | ✅ | - | - | Parallel |
| 14_parallel_rewind | ✅ | ✅ | - | - | - | - | Parallel |
| 15_enhanced_routing | - | - | - | - | - | - | Advanced |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There seems to be a mismatch in the example numbering. The feature matrix lists 15_enhanced_routing as 03_enhanced_routing. Please correct the example number to 15_enhanced_routing for consistency with the file structure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: numbering corrected in run_all_examples.sh.

Comment on lines +16 to +20
from .graph import END
from .graph import GraphAgent
from .graph import GraphNode
from .graph import GraphState
from .graph import START
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's great to see the new GraphAgent components being exported. However, I noticed that Context is no longer exported from this package's __all__ list (and also from the top-level google.adk package). This is a significant breaking change for users who might be importing Context from here. It would be beneficial to mention this change in the pull request description to ensure users are aware of the API update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context is imported and exported in init.py at lines 16 and 36.

Comment on lines 1047 to 1757
async def _run_async_impl(
self, ctx: InvocationContext
) -> AsyncGenerator[Event, None]:
"""Core graph execution logic.

Executes nodes in graph order, following conditional edges,
supporting loops and human-in-the-loop interrupts.

Args:
ctx: Invocation context

Yields:
Events from graph execution

Raises:
ValueError: If start node not set or graph structure invalid
"""
if not self.start_node:
raise ValueError("Start node not set. Call set_start() first.")

# Register session with InterruptService if enabled
if self.interrupt_service:
self.interrupt_service.register_session(ctx.session.id)

# Get effective telemetry config for nested graph inheritance
effective_config = self._get_effective_telemetry_config(ctx)

with tracer.start_as_current_span(
f"graph_agent_execution {self.name}"
) as span:
span.set_attribute("graph_agent.name", self.name)
span.set_attribute("graph_agent.start_node", self.start_node)
span.set_attribute("graph_agent.max_iterations", self.max_iterations)
try:
# Load execution tracking state (BaseAgentState pattern)
agent_state = (
self._load_agent_state(ctx, GraphAgentState) or GraphAgentState()
)

# Store telemetry config for nested graph inheritance
if effective_config:
agent_state.telemetry_config_dict = effective_config.model_dump()

# Initialize domain data from session state or user input.
# Exclude graph-internal keys to prevent circular references
# (state.data["graph_data"] → state.data) and keep domain data clean.
domain_data = {
k: v
for k, v in ctx.session.state.items()
if not k.startswith("_") and k not in _GRAPH_INTERNAL_KEYS
}
if domain_data:
state = GraphState(data=domain_data)
else:
# Extract text from Content object
user_text = ""
if (
hasattr(ctx, "user_content")
and ctx.user_content
and ctx.user_content.parts
):
user_text = (
ctx.user_content.parts[0].text
if ctx.user_content.parts[0].text
else ""
)
state = GraphState(data={"input": user_text})

# Track which parallel groups have been executed
executed_parallel_groups = set(agent_state.executed_parallel_groups)

# ADK resumability: resume from saved node or start fresh.
#
# Design note: SequentialAgent ONLY emits state events when
# ctx.is_resumable is True, because its state events serve only
# resumability. GraphAgent's state events serve multiple consumers
# (rewind, interrupts, telemetry) that are orthogonal to
# resumability. Therefore:
# - Per-iteration state events: always emitted (multi-consumer)
# - Resume skip: first iteration skipped when resuming (already
# persisted before pause, avoids duplicate)
# - end_of_agent: guarded by is_resumable (purely a resumability
# lifecycle signal, has no other consumers)
# - Interrupt/cancellation state saves: always emitted (they
# serve interrupt functionality, not just resumability)
current_node_name, iteration, resuming = self._get_resume_state(
agent_state
)
pause_invocation = False

while current_node_name and iteration < self.max_iterations:
iteration += 1
current_node = self.nodes[current_node_name]

# Check for immediate cancellation (ESC-like interrupt)
# Allows user to abort execution at any time, not just at pause points
if self.interrupt_service and not self.interrupt_service.is_active(
ctx.session.id
):
logger.info(
"GraphAgent execution cancelled (immediate interrupt) for"
f" session {ctx.session.id}"
)
# Save partial state before cancelling (enables resume/restart)
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
yield Event(
author=self.name,
content=types.Content(
parts=[types.Part(text="⚠️ Execution cancelled by user")]
),
actions=EventActions(
escalate=False,
state_delta={
"graph_cancelled": True,
"graph_cancelled_at_node": current_node_name,
"graph_iteration": iteration,
"graph_data": state.data,
"graph_path": list(agent_state.path),
"graph_can_resume": True,
},
),
)
break # Exit immediately but state is saved

# Track execution path in agent_state
agent_state.path.append(current_node_name)
agent_state.iteration = iteration
agent_state.current_node = current_node_name
agent_state.node_invocations.setdefault(current_node_name, []).append(
ctx.invocation_id
)

# ADK resumability: reset sub-agent states on cycle revisit
# (mirrors LoopAgent pattern at loop_agent.py:114)
if (
agent_state.path.count(current_node_name) > 1
and current_node.agent
):
ctx.reset_sub_agent_states(current_node.agent.name)

# Track agent path for nested graph support
if self.name not in agent_state.agent_path:
agent_state.agent_path.append(self.name)

# Persist execution tracking via agent_state event.
# These events are consumed by rewind, interrupts, and telemetry
# (not just resumability), so they're always emitted.
# Skip only on first iteration when resuming (already persisted).
if not resuming:
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
else:
resuming = False # Only skip first iteration after resume

# Invoke before_node_callback (custom observability)
if self.before_node_callback:
from .callbacks import NodeCallbackContext

callback_ctx = NodeCallbackContext(
node=current_node,
state=state,
iteration=iteration,
invocation_context=ctx,
metadata={
"agent_path": list(agent_state.agent_path),
"path": list(agent_state.path),
},
)

# Execute callback with telemetry
callback_start_time = time.time()
with graph_tracing.tracer.start_as_current_span(
"graph_callback before_node"
) as cb_span:
# Add attributes with additional_attributes support
attrs = self._get_telemetry_attributes(
{
graph_tracing.GRAPH_CALLBACK_TYPE: "before_node",
graph_tracing.GRAPH_AGENT_NAME: self.name,
graph_tracing.GRAPH_NODE_NAME: current_node_name,
},
effective_config=effective_config,
)
for key, value in attrs.items():
cb_span.set_attribute(key, value)

try:
event = await self.before_node_callback(callback_ctx)
if event:
yield event

# Record success (check sampling)
cb_span.set_attribute("graph.callback.success", True)
if self._should_sample(effective_config=effective_config):
callback_latency_ms = (
time.time() - callback_start_time
) * 1000
graph_tracing.record_callback_execution(
callback_type="before_node",
agent_name=self.name,
latency_ms=callback_latency_ms,
success=True,
)

except Exception as e:
# Record failure (check sampling)
cb_span.set_attribute("graph.callback.success", False)
cb_span.set_attribute("graph.callback.error", str(e))
if self._should_sample(effective_config=effective_config):
callback_latency_ms = (
time.time() - callback_start_time
) * 1000
graph_tracing.record_callback_execution(
callback_type="before_node",
agent_name=self.name,
latency_ms=callback_latency_ms,
success=False,
)
logger.error(
"before_node_callback failed for node"
f" '{current_node_name}': {e}",
exc_info=True,
)
# Continue execution despite callback error

# Handle BEFORE-node interrupt (validation timing)
if (
self._should_interrupt_before(current_node_name)
and self.interrupt_service
):
_b_events, _b_ctrl = await self._handle_before_node_interrupt(
current_node_name, current_node, state, ctx, agent_state
)
for _e in _b_events:
yield _e
# Persist agent_state after interrupt handler may have mutated it
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
if _b_ctrl == "break":
break
elif _b_ctrl is not None:
if isinstance(_b_ctrl, tuple):
current_node_name = _b_ctrl[1]
continue

# Check if current node is part of a parallel group
parallel_group_info = self._find_parallel_group(current_node_name)
if parallel_group_info:
group_id, parallel_group = parallel_group_info

# Check if this group has already been executed
if group_id in executed_parallel_groups:
# Group already executed, skip this node
logger.info(
f"Skipping node '{current_node_name}' - already executed as"
f" part of parallel group '{group_id}'"
)
# Route to next node from this node's edges
next_node_name = self._get_next_node_with_telemetry(
current_node, state
)
if next_node_name is None:
if current_node_name in self.end_nodes:
break
else:
raise ValueError(
f"Node {current_node_name} has no outgoing edges and is"
" not an end node"
)
current_node_name = next_node_name
continue

# Execute entire parallel group
logger.info(
f"Executing parallel group '{group_id}' with nodes:"
f" {parallel_group.nodes}"
)

# Execute parallel group with telemetry
parallel_start_time = time.time()
with graph_tracing.tracer.start_as_current_span(
f"parallel_group {group_id}"
) as pg_span:
# Add attributes with additional_attributes support
attrs = self._get_telemetry_attributes(
{
graph_tracing.GRAPH_PARALLEL_NODE_COUNT: len(
parallel_group.nodes
),
graph_tracing.GRAPH_PARALLEL_STRATEGY: (
parallel_group.join_strategy.value
),
graph_tracing.GRAPH_PARALLEL_WAIT_N: (
parallel_group.wait_n
),
graph_tracing.GRAPH_AGENT_NAME: self.name,
},
effective_config=effective_config,
)
for key, value in attrs.items():
pg_span.set_attribute(key, value)

# Collect all events from parallel execution
completed_count = 0
async for event in execute_parallel_group(
parallel_group,
self.nodes,
state,
ctx,
self._execute_node,
):
yield event
# Count completions (rough estimate based on events)
if event.author != self.name:
completed_count = min(
completed_count + 1, len(parallel_group.nodes)
)

# Record parallel group metrics (check sampling)
pg_span.set_attribute(
"graph.parallel.completed_count", completed_count
)
if self._should_sample(effective_config=effective_config):
parallel_latency_ms = (time.time() - parallel_start_time) * 1000
graph_tracing.record_parallel_group_execution(
agent_name=self.name,
node_count=len(parallel_group.nodes),
strategy=parallel_group.join_strategy.value,
latency_ms=parallel_latency_ms,
completed_count=completed_count,
)

# Mark group as executed
executed_parallel_groups.add(group_id)
agent_state.executed_parallel_groups = list(
executed_parallel_groups
)

# After parallel group completes, determine next node
# Use the current node's edges to determine routing
# (all nodes in group should have same outgoing edges)
next_node_name = self._get_next_node_with_telemetry(
current_node, state
)

if next_node_name is None:
# No more nodes after parallel group
if current_node_name in self.end_nodes:
break
else:
raise ValueError(
f"Parallel group '{group_id}' has no outgoing edges and"
f" node '{current_node_name}' is not an end node"
)

current_node_name = next_node_name
continue # Skip individual node execution, continue to next iteration

# Execute node with immediate cancellation support
# Check cancellation while streaming events from node execution
output_holder: Dict[str, Any] = {"output": ""}
try:
async for event in self._execute_node(
current_node,
state,
ctx,
effective_config,
output_holder=output_holder,
iteration=iteration,
):
# Check for immediate cancellation DURING node execution
if (
self.interrupt_service
and not self.interrupt_service.is_active(ctx.session.id)
):
logger.info(
"GraphAgent execution cancelled (immediate interrupt"
f" during node '{current_node_name}') for session"
f" {ctx.session.id}"
)
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
yield Event(
author=self.name,
content=types.Content(
parts=[
types.Part(
text=(
"⚠️ Execution cancelled during node"
f" '{current_node_name}'"
)
)
]
),
actions=EventActions(
escalate=False,
state_delta={
"graph_cancelled": True,
"graph_cancelled_at_node": current_node_name,
"graph_data": state.data,
"graph_partial_output": output_holder["output"],
"graph_can_resume": True,
},
),
)
return
yield event
except asyncio.CancelledError:
# Task cancelled externally (e.g., timeout, user abort)
logger.info(
f"GraphAgent task cancelled during node '{current_node_name}'"
f" for session {ctx.session.id}"
)
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
yield Event(
author=self.name,
content=types.Content(
parts=[
types.Part(
text=(
"⚠️ Task cancelled during node"
f" '{current_node_name}'"
)
)
]
),
actions=EventActions(
escalate=False,
state_delta={
"graph_task_cancelled": True,
"graph_cancelled_at_node": current_node_name,
"graph_data": state.data,
"graph_partial_output": output_holder["output"],
"graph_can_resume": True,
},
),
)
raise

# ADK resumability: check if node execution was paused
if output_holder.get("pause"):
pause_invocation = True
return

# Sync session state into GraphState.data FIRST so that
# output_mapper receives the most up-to-date values and can
# override them. Agents write routing signals via state_delta
# (the ADK-standard pattern); this sync makes those values
# visible to edge condition lambdas (which receive GraphState)
# without requiring an explicit output_mapper.
# Internal keys (prefix '_') are excluded.
for _sk, _sv in ctx.session.state.items():
if not _sk.startswith("_") and _sk not in _GRAPH_INTERNAL_KEYS:
state.data[_sk] = _sv

# Update state with node output (output_mapper runs AFTER
# session sync, so it can override synced values when needed)
output = output_holder["output"]
if output:
# Track state before reduction for telemetry
had_previous_value = current_node.name in state.data
reducer_start = time.time()

# Apply output mapper with reducer
prev_state = state
state = current_node.output_mapper(output, state)
if state is None:
# Custom output_mapper mutated in-place but forgot to return
state = prev_state

# Record state reducer telemetry (check sampling)
reducer_latency_ms = (time.time() - reducer_start) * 1000
if self._should_sample(effective_config=effective_config):
graph_tracing.record_state_reducer(
node_name=current_node.name,
reducer_type=current_node.reducer.name,
state_key=current_node.name,
agent_name=self.name,
latency_ms=reducer_latency_ms,
had_previous_value=had_previous_value,
)

# Record output mapper telemetry
is_default_mapper = (
current_node.output_mapper.__name__
== "_default_output_mapper"
)
graph_tracing.record_mapper(
node_name=current_node.name,
mapper_type="output",
agent_name=self.name,
latency_ms=reducer_latency_ms,
is_default=is_default_mapper,
)

# Invoke after_node_callback (custom observability)
if self.after_node_callback:
from .callbacks import NodeCallbackContext

callback_ctx = NodeCallbackContext(
node=current_node,
state=state,
iteration=iteration,
invocation_context=ctx,
metadata={
"output": output,
"agent_path": list(agent_state.agent_path),
"path": list(agent_state.path),
},
)

# Execute callback with telemetry
callback_start_time = time.time()
with graph_tracing.tracer.start_as_current_span(
"graph_callback after_node"
) as cb_span:
# Add attributes with additional_attributes support
attrs = self._get_telemetry_attributes(
{
graph_tracing.GRAPH_CALLBACK_TYPE: "after_node",
graph_tracing.GRAPH_AGENT_NAME: self.name,
graph_tracing.GRAPH_NODE_NAME: current_node_name,
},
effective_config=effective_config,
)
for key, value in attrs.items():
cb_span.set_attribute(key, value)

try:
event = await self.after_node_callback(callback_ctx)
if event:
yield event

# Record success (check sampling)
cb_span.set_attribute("graph.callback.success", True)
if self._should_sample(effective_config=effective_config):
callback_latency_ms = (
time.time() - callback_start_time
) * 1000
graph_tracing.record_callback_execution(
callback_type="after_node",
agent_name=self.name,
latency_ms=callback_latency_ms,
success=True,
)

except Exception as e:
# Record failure (check sampling)
cb_span.set_attribute("graph.callback.success", False)
cb_span.set_attribute("graph.callback.error", str(e))
if self._should_sample(effective_config=effective_config):
callback_latency_ms = (
time.time() - callback_start_time
) * 1000
graph_tracing.record_callback_execution(
callback_type="after_node",
agent_name=self.name,
latency_ms=callback_latency_ms,
success=False,
)
logger.error(
"after_node_callback failed for node"
f" '{current_node_name}': {e}",
exc_info=True,
)
# Continue execution despite callback error

# Emit graph metadata event for evaluation framework
# This will be captured in Invocation.intermediate_data by EvaluationGenerator
# Set partial=True so is_final_response() returns False (making it an intermediate event)
graph_metadata = {
"graph_node": current_node_name,
"graph_iteration": iteration,
"graph_path": list(agent_state.path),
"node_invocations": {
name: len(invocs)
for name, invocs in agent_state.node_invocations.items()
},
"graph_state": dict(state.data),
}
yield Event(
author=f"{self.name}#metadata",
content=types.Content(
parts=[types.Part(text=f"[GraphMetadata] {graph_metadata}")]
),
partial=True, # Mark as intermediate event
)

# Handle AFTER-node interrupt (retrospective feedback timing)
# This enables retrospective feedback: observe past, steer future
if (
self._should_interrupt_after(current_node_name)
and self.interrupt_service
):
_a_events, _a_ctrl = await self._handle_after_node_interrupt(
current_node_name, state, ctx, agent_state
)
for _e in _a_events:
yield _e
# Persist agent_state after interrupt handler may have mutated it
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
if _a_ctrl == "break":
break
elif _a_ctrl is not None:
if isinstance(_a_ctrl, tuple):
current_node_name = _a_ctrl[1]
continue

# Checkpointing - yield event with state_delta to persist checkpoint
# Note: For full checkpoint/resume functionality, use CheckpointCallback
if self.checkpointing:
ctx.set_agent_state(self.name, agent_state=agent_state)
yield self._create_agent_state_event(ctx)
yield Event(
author=self.name,
content=types.Content(
parts=[types.Part(text=f"Checkpoint: {current_node_name}")]
),
actions=EventActions(
state_delta={
"graph_data": state.data,
"graph_checkpoint": {
"node": current_node_name,
"iteration": iteration,
},
}
),
)

# Inject transient execution data for edge conditions
state.data["_graph_iteration"] = agent_state.iteration
state.data["_graph_path"] = list(agent_state.path)
state.data["_conditions"] = dict(agent_state.conditions)

# Get next node via conditional routing
next_node_name = self._get_next_node_with_telemetry(
current_node, state
)

# Clean up transient keys
for _tk in ("_graph_iteration", "_graph_path", "_conditions"):
state.data.pop(_tk, None)
if next_node_name is None:
# No more edges - check if we're at an end node
if current_node_name in self.end_nodes:
break
else:
# Not at an end node and no edges - error
raise ValueError(
f"Node {current_node_name} has no outgoing edges and is not"
" an end node"
)

current_node_name = next_node_name

# Record iteration metrics (check sampling)
if self._should_sample(effective_config=effective_config):
graph_tracing.record_graph_iteration(
agent_name=self.name,
iteration=iteration,
path_length=len(agent_state.path),
)

# ADK resumability: skip final response + end_of_agent when paused
if not pause_invocation:
# Final response - yield event with graph metadata
# Include last node's output ONLY if:
# 1. explicit final_output is set, OR
# 2. last node was a function (doesn't yield events, so we need to show output)
# Don't include output for agent nodes (they already yielded their output)
final_output = state.data.get("final_output", "")

# If no explicit final_output, check if last node was a function
if not final_output and current_node_name:
last_node = self.nodes.get(current_node_name)
if last_node and last_node.function:
# Function node - include its output
final_output = state.data.get(current_node_name, "")

response_text = f"{final_output}"

yield Event(
author=self.name,
content=types.Content(parts=[types.Part(text=response_text)]),
actions=EventActions(
state_delta={
"graph_data": state.data,
"graph_iterations": iteration,
"graph_path": list(agent_state.path),
}
),
)
# end_of_agent is guarded by is_resumable because it is purely a
# resumability lifecycle signal (tells the runner "this agent is
# done, don't re-run it on resume"). Unlike per-iteration state
# events which serve rewind/interrupts/telemetry, end_of_agent
# has no other consumers.
if ctx.is_resumable:
ctx.set_agent_state(self.name, end_of_agent=True)
yield self._create_agent_state_event(ctx)

finally:
# Unregister session from InterruptService and finalize tracing
if self.interrupt_service:
self.interrupt_service.unregister_session(ctx.session.id)
span.set_attribute("graph_agent.completed", True)

# Interrupt methods inherited from GraphInterruptMixin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _run_async_impl method is quite long and complex, handling various aspects like state initialization, iteration, interrupts, callbacks, checkpointing, and parallel execution. For better readability and maintainability, consider refactoring this method by extracting some of its logic into smaller, more focused helper methods. For instance, the callback invocation logic, which is similar for before_node_callback and after_node_callback, could be a good candidate for a helper function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Extracted 4 helper methods from _run_async_impl: _execute_callback, _sync_state_and_reduce, _build_cancellation_events, and _execute_parallel_phase. Method reduced from ~541 to ~395 lines. The helpers encapsulate distinct concerns: callback execution, state synchronization, cancellation event building, and parallel phase orchestration.

Comment on lines 97 to 267
async def _handle_before_node_interrupt(
self,
current_node_name: str,
current_node: GraphNode,
state: GraphState,
ctx: InvocationContext,
agent_state: GraphAgentState,
) -> Tuple[List[Event], str | Tuple[str, str] | None]:
"""Handle a BEFORE-node interrupt and return events + routing control.

Args:
current_node_name: Name of the node about to execute.
current_node: The GraphNode about to execute (needed for "skip").
state: Current graph state.
ctx: Invocation context.
agent_state: Execution tracking state.

Returns:
Tuple of (events_to_yield, control) where control is:
- None: proceed to normal node execution.
- "rerun": re-run current node (continue the loop).
- "break": exit the main loop immediately.
- ("go_back", target_node): jump to target_node.
- ("skip", next_node | None): skip node, route to next_node.
"""
assert self.interrupt_service is not None
interrupt_message = await self._check_interrupt_with_telemetry(
ctx.session.id, "before"
)
if not interrupt_message:
return [], None

action_result = await self._process_interrupt_message(
interrupt_message, state, current_node_name, ctx, agent_state
)

should_escalate = (
action_result == "pause"
if isinstance(action_result, str)
else (isinstance(action_result, tuple) and action_result[0] == "pause")
)

event = Event(
author=self.name,
content=types.Content(
parts=[
types.Part(
text=(
"\U0001f6d1 INTERRUPT (BEFORE):"
f" {interrupt_message.text}"
)
)
]
),
actions=EventActions(
escalate=should_escalate,
state_delta={
"interrupt_message": interrupt_message.text,
"interrupt_timing": "before",
"interrupt_node": current_node_name,
},
),
)

if isinstance(action_result, tuple):
action, target_node = action_result
if action == "go_back":
return [event], ("go_back", target_node)
elif action_result == "rerun":
return [event], "rerun"
elif action_result == "skip":
next_node_name = self._get_next_node_with_telemetry(current_node, state) # type: ignore[attr-defined]
return (
[event],
("skip", next_node_name) if next_node_name else "break",
)
elif action_result == "pause":
try:
resumed = await self.interrupt_service.wait_if_paused(ctx.session.id)
if not resumed:
return [event], "break"
except TimeoutError:
return [event], "break"

return [event], None

async def _handle_after_node_interrupt(
self,
current_node_name: str,
state: GraphState,
ctx: InvocationContext,
agent_state: GraphAgentState,
) -> Tuple[List[Event], str | Tuple[str, str] | None]:
"""Handle an AFTER-node interrupt and return events + routing control.

Args:
current_node_name: Name of the node that just executed.
state: Current graph state (includes the node's output).
ctx: Invocation context.
agent_state: Execution tracking state.

Returns:
Tuple of (events_to_yield, control) where control is:
- None: accept results and proceed to next node.
- "rerun": re-run current node.
- "break": exit the main loop.
- ("go_back", target_node): jump to target_node.
"""
assert self.interrupt_service is not None
interrupt_message = await self._check_interrupt_with_telemetry(
ctx.session.id, "after"
)
if not interrupt_message:
return [], None

action_result = await self._process_interrupt_message(
interrupt_message, state, current_node_name, ctx, agent_state
)

should_escalate = (
action_result == "pause"
if isinstance(action_result, str)
else (isinstance(action_result, tuple) and action_result[0] == "pause")
)

state_delta_dict: Dict[str, Any] = {
"interrupt_message": interrupt_message.text,
"interrupt_timing": "after",
"interrupt_metadata": interrupt_message.metadata,
"interrupt_action": interrupt_message.action,
"interrupt_node": current_node_name,
}

event = Event(
author=self.name,
content=types.Content(
parts=[
types.Part(
text=(
"\U0001f6d1 INTERRUPT (AFTER):"
f" {interrupt_message.text}"
)
)
]
),
actions=EventActions(
escalate=should_escalate, state_delta=state_delta_dict
),
)

if isinstance(action_result, tuple):
action, target_node = action_result
if action == "go_back":
return [event], ("go_back", target_node)
elif action_result == "rerun":
return [event], "rerun"
elif action_result == "pause":
try:
resumed = await self.interrupt_service.wait_if_paused(ctx.session.id)
if not resumed:
logger.info(
"GraphAgent execution cancelled for session %s",
ctx.session.id,
)
return [event], "break"
except TimeoutError as e:
logger.warning("Interrupt wait timeout: %s", e)
return [event], "break"

return [event], None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The methods _handle_before_node_interrupt and _handle_after_node_interrupt share a significant amount of logic. They both check for an interrupt message, process it, create an event, and then handle the action result. To reduce code duplication and improve maintainability, you could refactor this into a single helper method that takes the interrupt mode ('before' or 'after') as a parameter and handles the common logic, while keeping the mode-specific differences (like the skip action) inside conditional blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already refactored: both methods delegate to _check_interrupt_with_telemetry, _process_interrupt_message, and _dispatch_interrupt_action. Remaining differences are intentional (BEFORE needs current_node for skip, AFTER includes metadata fields).

@drahnreb drahnreb force-pushed the feat/graph-agent-pr5 branch 5 times, most recently from 3282d2e to 5c2bdb5 Compare February 22, 2026 16:34
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of new examples and documentation for the GraphAgent framework, showcasing advanced features such as checkpointing, LLM-based interrupt reasoning, callback-based observability, flexible interrupt timings, immediate cancellation, dynamic task queues, agent-driven topology, Human-In-The-Loop (HITL) workflows, multi-agent coordination, and various parallel execution strategies (WAIT_ALL, WAIT_ANY, WAIT_N) with rewind integration. The changes include new READMEs, Python agent scripts, and YAML configurations for each example, demonstrating how to build complex, adaptive, and fault-tolerant agentic workflows. A review comment highlights a NameError in the scenario_5_all_interrupt_timings function within contributing/samples/graph_agent_advanced/agent.py, noting that paper_writer_agent and peer_reviewer_agent are not properly scoped and suggesting they be re-instantiated within the scenario function to resolve the issue.

Comment on lines +598 to +599
graph_before.add_node("write", agent=paper_writer_agent)
graph_before.add_node("peer_review", agent=peer_reviewer_agent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The agents paper_writer_agent and peer_reviewer_agent are not defined within the scope of the scenario_5_all_interrupt_timings function. They are created inside build_research_workflow but are not accessible here, which will lead to a NameError when this scenario is executed.

To fix this, you should create new instances of these agents within the scenario, similar to how other scenarios are structured.

Suggested change
graph_before.add_node("write", agent=paper_writer_agent)
graph_before.add_node("peer_review", agent=peer_reviewer_agent)
# Create fresh agents for this scenario to avoid parent conflicts.
(_, _, _, paper_writer_agent, peer_reviewer_agent) = _create_research_agents()
# Add simple nodes
graph_before.add_node("write", agent=paper_writer_agent)
graph_before.add_node("peer_review", agent=peer_reviewer_agent)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: scenario_5 now calls _create_research_agents() to get its own scoped references.

@drahnreb drahnreb force-pushed the feat/graph-agent-pr5 branch 3 times, most recently from aa73841 to 7eee735 Compare February 22, 2026 23:43
@drahnreb
Copy link
Author

Addressing review feedback

Force-pushed with the following fixes:

High:

  • NameError in scenario_5: scenario_5_all_interrupt_timings now calls _create_research_agents() to get its own scoped agent references.
  • Stale session fallback: Agent-driven checkpoint and topology samples use explicit None check with warning.

Medium:

  • Sample output accumulation: final += text / result += ... in 4 sample files.
  • Interrupt handler duplication: Already refactored — both methods delegate to shared helpers (_check_interrupt_with_telemetry, _process_interrupt_message, _dispatch_interrupt_action). Remaining differences are intentional.
  • Context export: Verified present in __init__.py.

@drahnreb
Copy link
Author

@gemini-code-assist please re-review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive pull request that introduces a powerful GraphAgent for complex workflow orchestration and a CheckpointService for state persistence. The implementation is robust, with thoughtful handling of concurrency, state management, and error conditions. The extensive set of new samples and documentation is particularly valuable for understanding and adopting these new features. My review includes a minor code cleanup suggestion and a documentation enhancement to make the new codebase even more discoverable.

- `src/google/adk/agents/graph/graph_agent.py` - Core orchestration
- `src/google/adk/agents/graph/interrupt_reasoner.py` - LLM reasoning
- `src/google/adk/agents/graph/callbacks.py` - Callback infrastructure
- `src/google/adk/agents/graph/interrupt_service.py` - Interrupt management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of code to explore is very helpful! Since this example heavily features checkpointing, it would be beneficial to also include a reference to the new CheckpointService here. For example:

- `src/google/adk/checkpoints/checkpoint_service.py` - Checkpoint management

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added src/google/adk/checkpoints/checkpoint_service.py - Checkpoint management to the list.

review = state.data.get("peer_review", {})
# LLM agents store output as JSON string; parse if needed
if isinstance(review, str):
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import json statement is redundant as json is already imported at the top of the file (line 16). It's also a good practice to keep all imports at the top of the file for better readability and to avoid potential side effects, as recommended by PEP 8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed duplicate import json (already imported at line 16).

Add GraphAgent for building directed-graph workflows with conditional
routing, cyclic execution, state management with reducers, typed events,
streaming, callbacks, rewind, resumability, telemetry with OpenTelemetry
tracing, evaluation metrics, and CLI graph visualization for GraphAgent
topologies. Includes samples and design documentation.
Add DynamicNode (runtime agent selection), NestedGraphNode (hierarchical
workflow composition), and DynamicParallelGroup (variable-count concurrent
execution). Extends CLI visualization with pattern-aware rendering
(diamond, parallelogram, sub-cluster shapes). Includes pattern samples,
node type reference, and design documentation.
Add stateless CheckpointService using SessionService and ArtifactService
primitives with delta compression, concurrent session locking, and
configurable retention. Includes CheckpointableMixin for any agent type,
GraphCheckpointCallback for node-level checkpointing, and comprehensive
error model with telemetry. Includes checkpoint samples.
@drahnreb drahnreb force-pushed the feat/graph-agent-pr5 branch from 7eee735 to 1c03f63 Compare February 23, 2026 13:38
@drahnreb
Copy link
Author

@gemini-code-assist please re-review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the CheckpointService and associated components to enable state persistence, delta compression, and human-in-the-loop interrupts for GraphAgent workflows. The implementation is comprehensive, including a stateless service, node-level callbacks, and extensive telemetry. My review identified a few areas for improvement, primarily concerning the use of internal asyncio.Queue attributes, performance overhead in state size validation, and potential edge cases in state delta calculation. Addressing these will enhance the robustness and performance of the persistence layer.

"data": state.data,
}
try:
result = eval(code, {"__builtins__": safe_builtins}, namespace) # noqa: S307
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The use of eval() for edge conditions, even with AST validation, remains a potential security risk if the sandbox is bypassed. While the current whitelist is restrictive, ensure that any future additions to _SAFE_NAMES or _SAFE_BUILTINS are carefully vetted.

# asyncio.Queue stores items in _queue (a collections.deque).
# This avoids drain/requeue and eliminates QueueFull risk.
queue = self._message_queues[session_id]
messages = list(queue._queue) # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing the internal _queue attribute of asyncio.Queue is discouraged as it is an implementation detail and may change in future Python versions. While it provides a non-destructive peek, it's better to use a more robust approach if possible, or at least document this dependency clearly.

Comment on lines +324 to +326
state_size = len(
json.dumps(state_snapshot, cls=PydanticJSONEncoder).encode("utf-8")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calculating the state size by performing a full json.dumps on every checkpoint creation can introduce significant performance overhead, especially for large states or high-frequency checkpointing. Consider a more efficient way to estimate size or make this validation optional.

Comment on lines +1738 to +1745
delta = {}
for _k, _v in state.data.items():
if (
not _k.startswith("_")
and _k not in _GRAPH_INTERNAL_KEYS
and ctx.session.state.get(_k) != _v
):
delta[_k] = _v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The delta calculation for state_delta events relies on a direct equality check (ctx.session.state.get(_k) != _v). This may be inefficient for large nested dictionaries or lists, and might not correctly detect changes in mutable objects if they are modified in-place. Consider using a more robust deep comparison or ensuring that state updates always use new object instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant