feat: Add CheckpointService for agent state persistence by drahnreb · Pull Request #4586 · google/adk-python

drahnreb · 2026-02-22T10:34:41Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Related: feat: GraphAgent — directed-graph workflow orchestration for ADK #4581
Depends on: feat: Add GraphAgent for directed-graph workflow orchestration #4582, feat: Add graph pattern nodes for dynamic dispatch and composition #4583, feat: Add parallel node groups with join strategies #4584, feat: Add InterruptService for human-in-the-loop graph workflows #4585

2. Or, if no issue exists, describe the change:

Problem:
Long-running agent workflows need state snapshots for recovery, debugging, and audit trails. There is no built-in checkpoint mechanism that composes existing ADK session/artifact services.

Solution:
Add stateless CheckpointService using SessionService and ArtifactService primitives with delta compression, concurrent session locking, and configurable retention. Includes CheckpointableMixin for any agent type, GraphCheckpointCallback for node-level checkpointing, and comprehensive error model (CheckpointNotFoundError, CheckpointCorruptedError, DeltaChainBrokenError) with telemetry.

What's included:

src/google/adk/checkpoints/ — checkpoint_service.py, models.py, mixins.py, callback.py, utils.py, __init__.py
src/google/adk/agents/graph/checkpoint_callback.py — GraphCheckpointCallback
Updated graph/__init__.py with checkpoint exports
Updated test_graph_agent.py with final test additions
Updated test_interrupt_integration.py with checkpoint+interrupt integration tests
7 test files (~105 tests): test_checkpoint_service.py, test_checkpoint_coverage.py, test_checkpoint_delta_chain.py, test_checkpoint_locks.py, test_checkpoint_mixin.py, test_checkpoint_utils.py, test_callback.py
7 samples (graph_agent_advanced, graph_agent_agent_driven_checkpoint, graph_agent_agent_driven_topology, graph_agent_dynamic_topology, graph_agent_hitl, graph_agent_parallel_features, graph_agent_todo_queue, examples 04/12)

Part 5 of 5 — see tracking issue #4581. Stacked on #4585.

Testing Plan

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

pytest tests/unittests/checkpoints/ -v — ~105 tests ✅
pytest tests/unittests/agents/test_interrupt_integration.py -v — checkpoint+interrupt integration ✅
All prior tests still pass (regression) ✅
Final state matches original branch: git diff feat/graph-agent-v4 feat/graph-agent-v3 -- src/ tests/ contributing/ docs/ → zero diff on our files ✅

Manual End-to-End (E2E) Tests:

7 checkpoint sample agents import and instantiate successfully.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Part 5 of 5 (final). Depends on all prior PRs: #4582, #4583, #4584, #4585. Core CheckpointService is agent-agnostic; only GraphCheckpointCallback depends on GraphAgent.

Total across all 5 PRs: ~727 tests, 26 samples, 6 design docs.

google-cla · 2026-02-22T10:34:45Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-02-22T10:36:45Z

Summary of Changes

Hello @drahnreb, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and manageability of long-running agent workflows by introducing a dedicated CheckpointService. This service provides a standardized mechanism for persisting agent state, particularly for GraphAgent, ensuring that complex, multi-step processes can be reliably resumed, debugged, and audited. It builds upon existing ADK infrastructure to offer advanced features like delta compression and concurrent locking, making agent workflows more resilient to interruptions and failures.

Highlights

CheckpointService for Agent State Persistence: Introduced a stateless CheckpointService to enable state snapshots for long-running agent workflows, supporting recovery, debugging, and audit trails. This service leverages existing SessionService and ArtifactService primitives.
GraphAgent Integration: Integrated checkpointing capabilities directly into GraphAgent through a new GraphCheckpointCallback, allowing for node-level checkpointing and agent-driven checkpoint requests.
Advanced Checkpointing Features: The CheckpointService supports delta compression for efficient storage, concurrent session locking to prevent data corruption, and configurable retention policies for managing checkpoint history.
Comprehensive Error Model and Telemetry: Implemented a robust error model (CheckpointNotFoundError, CheckpointCorruptedError, DeltaChainBrokenError) and integrated OpenTelemetry for enhanced observability and diagnostics of checkpoint operations.
Extensive Examples and Documentation: Added numerous new documentation files and 7 sample agents demonstrating various GraphAgent features, including advanced patterns, interrupt handling, parallel execution, and the newly added checkpointing functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

contributing/docs/advanced_graph_patterns.md
- Added documentation for advanced GraphAgent patterns, including dynamic task queues, nested invocations, and conditional parallel groups.
contributing/docs/graph_agent_design.md
- Added a design document outlining GraphAgent's motivation, key capabilities, architecture, and comparison to other workflow agents.
contributing/docs/graph_node_types.md
- Added documentation detailing various GraphAgent node types and their use cases.
contributing/docs/interrupt_service_architecture.md
- Added a design document for the InterruptService, covering its architecture, features, and security aspects.
contributing/docs/pattern_apis.md
- Added documentation for GraphAgent pattern APIs, including DynamicNode, NestedGraphNode, and DynamicParallelGroup.
contributing/samples/graph_agent_advanced/README.md
- Added a README for an advanced GraphAgent example demonstrating checkpointing, LLM-based interrupt reasoning, and custom observability.
contributing/samples/graph_agent_advanced/init.py
- Added an initialization file for the advanced GraphAgent example.
contributing/samples/graph_agent_advanced/agent.py
- Added an advanced GraphAgent example showcasing checkpointing, LLM-based interrupt reasoning, and flexible interrupt timings.
contributing/samples/graph_agent_advanced/root_agent.yaml
- Added a YAML configuration file for the advanced GraphAgent example.
contributing/samples/graph_agent_agent_driven_checkpoint/README.md
- Added a README for an agent-driven checkpoint example.
contributing/samples/graph_agent_agent_driven_checkpoint/agent.py
- Added an example demonstrating agent-driven checkpointing where an LLM decides when to create checkpoints.
contributing/samples/graph_agent_agent_driven_topology/README.md
- Added a README for an agent-driven topology example.
contributing/samples/graph_agent_agent_driven_topology/agent.py
- Added an example demonstrating agent-driven dynamic topology modification within a function node.
contributing/samples/graph_agent_basic/README.md
- Added a README for a basic GraphAgent example.
contributing/samples/graph_agent_basic/agent.py
- Added a basic GraphAgent example demonstrating conditional routing.
contributing/samples/graph_agent_basic/root_agent.yaml
- Added a YAML configuration file for the basic GraphAgent example.
contributing/samples/graph_agent_dynamic_queue/README.md
- Added a README for a dynamic task queue example.
contributing/samples/graph_agent_dynamic_queue/agent.py
- Added an example demonstrating a dynamic task queue pattern with runtime agent dispatch.
contributing/samples/graph_agent_dynamic_topology/README.md
- Added a README for a dynamic topology example.
contributing/samples/graph_agent_dynamic_topology/init.py
- Added an initialization file for the dynamic topology example.
contributing/samples/graph_agent_dynamic_topology/agent.py
- Added an example demonstrating dynamic topology modification at runtime based on LLM planner decisions.
contributing/samples/graph_agent_hitl/README.md
- Added a README for a Human-In-The-Loop (HITL) example.
contributing/samples/graph_agent_hitl/init.py
- Added an initialization file for the HITL example.
contributing/samples/graph_agent_hitl/agent.py
- Added an example demonstrating HITL with risk-gated approval and selective checkpointing.
contributing/samples/graph_agent_hitl_orchestrated/README.md
- Added a README for an orchestrated HITL pipeline example.
contributing/samples/graph_agent_hitl_orchestrated/agent.py
- Added an example demonstrating a composable HITL orchestrated pipeline using NestedGraphNode.
contributing/samples/graph_agent_hitl_review/README.md
- Added a README for a HITL content review workflow example.
contributing/samples/graph_agent_hitl_review/agent.py
- Added an example demonstrating a HITL content review workflow with a human approval loop.
contributing/samples/graph_agent_multi_agent/README.md
- Added a README for a multi-agent research workflow example.
contributing/samples/graph_agent_multi_agent/agent.py
- Added an example demonstrating a multi-agent research workflow with parallel execution and a quality review loop.
contributing/samples/graph_agent_parallel_features/README.md
- Added a README for a parallel execution and rewind features example.
contributing/samples/graph_agent_parallel_features/init.py
- Added an initialization file for the parallel features example.
contributing/samples/graph_agent_parallel_features/agent.py
- Added an example demonstrating comprehensive parallel execution and rewind features.
contributing/samples/graph_agent_pattern_dynamic_node/README.md
- Added a README for a DynamicNode pattern example.
contributing/samples/graph_agent_pattern_dynamic_node/agent.py
- Added an example demonstrating the DynamicNode pattern for runtime agent selection.
contributing/samples/graph_agent_pattern_nested_graph/README.md
- Added a README for a NestedGraphNode pattern example.
contributing/samples/graph_agent_pattern_nested_graph/agent.py
- Added an example demonstrating the NestedGraphNode pattern for hierarchical workflow composition.
contributing/samples/graph_agent_pattern_parallel_group/README.md
- Added a README for a DynamicParallelGroup pattern example.
contributing/samples/graph_agent_pattern_parallel_group/agent.py
- Added an example demonstrating the DynamicParallelGroup pattern for Tree of Thoughts.
contributing/samples/graph_agent_react_pattern/README.md
- Added a README for a ReAct pattern example.
contributing/samples/graph_agent_react_pattern/agent.py
- Added an example demonstrating the ReAct pattern for reasoning and acting loops.
contributing/samples/graph_agent_todo_queue/README.md
- Added a README for a TODO queue example.
contributing/samples/graph_agent_todo_queue/init.py
- Added an initialization file for the TODO queue example.
contributing/samples/graph_agent_todo_queue/agent.py
- Added an example demonstrating a TODO queue orchestrator with resume-safe checkpointing.
contributing/samples/graph_examples/01_basic/init.py
- Added an initialization file for the basic graph example.
contributing/samples/graph_examples/01_basic/agent.py
- Added a basic GraphAgent example demonstrating a simple directed workflow.
contributing/samples/graph_examples/02_conditional_routing/init.py
- Added an initialization file for the conditional routing example.
contributing/samples/graph_examples/02_conditional_routing/agent.py
- Added a conditional routing GraphAgent example demonstrating state-based decision making.
contributing/samples/graph_examples/03_cyclic_execution/init.py
- Added an initialization file for the cyclic execution example.
contributing/samples/graph_examples/03_cyclic_execution/agent.py
- Added a cyclic execution GraphAgent example demonstrating loops and iteration control.
contributing/samples/graph_examples/04_checkpointing/init.py
- Added an initialization file for the checkpointing example.
contributing/samples/graph_examples/04_checkpointing/agent.py
- Added a checkpointing GraphAgent example demonstrating automatic state persistence.
contributing/samples/graph_examples/05_interrupts_basic/init.py
- Added an initialization file for the basic interrupts example.
contributing/samples/graph_examples/05_interrupts_basic/agent.py
- Added a basic interrupts GraphAgent example demonstrating all interrupt actions.
contributing/samples/graph_examples/06_interrupts_reasoning/init.py
- Added an initialization file for the interrupts reasoning example.
contributing/samples/graph_examples/06_interrupts_reasoning/agent.py
- Added an interrupts with reasoning GraphAgent example demonstrating condition-based action selection.
contributing/samples/graph_examples/07_callbacks/init.py
- Added an initialization file for the callbacks example.
contributing/samples/graph_examples/07_callbacks/agent.py
- Added a callbacks GraphAgent example demonstrating node lifecycle hooks.
contributing/samples/graph_examples/08_rewind/init.py
- Added an initialization file for the rewind example.
contributing/samples/graph_examples/08_rewind/agent.py
- Added a rewind GraphAgent example demonstrating time-travel debugging.
contributing/samples/graph_examples/09_parallel_wait_all/init.py
- Added an initialization file for the parallel wait all example.
contributing/samples/graph_examples/09_parallel_wait_all/agent.py
- Added a parallel wait all GraphAgent example demonstrating concurrent node execution.
contributing/samples/graph_examples/10_parallel_wait_any/init.py
- Added an initialization file for the parallel wait any example.
contributing/samples/graph_examples/10_parallel_wait_any/agent.py
- Added a parallel wait any GraphAgent example demonstrating racing multiple data sources.
contributing/samples/graph_examples/11_parallel_wait_n/init.py
- Added an initialization file for the parallel wait N example.
contributing/samples/graph_examples/11_parallel_wait_n/agent.py
- Added a parallel wait N GraphAgent example demonstrating proceeding after N of M branches complete.
contributing/samples/graph_examples/12_parallel_checkpointing/init.py
- Added an initialization file for the parallel checkpointing example.
contributing/samples/graph_examples/12_parallel_checkpointing/agent.py
- Added a parallel checkpointing GraphAgent example demonstrating state persistence across parallel execution.
contributing/samples/graph_examples/13_parallel_interrupts/init.py
- Added an initialization file for the parallel interrupts example.
contributing/samples/graph_examples/13_parallel_interrupts/agent.py
- Added a parallel interrupts GraphAgent example demonstrating interrupt handling inside parallel branches.
contributing/samples/graph_examples/14_parallel_rewind/init.py
- Added an initialization file for the parallel rewind example.
contributing/samples/graph_examples/14_parallel_rewind/agent.py
- Added a parallel rewind GraphAgent example demonstrating rewind with parallel workflows.
contributing/samples/graph_examples/15_enhanced_routing/init.py
- Added an initialization file for the enhanced routing example.
contributing/samples/graph_examples/15_enhanced_routing/agent.py
- Added an enhanced routing GraphAgent example demonstrating priority, weighted, and fallback routing.
contributing/samples/graph_examples/README.md
- Added a comprehensive README for all GraphAgent examples, detailing features and usage.
contributing/samples/graph_examples/init.py
- Added an initialization file for the graph examples directory.
contributing/samples/graph_examples/example_utils.py
- Added utility functions for GraphAgent examples, including LLM mode toggling.
contributing/samples/graph_examples/run_all_examples.sh
- Added a shell script to run all GraphAgent examples.
contributing/samples/graph_examples/run_example.py
- Added a utility script to run individual GraphAgent examples with optional tracing and LLM mode.
docs/future-work/dynamic-topology-modification.md
- Added a detailed design document outlining the implementation plan for dynamic topology modification in GraphAgent.
src/google/adk/init.py
- Updated __all__ to remove Context and added GraphAgent related imports.
src/google/adk/agents/init.py
- Updated __all__ to remove Context and added GraphAgent, GraphNode, GraphState, START, END imports.
src/google/adk/agents/graph/init.py
- Added an initialization file for the new graph module, exporting all its components.
src/google/adk/agents/graph/callbacks.py
- Added callbacks infrastructure for GraphAgent, including NodeCallbackContext and EdgeCallbackContext.
src/google/adk/agents/graph/checkpoint_callback.py
- Added GraphCheckpointCallback for node-level checkpointing within GraphAgent workflows.
src/google/adk/agents/graph/evaluation_metrics.py
- Added custom evaluation metrics (graph_path_match, state_contains_keys, node_execution_count) for GraphAgent workflows.
src/google/adk/agents/graph/graph_agent.py
- Added the core GraphAgent class, implementing graph execution logic, conditional routing, and integration with interrupts and telemetry.
src/google/adk/agents/graph/graph_agent_config.py
- Added Pydantic models for configuring GraphAgent via YAML, including nodes, edges, interrupts, and parallel groups.
src/google/adk/agents/graph/graph_agent_state.py
- Added GraphAgentState for tracking GraphAgent's execution state, such as current node, iteration, and path.
src/google/adk/agents/graph/graph_edge.py
- Added EdgeCondition for defining conditional edges with priority and weight for advanced routing.
src/google/adk/agents/graph/graph_events.py
- Added typed events (GraphEvent, GraphEventType, GraphStreamMode) for graph execution streaming and monitoring.
src/google/adk/agents/graph/graph_export.py
- Added functions to export graph structure and execution data in D3-compatible JSON format for visualization.
src/google/adk/agents/graph/graph_interrupt_handler.py
- Added a mixin (GraphInterruptMixin) to encapsulate interrupt handling logic for graph-based agents.
src/google/adk/agents/graph/graph_node.py
- Added GraphNode to wrap any ADK agent or custom function as a node in the graph.
src/google/adk/agents/graph/graph_rewind.py
- Added rewind_to_node function for time-travel debugging and re-execution of graph workflows.
src/google/adk/agents/graph/graph_state.py
- Added GraphState as the primary container for managing domain data within graph execution.
src/google/adk/agents/graph/graph_telemetry.py
- Added telemetry mixins (AgentTelemetryMixin, GraphTelemetryMixin) for graph observability and tracing.
src/google/adk/agents/graph/interrupt.py
- Added interrupt modes (InterruptMode) and configuration (InterruptConfig, InterruptAction) for human-in-the-loop interactions.
src/google/adk/agents/graph/interrupt_reasoner.py
- Added InterruptReasoner for LLM-based decision making on interrupt messages.
src/google/adk/agents/graph/interrupt_service.py
- Added InterruptService for dynamic runtime interrupts, including message queuing, pause/resume, and cancellation.
src/google/adk/agents/graph/parallel.py
- Added parallel execution components (ParallelNodeGroup, JoinStrategy, ErrorPolicy) for concurrent node execution.
src/google/adk/agents/graph/patterns.py
- Added pattern APIs for DynamicNode, NestedGraphNode, and DynamicParallelGroup to support advanced graph architectures.
src/google/adk/agents/graph/state_utils.py
- Added utility functions for state parsing and Pydantic JSON encoding.
src/google/adk/checkpoints/init.py
- Updated __all__ to include CheckpointService and CheckpointCallback.
src/google/adk/checkpoints/callback.py
- Added CheckpointCallback for agent-agnostic checkpointing lifecycle events.
src/google/adk/checkpoints/checkpoint_service.py
- Added CheckpointService for managing agent state persistence, including creation, listing, and restoration of checkpoints.
src/google/adk/checkpoints/errors.py
- Added custom error classes specific to checkpointing operations.
src/google/adk/checkpoints/models.py
- Added Pydantic models for checkpoint metadata and state snapshots.
src/google/adk/checkpoints/tracing.py
- Added OpenTelemetry tracing for checkpoint operations.
src/google/adk/checkpoints/utils.py
- Added utility functions for checkpointing operations.
src/google/adk/sessions/init.py
- Updated __all__ to include InMemorySessionService.
src/google/adk/sessions/in_memory_session_service.py
- Added InMemorySessionService for in-memory session management.
src/google/adk/telemetry/init.py
- Updated __all__ to include graph_tracing.
src/google/adk/telemetry/graph_tracing.py
- Added OpenTelemetry tracing for GraphAgent components.
src/google/adk/telemetry/tracing.py
- Updated __all__ to include tracer.
tests/unittests/agents/test_graph_agent.py
- Added unit tests for the core GraphAgent functionality.
tests/unittests/agents/test_graph_callbacks.py
- Added unit tests for GraphAgent callbacks.
tests/unittests/agents/test_graph_checkpoint_callback.py
- Added unit tests for GraphCheckpointCallback.
tests/unittests/agents/test_graph_config.py
- Added unit tests for GraphAgent configuration parsing.
tests/unittests/agents/test_graph_edge.py
- Added unit tests for EdgeCondition and routing logic.
tests/unittests/agents/test_graph_interrupt_handler.py
- Added unit tests for GraphAgent's interrupt handling mixin.
tests/unittests/agents/test_graph_node.py
- Added unit tests for GraphNode behavior.
tests/unittests/agents/test_graph_parallel.py
- Added unit tests for parallel execution features.
tests/unittests/agents/test_graph_patterns.py
- Added unit tests for GraphAgent pattern APIs.
tests/unittests/agents/test_graph_rewind.py
- Added unit tests for graph rewind functionality.
tests/unittests/agents/test_graph_state.py
- Added unit tests for GraphState and its utility methods.
tests/unittests/agents/test_graph_telemetry.py
- Added unit tests for graph telemetry and tracing.
tests/unittests/agents/test_interrupt_integration.py
- Added integration tests for interrupt service functionality.
tests/unittests/agents/test_interrupt_reasoner.py
- Added unit tests for InterruptReasoner.
tests/unittests/agents/test_interrupt_service.py
- Added unit tests for InterruptService.
tests/unittests/checkpoints/test_checkpoint_service.py
- Added unit tests for CheckpointService.
tests/unittests/checkpoints/test_checkpoint_tracing.py
- Added unit tests for checkpoint tracing.
tests/unittests/checkpoints/test_checkpoint_utils.py
- Added unit tests for checkpoint utility functions.

Activity

The pull request introduces a new CheckpointService for agent state persistence, which is the final part (Part 5 of 5) of a larger feature rollout, building on previous PRs (feat: Add GraphAgent for directed-graph workflow orchestration #4582, feat: Add graph pattern nodes for dynamic dispatch and composition #4583, feat: Add parallel node groups with join strategies #4584, feat: Add InterruptService for human-in-the-loop graph workflows #4585).
Extensive testing has been performed, including ~105 new unit tests across 7 test files specifically for checkpointing and GraphAgent features.
Integration tests for checkpointing and interrupt functionality have been added to test_interrupt_integration.py.
Seven new sample agents have been added to demonstrate various GraphAgent capabilities, including advanced features and checkpointing.
The author has confirmed that all new and existing unit tests pass locally and that manual end-to-end tests for the new samples are successful.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is an impressive and comprehensive pull request that introduces the CheckpointService and a powerful GraphAgent with extensive features like conditional routing, parallel execution, interrupts, and checkpointing. The inclusion of detailed design documents, architectural patterns, and a wide array of examples is highly commendable and will be of great value to users. The code is well-structured and shows careful consideration for edge cases, security, and performance.

My review includes a few suggestions for improving maintainability by refactoring some complex methods and correcting minor inconsistencies in documentation and scripts. One notable change is the removal of Context from the public API in google.adk.__init__, which is a breaking change that should probably be highlighted in the PR description for visibility.

gemini-code-assist · 2026-02-22T10:39:27Z

contributing/samples/graph_examples/run_all_examples.sh

+    "01_basic"
+    "02_conditional_routing"
+    "03_cyclic_execution"
+    "03_enhanced_routing"


The example name 03_enhanced_routing seems incorrect as the corresponding directory is named 15_enhanced_routing. This will cause the script to fail when trying to run this example. Please update it to 15_enhanced_routing.

Suggested change

"03_enhanced_routing"

"15_enhanced_routing"

Fixed. The run_all_examples.sh numbering has been corrected to match the actual example directory names (01_basic through 15_enhanced_routing).

gemini-code-assist · 2026-02-22T10:39:27Z

contributing/docs/advanced_graph_patterns.md

+            "user_content": types.Content(
+                role="user",
+                parts=[types.Part(text=task_data)]
+            )


In the code example for creating a context for an agent, you are using types.Content and types.Part. It seems the types module from google.genai is used but not imported in this snippet. Please add from google.genai import types to make the example self-contained and runnable.

Acknowledged: types is imported via google.genai.types in all sample files.

gemini-code-assist · 2026-02-22T10:39:27Z

contributing/docs/graph_agent_design.md

+# GraphAgent Design Document
+
+**Author**: ADK Team
+**Date**: 2026-01-25


The date 2026-01-25 is in the future. This might be a placeholder or a typo. To avoid confusion, consider replacing it with a placeholder like YYYY-MM-DD or removing the date line if it's not essential.

gemini-code-assist · 2026-02-22T10:39:27Z

contributing/docs/interrupt_service_architecture.md

+# InterruptService Architecture
+
+**Author**: ADK Team
+**Date**: 2026-02-04


The date 2026-02-04 is in the future. This could be confusing for readers. It might be better to use a placeholder like YYYY-MM-DD or remove the date if it's not critical for the document.

gemini-code-assist · 2026-02-22T10:39:27Z

contributing/samples/graph_examples/README.md

+| 12_parallel_checkpointing | ✅ | - | ✅ | - | - | - | Parallel |
+| 13_parallel_interrupts | ✅ | - | - | ✅ | - | - | Parallel |
+| 14_parallel_rewind | ✅ | ✅ | - | - | - | - | Parallel |
+| 15_enhanced_routing | - | - | - | - | - | - | Advanced |


There seems to be a mismatch in the example numbering. The feature matrix lists 15_enhanced_routing as 03_enhanced_routing. Please correct the example number to 15_enhanced_routing for consistency with the file structure.

Fixed: numbering corrected in run_all_examples.sh.

gemini-code-assist · 2026-02-22T10:39:27Z

src/google/adk/agents/__init__.py

+from .graph import END
+from .graph import GraphAgent
+from .graph import GraphNode
+from .graph import GraphState
+from .graph import START


It's great to see the new GraphAgent components being exported. However, I noticed that Context is no longer exported from this package's __all__ list (and also from the top-level google.adk package). This is a significant breaking change for users who might be importing Context from here. It would be beneficial to mention this change in the pull request description to ensure users are aware of the API update.

Context is imported and exported in init.py at lines 16 and 36.

gemini-code-assist · 2026-02-22T10:39:27Z

src/google/adk/agents/graph/graph_agent.py

+  async def _run_async_impl(
+      self, ctx: InvocationContext
+  ) -> AsyncGenerator[Event, None]:
+    """Core graph execution logic.
+
+    Executes nodes in graph order, following conditional edges,
+    supporting loops and human-in-the-loop interrupts.
+
+    Args:
+        ctx: Invocation context
+
+    Yields:
+        Events from graph execution
+
+    Raises:
+        ValueError: If start node not set or graph structure invalid
+    """
+    if not self.start_node:
+      raise ValueError("Start node not set. Call set_start() first.")
+
+    # Register session with InterruptService if enabled
+    if self.interrupt_service:
+      self.interrupt_service.register_session(ctx.session.id)
+
+    # Get effective telemetry config for nested graph inheritance
+    effective_config = self._get_effective_telemetry_config(ctx)
+
+    with tracer.start_as_current_span(
+        f"graph_agent_execution {self.name}"
+    ) as span:
+      span.set_attribute("graph_agent.name", self.name)
+      span.set_attribute("graph_agent.start_node", self.start_node)
+      span.set_attribute("graph_agent.max_iterations", self.max_iterations)
+      try:
+        # Load execution tracking state (BaseAgentState pattern)
+        agent_state = (
+            self._load_agent_state(ctx, GraphAgentState) or GraphAgentState()
+        )
+
+        # Store telemetry config for nested graph inheritance
+        if effective_config:
+          agent_state.telemetry_config_dict = effective_config.model_dump()
+
+        # Initialize domain data from session state or user input.
+        # Exclude graph-internal keys to prevent circular references
+        # (state.data["graph_data"] → state.data) and keep domain data clean.
+        domain_data = {
+            k: v
+            for k, v in ctx.session.state.items()
+            if not k.startswith("_") and k not in _GRAPH_INTERNAL_KEYS
+        }
+        if domain_data:
+          state = GraphState(data=domain_data)
+        else:
+          # Extract text from Content object
+          user_text = ""
+          if (
+              hasattr(ctx, "user_content")
+              and ctx.user_content
+              and ctx.user_content.parts
+          ):
+            user_text = (
+                ctx.user_content.parts[0].text
+                if ctx.user_content.parts[0].text
+                else ""
+            )
+          state = GraphState(data={"input": user_text})
+
+        # Track which parallel groups have been executed
+        executed_parallel_groups = set(agent_state.executed_parallel_groups)
+
+        # ADK resumability: resume from saved node or start fresh.
+        #
+        # Design note: SequentialAgent ONLY emits state events when
+        # ctx.is_resumable is True, because its state events serve only
+        # resumability. GraphAgent's state events serve multiple consumers
+        # (rewind, interrupts, telemetry) that are orthogonal to
+        # resumability. Therefore:
+        #   - Per-iteration state events: always emitted (multi-consumer)
+        #   - Resume skip: first iteration skipped when resuming (already
+        #     persisted before pause, avoids duplicate)
+        #   - end_of_agent: guarded by is_resumable (purely a resumability
+        #     lifecycle signal, has no other consumers)
+        #   - Interrupt/cancellation state saves: always emitted (they
+        #     serve interrupt functionality, not just resumability)
+        current_node_name, iteration, resuming = self._get_resume_state(
+            agent_state
+        )
+        pause_invocation = False
+
+        while current_node_name and iteration < self.max_iterations:
+          iteration += 1
+          current_node = self.nodes[current_node_name]
+
+          # Check for immediate cancellation (ESC-like interrupt)
+          # Allows user to abort execution at any time, not just at pause points
+          if self.interrupt_service and not self.interrupt_service.is_active(
+              ctx.session.id
+          ):
+            logger.info(
+                "GraphAgent execution cancelled (immediate interrupt) for"
+                f" session {ctx.session.id}"
+            )
+            # Save partial state before cancelling (enables resume/restart)
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+            yield Event(
+                author=self.name,
+                content=types.Content(
+                    parts=[types.Part(text="⚠️ Execution cancelled by user")]
+                ),
+                actions=EventActions(
+                    escalate=False,
+                    state_delta={
+                        "graph_cancelled": True,
+                        "graph_cancelled_at_node": current_node_name,
+                        "graph_iteration": iteration,
+                        "graph_data": state.data,
+                        "graph_path": list(agent_state.path),
+                        "graph_can_resume": True,
+                    },
+                ),
+            )
+            break  # Exit immediately but state is saved
+
+          # Track execution path in agent_state
+          agent_state.path.append(current_node_name)
+          agent_state.iteration = iteration
+          agent_state.current_node = current_node_name
+          agent_state.node_invocations.setdefault(current_node_name, []).append(
+              ctx.invocation_id
+          )
+
+          # ADK resumability: reset sub-agent states on cycle revisit
+          # (mirrors LoopAgent pattern at loop_agent.py:114)
+          if (
+              agent_state.path.count(current_node_name) > 1
+              and current_node.agent
+          ):
+            ctx.reset_sub_agent_states(current_node.agent.name)
+
+          # Track agent path for nested graph support
+          if self.name not in agent_state.agent_path:
+            agent_state.agent_path.append(self.name)
+
+          # Persist execution tracking via agent_state event.
+          # These events are consumed by rewind, interrupts, and telemetry
+          # (not just resumability), so they're always emitted.
+          # Skip only on first iteration when resuming (already persisted).
+          if not resuming:
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+          else:
+            resuming = False  # Only skip first iteration after resume
+
+          # Invoke before_node_callback (custom observability)
+          if self.before_node_callback:
+            from .callbacks import NodeCallbackContext
+
+            callback_ctx = NodeCallbackContext(
+                node=current_node,
+                state=state,
+                iteration=iteration,
+                invocation_context=ctx,
+                metadata={
+                    "agent_path": list(agent_state.agent_path),
+                    "path": list(agent_state.path),
+                },
+            )
+
+            # Execute callback with telemetry
+            callback_start_time = time.time()
+            with graph_tracing.tracer.start_as_current_span(
+                "graph_callback before_node"
+            ) as cb_span:
+              # Add attributes with additional_attributes support
+              attrs = self._get_telemetry_attributes(
+                  {
+                      graph_tracing.GRAPH_CALLBACK_TYPE: "before_node",
+                      graph_tracing.GRAPH_AGENT_NAME: self.name,
+                      graph_tracing.GRAPH_NODE_NAME: current_node_name,
+                  },
+                  effective_config=effective_config,
+              )
+              for key, value in attrs.items():
+                cb_span.set_attribute(key, value)
+
+              try:
+                event = await self.before_node_callback(callback_ctx)
+                if event:
+                  yield event
+
+                # Record success (check sampling)
+                cb_span.set_attribute("graph.callback.success", True)
+                if self._should_sample(effective_config=effective_config):
+                  callback_latency_ms = (
+                      time.time() - callback_start_time
+                  ) * 1000
+                  graph_tracing.record_callback_execution(
+                      callback_type="before_node",
+                      agent_name=self.name,
+                      latency_ms=callback_latency_ms,
+                      success=True,
+                  )
+
+              except Exception as e:
+                # Record failure (check sampling)
+                cb_span.set_attribute("graph.callback.success", False)
+                cb_span.set_attribute("graph.callback.error", str(e))
+                if self._should_sample(effective_config=effective_config):
+                  callback_latency_ms = (
+                      time.time() - callback_start_time
+                  ) * 1000
+                  graph_tracing.record_callback_execution(
+                      callback_type="before_node",
+                      agent_name=self.name,
+                      latency_ms=callback_latency_ms,
+                      success=False,
+                  )
+                logger.error(
+                    "before_node_callback failed for node"
+                    f" '{current_node_name}': {e}",
+                    exc_info=True,
+                )
+              # Continue execution despite callback error
+
+          # Handle BEFORE-node interrupt (validation timing)
+          if (
+              self._should_interrupt_before(current_node_name)
+              and self.interrupt_service
+          ):
+            _b_events, _b_ctrl = await self._handle_before_node_interrupt(
+                current_node_name, current_node, state, ctx, agent_state
+            )
+            for _e in _b_events:
+              yield _e
+            # Persist agent_state after interrupt handler may have mutated it
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+            if _b_ctrl == "break":
+              break
+            elif _b_ctrl is not None:
+              if isinstance(_b_ctrl, tuple):
+                current_node_name = _b_ctrl[1]
+              continue
+
+          # Check if current node is part of a parallel group
+          parallel_group_info = self._find_parallel_group(current_node_name)
+          if parallel_group_info:
+            group_id, parallel_group = parallel_group_info
+
+            # Check if this group has already been executed
+            if group_id in executed_parallel_groups:
+              # Group already executed, skip this node
+              logger.info(
+                  f"Skipping node '{current_node_name}' - already executed as"
+                  f" part of parallel group '{group_id}'"
+              )
+              # Route to next node from this node's edges
+              next_node_name = self._get_next_node_with_telemetry(
+                  current_node, state
+              )
+              if next_node_name is None:
+                if current_node_name in self.end_nodes:
+                  break
+                else:
+                  raise ValueError(
+                      f"Node {current_node_name} has no outgoing edges and is"
+                      " not an end node"
+                  )
+              current_node_name = next_node_name
+              continue
+
+            # Execute entire parallel group
+            logger.info(
+                f"Executing parallel group '{group_id}' with nodes:"
+                f" {parallel_group.nodes}"
+            )
+
+            # Execute parallel group with telemetry
+            parallel_start_time = time.time()
+            with graph_tracing.tracer.start_as_current_span(
+                f"parallel_group {group_id}"
+            ) as pg_span:
+              # Add attributes with additional_attributes support
+              attrs = self._get_telemetry_attributes(
+                  {
+                      graph_tracing.GRAPH_PARALLEL_NODE_COUNT: len(
+                          parallel_group.nodes
+                      ),
+                      graph_tracing.GRAPH_PARALLEL_STRATEGY: (
+                          parallel_group.join_strategy.value
+                      ),
+                      graph_tracing.GRAPH_PARALLEL_WAIT_N: (
+                          parallel_group.wait_n
+                      ),
+                      graph_tracing.GRAPH_AGENT_NAME: self.name,
+                  },
+                  effective_config=effective_config,
+              )
+              for key, value in attrs.items():
+                pg_span.set_attribute(key, value)
+
+              # Collect all events from parallel execution
+              completed_count = 0
+              async for event in execute_parallel_group(
+                  parallel_group,
+                  self.nodes,
+                  state,
+                  ctx,
+                  self._execute_node,
+              ):
+                yield event
+                # Count completions (rough estimate based on events)
+                if event.author != self.name:
+                  completed_count = min(
+                      completed_count + 1, len(parallel_group.nodes)
+                  )
+
+              # Record parallel group metrics (check sampling)
+              pg_span.set_attribute(
+                  "graph.parallel.completed_count", completed_count
+              )
+              if self._should_sample(effective_config=effective_config):
+                parallel_latency_ms = (time.time() - parallel_start_time) * 1000
+                graph_tracing.record_parallel_group_execution(
+                    agent_name=self.name,
+                    node_count=len(parallel_group.nodes),
+                    strategy=parallel_group.join_strategy.value,
+                    latency_ms=parallel_latency_ms,
+                    completed_count=completed_count,
+                )
+
+            # Mark group as executed
+            executed_parallel_groups.add(group_id)
+            agent_state.executed_parallel_groups = list(
+                executed_parallel_groups
+            )
+
+            # After parallel group completes, determine next node
+            # Use the current node's edges to determine routing
+            # (all nodes in group should have same outgoing edges)
+            next_node_name = self._get_next_node_with_telemetry(
+                current_node, state
+            )
+
+            if next_node_name is None:
+              # No more nodes after parallel group
+              if current_node_name in self.end_nodes:
+                break
+              else:
+                raise ValueError(
+                    f"Parallel group '{group_id}' has no outgoing edges and"
+                    f" node '{current_node_name}' is not an end node"
+                )
+
+            current_node_name = next_node_name
+            continue  # Skip individual node execution, continue to next iteration
+
+          # Execute node with immediate cancellation support
+          # Check cancellation while streaming events from node execution
+          output_holder: Dict[str, Any] = {"output": ""}
+          try:
+            async for event in self._execute_node(
+                current_node,
+                state,
+                ctx,
+                effective_config,
+                output_holder=output_holder,
+                iteration=iteration,
+            ):
+              # Check for immediate cancellation DURING node execution
+              if (
+                  self.interrupt_service
+                  and not self.interrupt_service.is_active(ctx.session.id)
+              ):
+                logger.info(
+                    "GraphAgent execution cancelled (immediate interrupt"
+                    f" during node '{current_node_name}') for session"
+                    f" {ctx.session.id}"
+                )
+                ctx.set_agent_state(self.name, agent_state=agent_state)
+                yield self._create_agent_state_event(ctx)
+                yield Event(
+                    author=self.name,
+                    content=types.Content(
+                        parts=[
+                            types.Part(
+                                text=(
+                                    "⚠️ Execution cancelled during node"
+                                    f" '{current_node_name}'"
+                                )
+                            )
+                        ]
+                    ),
+                    actions=EventActions(
+                        escalate=False,
+                        state_delta={
+                            "graph_cancelled": True,
+                            "graph_cancelled_at_node": current_node_name,
+                            "graph_data": state.data,
+                            "graph_partial_output": output_holder["output"],
+                            "graph_can_resume": True,
+                        },
+                    ),
+                )
+                return
+              yield event
+          except asyncio.CancelledError:
+            # Task cancelled externally (e.g., timeout, user abort)
+            logger.info(
+                f"GraphAgent task cancelled during node '{current_node_name}'"
+                f" for session {ctx.session.id}"
+            )
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+            yield Event(
+                author=self.name,
+                content=types.Content(
+                    parts=[
+                        types.Part(
+                            text=(
+                                "⚠️ Task cancelled during node"
+                                f" '{current_node_name}'"
+                            )
+                        )
+                    ]
+                ),
+                actions=EventActions(
+                    escalate=False,
+                    state_delta={
+                        "graph_task_cancelled": True,
+                        "graph_cancelled_at_node": current_node_name,
+                        "graph_data": state.data,
+                        "graph_partial_output": output_holder["output"],
+                        "graph_can_resume": True,
+                    },
+                ),
+            )
+            raise
+
+          # ADK resumability: check if node execution was paused
+          if output_holder.get("pause"):
+            pause_invocation = True
+            return
+
+          # Sync session state into GraphState.data FIRST so that
+          # output_mapper receives the most up-to-date values and can
+          # override them. Agents write routing signals via state_delta
+          # (the ADK-standard pattern); this sync makes those values
+          # visible to edge condition lambdas (which receive GraphState)
+          # without requiring an explicit output_mapper.
+          # Internal keys (prefix '_') are excluded.
+          for _sk, _sv in ctx.session.state.items():
+            if not _sk.startswith("_") and _sk not in _GRAPH_INTERNAL_KEYS:
+              state.data[_sk] = _sv
+
+          # Update state with node output (output_mapper runs AFTER
+          # session sync, so it can override synced values when needed)
+          output = output_holder["output"]
+          if output:
+            # Track state before reduction for telemetry
+            had_previous_value = current_node.name in state.data
+            reducer_start = time.time()
+
+            # Apply output mapper with reducer
+            prev_state = state
+            state = current_node.output_mapper(output, state)
+            if state is None:
+              # Custom output_mapper mutated in-place but forgot to return
+              state = prev_state
+
+            # Record state reducer telemetry (check sampling)
+            reducer_latency_ms = (time.time() - reducer_start) * 1000
+            if self._should_sample(effective_config=effective_config):
+              graph_tracing.record_state_reducer(
+                  node_name=current_node.name,
+                  reducer_type=current_node.reducer.name,
+                  state_key=current_node.name,
+                  agent_name=self.name,
+                  latency_ms=reducer_latency_ms,
+                  had_previous_value=had_previous_value,
+              )
+
+              # Record output mapper telemetry
+              is_default_mapper = (
+                  current_node.output_mapper.__name__
+                  == "_default_output_mapper"
+              )
+              graph_tracing.record_mapper(
+                  node_name=current_node.name,
+                  mapper_type="output",
+                  agent_name=self.name,
+                  latency_ms=reducer_latency_ms,
+                  is_default=is_default_mapper,
+              )
+
+          # Invoke after_node_callback (custom observability)
+          if self.after_node_callback:
+            from .callbacks import NodeCallbackContext
+
+            callback_ctx = NodeCallbackContext(
+                node=current_node,
+                state=state,
+                iteration=iteration,
+                invocation_context=ctx,
+                metadata={
+                    "output": output,
+                    "agent_path": list(agent_state.agent_path),
+                    "path": list(agent_state.path),
+                },
+            )
+
+            # Execute callback with telemetry
+            callback_start_time = time.time()
+            with graph_tracing.tracer.start_as_current_span(
+                "graph_callback after_node"
+            ) as cb_span:
+              # Add attributes with additional_attributes support
+              attrs = self._get_telemetry_attributes(
+                  {
+                      graph_tracing.GRAPH_CALLBACK_TYPE: "after_node",
+                      graph_tracing.GRAPH_AGENT_NAME: self.name,
+                      graph_tracing.GRAPH_NODE_NAME: current_node_name,
+                  },
+                  effective_config=effective_config,
+              )
+              for key, value in attrs.items():
+                cb_span.set_attribute(key, value)
+
+              try:
+                event = await self.after_node_callback(callback_ctx)
+                if event:
+                  yield event
+
+                # Record success (check sampling)
+                cb_span.set_attribute("graph.callback.success", True)
+                if self._should_sample(effective_config=effective_config):
+                  callback_latency_ms = (
+                      time.time() - callback_start_time
+                  ) * 1000
+                  graph_tracing.record_callback_execution(
+                      callback_type="after_node",
+                      agent_name=self.name,
+                      latency_ms=callback_latency_ms,
+                      success=True,
+                  )
+
+              except Exception as e:
+                # Record failure (check sampling)
+                cb_span.set_attribute("graph.callback.success", False)
+                cb_span.set_attribute("graph.callback.error", str(e))
+                if self._should_sample(effective_config=effective_config):
+                  callback_latency_ms = (
+                      time.time() - callback_start_time
+                  ) * 1000
+                  graph_tracing.record_callback_execution(
+                      callback_type="after_node",
+                      agent_name=self.name,
+                      latency_ms=callback_latency_ms,
+                      success=False,
+                  )
+                logger.error(
+                    "after_node_callback failed for node"
+                    f" '{current_node_name}': {e}",
+                    exc_info=True,
+                )
+              # Continue execution despite callback error
+
+          # Emit graph metadata event for evaluation framework
+          # This will be captured in Invocation.intermediate_data by EvaluationGenerator
+          # Set partial=True so is_final_response() returns False (making it an intermediate event)
+          graph_metadata = {
+              "graph_node": current_node_name,
+              "graph_iteration": iteration,
+              "graph_path": list(agent_state.path),
+              "node_invocations": {
+                  name: len(invocs)
+                  for name, invocs in agent_state.node_invocations.items()
+              },
+              "graph_state": dict(state.data),
+          }
+          yield Event(
+              author=f"{self.name}#metadata",
+              content=types.Content(
+                  parts=[types.Part(text=f"[GraphMetadata] {graph_metadata}")]
+              ),
+              partial=True,  # Mark as intermediate event
+          )
+
+          # Handle AFTER-node interrupt (retrospective feedback timing)
+          # This enables retrospective feedback: observe past, steer future
+          if (
+              self._should_interrupt_after(current_node_name)
+              and self.interrupt_service
+          ):
+            _a_events, _a_ctrl = await self._handle_after_node_interrupt(
+                current_node_name, state, ctx, agent_state
+            )
+            for _e in _a_events:
+              yield _e
+            # Persist agent_state after interrupt handler may have mutated it
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+            if _a_ctrl == "break":
+              break
+            elif _a_ctrl is not None:
+              if isinstance(_a_ctrl, tuple):
+                current_node_name = _a_ctrl[1]
+              continue
+
+          # Checkpointing - yield event with state_delta to persist checkpoint
+          # Note: For full checkpoint/resume functionality, use CheckpointCallback
+          if self.checkpointing:
+            ctx.set_agent_state(self.name, agent_state=agent_state)
+            yield self._create_agent_state_event(ctx)
+            yield Event(
+                author=self.name,
+                content=types.Content(
+                    parts=[types.Part(text=f"Checkpoint: {current_node_name}")]
+                ),
+                actions=EventActions(
+                    state_delta={
+                        "graph_data": state.data,
+                        "graph_checkpoint": {
+                            "node": current_node_name,
+                            "iteration": iteration,
+                        },
+                    }
+                ),
+            )
+
+          # Inject transient execution data for edge conditions
+          state.data["_graph_iteration"] = agent_state.iteration
+          state.data["_graph_path"] = list(agent_state.path)
+          state.data["_conditions"] = dict(agent_state.conditions)
+
+          # Get next node via conditional routing
+          next_node_name = self._get_next_node_with_telemetry(
+              current_node, state
+          )
+
+          # Clean up transient keys
+          for _tk in ("_graph_iteration", "_graph_path", "_conditions"):
+            state.data.pop(_tk, None)
+          if next_node_name is None:
+            # No more edges - check if we're at an end node
+            if current_node_name in self.end_nodes:
+              break
+            else:
+              # Not at an end node and no edges - error
+              raise ValueError(
+                  f"Node {current_node_name} has no outgoing edges and is not"
+                  " an end node"
+              )
+
+          current_node_name = next_node_name
+
+          # Record iteration metrics (check sampling)
+          if self._should_sample(effective_config=effective_config):
+            graph_tracing.record_graph_iteration(
+                agent_name=self.name,
+                iteration=iteration,
+                path_length=len(agent_state.path),
+            )
+
+        # ADK resumability: skip final response + end_of_agent when paused
+        if not pause_invocation:
+          # Final response - yield event with graph metadata
+          # Include last node's output ONLY if:
+          # 1. explicit final_output is set, OR
+          # 2. last node was a function (doesn't yield events, so we need to show output)
+          # Don't include output for agent nodes (they already yielded their output)
+          final_output = state.data.get("final_output", "")
+
+          # If no explicit final_output, check if last node was a function
+          if not final_output and current_node_name:
+            last_node = self.nodes.get(current_node_name)
+            if last_node and last_node.function:
+              # Function node - include its output
+              final_output = state.data.get(current_node_name, "")
+
+          response_text = f"{final_output}"
+
+          yield Event(
+              author=self.name,
+              content=types.Content(parts=[types.Part(text=response_text)]),
+              actions=EventActions(
+                  state_delta={
+                      "graph_data": state.data,
+                      "graph_iterations": iteration,
+                      "graph_path": list(agent_state.path),
+                  }
+              ),
+          )
+          # end_of_agent is guarded by is_resumable because it is purely a
+          # resumability lifecycle signal (tells the runner "this agent is
+          # done, don't re-run it on resume"). Unlike per-iteration state
+          # events which serve rewind/interrupts/telemetry, end_of_agent
+          # has no other consumers.
+          if ctx.is_resumable:
+            ctx.set_agent_state(self.name, end_of_agent=True)
+            yield self._create_agent_state_event(ctx)
+
+      finally:
+        # Unregister session from InterruptService and finalize tracing
+        if self.interrupt_service:
+          self.interrupt_service.unregister_session(ctx.session.id)
+        span.set_attribute("graph_agent.completed", True)
+
+  # Interrupt methods inherited from GraphInterruptMixin


The _run_async_impl method is quite long and complex, handling various aspects like state initialization, iteration, interrupts, callbacks, checkpointing, and parallel execution. For better readability and maintainability, consider refactoring this method by extracting some of its logic into smaller, more focused helper methods. For instance, the callback invocation logic, which is similar for before_node_callback and after_node_callback, could be a good candidate for a helper function.

Fixed. Extracted 4 helper methods from _run_async_impl: _execute_callback, _sync_state_and_reduce, _build_cancellation_events, and _execute_parallel_phase. Method reduced from ~541 to ~395 lines. The helpers encapsulate distinct concerns: callback execution, state synchronization, cancellation event building, and parallel phase orchestration.

gemini-code-assist · 2026-02-22T10:39:27Z

src/google/adk/agents/graph/graph_interrupt_handler.py

+  async def _handle_before_node_interrupt(
+      self,
+      current_node_name: str,
+      current_node: GraphNode,
+      state: GraphState,
+      ctx: InvocationContext,
+      agent_state: GraphAgentState,
+  ) -> Tuple[List[Event], str | Tuple[str, str] | None]:
+    """Handle a BEFORE-node interrupt and return events + routing control.
+
+    Args:
+        current_node_name: Name of the node about to execute.
+        current_node: The GraphNode about to execute (needed for "skip").
+        state: Current graph state.
+        ctx: Invocation context.
+        agent_state: Execution tracking state.
+
+    Returns:
+        Tuple of (events_to_yield, control) where control is:
+        - None: proceed to normal node execution.
+        - "rerun": re-run current node (continue the loop).
+        - "break": exit the main loop immediately.
+        - ("go_back", target_node): jump to target_node.
+        - ("skip", next_node | None): skip node, route to next_node.
+    """
+    assert self.interrupt_service is not None
+    interrupt_message = await self._check_interrupt_with_telemetry(
+        ctx.session.id, "before"
+    )
+    if not interrupt_message:
+      return [], None
+
+    action_result = await self._process_interrupt_message(
+        interrupt_message, state, current_node_name, ctx, agent_state
+    )
+
+    should_escalate = (
+        action_result == "pause"
+        if isinstance(action_result, str)
+        else (isinstance(action_result, tuple) and action_result[0] == "pause")
+    )
+
+    event = Event(
+        author=self.name,
+        content=types.Content(
+            parts=[
+                types.Part(
+                    text=(
+                        "\U0001f6d1 INTERRUPT (BEFORE):"
+                        f" {interrupt_message.text}"
+                    )
+                )
+            ]
+        ),
+        actions=EventActions(
+            escalate=should_escalate,
+            state_delta={
+                "interrupt_message": interrupt_message.text,
+                "interrupt_timing": "before",
+                "interrupt_node": current_node_name,
+            },
+        ),
+    )
+
+    if isinstance(action_result, tuple):
+      action, target_node = action_result
+      if action == "go_back":
+        return [event], ("go_back", target_node)
+    elif action_result == "rerun":
+      return [event], "rerun"
+    elif action_result == "skip":
+      next_node_name = self._get_next_node_with_telemetry(current_node, state)  # type: ignore[attr-defined]
+      return (
+          [event],
+          ("skip", next_node_name) if next_node_name else "break",
+      )
+    elif action_result == "pause":
+      try:
+        resumed = await self.interrupt_service.wait_if_paused(ctx.session.id)
+        if not resumed:
+          return [event], "break"
+      except TimeoutError:
+        return [event], "break"
+
+    return [event], None
+
+  async def _handle_after_node_interrupt(
+      self,
+      current_node_name: str,
+      state: GraphState,
+      ctx: InvocationContext,
+      agent_state: GraphAgentState,
+  ) -> Tuple[List[Event], str | Tuple[str, str] | None]:
+    """Handle an AFTER-node interrupt and return events + routing control.
+
+    Args:
+        current_node_name: Name of the node that just executed.
+        state: Current graph state (includes the node's output).
+        ctx: Invocation context.
+        agent_state: Execution tracking state.
+
+    Returns:
+        Tuple of (events_to_yield, control) where control is:
+        - None: accept results and proceed to next node.
+        - "rerun": re-run current node.
+        - "break": exit the main loop.
+        - ("go_back", target_node): jump to target_node.
+    """
+    assert self.interrupt_service is not None
+    interrupt_message = await self._check_interrupt_with_telemetry(
+        ctx.session.id, "after"
+    )
+    if not interrupt_message:
+      return [], None
+
+    action_result = await self._process_interrupt_message(
+        interrupt_message, state, current_node_name, ctx, agent_state
+    )
+
+    should_escalate = (
+        action_result == "pause"
+        if isinstance(action_result, str)
+        else (isinstance(action_result, tuple) and action_result[0] == "pause")
+    )
+
+    state_delta_dict: Dict[str, Any] = {
+        "interrupt_message": interrupt_message.text,
+        "interrupt_timing": "after",
+        "interrupt_metadata": interrupt_message.metadata,
+        "interrupt_action": interrupt_message.action,
+        "interrupt_node": current_node_name,
+    }
+
+    event = Event(
+        author=self.name,
+        content=types.Content(
+            parts=[
+                types.Part(
+                    text=(
+                        "\U0001f6d1 INTERRUPT (AFTER):"
+                        f" {interrupt_message.text}"
+                    )
+                )
+            ]
+        ),
+        actions=EventActions(
+            escalate=should_escalate, state_delta=state_delta_dict
+        ),
+    )
+
+    if isinstance(action_result, tuple):
+      action, target_node = action_result
+      if action == "go_back":
+        return [event], ("go_back", target_node)
+    elif action_result == "rerun":
+      return [event], "rerun"
+    elif action_result == "pause":
+      try:
+        resumed = await self.interrupt_service.wait_if_paused(ctx.session.id)
+        if not resumed:
+          logger.info(
+              "GraphAgent execution cancelled for session %s",
+              ctx.session.id,
+          )
+          return [event], "break"
+      except TimeoutError as e:
+        logger.warning("Interrupt wait timeout: %s", e)
+        return [event], "break"
+
+    return [event], None
+


The methods _handle_before_node_interrupt and _handle_after_node_interrupt share a significant amount of logic. They both check for an interrupt message, process it, create an event, and then handle the action result. To reduce code duplication and improve maintainability, you could refactor this into a single helper method that takes the interrupt mode ('before' or 'after') as a parameter and handles the common logic, while keeping the mode-specific differences (like the skip action) inside conditional blocks.

Already refactored: both methods delegate to _check_interrupt_with_telemetry, _process_interrupt_message, and _dispatch_interrupt_action. Remaining differences are intentional (BEFORE needs current_node for skip, AFTER includes metadata fields).

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of new examples and documentation for the GraphAgent framework, showcasing advanced features such as checkpointing, LLM-based interrupt reasoning, callback-based observability, flexible interrupt timings, immediate cancellation, dynamic task queues, agent-driven topology, Human-In-The-Loop (HITL) workflows, multi-agent coordination, and various parallel execution strategies (WAIT_ALL, WAIT_ANY, WAIT_N) with rewind integration. The changes include new READMEs, Python agent scripts, and YAML configurations for each example, demonstrating how to build complex, adaptive, and fault-tolerant agentic workflows. A review comment highlights a NameError in the scenario_5_all_interrupt_timings function within contributing/samples/graph_agent_advanced/agent.py, noting that paper_writer_agent and peer_reviewer_agent are not properly scoped and suggesting they be re-instantiated within the scenario function to resolve the issue.

gemini-code-assist · 2026-02-22T16:41:36Z

contributing/samples/graph_agent_advanced/agent.py

+  graph_before.add_node("write", agent=paper_writer_agent)
+  graph_before.add_node("peer_review", agent=peer_reviewer_agent)


The agents paper_writer_agent and peer_reviewer_agent are not defined within the scope of the scenario_5_all_interrupt_timings function. They are created inside build_research_workflow but are not accessible here, which will lead to a NameError when this scenario is executed.

To fix this, you should create new instances of these agents within the scenario, similar to how other scenarios are structured.

Suggested change

graph_before.add_node("write", agent=paper_writer_agent)

graph_before.add_node("peer_review", agent=peer_reviewer_agent)

# Create fresh agents for this scenario to avoid parent conflicts.

(_, _, _, paper_writer_agent, peer_reviewer_agent) = _create_research_agents()

# Add simple nodes

graph_before.add_node("write", agent=paper_writer_agent)

graph_before.add_node("peer_review", agent=peer_reviewer_agent)

Fixed: scenario_5 now calls _create_research_agents() to get its own scoped references.

drahnreb · 2026-02-23T02:26:25Z

Addressing review feedback

Force-pushed with the following fixes:

High:

NameError in scenario_5: scenario_5_all_interrupt_timings now calls _create_research_agents() to get its own scoped agent references.
Stale session fallback: Agent-driven checkpoint and topology samples use explicit None check with warning.

Medium:

Sample output accumulation: final += text / result += ... in 4 sample files.
Interrupt handler duplication: Already refactored — both methods delegate to shared helpers (_check_interrupt_with_telemetry, _process_interrupt_message, _dispatch_interrupt_action). Remaining differences are intentional.
Context export: Verified present in __init__.py.

drahnreb · 2026-02-23T02:43:23Z

@gemini-code-assist please re-review

gemini-code-assist

Code Review

This is an impressive pull request that introduces a powerful GraphAgent for complex workflow orchestration and a CheckpointService for state persistence. The implementation is robust, with thoughtful handling of concurrency, state management, and error conditions. The extensive set of new samples and documentation is particularly valuable for understanding and adopting these new features. My review includes a minor code cleanup suggestion and a documentation enhancement to make the new codebase even more discoverable.

gemini-code-assist · 2026-02-23T02:45:54Z

contributing/samples/graph_agent_advanced/README.md

+- `src/google/adk/agents/graph/graph_agent.py` - Core orchestration
+- `src/google/adk/agents/graph/interrupt_reasoner.py` - LLM reasoning
+- `src/google/adk/agents/graph/callbacks.py` - Callback infrastructure
+- `src/google/adk/agents/graph/interrupt_service.py` - Interrupt management


The list of code to explore is very helpful! Since this example heavily features checkpointing, it would be beneficial to also include a reference to the new CheckpointService here. For example:

- `src/google/adk/checkpoints/checkpoint_service.py` - Checkpoint management

Fixed — added src/google/adk/checkpoints/checkpoint_service.py - Checkpoint management to the list.

gemini-code-assist · 2026-02-23T02:45:54Z

contributing/samples/graph_agent_advanced/agent.py

+  review = state.data.get("peer_review", {})
+  # LLM agents store output as JSON string; parse if needed
+  if isinstance(review, str):
+    import json


This import json statement is redundant as json is already imported at the top of the file (line 16). It's also a good practice to keep all imports at the top of the file for better readability and to avoid potential side effects, as recommended by PEP 8.

Fixed — removed duplicate import json (already imported at line 16).

Add GraphAgent for building directed-graph workflows with conditional routing, cyclic execution, state management with reducers, typed events, streaming, callbacks, rewind, resumability, telemetry with OpenTelemetry tracing, evaluation metrics, and CLI graph visualization for GraphAgent topologies. Includes samples and design documentation.

Add DynamicNode (runtime agent selection), NestedGraphNode (hierarchical workflow composition), and DynamicParallelGroup (variable-count concurrent execution). Extends CLI visualization with pattern-aware rendering (diamond, parallelogram, sub-cluster shapes). Includes pattern samples, node type reference, and design documentation.

Add stateless CheckpointService using SessionService and ArtifactService primitives with delta compression, concurrent session locking, and configurable retention. Includes CheckpointableMixin for any agent type, GraphCheckpointCallback for node-level checkpointing, and comprehensive error model with telemetry. Includes checkpoint samples.

drahnreb · 2026-02-23T20:26:36Z

@gemini-code-assist please re-review

gemini-code-assist

Code Review

This pull request introduces the CheckpointService and associated components to enable state persistence, delta compression, and human-in-the-loop interrupts for GraphAgent workflows. The implementation is comprehensive, including a stateless service, node-level callbacks, and extensive telemetry. My review identified a few areas for improvement, primarily concerning the use of internal asyncio.Queue attributes, performance overhead in state size validation, and potential edge cases in state delta calculation. Addressing these will enhance the robustness and performance of the persistence layer.

gemini-code-assist · 2026-02-23T20:33:04Z

src/google/adk/agents/graph/graph_agent.py

+        "data": state.data,
+    }
+    try:
+      result = eval(code, {"__builtins__": safe_builtins}, namespace)  # noqa: S307


The use of eval() for edge conditions, even with AST validation, remains a potential security risk if the sandbox is bypassed. While the current whitelist is restrictive, ensure that any future additions to _SAFE_NAMES or _SAFE_BUILTINS are carefully vetted.

gemini-code-assist · 2026-02-23T20:33:04Z

src/google/adk/agents/graph/interrupt_service.py

+    # asyncio.Queue stores items in _queue (a collections.deque).
+    # This avoids drain/requeue and eliminates QueueFull risk.
+    queue = self._message_queues[session_id]
+    messages = list(queue._queue)  # type: ignore[attr-defined]


Accessing the internal _queue attribute of asyncio.Queue is discouraged as it is an implementation detail and may change in future Python versions. While it provides a non-destructive peek, it's better to use a more robust approach if possible, or at least document this dependency clearly.

gemini-code-assist · 2026-02-23T20:33:04Z

src/google/adk/checkpoints/checkpoint_service.py

+        state_size = len(
+            json.dumps(state_snapshot, cls=PydanticJSONEncoder).encode("utf-8")
+        )


Calculating the state size by performing a full json.dumps on every checkpoint creation can introduce significant performance overhead, especially for large states or high-frequency checkpointing. Consider a more efficient way to estimate size or make this validation optional.

gemini-code-assist · 2026-02-23T20:33:04Z

src/google/adk/agents/graph/graph_agent.py

+            delta = {}
+            for _k, _v in state.data.items():
+              if (
+                  not _k.startswith("_")
+                  and _k not in _GRAPH_INTERNAL_KEYS
+                  and ctx.session.state.get(_k) != _v
+              ):
+                delta[_k] = _v


The delta calculation for state_delta events relies on a direct equality check (ctx.session.state.get(_k) != _v). This may be inefficient for large nested dictionaries or lists, and might not correctly detect changes in mutable objects if they are modified in-place. Consider using a more robust deep comparison or ensuring that state updates always use new object instances.

drahnreb mentioned this pull request Feb 22, 2026

feat: GraphAgent — directed-graph workflow orchestration for ADK #4581

Open

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

drahnreb force-pushed the feat/graph-agent-pr5 branch 5 times, most recently from 3282d2e to 5c2bdb5 Compare February 22, 2026 16:34

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

drahnreb force-pushed the feat/graph-agent-pr5 branch 3 times, most recently from aa73841 to 7eee735 Compare February 22, 2026 23:43

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

drahnreb added 5 commits February 23, 2026 14:26

feat: Add parallel node groups with join strategies

20d6a1f

feat: Add InterruptService for human-in-the-loop graph workflows

218d2ec

drahnreb force-pushed the feat/graph-agent-pr5 branch from 7eee735 to 1c03f63 Compare February 23, 2026 13:38

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

		graph_before.add_node("write", agent=paper_writer_agent)
		graph_before.add_node("peer_review", agent=peer_reviewer_agent)

Comments

Conversation

drahnreb commented Feb 22, 2026

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

google-cla bot commented Feb 22, 2026

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drahnreb commented Feb 23, 2026

Addressing review feedback

Uh oh!

drahnreb commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment