Skip to content

Conversation

@rahul2393
Copy link
Contributor

@rahul2393 rahul2393 commented Dec 24, 2025

This PR changes the Java Spanner client's multiplexed session creation behavior to retry on transient errors. Previously, if the initial CreateSession RPC failed, the error was cached permanently and returned to all subsequent requests without any retry attempt.

Problem

The current implementation fetches a multiplexed session once during client initialization:

  • On success: The session is cached and the maintainer refreshes it before expiration
  • On failure: The error is cached permanently and returned to all RPCs forever

This behavior causes client breakage on transient failures (e.g., DEADLINE_EXCEEDED, UNAVAILABLE) even after the server recovers.

Solution

Implement retry-on-access semantics:

Error Type Behavior
UNIMPLEMENTED Cache error, fall back to regular sessions, never retry
DatabaseNotFoundException Cache error, never retry (permanent)
InstanceNotFoundException Cache error, never retry (permanent)
DEADLINE_EXCEEDED Cache error temporarily, retry on next RPC access
UNAVAILABLE Cache error temporarily, retry on next RPC access
Other transient errors Cache error temporarily, retry on next RPC access

Implementation Details

  1. State Management: Changed from SettableApiFuture to explicit state tracking:

    • AtomicReference<ApiFuture<SessionReference>> for the session
    • AtomicReference<Throwable> for the last creation error
    • ReentrantLock + CountDownLatch + AtomicBoolean for coordinating concurrent creation attempts
  2. Thread Safety: Used blocking lock pattern to avoid race conditions where a waiting thread could read an old latch that has already been counted down and replaced.

  3. Maintainer Behavior: The maintainer now starts even on failure (for transient errors) to enable background retries.

Behavior Diagram

flowchart TD
    subgraph "Session Creation Flow"
        A[Client Initialization] --> B[Trigger CreateSession RPC]
        B --> C{Creation Result?}
        
        C -->|Success| D[Store Session Reference]
        D --> E[Start Maintainer]
        E --> F[Session Ready for Use]
        
        C -->|UNIMPLEMENTED| G[Mark as Unimplemented]
        G --> H[Fall Back to Regular Sessions]
        
        C -->|DatabaseNotFoundException<br/>InstanceNotFoundException| I[Cache Error as Permanent]
        I --> J[Do NOT Start Maintainer]
        J --> K[Return Error on All RPCs - No Retry]
        
        C -->|Transient Error<br/>DEADLINE_EXCEEDED, UNAVAILABLE, etc.| L[Cache Error Temporarily]
        L --> M[Start Maintainer]
        M --> N[Allow Retry on Next RPC Access]
    end

    subgraph "Retry on Access Flow"
        O[RPC Request] --> P{Session Exists?}
        P -->|Yes| Q[Use Existing Session]
        P -->|No| R{Check Error Type}
        
        R -->|UNIMPLEMENTED| S[Throw UNIMPLEMENTED Error]
        R -->|ResourceNotFoundException| T[Throw ResourceNotFoundException]
        R -->|No Permanent Error| U{Creation In Progress?}
        
        U -->|Yes| V[Wait for Creation to Complete]
        U -->|No| W[Trigger New CreateSession RPC]
        W --> V
        V --> X{Creation Result?}
        X -->|Success| Q
        X -->|Failure| Y[Throw Error to Caller]
    end

    subgraph "Maintainer Flow"
        Z[Maintainer Running] --> AA{Session Expired or Null?}
        AA -->|No| AB[Sleep and Check Again]
        AB --> AA
        AA -->|Yes| AC{Permanent Error?}
        AC -->|UNIMPLEMENTED or ResourceNotFound| AD[Stop - No Retry]
        AC -->|No| AE[Trigger CreateSession RPC]
        AE --> AF{Creation Result?}
        AF -->|Success| AG[Update Session Reference]
        AG --> AB
        AF -->|Failure| AH[Log Error, Continue Loop]
        AH --> AB
    end
Loading

Concurrent Request Handling

When multiple RPCs arrive during a failed/pending session creation:

Timeline:
0s     - Initial CreateSession RPC sent
15s    - RPC1 arrives, waits for creation
20s    - RPC2 arrives, waits for creation  
25s    - RPC3 arrives, waits for creation
30s    - CreateSession fails with DEADLINE_EXCEEDED
         → All waiting RPCs (RPC1, RPC2, RPC3) receive the error
         → Maintainer is started for background retry
35s    - RPC4 arrives, triggers NEW CreateSession attempt
         → Other RPCs arriving now wait for this attempt
40s    - CreateSession succeeds
         → RPC4 and all waiting RPCs proceed with new session

@rahul2393 rahul2393 requested review from a team as code owners December 24, 2025 07:39
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: spanner Issues related to the googleapis/java-spanner API. labels Dec 24, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @rahul2393, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of multiplexed session creation in the Spanner client library. It introduces a sophisticated retry-on-access mechanism, allowing the client to gracefully recover from transient session creation failures by automatically retrying. Concurrency primitives ensure that session creation attempts are synchronized, and improved error handling ensures that only non-recoverable errors (like UNIMPLEMENTED) lead to a permanent fallback to regular sessions, thereby improving the overall reliability and user experience when using multiplexed sessions.

Highlights

  • Concurrency Control for Session Creation: Introduced ReentrantLock and CountDownLatch to synchronize and manage concurrent attempts at creating multiplexed sessions, preventing race conditions and ensuring efficient waiting for session availability.
  • Retry-on-Access Mechanism: Implemented a getOrCreateSessionReference() method that automatically retries multiplexed session creation upon access if no session exists or if a previous creation attempt failed (unless the error is UNIMPLEMENTED).
  • Improved Error Handling and Fallback: Refactored session creation success and failure callbacks (onSessionCreatedSuccessfully, onSessionCreationFailed) to store transient errors and allow retries, while still correctly marking multiplexed sessions as unsupported and falling back to regular sessions for UNIMPLEMENTED errors.
  • Maintainer Integration: The MultiplexedSessionMaintainer now actively participates in the session creation retry logic, attempting to create or refresh sessions if needed and not already in progress.
  • Comprehensive Test Coverage: Added new unit tests to validate various scenarios, including sticky errors, successful retries after initial failures, and correct fallback behavior for UNIMPLEMENTED errors.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a retry-on-access mechanism for multiplexed session creation, which is a valuable improvement for resilience against transient failures. The new concurrency logic using ReentrantLock and CountDownLatch is a good approach. However, I've identified a critical race condition in the implementation that could lead to deadlocks. My review includes a suggested refactoring to resolve this issue, ensuring the concurrent code is safe. The accompanying tests for the new functionality are well-written and cover the intended behavior.

@rahul2393 rahul2393 force-pushed the fix-multiplex-session-init branch from 7b5db3d to de2a5f7 Compare December 24, 2025 08:07
@rahul2393 rahul2393 force-pushed the fix-multiplex-session-init branch from de2a5f7 to 81b39cd Compare December 24, 2025 08:12
@rahul2393 rahul2393 force-pushed the fix-multiplex-session-init branch from 81b39cd to 3235d25 Compare December 24, 2025 08:23
@rahul2393 rahul2393 closed this Dec 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the googleapis/java-spanner API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant