Skip to content

fix: evict cached Temporal client on bad-client failures#329

Open
rupesh-parab-one-app wants to merge 1 commit into
temporalio:mainfrom
rupesh-parab-one-app:fix/evict-client-on-deletion-cleanup-failure
Open

fix: evict cached Temporal client on bad-client failures#329
rupesh-parab-one-app wants to merge 1 commit into
temporalio:mainfrom
rupesh-parab-one-app:fix/evict-client-on-deletion-cleanup-failure

Conversation

@rupesh-parab-one-app
Copy link
Copy Markdown

@rupesh-parab-one-app rupesh-parab-one-app commented May 21, 2026

What

Cached Temporal SDK clients are now evicted consistently when the controller observes failures that indicate the cached client may no longer be usable.

This expands the original deletion-cleanup-only fix to cover both places that reuse a cached client:

  • the main Reconcile path after GetWorkerDeploymentState / DescribeWorkerDeployment
  • handleDeletion while cleaning up Temporal server-side Worker Deployment data

The shared shouldEvictClient(err) predicate keeps the existing auth behavior and adds bounded recovery for transport/connectivity failures:

  • serviceerror.PermissionDenied
  • Unauthenticated
  • context.DeadlineExceeded
  • serviceerror.Unavailable

It intentionally does not evict on broader server/application responses such as ResourceExhausted, context.Canceled, or NotFound.

Why

In #328 we observed the controller repeatedly reusing the same cached SDK client after DescribeWorkerDeployment returned context.DeadlineExceeded. The controller did not recover until the manager pod restarted and dropped the in-memory pool.

The earlier narrow version of this PR only evicted in handleDeletion, but reviewers correctly pointed out that the main Reconcile path has the same shape: get a cached client, call Describe, then requeue on transport failure without evicting. This PR now closes that recovery gap in both paths.

Changes

  • internal/controller/worker_controller.go
    • adds shouldEvictClient(err) next to isAccessDeniedErr
    • replaces main Reconcile isAccessDeniedErr eviction checks with shouldEvictClient
    • updates handleDeletion to use the same predicate instead of evicting on every non-nil return
  • internal/controller/reconciler_events_test.go
    • adds TestShouldEvictClient to lock down included/excluded error classes
    • adds TestReconcile_EvictsCachedClientOnTransportFailure
    • keeps the deletion cleanup regression test and its NotFound-retains-client case
    • keeps the no-op Close() on stubTemporalClient so EvictClient can close test clients safely

Test plan

  • go test ./internal/controller -run 'Test(ShouldEvictClient|Reconcile_EvictsCachedClientOnTransportFailure|HandleDeletion_EvictsCachedClientOnTemporalFailure|Reconcile_DescribeWorkerDeploymentNotFound)' -count=1
  • go test ./internal/controller -count=1

Closes #328.

@rupesh-parab-one-app rupesh-parab-one-app requested review from a team and jlegrone as code owners May 21, 2026 13:28
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 21, 2026

CLA assistant check
All committers have signed the CLA.

Reusing a cached SDK client after access or transport failures can keep the
controller wedged on the same unhealthy client until the manager pod restarts
and drops the in-memory pool.

Centralize the eviction decision in shouldEvictClient and use it from both the
main Reconcile path and WorkerDeployment deletion cleanup. The predicate keeps
the existing PermissionDenied/Unauthenticated behavior and adds transport cases
that benefit from redialing: context.DeadlineExceeded and serviceerror.Unavailable.
It intentionally leaves ResourceExhausted, context.Canceled, and domain responses
such as NotFound alone so ordinary server-side or lifecycle responses do not
churn otherwise healthy clients.

Add regressions for Reconcile and deletion cleanup so a cached client returning
context.DeadlineExceeded is evicted before the next reconcile retries.
@rupesh-parab-one-app rupesh-parab-one-app force-pushed the fix/evict-client-on-deletion-cleanup-failure branch from 9b56a0d to 8bc7c67 Compare May 27, 2026 04:38
@rupesh-parab-one-app rupesh-parab-one-app changed the title fix: evict cached Temporal client when deletion cleanup fails fix: evict cached Temporal client on bad-client failures May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkerDeployment deletion finalizer infinite-loops on transport-level Temporal SDK errors (cached client never evicted)

2 participants