fix: evict cached Temporal client on bad-client failures#329
Open
rupesh-parab-one-app wants to merge 1 commit into
Open
Conversation
9bd4c2c to
9b56a0d
Compare
Reusing a cached SDK client after access or transport failures can keep the controller wedged on the same unhealthy client until the manager pod restarts and drops the in-memory pool. Centralize the eviction decision in shouldEvictClient and use it from both the main Reconcile path and WorkerDeployment deletion cleanup. The predicate keeps the existing PermissionDenied/Unauthenticated behavior and adds transport cases that benefit from redialing: context.DeadlineExceeded and serviceerror.Unavailable. It intentionally leaves ResourceExhausted, context.Canceled, and domain responses such as NotFound alone so ordinary server-side or lifecycle responses do not churn otherwise healthy clients. Add regressions for Reconcile and deletion cleanup so a cached client returning context.DeadlineExceeded is evicted before the next reconcile retries.
9b56a0d to
8bc7c67
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Cached Temporal SDK clients are now evicted consistently when the controller observes failures that indicate the cached client may no longer be usable.
This expands the original deletion-cleanup-only fix to cover both places that reuse a cached client:
Reconcilepath afterGetWorkerDeploymentState/DescribeWorkerDeploymenthandleDeletionwhile cleaning up Temporal server-side Worker Deployment dataThe shared
shouldEvictClient(err)predicate keeps the existing auth behavior and adds bounded recovery for transport/connectivity failures:serviceerror.PermissionDeniedUnauthenticatedcontext.DeadlineExceededserviceerror.UnavailableIt intentionally does not evict on broader server/application responses such as
ResourceExhausted,context.Canceled, orNotFound.Why
In #328 we observed the controller repeatedly reusing the same cached SDK client after
DescribeWorkerDeploymentreturnedcontext.DeadlineExceeded. The controller did not recover until the manager pod restarted and dropped the in-memory pool.The earlier narrow version of this PR only evicted in
handleDeletion, but reviewers correctly pointed out that the mainReconcilepath has the same shape: get a cached client, callDescribe, then requeue on transport failure without evicting. This PR now closes that recovery gap in both paths.Changes
internal/controller/worker_controller.goshouldEvictClient(err)next toisAccessDeniedErrisAccessDeniedErreviction checks withshouldEvictClienthandleDeletionto use the same predicate instead of evicting on every non-nil returninternal/controller/reconciler_events_test.goTestShouldEvictClientto lock down included/excluded error classesTestReconcile_EvictsCachedClientOnTransportFailureClose()onstubTemporalClientsoEvictClientcan close test clients safelyTest plan
go test ./internal/controller -run 'Test(ShouldEvictClient|Reconcile_EvictsCachedClientOnTransportFailure|HandleDeletion_EvictsCachedClientOnTemporalFailure|Reconcile_DescribeWorkerDeploymentNotFound)' -count=1go test ./internal/controller -count=1Closes #328.