[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390
[client] Fix stale metadata on readOnlyGateway by adding RetryableGatewayClientProxy#3390loserwang1024 wants to merge 1 commit into
Conversation
fresh-borzoni
left a comment
There was a problem hiding this comment.
@loserwang1024 Thank you for the very important PR, left some comments, PTAL
| private final AdminReadOnlyGateway readOnlyGateway; | ||
| private final MetadataUpdater metadataUpdater; | ||
|
|
||
| private static final int READ_ONLY_GATEWAY_MAX_RETRIES = 3; |
There was a problem hiding this comment.
With maxRetries=3, bootstrap reinit needs 4 refreshes. You only get 3 per request.
Shall we loop inside updateMetadata until either success or null-triggered bootstrap?
| cause); | ||
| // Run metadata refresh and retry on a separate thread to avoid | ||
| // blocking Netty IO threads that may complete the failed future. | ||
| CompletableFuture.runAsync( |
There was a problem hiding this comment.
do we want some backoff?
I mean 3 retries fire in milliseconds, seems wasteful on slow DNS or restarting pods.
| * <li>Metadata refresh is triggered, which marks the failed server as unavailable | ||
| * <li>After N failed refreshes, all servers are marked unavailable, triggering re-initialization | ||
| * from bootstrap servers | ||
| * <li>The next retry succeeds with the refreshed server addresses |
There was a problem hiding this comment.
ditto: only true when maxRetries > cluster_size
| GatewayClientProxy.createGatewayProxy( | ||
| metadataUpdater::getCoordinatorServer, client, AdminGateway.class); | ||
| this.readOnlyGateway = | ||
| AdminGateway rawReadOnlyGateway = |
There was a problem hiding this comment.
Shall we add TODO for writes, since they are still broken?
| private final AdminReadOnlyGateway readOnlyGateway; | ||
| private final MetadataUpdater metadataUpdater; | ||
|
|
||
| private static final int READ_ONLY_GATEWAY_MAX_RETRIES = 3; |
There was a problem hiding this comment.
Shall we make it a ConfigOption to make more operator-friendly?
| this.readOnlyGateway = | ||
| AdminGateway rawReadOnlyGateway = | ||
| GatewayClientProxy.createGatewayProxy( | ||
| metadataUpdater::getRandomTabletServer, client, AdminGateway.class); |
There was a problem hiding this comment.
AdminReadOnlyGateway.class?
| cause); | ||
| // Run metadata refresh and retry on a separate thread to avoid | ||
| // blocking Netty IO threads that may complete the failed future. | ||
| CompletableFuture.runAsync( |
There was a problem hiding this comment.
runAsync without an executor uses ForkJoinPool.commonPool(), should we use a dedicated executor instead?
| cause); | ||
| // Run metadata refresh and retry on a separate thread to avoid | ||
| // blocking Netty IO threads that may complete the failed future. | ||
| CompletableFuture.runAsync( |
There was a problem hiding this comment.
Every retry here fires its own updateMetadata call, and that method's synchronized(this) block is the same paths use to refresh leader info.
Example: during a rolling upgrade, N concurrent failing admin calls × 3 retries all queue up behind one lock, and the data plane's refreshes wait in the same line.
Could we share one in-flight refresh across concurrent retriers?
Purpose
Linked issue: close #3389
Brief change log
Tests
API and Format
Documentation