HDDS-14793. Fix race condition in XceiverClientGrpc#connectToDatanode causing intermittent NPEs.#9997
Conversation
…rpc#sendCommandAsync`.
…e` causing intermittent NPEs.
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @ptlrs for working on this.
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
Outdated
Show resolved
Hide resolved
|
Thanks for the review @adoroszlai, I have pushed the fixes. |
hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
Outdated
Show resolved
Hide resolved
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @ptlrs for updating the patch. compute logic looks good now.
| if (details == null || !dnChannelInfoMap.containsKey(details.getID())) { | ||
| return false; | ||
| } | ||
|
|
||
| private boolean isConnected(ManagedChannel channel) { | ||
| return channel != null && !channel.isTerminated() && !channel.isShutdown(); | ||
| return !dnChannelInfoMap.get(details.getID()).isChannelInactive(); |
There was a problem hiding this comment.
Avoid separate containsKey and get.
if (details == null) {
return false;
}
ChannelInfo info = dnChannelInfoMap.get(details.getID());
return info != null && !info.isChannelInactive();| .map(ChannelInfo::getChannel) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
ChannelInfo.channel may be null (or at least there are checks for that elsewhere).
| .map(ChannelInfo::getChannel) | |
| .collect(Collectors.toList()); | |
| .map(ChannelInfo::getChannel) | |
| .filter(Objects::nonNull) | |
| .collect(Collectors.toList()); |
| throw new IOException("This channel is not connected."); | ||
| } | ||
|
|
||
| ManagedChannel channel = channels.get(dn.getID()); | ||
| // If the channel doesn't exist for this specific datanode or the channel | ||
| // is closed, just reconnect | ||
| if (!isConnected(channel)) { | ||
| // If the channel doesn't exist for this specific datanode or the channel is closed, just reconnect | ||
| if (!isConnected(dn)) { | ||
| reconnect(dn); | ||
| } | ||
|
|
||
| } | ||
|
|
||
| private void reconnect(DatanodeDetails dn) | ||
| throws IOException { | ||
| ManagedChannel channel; | ||
| try { | ||
| connectToDatanode(dn); | ||
| channel = channels.get(dn.getID()); | ||
| } catch (Exception e) { | ||
| throw new IOException("Error while connecting", e); | ||
| } | ||
|
|
||
| if (!isConnected(channel)) { | ||
| if (!isConnected(dn)) { | ||
| throw new IOException("This channel is not connected."); | ||
| } |
There was a problem hiding this comment.
Given that connectToDatanode handles all cases ("closed", "already connected", "needs new connection"), checkOpen can be simplified:
connectToDatanode(dn);
if (!isConnected(dn)) {
throw new IOException("This channel is not connected.");
}and reconnect can be removed.
| } catch (RuntimeException e) { | ||
| LOG.error("Failed to create channel to datanode {}", dn, e); | ||
| throw new IOException(e.getCause()); |
There was a problem hiding this comment.
We need to keep the translation back from unchecked exception to IOException, because callers may not handle the former.
try {
dnChannelInfoMap.compute(...);
} catch (UncheckedIOException e) {
LOG.error("Failed to create channel to datanode {}", dn, e);
throw e.getCause();
}…pc to prevent runtime exceptions.
|
@adoroszlai thanks for the instructions on how to run the Flaky test CI . After running it on master and a test branch which has the latest changes of this PR we see the following results:
|
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @ptlrs for updating the patch.
| try { | ||
| connectToDatanode(dn); | ||
| channel = channels.get(dn.getID()); | ||
| } catch (Exception e) { | ||
| throw new IOException("Error while connecting", e); | ||
| } |
There was a problem hiding this comment.
With simplified checkOpen, there are two test cases that need to be updated, sorry for missing that.
Expecting actual:
"Error while connecting"
to contain:
"This channel is not connected"
at org.apache.hadoop.hdds.scm.TestXceiverClientManager.testFreeByReference(TestXceiverClientManager.java:162)
...
at org.apache.hadoop.hdds.scm.TestXceiverClientManager.testFreeByEviction(TestXceiverClientManager.java:211)
Let's remove this try-catch, we don't need to wrap the exception from connectToDatanode.
With that, exception will have the correct message (Client is closed). Please update the test to reflect that.
diff --git hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
index a9dbcb9456..1f9ac0a122 100644
--- hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
+++ hadoop-hdds/client/src/main/java/org/apache/hadoop/hdds/scm/XceiverClientGrpc.java
@@ -738,11 +738,7 @@ private void decreasePendingMetricsAndReleaseSemaphore() {
private void checkOpen(DatanodeDetails dn)
throws IOException {
- try {
- connectToDatanode(dn);
- } catch (Exception e) {
- throw new IOException("Error while connecting", e);
- }
+ connectToDatanode(dn);
if (!isConnected(dn)) {
throw new IOException("This channel is not connected.");
diff --git hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestXceiverClientManager.java hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestXceiverClientManager.java
index 06d19e5575..9468cec94f 100644
--- hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestXceiverClientManager.java
+++ hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/TestXceiverClientManager.java
@@ -159,7 +159,7 @@ public void testFreeByReference(@TempDir Path metaDir) throws IOException {
Throwable t = assertThrows(IOException.class,
() -> ContainerProtocolCalls.createContainer(client1,
container1.getContainerInfo().getContainerID(), null));
- assertThat(t.getMessage()).contains("This channel is not connected");
+ assertThat(t.getMessage()).contains("Client is closed");
clientManager.releaseClient(client2, false);
}
@@ -208,7 +208,7 @@ public void testFreeByEviction(@TempDir Path metaDir) throws IOException {
Throwable t = assertThrows(IOException.class,
() -> ContainerProtocolCalls.createContainer(client1,
container1.getContainerInfo().getContainerID(), null));
- assertThat(t.getMessage()).contains("This channel is not connected");
+ assertThat(t.getMessage()).contains("Client is closed");
clientManager.releaseClient(client2, false);
}
What changes were proposed in this pull request?
The
XceiverClientGrpc#connectToDatanodeintermittently fails with an NPE.The problem is that for a given datanode, there is a race condition between creating a channel and creating a stub.
When a new channel is created for a DN, it is put into the
channelsmap. However, presence of a channel in the map does not imply that the corresponding stub for the same DN also exists in theasyncStubsmap.If the stub is accessed after creating a channel but before the creation of stub, we can get an NPE.
This PR fixes the problem by:
dnChannelInfoMapfor both the channels and stubs instead of two independent mapsChannelInfoclass to group the channel and stubWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14793
How was this patch tested?
CI: https://github.com/ptlrs/ozone/actions/runs/23703558972
Flaky test runner: https://github.com/ptlrs/ozone/actions/runs/23823575690