Skip to content

HDFS-17916: DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to ByteArrayManager#8466

Merged
steveloughran merged 2 commits into
apache:trunkfrom
HubSpot:HDFS-17916
May 26, 2026
Merged

HDFS-17916: DataStreamer#processDatanodeOrExternalError() fails to return byte arrays to ByteArrayManager#8466
steveloughran merged 2 commits into
apache:trunkfrom
HubSpot:HDFS-17916

Conversation

@charlesconnell
Copy link
Copy Markdown
Contributor

Description of PR

A certain code path in the DFS client DataStreamer appears to discard DFSPacket objects without returning their contained byte arrays to the ByteArrayManager. I discovered this bug at my company after we had HBase server threads hung for hours at ByteArrayManager#allocate(). Because the leak only happens in an error-handling path, the problem requires an unhealthy HDFS cluster in order to be exposed.

I took a heap dump of a high-uptime but relatively healthy HBase server, and found evidence of leaked byte arrays there too. In the heap dump, the two FixedLengthManagers both had numAllocated = 9, but there were zero live DFSPacket objects. This suggests that the byte arrays, and their containing DFSPackets had been garbage collected, unbeknownst to FixedLengthManager.

In DataStreamer.java starting at line 1410, the DFSPacket that is remove()'d from dataQueue is allowed to be garbage collected without further interaction.

    if (!streamerClosed && dfsClient.clientRunning) {
      if (stage == BlockConstructionStage.PIPELINE_CLOSE) {        // If we had an error while closing the pipeline, we go through a fast-path
        // where the BlockReceiver does not run. Instead, the DataNode just finalizes
        // the block immediately during the 'connect ack' process. So, we want to pull
        // the end-of-block packet from the dataQueue, since we don't actually have
        // a true pipeline to send it over.
        //
        // We also need to set lastAckedSeqno to the end-of-block Packet's seqno, so that
        // a client waiting on close() will be aware that the flush finished.
        synchronized (dataQueue) {
          DFSPacket endOfBlockPacket = dataQueue.remove();  // remove the end of block packet
          // Close any trace span associated with this Packet
          Span span = endOfBlockPacket.getSpan();
          if (span != null) {
            span.finish();
            endOfBlockPacket.setSpan(null);
          }
          assert endOfBlockPacket.isLastPacketInBlock();
          assert lastAckedSeqno == endOfBlockPacket.getSeqno() - 1;
          lastAckedSeqno = endOfBlockPacket.getSeqno();
          pipelineRecoveryCount = 0;
          dataQueue.notifyAll();
        }
        endBlock();
      } else {
        initDataStreaming();
      }
    } 

This PR adds this line in order to return the packet's buffer to the ByteArrayManager:

endOfBlockPacket.releaseBuffer(byteArrayManager);

Contains content generated by Claude Opus 4.7

How was this patch tested?

New unit tests added

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 0m 32s Docker failed to build run-specific yetus/hadoop:tp-2658}.
Subsystem Report/Notes
GITHUB PR #8466
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8466/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@charlesconnell
Copy link
Copy Markdown
Contributor Author

Yetus failure appears unrelated to this PR

Copy link
Copy Markdown
Contributor

@ZanderXu ZanderXu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

@ndimiduk
Copy link
Copy Markdown
Member

ndimiduk commented May 8, 2026

Looks right by me as well. Heya @ZanderXu , anything else to do here? Do you mind handling the commits on this one? I'm not technically a committer.

@ndimiduk
Copy link
Copy Markdown
Member

@steveloughran @cnauroth @jojochuang any chance you have a moment to see this one home? Much appreciated!

// release() is performed by the streamer thread, so allow a brief
// moment for that thread to settle after close() returns.
GenericTestUtils.waitFor(() -> managers.countAllocated() == 0, 50, 5000);
assertEquals(0, managers.countAllocated(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertj

assertThat(managers.countAllocated())
  .describedAs("count allocated")
  .isEqualTo(0);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ByteArrayManager bam =
fs.getClient().getClientContext().getByteArrayManager();
assertTrue(bam instanceof ByteArrayManager.Impl,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use AssertJ asserts, as they can backport to branch-3.4 (java8, junit4

assertThat(bam)
  .describedAs(""expected bounded ByteArrayManager ")
  .isInstanceOf(ByteArrayManager.Impl); 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@steveloughran
Copy link
Copy Markdown
Contributor

@ndimiduk i don't go near hdfs, but for you I will, especially as this is so simple production-side. Commented on the tests, which are mainly about using assertJ

@charlesconnell once this is in, you'll need to backport prs to branch-3.5 (straightforward) and branch-3.4, where the junit4/5 migration adds work. Using assertJ means the only homework here is changing the import of the Test class

@charlesconnell
Copy link
Copy Markdown
Contributor Author

Thank you for the review @steveloughran, I've made your requested changes

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending test results

@charlesconnell
Copy link
Copy Markdown
Contributor Author

Tests ran for 24 hours and were killed

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 54s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 1m 57s Maven dependency ordering for branch
+1 💚 mvninstall 52m 41s trunk passed
+1 💚 compile 5m 57s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 6m 29s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 2m 14s trunk passed
+1 💚 mvnsite 3m 16s trunk passed
+1 💚 javadoc 2m 33s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 2m 35s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 8m 14s trunk passed
+1 💚 shadedclient 37m 3s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 30s Maven dependency ordering for patch
+1 💚 mvninstall 2m 15s the patch passed
+1 💚 compile 5m 25s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 5m 25s the patch passed
+1 💚 compile 6m 2s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 6m 2s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 45s the patch passed
+1 💚 mvnsite 2m 27s the patch passed
+1 💚 javadoc 1m 36s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 1m 41s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 7m 40s the patch passed
+1 💚 shadedclient 36m 56s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 37s hadoop-hdfs-client in the patch passed.
-1 ❌ unit 255m 0s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 51s The patch does not generate ASF License warnings.
447m 26s
Reason Tests
Failed junit tests hadoop.hdfs.server.datanode.fsdataset.impl.TestFsVolumeList
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8466/3/artifact/out/Dockerfile
GITHUB PR #8466
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3447ae1728f0 5.15.0-174-generic #184-Ubuntu SMP Fri Mar 13 18:41:50 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 853a790
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8466/3/testReport/
Max. process+thread count 2382 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8466/3/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@charlesconnell
Copy link
Copy Markdown
Contributor Author

charlesconnell commented May 22, 2026

@steveloughran I believe that the failed tests here are unrelated to my change. If you are satisfied with it, could you please merge?

Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
failures look unrelated to me too, given this is the error case. I have just had to review this code carefully though to be confident that this is good.

Can I add that the code around 1292 also seems to need a similar cleanup: the buffer release should be moved into the finally(), rather than the end of the successful path.

@steveloughran steveloughran merged commit 135e36a into apache:trunk May 26, 2026
3 of 6 checks passed
@steveloughran
Copy link
Copy Markdown
Contributor

@charlesconnell merged to trunk. Can you submit a backport PR against branch-3.5 so we can see how things go there?

@charlesconnell
Copy link
Copy Markdown
Contributor Author

@steveloughran Thank you for the merge. The backport is open at #8516.

The code around line 1292 could be improved to be easier to reason about, but I don't think there is a byte array leak there. The DFSPacket aliased as one stays in the ackQueue until 1288. I think we can assume nothing can fail between lines 1288 when it's taken from the ackQueue and line 1292 when it's freed. Any packets that are in the ackQueue will be freed eventually when closeInternal() runs, which would happen if there were failures elsewhere in the work() loop.

@charlesconnell charlesconnell deleted the HDFS-17916 branch May 26, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants