Skip to content

HDDS-15382. Close idle connection for Datanode GRPC server#10371

Open
ChenSammi wants to merge 2 commits into
apache:masterfrom
ChenSammi:HDDS-15382
Open

HDDS-15382. Close idle connection for Datanode GRPC server#10371
ChenSammi wants to merge 2 commits into
apache:masterfrom
ChenSammi:HDDS-15382

Conversation

@ChenSammi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

HDDS-15149 limits the pending connections allowed for datanode grpc server.

This JIRA is another defensive protection, close the connection if it's idle for 15 minutes.

Also reduce the so.backlog from default 4096 to 256, as 4096 seems too much, as IPC server so.backlog is also 256.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15382

How was this patch tested?

existing UT

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Datanode gRPC server against connection exhaustion and lingering idle clients by introducing server-side idle/keepalive handling and by lowering the default socket backlog to align with IPC defaults.

Changes:

  • Configure the Datanode gRPC server to enforce a max idle connection time and to use HTTP/2 keepalive pings/timeouts.
  • Reduce the default hdds.datanode.grpc.so.backlog from 4096 to 256.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/XceiverServerGrpc.java Adds gRPC Netty server connection idle and keepalive settings.
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java Lowers the default gRPC server socket backlog and updates the config default.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.executor(readExecutors)
// If a client does not send an actual functional business RPC for 15 minutes,
// the server kicks them off with a GOAWAY frame.
.maxConnectionIdle(15, TimeUnit.MINUTES)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also set MaxConnectionAge() to 1H?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a long running application, a ozone client can exist for hours, even days.

// If the server fires a ping and the client fails to respond with a
// PING ACK within 30 seconds, the server assumes the socket is a dead
// "zombie connection" and immediately destroys the TCP socket.
.keepAliveTimeout(30, TimeUnit.SECONDS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15s?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15s is a little aggressive.

Comment on lines +147 to +148
// if the network wire or client machine is still alive.
.keepAliveTime(5, TimeUnit.MINUTES)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1min?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1min is a little aggressive.

@ChenSammi ChenSammi requested a review from jojochuang May 28, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants