Skip to content

Add broker drain API#18709

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:broker-drain-api
Open

Add broker drain API#18709
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:broker-drain-api

Conversation

@xiangfu0

@xiangfu0 xiangfu0 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a broker-local drain manager plus POST /drain and GET /drain/status admin APIs.
  • Reject new broker query admissions during drain with BrokerShuttingDown while waiting for already accepted queries to finish.
  • Mark the broker shutdownInProgress=true, remove it from brokerResource, fail health checks while draining, and restore stale drain state on startup.

User manual

To gracefully drain the broker and stop it after accepted queries finish:

curl -X POST "http://<broker-host>:<admin-port>/drain?timeoutMs=-1&shutdown=true"

timeoutMs=-1 uses pinot.broker.delayShutdownTimeMs. Use shutdown=false when an external orchestrator should terminate the process after drain completes:

curl -X POST "http://<broker-host>:<admin-port>/drain?timeoutMs=30000&shutdown=false"
curl "http://<broker-host>:<admin-port>/drain/status"

While draining, the broker health check returns 503 and new valid queries sent to this broker fail with BrokerShuttingDown (212), so clients should retry another broker.

Sample query

SELECT COUNT(*) FROM myTable;

If this query is sent to a draining broker, the broker returns BrokerShuttingDown instead of accepting it for execution.

Tests

  • ./mvnw -pl pinot-broker -am -Dtest=BrokerDrainManagerTest,BrokerRequestHandlerDelegateTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw spotless:apply -pl pinot-broker,pinot-spi,pinot-controller
  • ./mvnw checkstyle:check -pl pinot-broker,pinot-spi,pinot-controller
  • ./mvnw license:format -pl pinot-broker,pinot-spi,pinot-controller
  • ./mvnw license:check -pl pinot-broker,pinot-spi,pinot-controller
  • git diff --check upstream/master...HEAD

Review notes

  • Self-review found and fixed one shutdown lifecycle issue: shutdown=true now delays the asynchronous stop callback briefly so the drain API response is not raced by admin server shutdown.
  • No remaining self-review findings.

@codecov-commenter

codecov-commenter commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 46.78899% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.74%. Comparing base (ff7df48) to head (9414905).

Files with missing lines Patch % Lines
...apache/pinot/broker/broker/BrokerDrainManager.java 64.70% 44 Missing and 4 partials ⚠️
...r/requesthandler/BrokerRequestHandlerDelegate.java 38.23% 17 Missing and 4 partials ⚠️
...broker/api/resources/PinotBrokerDrainResource.java 0.00% 15 Missing ⚠️
...pinot/broker/api/resources/PinotClientRequest.java 0.00% 12 Missing ⚠️
...e/pinot/broker/broker/helix/BaseBrokerStarter.java 0.00% 12 Missing ⚠️
...t/broker/api/resources/PinotBrokerHealthCheck.java 0.00% 6 Missing ⚠️
...pinot/broker/broker/BrokerAdminApiApplication.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18709      +/-   ##
============================================
- Coverage     64.75%   64.74%   -0.01%     
  Complexity     1319     1319              
============================================
  Files          3391     3393       +2     
  Lines        210891   211085     +194     
  Branches      33105    33129      +24     
============================================
+ Hits         136552   136677     +125     
- Misses        63320    63385      +65     
- Partials      11019    11023       +4     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 ?
java-21 64.74% <46.78%> (-0.01%) ⬇️
temurin 64.74% <46.78%> (-0.01%) ⬇️
unittests 64.74% <46.78%> (-0.01%) ⬇️
unittests1 56.94% <100.00%> (-0.01%) ⬇️
unittests2 37.21% <46.78%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal issue; see inline comment.

}

private void startDrain() {
if (!_draining.compareAndSet(false, true)) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startDrain() marks the broker draining before getConnectedHelixManager() and the Helix writes succeed. Because BaseBrokerStarter exposes the admin API before _participantHelixManager.connect(), a startup-window POST /drain (or any Helix write failure here) returns 500 but still leaves _draining set, so later queries are rejected until restart. Please make this transition rollback-safe or only flip the local drain state after the Helix preconditions/writes succeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants