[fix](fe) Skip decommissioning BE for stream load#65049
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
Pull request overview
This PR tightens FE backend selection for stream load/txn stream load and related cloud load paths to avoid routing new load traffic to backends that are decommissioned or in decommissioning (draining) state.
Changes:
- Add a
needNonDecommissionedflag toBeSelectionPolicyand enforce it during candidate matching. - Apply the non-decommissioned constraint to stream-load redirect (
LoadAction) and transaction stream-load backend selection (InsertStreamTxnExecutor). - Update cloud stream-load redirect selection and group-commit backend availability checks to skip decommissioned/decommissioning backends, with added unit test coverage.
Review Checkpoints (per FE review guide)
- Goal & correctness: The change directly addresses the stated bug (load availability previously ignored decommission/decommissioning) and adds targeted unit tests.
- Scope/focus: Small and focused; changes are limited to selection policy + the affected selection call sites.
- Concurrency: No new concurrency primitives or shared-state mutation in production code paths; only selection predicates/policies are updated.
- Error handling: Existing error paths remain; selection failures still surface via exceptions (e.g., “No available backend…”).
- Parallel paths: Both non-cloud (policy-based selection) and cloud-specific selection/availability checks are updated as described.
- Test coverage: Adds/extends FE unit tests covering the new exclusion behavior for policy selection, txn-load selection, and cloud stream-load redirect selection.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/system/BeSelectionPolicy.java | Introduces needNonDecommissioned and filters out decommissioned/decommissioning backends when enabled. |
| fe/fe-core/src/main/java/org/apache/doris/qe/InsertStreamTxnExecutor.java | Ensures txn stream-load backend selection excludes decommissioned/decommissioning backends. |
| fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/LoadAction.java | Ensures stream-load redirect backend selection excludes decommissioned/decommissioning backends. |
| fe/fe-core/src/main/java/org/apache/doris/load/StreamLoadHandler.java | In cloud redirect selection, filters out decommissioned/decommissioning backends in addition to load-availability. |
| fe/fe-core/src/main/java/org/apache/doris/load/GroupCommitManager.java | Treats decommissioning as unavailable for group commit backend selection (now consistently across modes). |
| fe/fe-core/src/test/java/org/apache/doris/system/SystemInfoServiceTest.java | Adds unit coverage for needNonDecommissioned behavior in backend policy selection. |
| fe/fe-core/src/test/java/org/apache/doris/qe/InsertStreamTxnExecutorTest.java | Updates test to cover skipping a decommissioning backend in txn-load selection. |
| fe/fe-core/src/test/java/org/apache/doris/load/StreamLoadHandlerTest.java | Adds a new unit test ensuring cloud stream-load selection skips decommissioning backends. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found one blocking propagation gap: the new non-decommissioned stream-load selection policy is applied to the changed direct redirect, txn stream-load, cloud redirect, and group-commit paths, but two FE-originated local stream-load submitters still select with only needLoadAvailable().
Critical checkpoint conclusions:
- Goal and tests: the PR addresses the stated stream-load decommissioning issue for the changed paths and adds unit coverage, but the coverage does not exercise UploadAction/LoadSubmitter or local-mode MysqlLoadManager stream-load submission.
- Scope and parallel paths: the implementation is small and readable, but the same backend-selection invariant is missing from parallel stream-load entry points.
- Concurrency, lifecycle, config, and compatibility: no new locking, persistent state, config item, protocol, or storage compatibility issue found.
- Behavior and observability: error paths remain existing LoadException/UserException surfaces; the already-existing Copilot toString diagnostic comment covers the missing policy flag in diagnostics, so I did not duplicate it.
- Validation: I reviewed the GitHub PR diff and surrounding call paths; I did not run the FE unit tests in this review environment.
Subagent conclusions: optimizer-rewrite reported NO_NEW_VALUABLE_FINDINGS. tests-session-config proposed TSC-1; I verified and accepted it as MAIN-1, with the inline comment below. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same MAIN-1 comment set.
User focus: no additional user-provided review focus.
TPC-H: Total hot run time: 30119 ms |
8d709c6 to
64a8832
Compare
|
/review |
TPC-DS: Total hot run time: 174276 ms |
ClickBench: Total hot run time: 25.27 s |
64a8832 to
f18ff9a
Compare
|
/review |
|
run buildall |
There was a problem hiding this comment.
Automated review found one remaining issue.
The functional selector changes are mostly consistent: stream-load redirect, upload submitter, MySQL load, txn stream load, cloud stream-load selection, and group-commit availability now avoid decommissioned/decommissioning BEs. Existing threads for BeSelectionPolicy.toString(), LoadSubmitter/MysqlLoadManager, and the test name are already addressed in head 64a8832.
Critical checkpoint conclusions:
- Goal/test proof: The PR addresses redirect/selector gaps and adds focused FE unit tests, but one changed group-commit diagnostic is incomplete.
- Scope/focus: The code change is small and focused on FE backend selection.
- Concurrency/lifecycle: No new shared mutable lifecycle beyond existing Env/Config test swaps; tests restore state.
- Config/compatibility/persistence: No new config, thrift, storage, or persistence format.
- Parallel paths: Most sibling stream-load selectors were updated; Nereids FileLoadScanNode was reviewed and dismissed because stream-load scan ranges carry no execution locations.
- Test coverage: Added unit tests cover policy, redirect selector siblings, cloud stream-load selection, local MySQL load, LoadSubmitter, and txn stream-load selection. I could not run them locally because thirdparty/installed and thirdparty/installed/bin/protoc are missing in this runner.
- User focus: No additional user-provided review focus.
Subagent conclusions: optimizer-rewrite reported no new valuable findings and challenged M-1; after verification M-1 was dismissed. tests-session-config duplicated M-1 and proposed TSC-001, accepted as the single inline comment M-2. Convergence round 1 ended with both subagents returning NO_NEW_VALUABLE_FINDINGS for the M-2-only final set.
TPC-H: Total hot run time: 29655 ms |
There was a problem hiding this comment.
Automated review found one remaining issue: the new non-decommissioned selection policy still is not applied to two parallel FE-created new-load selector families.
Critical checkpoint conclusions:
- Goal and tests: The PR covers the changed direct stream-load redirect, upload submitter, MySQL load, txn stream-load, cloud stream-load, and group-commit paths, and adds focused FE unit tests. The fix is still incomplete for streaming CDC/write and routine-load task scheduling selectors.
- Scope and parallel paths: The implementation is focused, but parallel new-load paths remain outside the new policy.
- Concurrency/lifecycle: No new locking or lifecycle issue found in the production change. The new tests restore mutated Env/Config globals in finally blocks.
- Config/compatibility/persistence: No new config, thrift/storage format, edit-log, or rolling-upgrade compatibility issue found.
- Behavior and observability: Existing error paths remain exception-based. The already-open group-commit diagnostic thread covers the local decommissioning detail omission, so I did not duplicate it.
- Test coverage: I reviewed the listed unit tests and changed call paths. I could not run the FE unit tests in this checkout because thirdparty/installed and thirdparty/installed/bin/protoc are missing.
Subagent conclusions: optimizer-rewrite proposed OPT-1, dismissed as inherited from the PR head parent and outside this PR's commit patch; later proposed OPT-2, accepted and merged into the single inline comment. tests-session-config proposed TEST-1, accepted and merged into the same inline comment. Convergence round 2 ended with both live subagents returning NO_NEW_VALUABLE_FINDINGS for the combined M-2/M-3 final comment set.
User focus: No additional user-provided review focus.
TPC-DS: Total hot run time: 174244 ms |
ClickBench: Total hot run time: 25.25 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
f18ff9a to
8d44d35
Compare
|
/review |
There was a problem hiding this comment.
I found one remaining correctness issue in the routine-load fallback after the new non-decommissioned policy is applied.
Critical checkpoints:
- Goal/test: the PR mostly implements skipping decommissioned/decommissioning BEs for stream-load and related selector paths, with focused unit tests, but one routine-load allocation fallback can still return an ineligible previous BE under saturation.
- Scope: the change is focused FE load/backend-selection logic; no new config, storage format, or FE/BE protocol compatibility concern was found.
- Concurrency/lifecycle: no new locking or lifecycle issue found; the accepted issue is deterministic selection logic in the scheduler path.
- Parallel paths: direct stream-load redirect, transaction stream load, upload submitter, MySQL load, streaming-job selection, group commit cache/random selection, and routine-load local/cloud selection were reviewed. Existing threads cover earlier selector gaps; this review comments only on a surviving routine-load fallback.
- Tests: current tests cover the main selector filters, but the saturated previous-BE fallback is not covered. I could not run FE UTs here because thirdparty/installed/bin/protoc is missing.
Subagent conclusions: optimizer-rewrite proposed OPT-1, which became the inline comment below. tests-session-config reported NO_NEW_VALUABLE_FINDINGS. After MAIN-1 was merged into the proposed final comment set, both live subagents returned NO_NEW_VALUABLE_FINDINGS in convergence round 1.
8d44d35 to
ffa1749
Compare
|
/review |
ffa1749 to
3d2c558
Compare
|
Updated in 3d2c558: moved the non-decommissioned requirement into |
|
/review |
There was a problem hiding this comment.
Reviewed live head 3d2c558. I found one remaining load-selection correctness issue in the cloud routine-load path. Existing review threads for BeSelectionPolicy diagnostics, local stream-load selectors, group-commit diagnostics, streaming-job/routine-load decommissioning coverage, and previous-BE reuse were treated as known context and not duplicated.
Validation: static review of the live GitHub patch and related FE load-routing paths; git diff --check c7874e4..3d2c558 is clean. I did not run FE UTs because thirdparty/installed/bin/protoc is missing in this checkout.
Subagent conclusions: optimizer-rewrite proposed OR-65049-001 for the inherited LoadAction IPv6 parser, which was dismissed as out-of-PR after verifying the live PR file list/current parent patch. tests-session-config found no new candidates. After the live-head refresh, convergence round 2 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS for the same single-comment set.
|
Codex automated review failed and did not complete. Error: Codex completed, but no new pull request review was submitted for the current head SHA. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
There was a problem hiding this comment.
Reviewed live head 3d2c558 after refreshing the current GitHub review threads. I am not adding new inline comments because the only substantiated remaining issue from this pass is already covered by live inline thread #65049 (comment) on CloudRoutineLoadManager.java:57.
Critical checkpoint conclusions:
- Goal and tests: the PR aims to keep new load routing away from decommissioned/decommissioning BEs. The main changed selectors and tests cover that direction; the already-open cloud routine-load thread covers the remaining load-disabled/shutdown predicate gap.
- Scope and parallel paths: reviewed BeSelectionPolicy, local/cloud stream-load redirect, LoadSubmitter, MysqlLoadManager, StreamingJobUtils, local/cloud routine load, previous-BE reuse, group commit, insert stream txn, and the adjacent FileLoadScan/FederationBackendPolicy load policy users. No additional non-duplicate issue survived verification.
- Concurrency/lifecycle/config: no new locking or persistent metadata path was introduced. The added tests restore the global Env/SystemInfoService and Config state they mutate.
- Compatibility/observability: no FE/BE protocol or storage format compatibility change was found. Existing review threads already covered policy/group-commit diagnostics.
- Test validation:
git diff --check 9d7d3a2c4f437e0c2962bde1a49848ed0535063c 3d2c5588e3ace9b534d262bf3ce9aeab66dae1f9is clean. I did not run FE unit tests because this checkout lacksthirdparty/installedandthirdparty/installed/bin/protoc.
User focus: no additional user-provided review focus was present.
Subagent conclusions: optimizer-rewrite proposed OPT-FPW-1 for the cloud routine-load load-availability gap; the main review verified it but marked it duplicate_of_existing_thread because live comment 3502806750 already reports the same issue on the current head. tests-session-config reported no separate valuable findings. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same no-new-inline-comment set.
3d2c558 to
ef0cf95
Compare
|
Updated in ef0cf95: minimized the implementation after moving the non-decommissioned requirement into needLoadAvailable(). Removed LoadSubmitter/MysqlLoadManager production visibility changes and their UTs; kept only policy test, direct cloud stream-load test, and routine-load fallback test. |
|
/review |
There was a problem hiding this comment.
Automated review found one additional load-selection issue.
Critical checkpoint conclusions:
- Goal and tests: the PR aims to keep new load routing away from decommissioned/decommissioning BEs. The policy-backed stream-load, group-commit, streaming-job, and routine-load task selectors mostly move in that direction, and the added tests cover core policy and selected local/cloud paths. One direct Kafka/Kinesis metadata proxy selector family still bypasses the new non-decommissioning predicate.
- Scope and parallel paths: reviewed the live changed-file list plus adjacent LoadSubmitter, MysqlLoadManager, InsertStreamTxnExecutor, FileLoadScan/FederationBackendPolicy, routine-load task scheduling, Kafka/Kinesis metadata utilities, and existing GitHub threads. The cloud routine-load load-disabled/shutdown gap is already covered by live thread r3502806750 and was not duplicated.
- Concurrency/lifecycle/config: no new persistent metadata, lock ordering, or FE/BE protocol compatibility issue was found. The changed tests restore the global Env/SystemInfoService and Config values they mutate.
- Test validation: review was static only in this checkout because
.worktree_initialized,thirdparty/installed, andthirdparty/installed/bin/protocare missing, so FE unit tests/build were not run here.
User focus: no additional user-provided review focus was present.
Subagent conclusions: optimizer-rewrite proposed the Kafka/Kinesis metadata proxy selector gap, which was verified and became the inline comment. tests-session-config reported no separate valuable finding. The cloud routine-load load-availability issue was dismissed as duplicate of existing thread r3502806750. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for this same one-comment final set.
| public Builder needLoadAvailable() { | ||
| policy.needLoadAvailable = true; | ||
| policy.needNonDecommissioned = true; | ||
| return this; |
There was a problem hiding this comment.
By making needLoadAvailable() imply needNonDecommissioned(), the policy-backed load selectors now skip draining BEs, but the Kafka/Kinesis routine-load metadata proxy selectors still bypass this policy. KafkaUtil.getInfoRequest() and KinesisUtil.getInfoRequest() build candidates with backend.isLoadAvailable() && !backend.isDecommissioned() only; isLoadAvailable() does not check isDecommissioning(), and those helpers then send BackendServiceProxy.getInfo to the selected BE. So an alive decommissioning BE can still receive new routine-load Kafka/Kinesis metadata RPCs even though the PR moves the other load-selection paths away from draining BEs. Please add the same non-decommissioning predicate or a shared helper there, with focused coverage for decommissioning metadata-proxy candidates.
There was a problem hiding this comment.
Fixed in ac51c3d: KafkaUtil.getInfoRequest() and KinesisUtil.getInfoRequest() now also skip Backend.isDecommissioning() metadata-proxy candidates, matching the non-decommissioned load-selection behavior.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Stream load backend selection could still choose a backend that is being decommissioned because load availability only checked alive/load-disabled/shutdown state. This can redirect new load requests or assign new load tasks to a draining backend. This change makes BeSelectionPolicy.needLoadAvailable() also require non-decommissioned backends, so policy-based load backend selection skips decommissioned and decommissioning backends by default. Direct cloud load selection paths that do not use BeSelectionPolicy also skip decommissioned and decommissioning backends. Routine load previous-BE reuse now only reuses the previous backend when it is still eligible.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.system.SystemInfoServiceTest
- ./run-fe-ut.sh --run org.apache.doris.load.StreamLoadHandlerTest,org.apache.doris.load.routineload.RoutineLoadBackendSelectionTest
- Behavior changed: Yes. Load backend selection skips decommissioned and decommissioning backends.
- Does this need documentation: No
ef0cf95 to
ac51c3d
Compare
|
/review |
There was a problem hiding this comment.
I found two remaining correctness issues in the decommissioning-BE selection changes.
Critical checkpoint conclusions:
- Goal/test proof: the PR moves most load selectors away from decommissioned/decommissioning BEs and adds focused unit tests, but the Kafka/Kinesis metadata fallback and routine-load slot accounting still leave reachable gaps.
- Scope/focus: the change is mostly focused on FE load/backend selection. No additional user-provided focus points were listed.
- Concurrency/lifecycle: no new lock-order or lifecycle issue found in the changed code; the routine-load issue is scheduling state consistency between slot accounting and eligible backend selection.
- Config/compatibility/persistence: no new config, serialization, or edit-log compatibility issue found.
- Parallel paths: direct cloud/local stream load, group commit, streaming job, routine-load, and Kafka/Kinesis metadata paths were checked. Two parallel/fallback paths still need fixes.
- Tests: added tests cover several fixed paths, but they do not cover the Kafka/Kinesis blacklist fallback or the routine-load scheduler idle-slot mismatch. I did not run FE UTs because this checkout lacks thirdparty/installed and thirdparty/installed/bin/protoc, which fe/AGENTS.md requires before FE build/test execution.
Subagent conclusions: optimizer-rewrite proposed OR-1, accepted as M-1. tests-session-config independently reported TS-1 as a duplicate of OR-1 and found no separate test/session/config issue. I added M-2 from main review. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same final comment set.
| if (backend != null && backend.isLoadAvailable() | ||
| && !backend.isDecommissioned() | ||
| && !backend.isDecommissioning() | ||
| && !failedBeIds.contains(beId) |
There was a problem hiding this comment.
The new primary predicate skips decommissioning BEs, but the fallback below still bypasses it when backendIds is empty. Lines 255-260 repopulate candidates from routineLoadManager.getBlacklist() with only backend != null, so a BE that was blacklisted by an earlier metadata retry and later enters decommissioning can still be selected here and receive BackendServiceProxy.getInfo once all primary candidates are filtered out. KinesisUtil has the same fallback at lines 112-115. Please apply the same load-available/non-decommissioned predicate to the blacklist fallback, or share candidate construction, before sending metadata RPCs.
| // try to reuse the object cache as much as possible | ||
| if (previousBeIdleTaskNum == maxIdleSlotNum) { | ||
| if (previousBeAvailable && previousBeIdleTaskNum == maxIdleSlotNum) { | ||
| return previousBeId; |
There was a problem hiding this comment.
This still allows a saturated previous BE to be reused when the only idle capacity comes from a BE that the new availability filter excludes. The scheduler gate uses getClusterIdleSlotNum(), whose slot map is still built from getAllBackendIds(true) and therefore counts alive decommissioning BEs. After this PR, getAvailableBackendIds() excludes that draining BE, so with one saturated eligible previous BE and one idle decommissioning BE, the loop below leaves maxIdleSlotNum == 0 and resultBeId == -1, but this tie check returns the previous BE anyway and allocateTaskToBe() submits another task to that saturated backend. Please keep routine-load slot accounting aligned with the new eligible-backend predicate, and only reuse the previous BE when it has a positive idle slot.
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Stream load backend selection could still choose a backend that is being decommissioned because load availability only checks alive/load-disabled/shutdown state. This can redirect new stream load requests to a draining backend. This change adds an explicit non-decommissioned selection policy and applies it to stream load redirect, upload submitter, MySQL load, and transaction stream load backend selection. Cloud stream load redirect and group commit backend availability now also skip backends that are decommissioned or decommissioning.
Release note
None
Check List (For Author)