Skip to content

[fix](fe) Skip decommissioning BE for stream load#65049

Open
liaoxin01 wants to merge 1 commit into
apache:masterfrom
liaoxin01:codex/skip-decommissioning-be-stream-load
Open

[fix](fe) Skip decommissioning BE for stream load#65049
liaoxin01 wants to merge 1 commit into
apache:masterfrom
liaoxin01:codex/skip-decommissioning-be-stream-load

Conversation

@liaoxin01

@liaoxin01 liaoxin01 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Stream load backend selection could still choose a backend that is being decommissioned because load availability only checks alive/load-disabled/shutdown state. This can redirect new stream load requests to a draining backend. This change adds an explicit non-decommissioned selection policy and applies it to stream load redirect, upload submitter, MySQL load, and transaction stream load backend selection. Cloud stream load redirect and group commit backend availability now also skip backends that are decommissioned or decommissioning.

Release note

None

Check List (For Author)

  • Test: Unit Test
    • ./run-fe-ut.sh --run org.apache.doris.system.SystemInfoServiceTest,org.apache.doris.qe.InsertStreamTxnExecutorTest,org.apache.doris.load.StreamLoadHandlerTest,org.apache.doris.httpv2.util.LoadSubmitterTest,org.apache.doris.load.loadv2.MysqlLoadManagerTest
  • Behavior changed: Yes. Stream load backend selection skips decommissioned and decommissioning backends.
  • Does this need documentation: No

Copilot AI review requested due to automatic review settings June 30, 2026 14:21
@liaoxin01 liaoxin01 requested a review from gavinchou as a code owner June 30, 2026 14:21
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

gavinchou
gavinchou previously approved these changes Jun 30, 2026
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 30, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens FE backend selection for stream load/txn stream load and related cloud load paths to avoid routing new load traffic to backends that are decommissioned or in decommissioning (draining) state.

Changes:

  • Add a needNonDecommissioned flag to BeSelectionPolicy and enforce it during candidate matching.
  • Apply the non-decommissioned constraint to stream-load redirect (LoadAction) and transaction stream-load backend selection (InsertStreamTxnExecutor).
  • Update cloud stream-load redirect selection and group-commit backend availability checks to skip decommissioned/decommissioning backends, with added unit test coverage.

Review Checkpoints (per FE review guide)

  • Goal & correctness: The change directly addresses the stated bug (load availability previously ignored decommission/decommissioning) and adds targeted unit tests.
  • Scope/focus: Small and focused; changes are limited to selection policy + the affected selection call sites.
  • Concurrency: No new concurrency primitives or shared-state mutation in production code paths; only selection predicates/policies are updated.
  • Error handling: Existing error paths remain; selection failures still surface via exceptions (e.g., “No available backend…”).
  • Parallel paths: Both non-cloud (policy-based selection) and cloud-specific selection/availability checks are updated as described.
  • Test coverage: Adds/extends FE unit tests covering the new exclusion behavior for policy selection, txn-load selection, and cloud stream-load redirect selection.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/system/BeSelectionPolicy.java Introduces needNonDecommissioned and filters out decommissioned/decommissioning backends when enabled.
fe/fe-core/src/main/java/org/apache/doris/qe/InsertStreamTxnExecutor.java Ensures txn stream-load backend selection excludes decommissioned/decommissioning backends.
fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/LoadAction.java Ensures stream-load redirect backend selection excludes decommissioned/decommissioning backends.
fe/fe-core/src/main/java/org/apache/doris/load/StreamLoadHandler.java In cloud redirect selection, filters out decommissioned/decommissioning backends in addition to load-availability.
fe/fe-core/src/main/java/org/apache/doris/load/GroupCommitManager.java Treats decommissioning as unavailable for group commit backend selection (now consistently across modes).
fe/fe-core/src/test/java/org/apache/doris/system/SystemInfoServiceTest.java Adds unit coverage for needNonDecommissioned behavior in backend policy selection.
fe/fe-core/src/test/java/org/apache/doris/qe/InsertStreamTxnExecutorTest.java Updates test to cover skipping a decommissioning backend in txn-load selection.
fe/fe-core/src/test/java/org/apache/doris/load/StreamLoadHandlerTest.java Adds a new unit test ensuring cloud stream-load selection skips decommissioning backends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fe/fe-core/src/test/java/org/apache/doris/qe/InsertStreamTxnExecutorTest.java Outdated
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@liaoxin01

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking propagation gap: the new non-decommissioned stream-load selection policy is applied to the changed direct redirect, txn stream-load, cloud redirect, and group-commit paths, but two FE-originated local stream-load submitters still select with only needLoadAvailable().

Critical checkpoint conclusions:

  • Goal and tests: the PR addresses the stated stream-load decommissioning issue for the changed paths and adds unit coverage, but the coverage does not exercise UploadAction/LoadSubmitter or local-mode MysqlLoadManager stream-load submission.
  • Scope and parallel paths: the implementation is small and readable, but the same backend-selection invariant is missing from parallel stream-load entry points.
  • Concurrency, lifecycle, config, and compatibility: no new locking, persistent state, config item, protocol, or storage compatibility issue found.
  • Behavior and observability: error paths remain existing LoadException/UserException surfaces; the already-existing Copilot toString diagnostic comment covers the missing policy flag in diagnostics, so I did not duplicate it.
  • Validation: I reviewed the GitHub PR diff and surrounding call paths; I did not run the FE unit tests in this review environment.

Subagent conclusions: optimizer-rewrite reported NO_NEW_VALUABLE_FINDINGS. tests-session-config proposed TSC-1; I verified and accepted it as MAIN-1, with the inline comment below. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same MAIN-1 comment set.

User focus: no additional user-provided review focus.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 30119 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8d709c6573d2d71dd7724968b5584044268f7878, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17639	4094	4114	4094
q2	2043	320	220	220
q3	10268	1473	870	870
q4	4682	482	338	338
q5	7517	886	581	581
q6	184	179	142	142
q7	841	867	622	622
q8	9327	1449	1557	1449
q9	5687	4461	4430	4430
q10	6779	1828	1539	1539
q11	513	360	324	324
q12	708	564	450	450
q13	18116	3440	2781	2781
q14	272	273	248	248
q15	q16	788	774	716	716
q17	1033	976	1086	976
q18	7026	5725	5667	5667
q19	1312	1322	1213	1213
q20	750	680	560	560
q21	5966	2740	2592	2592
q22	449	366	307	307
Total cold run time: 101900 ms
Total hot run time: 30119 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4447	4343	4353	4343
q2	290	319	215	215
q3	4555	5010	4426	4426
q4	2106	2196	1383	1383
q5	4506	4342	4369	4342
q6	239	185	134	134
q7	2015	2193	1727	1727
q8	2626	2227	2392	2227
q9	8298	7946	7998	7946
q10	4853	4754	4308	4308
q11	606	423	398	398
q12	789	768	565	565
q13	3313	3628	3006	3006
q14	308	300	274	274
q15	q16	721	738	658	658
q17	1407	1404	1399	1399
q18	8041	7142	7234	7142
q19	1177	1084	1110	1084
q20	2231	2230	1928	1928
q21	5376	4691	4507	4507
q22	527	480	418	418
Total cold run time: 58431 ms
Total hot run time: 52430 ms

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from 8d709c6 to 64a8832 Compare June 30, 2026 14:58
@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 30, 2026
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 174276 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8d709c6573d2d71dd7724968b5584044268f7878, data reload: false

query5	4331	635	493	493
query6	472	215	204	204
query7	4960	654	349	349
query8	334	192	172	172
query9	8739	4094	4121	4094
query10	478	364	315	315
query11	5944	2368	2175	2175
query12	162	102	97	97
query13	1280	603	438	438
query14	6313	5341	4988	4988
query14_1	4532	4372	4338	4338
query15	219	203	180	180
query16	1039	491	449	449
query17	939	731	603	603
query18	2444	493	355	355
query19	213	206	167	167
query20	113	112	108	108
query21	239	165	139	139
query22	13596	13626	13472	13472
query23	17386	16583	16166	16166
query23_1	16263	16336	16303	16303
query24	7475	1818	1325	1325
query24_1	1349	1323	1335	1323
query25	579	466	405	405
query26	1340	387	214	214
query27	2593	599	403	403
query28	4475	2068	2054	2054
query29	1128	643	496	496
query30	333	260	232	232
query31	1119	1093	990	990
query32	135	64	64	64
query33	522	320	263	263
query34	1164	1122	690	690
query35	763	815	673	673
query36	1392	1391	1207	1207
query37	204	119	93	93
query38	1887	1736	1650	1650
query39	926	916	888	888
query39_1	883	874	865	865
query40	240	157	148	148
query41	67	65	63	63
query42	93	90	93	90
query43	332	326	286	286
query44	1465	794	789	789
query45	215	203	190	190
query46	1147	1218	757	757
query47	2396	2378	2282	2282
query48	423	393	297	297
query49	580	432	311	311
query50	1116	422	334	334
query51	4415	4394	4345	4345
query52	81	85	74	74
query53	273	287	204	204
query54	283	242	213	213
query55	75	76	66	66
query56	297	290	296	290
query57	1429	1404	1312	1312
query58	302	279	251	251
query59	1612	1673	1406	1406
query60	314	267	252	252
query61	154	149	149	149
query62	710	651	576	576
query63	244	203	210	203
query64	2535	779	603	603
query65	4892	4766	4726	4726
query66	1846	514	387	387
query67	29727	29860	29446	29446
query68	3172	1556	1059	1059
query69	447	326	264	264
query70	1060	948	979	948
query71	370	338	318	318
query72	2831	2619	2311	2311
query73	882	780	418	418
query74	5103	4985	4772	4772
query75	2630	2611	2239	2239
query76	2309	1226	828	828
query77	357	377	279	279
query78	12466	12428	11762	11762
query79	1355	1236	773	773
query80	635	558	486	486
query81	462	324	280	280
query82	971	155	118	118
query83	397	323	294	294
query84	333	166	137	137
query85	936	617	520	520
query86	364	319	292	292
query87	1861	1825	1772	1772
query88	3772	2849	2825	2825
query89	477	411	361	361
query90	1978	218	194	194
query91	202	191	161	161
query92	65	64	59	59
query93	1518	1571	1027	1027
query94	575	357	332	332
query95	804	520	485	485
query96	1010	813	363	363
query97	2686	2694	2603	2603
query98	216	205	196	196
query99	1185	1147	1010	1010
Total cold run time: 259224 ms
Total hot run time: 174276 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.27 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8d709c6573d2d71dd7724968b5584044268f7878, data reload: false

query1	0.00	0.00	0.01
query2	0.10	0.05	0.05
query3	0.25	0.14	0.14
query4	1.60	0.14	0.14
query5	0.24	0.24	0.22
query6	1.28	1.06	1.08
query7	0.04	0.01	0.00
query8	0.06	0.04	0.03
query9	0.38	0.30	0.30
query10	0.53	0.54	0.54
query11	0.18	0.13	0.14
query12	0.18	0.15	0.14
query13	0.47	0.47	0.47
query14	1.01	1.01	0.99
query15	0.62	0.60	0.60
query16	0.31	0.35	0.33
query17	1.08	1.15	1.05
query18	0.23	0.21	0.21
query19	2.00	1.98	2.00
query20	0.02	0.01	0.02
query21	15.46	0.21	0.13
query22	4.92	0.06	0.05
query23	16.14	0.30	0.12
query24	2.98	0.41	0.33
query25	0.11	0.07	0.04
query26	0.73	0.22	0.16
query27	0.04	0.04	0.04
query28	3.52	0.91	0.54
query29	12.49	4.35	3.52
query30	0.28	0.16	0.16
query31	2.77	0.59	0.32
query32	3.22	0.59	0.48
query33	3.12	3.23	3.20
query34	15.53	4.18	3.49
query35	3.50	3.50	3.54
query36	0.55	0.43	0.44
query37	0.10	0.07	0.07
query38	0.06	0.05	0.04
query39	0.05	0.03	0.03
query40	0.18	0.16	0.15
query41	0.09	0.04	0.03
query42	0.04	0.02	0.02
query43	0.05	0.03	0.03
Total cold run time: 96.51 s
Total hot run time: 25.27 s

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from 64a8832 to f18ff9a Compare June 30, 2026 15:09
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@liaoxin01

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review found one remaining issue.

The functional selector changes are mostly consistent: stream-load redirect, upload submitter, MySQL load, txn stream load, cloud stream-load selection, and group-commit availability now avoid decommissioned/decommissioning BEs. Existing threads for BeSelectionPolicy.toString(), LoadSubmitter/MysqlLoadManager, and the test name are already addressed in head 64a8832.

Critical checkpoint conclusions:

  • Goal/test proof: The PR addresses redirect/selector gaps and adds focused FE unit tests, but one changed group-commit diagnostic is incomplete.
  • Scope/focus: The code change is small and focused on FE backend selection.
  • Concurrency/lifecycle: No new shared mutable lifecycle beyond existing Env/Config test swaps; tests restore state.
  • Config/compatibility/persistence: No new config, thrift, storage, or persistence format.
  • Parallel paths: Most sibling stream-load selectors were updated; Nereids FileLoadScanNode was reviewed and dismissed because stream-load scan ranges carry no execution locations.
  • Test coverage: Added unit tests cover policy, redirect selector siblings, cloud stream-load selection, local MySQL load, LoadSubmitter, and txn stream-load selection. I could not run them locally because thirdparty/installed and thirdparty/installed/bin/protoc are missing in this runner.
  • User focus: No additional user-provided review focus.

Subagent conclusions: optimizer-rewrite reported no new valuable findings and challenged M-1; after verification M-1 was dismissed. tests-session-config duplicated M-1 and proposed TSC-001, accepted as the single inline comment M-2. Convergence round 1 ended with both subagents returning NO_NEW_VALUABLE_FINDINGS for the M-2-only final set.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29655 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f18ff9ae3c2d985bcda4f372d58d0dc47728e188, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17616	4104	4073	4073
q2	2045	322	191	191
q3	10378	1385	811	811
q4	4687	467	342	342
q5	7499	883	589	589
q6	186	169	136	136
q7	801	838	628	628
q8	9431	1535	1609	1535
q9	5699	4486	4414	4414
q10	6765	1795	1532	1532
q11	509	351	321	321
q12	721	557	449	449
q13	18124	3434	2762	2762
q14	265	263	237	237
q15	q16	792	785	719	719
q17	1002	1049	939	939
q18	7065	5739	5546	5546
q19	1322	1318	1098	1098
q20	775	654	521	521
q21	5923	2709	2512	2512
q22	446	365	300	300
Total cold run time: 102051 ms
Total hot run time: 29655 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4391	4415	4308	4308
q2	292	310	213	213
q3	4671	4987	4388	4388
q4	2090	2159	1386	1386
q5	4480	4337	4379	4337
q6	235	179	130	130
q7	1748	2229	1756	1756
q8	2681	2239	2235	2235
q9	8212	8145	7840	7840
q10	4815	4759	4267	4267
q11	597	421	403	403
q12	778	777	536	536
q13	3304	3549	2877	2877
q14	309	294	289	289
q15	q16	707	766	636	636
q17	1398	1364	1481	1364
q18	7921	7452	7499	7452
q19	1178	1123	1131	1123
q20	2253	2205	1936	1936
q21	5301	4582	4407	4407
q22	506	457	400	400
Total cold run time: 57867 ms
Total hot run time: 52283 ms

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review found one remaining issue: the new non-decommissioned selection policy still is not applied to two parallel FE-created new-load selector families.

Critical checkpoint conclusions:

  • Goal and tests: The PR covers the changed direct stream-load redirect, upload submitter, MySQL load, txn stream-load, cloud stream-load, and group-commit paths, and adds focused FE unit tests. The fix is still incomplete for streaming CDC/write and routine-load task scheduling selectors.
  • Scope and parallel paths: The implementation is focused, but parallel new-load paths remain outside the new policy.
  • Concurrency/lifecycle: No new locking or lifecycle issue found in the production change. The new tests restore mutated Env/Config globals in finally blocks.
  • Config/compatibility/persistence: No new config, thrift/storage format, edit-log, or rolling-upgrade compatibility issue found.
  • Behavior and observability: Existing error paths remain exception-based. The already-open group-commit diagnostic thread covers the local decommissioning detail omission, so I did not duplicate it.
  • Test coverage: I reviewed the listed unit tests and changed call paths. I could not run the FE unit tests in this checkout because thirdparty/installed and thirdparty/installed/bin/protoc are missing.

Subagent conclusions: optimizer-rewrite proposed OPT-1, dismissed as inherited from the PR head parent and outside this PR's commit patch; later proposed OPT-2, accepted and merged into the single inline comment. tests-session-config proposed TEST-1, accepted and merged into the same inline comment. Convergence round 2 ended with both live subagents returning NO_NEW_VALUABLE_FINDINGS for the combined M-2/M-3 final comment set.

User focus: No additional user-provided review focus.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 174244 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f18ff9ae3c2d985bcda4f372d58d0dc47728e188, data reload: false

query5	4345	641	505	505
query6	492	218	206	206
query7	4913	615	344	344
query8	362	194	165	165
query9	8773	4095	4064	4064
query10	516	338	306	306
query11	5945	2365	2175	2175
query12	164	101	97	97
query13	1270	628	406	406
query14	6249	5264	4997	4997
query14_1	4310	4301	4322	4301
query15	213	203	180	180
query16	988	437	445	437
query17	976	730	595	595
query18	2452	484	359	359
query19	211	198	168	168
query20	113	111	113	111
query21	238	160	139	139
query22	13615	13606	13358	13358
query23	17338	16487	16174	16174
query23_1	16332	16373	16219	16219
query24	7614	1795	1296	1296
query24_1	1320	1313	1320	1313
query25	571	482	394	394
query26	1354	351	221	221
query27	2584	619	403	403
query28	4450	2066	2031	2031
query29	1115	655	511	511
query30	342	275	234	234
query31	1123	1104	988	988
query32	124	64	65	64
query33	555	340	266	266
query34	1224	1159	676	676
query35	776	780	673	673
query36	1400	1391	1221	1221
query37	165	110	91	91
query38	1909	1716	1658	1658
query39	915	921	899	899
query39_1	865	879	871	871
query40	240	169	137	137
query41	69	64	64	64
query42	99	97	97	97
query43	324	326	285	285
query44	1413	789	800	789
query45	206	192	175	175
query46	1043	1205	759	759
query47	2358	2345	2284	2284
query48	405	429	306	306
query49	591	421	328	328
query50	1104	442	324	324
query51	4434	4344	4416	4344
query52	87	86	76	76
query53	277	275	202	202
query54	277	229	221	221
query55	74	70	69	69
query56	315	303	283	283
query57	1443	1416	1346	1346
query58	305	245	243	243
query59	1570	1654	1470	1470
query60	299	268	252	252
query61	154	150	153	150
query62	704	652	573	573
query63	247	210	238	210
query64	2543	752	599	599
query65	4839	4791	4677	4677
query66	1850	512	383	383
query67	29695	29856	29632	29632
query68	3141	1541	1058	1058
query69	413	301	291	291
query70	1054	925	939	925
query71	348	331	303	303
query72	2838	2811	2348	2348
query73	848	766	428	428
query74	5127	5001	4740	4740
query75	2637	2589	2236	2236
query76	2344	1213	803	803
query77	357	390	293	293
query78	12462	12594	11804	11804
query79	1424	1110	768	768
query80	750	535	465	465
query81	473	318	291	291
query82	1021	164	126	126
query83	394	325	301	301
query84	323	164	134	134
query85	958	581	518	518
query86	398	297	282	282
query87	1838	1843	1763	1763
query88	3737	2823	2818	2818
query89	475	404	357	357
query90	1912	206	195	195
query91	206	188	161	161
query92	67	62	57	57
query93	1712	1541	1027	1027
query94	618	347	331	331
query95	796	576	480	480
query96	1126	852	350	350
query97	2705	2731	2565	2565
query98	230	206	199	199
query99	1163	1150	1025	1025
Total cold run time: 259588 ms
Total hot run time: 174244 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f18ff9ae3c2d985bcda4f372d58d0dc47728e188, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.05	0.06
query3	0.26	0.13	0.13
query4	1.61	0.13	0.13
query5	0.25	0.22	0.23
query6	1.28	1.05	1.05
query7	0.03	0.01	0.00
query8	0.06	0.04	0.04
query9	0.39	0.30	0.32
query10	0.56	0.58	0.56
query11	0.19	0.15	0.14
query12	0.18	0.14	0.14
query13	0.48	0.49	0.48
query14	1.02	1.03	1.00
query15	0.62	0.60	0.59
query16	0.32	0.32	0.31
query17	1.11	1.12	1.08
query18	0.23	0.21	0.22
query19	2.06	1.89	1.90
query20	0.01	0.01	0.01
query21	15.43	0.23	0.14
query22	4.87	0.05	0.06
query23	16.12	0.31	0.13
query24	2.97	0.43	0.33
query25	0.12	0.06	0.05
query26	0.73	0.21	0.16
query27	0.04	0.04	0.03
query28	3.53	0.94	0.56
query29	12.47	4.35	3.52
query30	0.27	0.15	0.15
query31	2.77	0.58	0.32
query32	3.22	0.61	0.49
query33	3.14	3.25	3.16
query34	15.62	4.23	3.52
query35	3.55	3.53	3.54
query36	0.55	0.44	0.41
query37	0.08	0.06	0.06
query38	0.05	0.04	0.03
query39	0.03	0.03	0.03
query40	0.17	0.17	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.04
Total cold run time: 96.67 s
Total hot run time: 25.25 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 58.33% (7/12) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 42.31% (11/26) 🎉
Increment coverage report
Complete coverage report

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from f18ff9a to 8d44d35 Compare July 1, 2026 00:38
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one remaining correctness issue in the routine-load fallback after the new non-decommissioned policy is applied.

Critical checkpoints:

  • Goal/test: the PR mostly implements skipping decommissioned/decommissioning BEs for stream-load and related selector paths, with focused unit tests, but one routine-load allocation fallback can still return an ineligible previous BE under saturation.
  • Scope: the change is focused FE load/backend-selection logic; no new config, storage format, or FE/BE protocol compatibility concern was found.
  • Concurrency/lifecycle: no new locking or lifecycle issue found; the accepted issue is deterministic selection logic in the scheduler path.
  • Parallel paths: direct stream-load redirect, transaction stream load, upload submitter, MySQL load, streaming-job selection, group commit cache/random selection, and routine-load local/cloud selection were reviewed. Existing threads cover earlier selector gaps; this review comments only on a surviving routine-load fallback.
  • Tests: current tests cover the main selector filters, but the saturated previous-BE fallback is not covered. I could not run FE UTs here because thirdparty/installed/bin/protoc is missing.

Subagent conclusions: optimizer-rewrite proposed OPT-1, which became the inline comment below. tests-session-config reported NO_NEW_VALUABLE_FINDINGS. After MAIN-1 was merged into the proposed final comment set, both live subagents returned NO_NEW_VALUABLE_FINDINGS in convergence round 1.

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from 8d44d35 to ffa1749 Compare July 1, 2026 01:39
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from ffa1749 to 3d2c558 Compare July 1, 2026 01:51
@liaoxin01

Copy link
Copy Markdown
Contributor Author

Updated in 3d2c558: moved the non-decommissioned requirement into BeSelectionPolicy.Builder.needLoadAvailable(), so policy-based load backend selection skips decommissioned/decommissioning BEs by default. Removed the duplicated .needNonDecommissioned() calls from the load selector call sites while keeping direct cloud/filter paths explicit. Re-ran SystemInfoServiceTest; the earlier targeted run had all other related selector tests pass before the SystemInfoService expectation update.

@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed live head 3d2c558. I found one remaining load-selection correctness issue in the cloud routine-load path. Existing review threads for BeSelectionPolicy diagnostics, local stream-load selectors, group-commit diagnostics, streaming-job/routine-load decommissioning coverage, and previous-BE reuse were treated as known context and not duplicated.

Validation: static review of the live GitHub patch and related FE load-routing paths; git diff --check c7874e4..3d2c558 is clean. I did not run FE UTs because thirdparty/installed/bin/protoc is missing in this checkout.

Subagent conclusions: optimizer-rewrite proposed OR-65049-001 for the inherited LoadAction IPv6 parser, which was dismissed as out-of-PR after verifying the live PR file list/current parent patch. tests-session-config found no new candidates. After the live-head refresh, convergence round 2 ended with both subagents reporting NO_NEW_VALUABLE_FINDINGS for the same single-comment set.

Comment thread fe/fe-core/src/main/java/org/apache/doris/cloud/load/CloudRoutineLoadManager.java Outdated
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Codex automated review failed and did not complete.

Error: Codex completed, but no new pull request review was submitted for the current head SHA.
Workflow run: https://github.com/apache/doris/actions/runs/28487602488

Please inspect the workflow logs and rerun the review after the underlying issue is resolved.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed live head 3d2c558 after refreshing the current GitHub review threads. I am not adding new inline comments because the only substantiated remaining issue from this pass is already covered by live inline thread #65049 (comment) on CloudRoutineLoadManager.java:57.

Critical checkpoint conclusions:

  • Goal and tests: the PR aims to keep new load routing away from decommissioned/decommissioning BEs. The main changed selectors and tests cover that direction; the already-open cloud routine-load thread covers the remaining load-disabled/shutdown predicate gap.
  • Scope and parallel paths: reviewed BeSelectionPolicy, local/cloud stream-load redirect, LoadSubmitter, MysqlLoadManager, StreamingJobUtils, local/cloud routine load, previous-BE reuse, group commit, insert stream txn, and the adjacent FileLoadScan/FederationBackendPolicy load policy users. No additional non-duplicate issue survived verification.
  • Concurrency/lifecycle/config: no new locking or persistent metadata path was introduced. The added tests restore the global Env/SystemInfoService and Config state they mutate.
  • Compatibility/observability: no FE/BE protocol or storage format compatibility change was found. Existing review threads already covered policy/group-commit diagnostics.
  • Test validation: git diff --check 9d7d3a2c4f437e0c2962bde1a49848ed0535063c 3d2c5588e3ace9b534d262bf3ce9aeab66dae1f9 is clean. I did not run FE unit tests because this checkout lacks thirdparty/installed and thirdparty/installed/bin/protoc.

User focus: no additional user-provided review focus was present.

Subagent conclusions: optimizer-rewrite proposed OPT-FPW-1 for the cloud routine-load load-availability gap; the main review verified it but marked it duplicate_of_existing_thread because live comment 3502806750 already reports the same issue on the current head. tests-session-config reported no separate valuable findings. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same no-new-inline-comment set.

@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from 3d2c558 to ef0cf95 Compare July 1, 2026 02:20
@liaoxin01

Copy link
Copy Markdown
Contributor Author

Updated in ef0cf95: minimized the implementation after moving the non-decommissioned requirement into needLoadAvailable(). Removed LoadSubmitter/MysqlLoadManager production visibility changes and their UTs; kept only policy test, direct cloud stream-load test, and routine-load fallback test.

@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review found one additional load-selection issue.

Critical checkpoint conclusions:

  • Goal and tests: the PR aims to keep new load routing away from decommissioned/decommissioning BEs. The policy-backed stream-load, group-commit, streaming-job, and routine-load task selectors mostly move in that direction, and the added tests cover core policy and selected local/cloud paths. One direct Kafka/Kinesis metadata proxy selector family still bypasses the new non-decommissioning predicate.
  • Scope and parallel paths: reviewed the live changed-file list plus adjacent LoadSubmitter, MysqlLoadManager, InsertStreamTxnExecutor, FileLoadScan/FederationBackendPolicy, routine-load task scheduling, Kafka/Kinesis metadata utilities, and existing GitHub threads. The cloud routine-load load-disabled/shutdown gap is already covered by live thread r3502806750 and was not duplicated.
  • Concurrency/lifecycle/config: no new persistent metadata, lock ordering, or FE/BE protocol compatibility issue was found. The changed tests restore the global Env/SystemInfoService and Config values they mutate.
  • Test validation: review was static only in this checkout because .worktree_initialized, thirdparty/installed, and thirdparty/installed/bin/protoc are missing, so FE unit tests/build were not run here.

User focus: no additional user-provided review focus was present.

Subagent conclusions: optimizer-rewrite proposed the Kafka/Kinesis metadata proxy selector gap, which was verified and became the inline comment. tests-session-config reported no separate valuable finding. The cloud routine-load load-availability issue was dismissed as duplicate of existing thread r3502806750. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for this same one-comment final set.

public Builder needLoadAvailable() {
policy.needLoadAvailable = true;
policy.needNonDecommissioned = true;
return this;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By making needLoadAvailable() imply needNonDecommissioned(), the policy-backed load selectors now skip draining BEs, but the Kafka/Kinesis routine-load metadata proxy selectors still bypass this policy. KafkaUtil.getInfoRequest() and KinesisUtil.getInfoRequest() build candidates with backend.isLoadAvailable() && !backend.isDecommissioned() only; isLoadAvailable() does not check isDecommissioning(), and those helpers then send BackendServiceProxy.getInfo to the selected BE. So an alive decommissioning BE can still receive new routine-load Kafka/Kinesis metadata RPCs even though the PR moves the other load-selection paths away from draining BEs. Please add the same non-decommissioning predicate or a shared helper there, with focused coverage for decommissioning metadata-proxy candidates.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ac51c3d: KafkaUtil.getInfoRequest() and KinesisUtil.getInfoRequest() now also skip Backend.isDecommissioning() metadata-proxy candidates, matching the non-decommissioned load-selection behavior.

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Stream load backend selection could still choose a backend that is being decommissioned because load availability only checked alive/load-disabled/shutdown state. This can redirect new load requests or assign new load tasks to a draining backend. This change makes BeSelectionPolicy.needLoadAvailable() also require non-decommissioned backends, so policy-based load backend selection skips decommissioned and decommissioning backends by default. Direct cloud load selection paths that do not use BeSelectionPolicy also skip decommissioned and decommissioning backends. Routine load previous-BE reuse now only reuses the previous backend when it is still eligible.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.system.SystemInfoServiceTest
    - ./run-fe-ut.sh --run org.apache.doris.load.StreamLoadHandlerTest,org.apache.doris.load.routineload.RoutineLoadBackendSelectionTest
- Behavior changed: Yes. Load backend selection skips decommissioned and decommissioning backends.
- Does this need documentation: No
@liaoxin01 liaoxin01 force-pushed the codex/skip-decommissioning-be-stream-load branch from ef0cf95 to ac51c3d Compare July 1, 2026 03:40
@liaoxin01

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two remaining correctness issues in the decommissioning-BE selection changes.

Critical checkpoint conclusions:

  • Goal/test proof: the PR moves most load selectors away from decommissioned/decommissioning BEs and adds focused unit tests, but the Kafka/Kinesis metadata fallback and routine-load slot accounting still leave reachable gaps.
  • Scope/focus: the change is mostly focused on FE load/backend selection. No additional user-provided focus points were listed.
  • Concurrency/lifecycle: no new lock-order or lifecycle issue found in the changed code; the routine-load issue is scheduling state consistency between slot accounting and eligible backend selection.
  • Config/compatibility/persistence: no new config, serialization, or edit-log compatibility issue found.
  • Parallel paths: direct cloud/local stream load, group commit, streaming job, routine-load, and Kafka/Kinesis metadata paths were checked. Two parallel/fallback paths still need fixes.
  • Tests: added tests cover several fixed paths, but they do not cover the Kafka/Kinesis blacklist fallback or the routine-load scheduler idle-slot mismatch. I did not run FE UTs because this checkout lacks thirdparty/installed and thirdparty/installed/bin/protoc, which fe/AGENTS.md requires before FE build/test execution.

Subagent conclusions: optimizer-rewrite proposed OR-1, accepted as M-1. tests-session-config independently reported TS-1 as a duplicate of OR-1 and found no separate test/session/config issue. I added M-2 from main review. Final convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same final comment set.

if (backend != null && backend.isLoadAvailable()
&& !backend.isDecommissioned()
&& !backend.isDecommissioning()
&& !failedBeIds.contains(beId)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new primary predicate skips decommissioning BEs, but the fallback below still bypasses it when backendIds is empty. Lines 255-260 repopulate candidates from routineLoadManager.getBlacklist() with only backend != null, so a BE that was blacklisted by an earlier metadata retry and later enters decommissioning can still be selected here and receive BackendServiceProxy.getInfo once all primary candidates are filtered out. KinesisUtil has the same fallback at lines 112-115. Please apply the same load-available/non-decommissioned predicate to the blacklist fallback, or share candidate construction, before sending metadata RPCs.

// try to reuse the object cache as much as possible
if (previousBeIdleTaskNum == maxIdleSlotNum) {
if (previousBeAvailable && previousBeIdleTaskNum == maxIdleSlotNum) {
return previousBeId;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still allows a saturated previous BE to be reused when the only idle capacity comes from a BE that the new availability filter excludes. The scheduler gate uses getClusterIdleSlotNum(), whose slot map is still built from getAllBackendIds(true) and therefore counts alive decommissioning BEs. After this PR, getAvailableBackendIds() excludes that draining BE, so with one saturated eligible previous BE and one idle decommissioning BE, the loop below leaves maxIdleSlotNum == 0 and resultBeId == -1, but this tie check returns the previous BE anyway and allocateTaskToBe() submits another task to that saturated backend. Please keep routine-load slot accounting aligned with the new eligible-backend predicate, and only reuse the previous BE when it has a positive idle slot.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants