Skip to content

[feature](runtime-filter) Add adaptive global runtime filter tree publish#64851

Open
BiteTheDDDDt wants to merge 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/runtime-filter-tree-publish
Open

[feature](runtime-filter) Add adaptive global runtime filter tree publish#64851
BiteTheDDDDt wants to merge 1 commit into
apache:masterfrom
BiteTheDDDDt:codex/runtime-filter-tree-publish

Conversation

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: Global runtime filters are currently published from the merge coordinator directly to every scan target. When a merged filter is large and the cluster has many BEs, the coordinator sends many large duplicate RPC attachments. This PR adds adaptive tree publish for global runtime filters so each relay node forwards a bounded number of copies based on runtime_filter_tree_publish_max_send_bytes.

Release note

Add runtime_filter_tree_publish_max_send_bytes session variable to control adaptive global runtime filter tree publish. Set it to 0 to disable tree publish.

Check List (For Author)

  • Test:
    • Manual test: git diff --cached --check
    • Manual test: build-support/check-format.sh
    • Manual test: DORIS_HOME=/mnt/disk8/xiaolei/codespaces/doris2 DORIS_THIRDPARTY=/mnt/disk8/xiaolei/codespaces/doris2/thirdparty ninja -C be/build_Release src/exec/CMakeFiles/Exec.dir/runtime_filter/runtime_filter_mgr.cpp.o src/service/CMakeFiles/Service.dir/internal_service.cpp.o
    • Manual test: ./build.sh --fe
  • Behavior changed: Yes. Large global runtime filters can be published through an adaptive relay tree instead of direct coordinator fanout.
  • Does this need documentation: No

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt BiteTheDDDDt marked this pull request as ready for review June 25, 2026 09:44
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking issues in the adaptive runtime-filter tree publish path.

Critical checkpoint conclusions:

  • Goal: the PR implements adaptive tree publish, but the current implementation is not equivalent to direct publish on error handling and is not safe for mixed-version BE rolling upgrades.
  • Scope: the code is mostly focused on runtime-filter publish, session option forwarding, and protocol fields.
  • Concurrency/lifecycle: the new async relay closure is self-owned, but downstream RPC failure lifecycle is only logged instead of reaching the query cancellation path.
  • Configuration: runtime_filter_tree_publish_max_send_bytes is forwarded to BE and 0 disables tree publish; existing review threads already cover default/fallback wording.
  • Compatibility: the new forward_targets protobuf field needs receiver capability gating before a BE can be selected as a relay.
  • Parallel paths: direct publish and relay publish now diverge for RPC/status failure handling.
  • Tests: helper unit tests cover splitting/fanout/task construction, but not relay failure propagation or mixed old/new BE fallback.
  • Observability/performance: existing threads already cover INFO logging; no separate new comment from this pass.

User focus: no additional user-provided review focus was present.

Subagent conclusions: optimizer-rewrite produced MAIN-001 and duplicate notes for the capability and out-of-scope Nereids candidates. tests-session-config produced MAIN-002 and an out-of-scope Nereids candidate. The Nereids candidate was dismissed because GitHub's authoritative PR file list contains only the seven runtime-filter/session/protocol files. Convergence round 1 ended with both live subagents replying NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set.

Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/runtime-filter-tree-publish branch 2 times, most recently from e904aa9 to 2b24ec5 Compare June 25, 2026 10:25
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PR diff and related runtime-filter paths. I am not adding new inline comments because the substantiated correctness concern I still see is already covered by the existing rolling-upgrade thread, and the other prior relay-error-policy concern has been addressed in the current head.

Critical checkpoint conclusions:

  • Goal: the PR adds adaptive tree publishing for global runtime filters, controlled by runtime_filter_tree_publish_max_send_bytes. The main direct, tree-root, and relay helper paths are present.
  • Scope: the GitHub PR surface is the seven files listed by the harness and GitHub API; local git diff base..head contains unrelated stacked FE/regression changes, so I kept this review scoped to the authoritative PR files.
  • Concurrency and lifecycle: async BRPC callbacks in the root publish path keep callback state in GlobalMergeContext; relay child RPCs now keep their request/callback state in RuntimeFilterRelayRpcClosure. I did not find a new lifetime issue in the current code.
  • Config propagation: the new FE session variable is forwarded through SessionVariable.toThrift() into TQueryOptions, and BE disables tree publish if the field is not set or is set to 0.
  • Compatibility: the rolling BE upgrade risk from forwarding through an old BE is still a valid concern, but it is already raised in the existing inline thread at runtime_filter_mgr.cpp:658, so I did not duplicate it.
  • Parallel paths: direct publish remains the fallback when tree publish is disabled or target fragment IDs are unavailable; relay forwarding now uses the same query-aware callback type as direct publish for child RPC responses.
  • Tests: the added BE unit tests cover fanout splitting, fanout calculation, and task construction. I did not run BE/FE tests in this runner because thirdparty/installed is missing; git diff --check passed for the seven authoritative PR files.
  • Observability: the tree/relay publish messages are VLOG_NOTICE in the current head; the earlier INFO-log concern is already in the existing review context.

Subagent conclusions:

  • optimizer-rewrite reported NO_NEW_VALUABLE_FINDINGS in the initial pass and again in convergence round 1.
  • tests-session-config reported NO_NEW_VALUABLE_FINDINGS in the initial pass and again in convergence round 1.
  • No subagent candidates became new inline comments; duplicate suppression kept the rolling-upgrade compatibility issue on the existing thread.

User focus: no additional user-provided review focus was present.

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29778 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2b24ec53c74d400900eba62a253b6bc1435a6d2c, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17680	4145	4110	4110
q2	2041	341	194	194
q3	10280	1464	854	854
q4	4683	479	343	343
q5	7493	901	599	599
q6	183	176	139	139
q7	793	841	625	625
q8	9368	1583	1698	1583
q9	5674	4574	4604	4574
q10	6784	1813	1561	1561
q11	450	275	258	258
q12	632	425	287	287
q13	18092	3516	2814	2814
q14	267	273	246	246
q15	q16	792	779	709	709
q17	1003	990	1007	990
q18	7136	5749	5589	5589
q19	1181	1332	1159	1159
q20	504	426	276	276
q21	5725	2560	2601	2560
q22	452	369	308	308
Total cold run time: 101213 ms
Total hot run time: 29778 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4468	4396	4370	4370
q2	320	351	230	230
q3	4621	5020	4431	4431
q4	2107	2191	1376	1376
q5	4505	4423	4353	4353
q6	240	182	130	130
q7	1782	2182	1789	1789
q8	2572	2311	2295	2295
q9	8477	8477	8181	8181
q10	4837	4802	4359	4359
q11	574	442	417	417
q12	768	810	552	552
q13	3359	3673	2932	2932
q14	305	299	273	273
q15	q16	720	741	638	638
q17	1393	1400	1524	1400
q18	8066	7455	7475	7455
q19	1206	1108	1104	1104
q20	2239	2222	1951	1951
q21	5664	4599	4527	4527
q22	558	453	408	408
Total cold run time: 58781 ms
Total hot run time: 53171 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171915 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2b24ec53c74d400900eba62a253b6bc1435a6d2c, data reload: false

query5	4326	633	487	487
query6	445	193	181	181
query7	4878	591	299	299
query8	337	185	172	172
query9	8773	4146	4147	4146
query10	445	319	275	275
query11	5984	2387	2169	2169
query12	156	106	103	103
query13	1349	649	449	449
query14	6301	5451	5037	5037
query14_1	4333	4321	4320	4320
query15	223	217	180	180
query16	1057	474	477	474
query17	1150	739	604	604
query18	2724	487	371	371
query19	216	201	153	153
query20	121	112	106	106
query21	218	141	118	118
query22	13653	13574	13492	13492
query23	17381	16500	16109	16109
query23_1	16280	16236	16315	16236
query24	7559	1817	1327	1327
query24_1	1327	1315	1324	1315
query25	585	482	399	399
query26	1311	329	179	179
query27	2643	567	344	344
query28	4452	2057	2061	2057
query29	1103	636	518	518
query30	317	237	199	199
query31	1109	1088	970	970
query32	103	63	62	62
query33	533	334	263	263
query34	1228	1139	663	663
query35	751	805	690	690
query36	1411	1352	1189	1189
query37	158	111	97	97
query38	1891	1748	1682	1682
query39	934	914	925	914
query39_1	888	871	882	871
query40	221	124	103	103
query41	66	61	66	61
query42	89	86	88	86
query43	335	329	288	288
query44	1484	821	802	802
query45	205	193	180	180
query46	1088	1263	771	771
query47	2356	2313	2234	2234
query48	408	430	296	296
query49	578	421	320	320
query50	1076	384	263	263
query51	4494	4372	4327	4327
query52	84	81	72	72
query53	261	258	193	193
query54	262	221	196	196
query55	75	70	67	67
query56	240	218	230	218
query57	1420	1400	1305	1305
query58	247	214	227	214
query59	1700	1680	1472	1472
query60	284	251	229	229
query61	152	152	151	151
query62	698	650	585	585
query63	233	193	200	193
query64	2536	772	612	612
query65	4880	4842	4799	4799
query66	1788	471	378	378
query67	28886	28847	28658	28658
query68	3174	1630	1001	1001
query69	424	324	256	256
query70	1069	994	964	964
query71	293	231	212	212
query72	2966	2658	2115	2115
query73	897	790	431	431
query74	5108	4969	4790	4790
query75	2579	2544	2184	2184
query76	2326	1259	812	812
query77	362	386	286	286
query78	12447	12613	11839	11839
query79	1426	1264	747	747
query80	584	458	401	401
query81	458	288	236	236
query82	628	158	137	137
query83	366	285	251	251
query84	265	150	116	116
query85	917	528	416	416
query86	404	310	282	282
query87	1840	1854	1776	1776
query88	3761	2805	2780	2780
query89	436	397	331	331
query90	1870	188	188	188
query91	178	170	142	142
query92	67	62	55	55
query93	1598	1519	838	838
query94	554	361	327	327
query95	690	381	435	381
query96	1116	794	367	367
query97	2713	2686	2580	2580
query98	215	208	203	203
query99	1202	1165	1028	1028
Total cold run time: 258056 ms
Total hot run time: 171915 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.41 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2b24ec53c74d400900eba62a253b6bc1435a6d2c, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.05	0.05
query3	0.26	0.15	0.13
query4	1.61	0.15	0.15
query5	0.25	0.25	0.23
query6	1.29	1.04	1.04
query7	0.04	0.01	0.01
query8	0.06	0.04	0.04
query9	0.42	0.32	0.32
query10	0.58	0.61	0.55
query11	0.18	0.14	0.14
query12	0.18	0.14	0.14
query13	0.47	0.50	0.48
query14	1.04	1.02	1.01
query15	0.63	0.59	0.60
query16	0.33	0.33	0.33
query17	1.15	1.16	1.14
query18	0.23	0.22	0.22
query19	2.02	2.00	1.94
query20	0.02	0.01	0.01
query21	15.44	0.23	0.14
query22	4.87	0.05	0.05
query23	16.15	0.32	0.12
query24	2.99	0.42	0.31
query25	0.12	0.05	0.05
query26	0.75	0.21	0.15
query27	0.04	0.04	0.03
query28	3.48	0.90	0.54
query29	12.52	4.35	3.46
query30	0.27	0.15	0.15
query31	2.77	0.62	0.31
query32	3.22	0.60	0.49
query33	3.17	3.17	3.19
query34	15.57	4.23	3.57
query35	3.57	3.54	3.53
query36	0.58	0.44	0.44
query37	0.09	0.07	0.07
query38	0.04	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.16	0.15
query41	0.08	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 96.87 s
Total hot run time: 25.41 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 40.00% (6/15) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 40.00% (6/15) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 23.48% (58/247) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.91% (21590/39319)
Line Coverage 38.38% (206292/537554)
Region Coverage 34.43% (162338/471498)
Branch Coverage 35.47% (71138/200544)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 51.42% (127/247) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28465/38377)
Line Coverage 58.02% (310055/534356)
Region Coverage 54.75% (259305/473599)
Branch Coverage 56.11% (112654/200787)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 51.42% (127/247) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28466/38377)
Line Coverage 58.03% (310078/534356)
Region Coverage 54.75% (259280/473599)
Branch Coverage 56.11% (112654/200787)

Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/runtime-filter-tree-publish branch 3 times, most recently from af5375f to a03874a Compare June 26, 2026 07:54
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/runtime-filter-tree-publish branch from a03874a to ac55c5c Compare June 26, 2026 08:19
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/runtime-filter-tree-publish branch from ac55c5c to 2cfe4b9 Compare June 26, 2026 08:24
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the current PR head 2cfe4b9d882fe5d913cd7157cf9961293d51ce24 against the seven authoritative changed files.

I am not adding new inline comments in this pass. The main correctness issue I can still substantiate is the rolling-upgrade compatibility risk for tree publish through the new PPublishFilterRequestV2.forward_targets relay field: an older selected relay BE can ignore descendants and only apply its own fragment_ids. That is already covered by existing inline thread discussion_r3473478336, so I am not duplicating it.

Critical checkpoint conclusions:

  • Goal: the PR adds adaptive global runtime-filter tree publish controlled by runtime_filter_tree_publish_max_send_bytes; the current implementation covers FE option propagation, protobuf relay fields, BE fanout/task construction, relay forwarding, and helper unit tests.
  • Scope/focus: the change is mostly focused on runtime-filter publish, with one FE session variable and protocol fields. No additional user-provided focus was supplied.
  • Concurrency/lifecycle: relay RPC closures now keep request/response/controller state alive and use HandleErrorBrpcCallback; no new non-duplicate lifecycle issue found.
  • Compatibility: the old-BE relay compatibility issue remains the key concern, but it is already raised inline. Target-fragment-id assumptions and fallback removal are also already active review-thread context.
  • Configuration/protocol: the new session variable is non-negative checked, forwarded through SessionVariable.toThrift(), and consumed from TQueryOptions; protobuf fields are appended.
  • Parallel paths: current Nereids and legacy coordinator target builders both populate TRuntimeFilterTargetParamsV2.target_fragment_ids; BE load paths without the optional query option leave tree publish disabled.
  • Tests: new BE unit coverage checks fanout, slicing, and task construction. I did not run FE/BE tests in this runner because thirdparty/installed is absent; git diff --check over the PR files passed.

Subagent conclusions:

  • optimizer-rewrite: no new valuable candidate; it verified runtime-filter publish/relay semantics and marked existing compatibility/error-policy concerns as already covered or fixed in current head.
  • tests-session-config: no new valuable candidate; it verified session variable propagation, thrift/proto additions, diff hygiene, and noted the same local test limitation.
  • Convergence round 1: both live subagents replied NO_NEW_VALUABLE_FINDINGS for the same current ledger/comment set after the last main ledger update.

@hello-stephen

Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.38% (1895/2449)
Line Coverage 64.45% (34000/52753)
Region Coverage 64.85% (17492/26973)
Branch Coverage 54.06% (9376/17344)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 18.18% (2/11) 🎉
Increment coverage report
Complete coverage report

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29548 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2cfe4b9d882fe5d913cd7157cf9961293d51ce24, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17727	4085	4020	4020
q2	2001	309	189	189
q3	10319	1465	822	822
q4	4682	469	340	340
q5	7573	877	589	589
q6	184	173	136	136
q7	771	845	627	627
q8	9341	1619	1653	1619
q9	5591	4542	4542	4542
q10	6820	1827	1553	1553
q11	445	280	247	247
q12	632	433	289	289
q13	18138	3389	2783	2783
q14	272	262	243	243
q15	q16	795	780	712	712
q17	999	967	936	936
q18	7366	5813	5571	5571
q19	1151	1119	1099	1099
q20	503	413	268	268
q21	6158	2921	2637	2637
q22	473	375	326	326
Total cold run time: 101941 ms
Total hot run time: 29548 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	5197	4850	4720	4720
q2	332	364	234	234
q3	5012	5333	4710	4710
q4	2085	2188	1397	1397
q5	4777	4847	4677	4677
q6	233	180	132	132
q7	1874	1793	1582	1582
q8	2431	2148	2064	2064
q9	7994	7674	7603	7603
q10	4727	4705	4241	4241
q11	537	393	349	349
q12	735	739	515	515
q13	2947	3336	2751	2751
q14	278	271	258	258
q15	q16	685	694	610	610
q17	1288	1270	1255	1255
q18	7184	7085	6751	6751
q19	1129	1112	1120	1112
q20	2223	2234	1941	1941
q21	5283	4562	4440	4440
q22	522	468	403	403
Total cold run time: 57473 ms
Total hot run time: 51745 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172355 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2cfe4b9d882fe5d913cd7157cf9961293d51ce24, data reload: false

query5	4311	628	507	507
query6	446	184	201	184
query7	4833	560	291	291
query8	331	195	169	169
query9	8761	4132	4059	4059
query10	460	311	279	279
query11	5858	2353	2153	2153
query12	159	110	102	102
query13	1288	608	433	433
query14	6162	5345	5073	5073
query14_1	4411	4339	4428	4339
query15	218	206	186	186
query16	1025	477	480	477
query17	1135	752	592	592
query18	2559	472	352	352
query19	208	188	147	147
query20	122	110	114	110
query21	224	146	120	120
query22	13633	13549	13478	13478
query23	17553	16555	16264	16264
query23_1	16269	16313	16417	16313
query24	7470	1847	1314	1314
query24_1	1337	1311	1319	1311
query25	567	483	386	386
query26	1307	335	171	171
query27	2654	603	360	360
query28	4480	2068	2042	2042
query29	1105	602	503	503
query30	308	231	199	199
query31	1132	1071	959	959
query32	110	62	57	57
query33	507	310	240	240
query34	1169	1176	631	631
query35	779	777	671	671
query36	1376	1367	1212	1212
query37	151	106	96	96
query38	1878	1684	1671	1671
query39	924	915	878	878
query39_1	858	905	870	870
query40	217	119	97	97
query41	64	64	65	64
query42	93	90	89	89
query43	344	330	292	292
query44	1457	803	793	793
query45	220	194	194	194
query46	1082	1212	760	760
query47	2368	2351	2239	2239
query48	409	420	285	285
query49	566	424	314	314
query50	943	345	244	244
query51	4456	4387	4314	4314
query52	82	84	70	70
query53	253	265	191	191
query54	263	228	200	200
query55	77	70	65	65
query56	234	221	211	211
query57	1434	1398	1314	1314
query58	246	215	236	215
query59	1655	1672	1457	1457
query60	287	252	224	224
query61	150	150	151	150
query62	686	645	589	589
query63	235	193	200	193
query64	2502	752	612	612
query65	4885	4810	4791	4791
query66	1754	468	337	337
query67	28850	28762	28705	28705
query68	3224	1596	1010	1010
query69	407	312	283	283
query70	1063	993	912	912
query71	292	233	216	216
query72	2866	2617	2380	2380
query73	874	814	450	450
query74	5118	4945	4755	4755
query75	2578	2555	2170	2170
query76	2351	1176	790	790
query77	360	382	297	297
query78	12503	12606	11875	11875
query79	1361	1219	772	772
query80	875	461	377	377
query81	492	279	238	238
query82	590	155	129	129
query83	346	273	251	251
query84	265	144	113	113
query85	900	497	401	401
query86	397	298	286	286
query87	1835	1827	1767	1767
query88	3718	2801	2816	2801
query89	436	381	336	336
query90	1779	186	191	186
query91	206	162	131	131
query92	67	62	59	59
query93	1517	1449	893	893
query94	609	363	321	321
query95	674	486	350	350
query96	1080	822	368	368
query97	2692	2717	2585	2585
query98	223	204	200	200
query99	1159	1149	1040	1040
Total cold run time: 257126 ms
Total hot run time: 172355 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.01 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2cfe4b9d882fe5d913cd7157cf9961293d51ce24, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.05	0.04
query3	0.26	0.14	0.16
query4	1.62	0.13	0.13
query5	0.23	0.22	0.22
query6	1.22	1.11	1.04
query7	0.04	0.00	0.00
query8	0.05	0.04	0.04
query9	0.38	0.31	0.30
query10	0.55	0.56	0.53
query11	0.19	0.14	0.13
query12	0.18	0.14	0.14
query13	0.46	0.48	0.47
query14	1.02	1.01	1.01
query15	0.60	0.58	0.59
query16	0.32	0.33	0.32
query17	1.09	1.13	1.06
query18	0.22	0.21	0.21
query19	2.03	1.92	2.00
query20	0.02	0.02	0.01
query21	15.43	0.21	0.14
query22	4.93	0.06	0.05
query23	16.15	0.32	0.13
query24	2.96	0.40	0.32
query25	0.10	0.05	0.04
query26	0.72	0.20	0.14
query27	0.05	0.05	0.03
query28	3.58	0.95	0.53
query29	12.47	4.35	3.44
query30	0.27	0.15	0.16
query31	2.76	0.58	0.31
query32	3.23	0.60	0.49
query33	3.15	3.18	3.16
query34	15.55	4.22	3.49
query35	3.54	3.48	3.53
query36	0.55	0.43	0.45
query37	0.09	0.07	0.06
query38	0.06	0.04	0.04
query39	0.03	0.03	0.03
query40	0.18	0.17	0.15
query41	0.08	0.04	0.03
query42	0.03	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.54 s
Total hot run time: 25.01 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 25.56% (57/223) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.87% (21596/39362)
Line Coverage 38.36% (206339/537923)
Region Coverage 34.42% (162343/471686)
Branch Coverage 35.46% (71134/200620)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 52.47% (117/223) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.20% (28508/38420)
Line Coverage 58.08% (310578/534725)
Region Coverage 54.89% (260041/473787)
Branch Coverage 56.19% (112860/200863)

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.17% (6/189) 🎉
Increment coverage report
Complete coverage report

Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
…lish

Issue Number: N/A

Related PR: N/A

Problem Summary: Global runtime filters are currently published from the merge coordinator directly to every scan target. When a merged filter is large and the cluster has many BEs, the coordinator sends many large duplicate RPC attachments. This change adds adaptive tree publish for global runtime filters so each relay node forwards a bounded number of copies based on runtime_filter_tree_publish_max_send_bytes.

Add runtime_filter_tree_publish_max_send_bytes session variable to control adaptive global runtime filter tree publish. Set it to 0 to disable tree publish.

- Test:
    - Manual test: git diff --cached --check
    - Manual test: build-support/check-format.sh
    - Manual test: DORIS_HOME=/mnt/disk8/xiaolei/codespaces/doris2 DORIS_THIRDPARTY=/mnt/disk8/xiaolei/codespaces/doris2/thirdparty ninja -C be/build_Release src/exec/CMakeFiles/Exec.dir/runtime_filter/runtime_filter_mgr.cpp.o src/service/CMakeFiles/Service.dir/internal_service.cpp.o
    - Manual test: ./build.sh --fe
- Behavior changed: Yes. Large global runtime filters can be published through an adaptive relay tree instead of direct coordinator fanout.
- Does this need documentation: No
@BiteTheDDDDt BiteTheDDDDt force-pushed the codex/runtime-filter-tree-publish branch from 2cfe4b9 to 75171cb Compare July 1, 2026 07:45
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR 64851 at head 75171cb.

I did not add new inline comments because the substantiated issues from this pass are already covered by existing inline threads:

  • discussion_r3473478336: tree publish still depends on the new forward_targets field being understood by relay BEs; mixed-version BEs can drop all descendants behind an old relay.
  • discussion_r3503788011: async relay child RPC failures still go through the generic HandleErrorBrpcCallback log path without query_id, filter_id, or target address, despite the reply saying this is done.
  • discussion_r3473478331: relay child RPC error policy was improved in the current head with HandleErrorBrpcCallback and synchronous stub-init failure propagation; I did not find a distinct new error-policy issue beyond the existing thread.

Critical checkpoints:

  • Goal/behavior: adaptive global runtime filter tree publish is implemented and the helper tests cover split/fanout/task construction, but mixed-version relay capability remains unresolved in the existing thread.
  • Scope: the change is focused across BE runtime-filter publish, internal-service relay forwarding, FE session variable propagation, and proto/thrift IDL.
  • Concurrency/lifecycle: no new lock-order issue found; async request/response/controller lifetime appears owned by the relay closure and callback shared pointers.
  • Config/compatibility: runtime_filter_tree_publish_max_send_bytes is forwarded through TQueryOptions; compatibility remains the main risk because forward_targets is ignored by older BEs.
  • Parallel paths: both Nereids RuntimeFiltersThriftBuilder and legacy Coordinator populate target_fragment_ids; I did not find a distinct parallel join/aggregate routing bug.
  • Testing: helper BE unit tests were added, but I could not run tests or regenerate proto/thrift outputs in this checkout because thirdparty/installed and thirdparty/installed/bin/protoc are missing.
  • Observability: the existing failed-forward-target logging thread still appears not fully addressed for async RPC failures.
  • User focus: no additional user-provided review focus was supplied.

Subagent conclusions:

  • optimizer-rewrite: no candidates appended; convergence round 1 returned NO_NEW_VALUABLE_FINDINGS.
  • tests-session-config: no candidates appended; convergence round 1 returned NO_NEW_VALUABLE_FINDINGS; test/protoc execution was also blocked by missing thirdparty dependencies.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 30097 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 75171cb7c77f903c96542ea473e4083fe0ecd070, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17682	4175	4179	4175
q2	2009	330	202	202
q3	10478	1452	825	825
q4	4755	474	344	344
q5	8187	869	584	584
q6	338	182	140	140
q7	921	846	627	627
q8	10721	1707	1696	1696
q9	5945	4531	4478	4478
q10	6860	1803	1515	1515
q11	496	356	319	319
q12	736	553	435	435
q13	18140	3396	2822	2822
q14	264	267	239	239
q15	q16	793	773	713	713
q17	939	1023	966	966
q18	6952	5798	5503	5503
q19	1156	1315	1189	1189
q20	785	675	568	568
q21	5571	2670	2456	2456
q22	435	366	301	301
Total cold run time: 104163 ms
Total hot run time: 30097 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4539	4487	4487	4487
q2	302	325	212	212
q3	4664	5028	4435	4435
q4	2093	2188	1368	1368
q5	4584	4402	4967	4402
q6	288	224	165	165
q7	2118	1829	1707	1707
q8	2662	2260	2378	2260
q9	8337	7932	7792	7792
q10	4850	4771	4357	4357
q11	626	433	398	398
q12	781	753	537	537
q13	3387	3581	2937	2937
q14	322	319	283	283
q15	q16	709	733	637	637
q17	1413	1507	1388	1388
q18	8111	7469	7292	7292
q19	1120	1069	1150	1069
q20	2224	2212	1930	1930
q21	5318	4689	4494	4494
q22	528	457	394	394
Total cold run time: 58976 ms
Total hot run time: 52544 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173927 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 75171cb7c77f903c96542ea473e4083fe0ecd070, data reload: false

query5	4382	624	491	491
query6	477	222	208	208
query7	4843	598	361	361
query8	339	188	176	176
query9	8788	4034	4024	4024
query10	479	358	302	302
query11	5977	2378	2150	2150
query12	157	102	110	102
query13	1273	644	453	453
query14	6241	5297	4966	4966
query14_1	4300	4295	4306	4295
query15	227	218	183	183
query16	1102	514	460	460
query17	1172	749	613	613
query18	2745	491	356	356
query19	219	193	165	165
query20	113	108	106	106
query21	238	155	147	147
query22	13598	13594	13450	13450
query23	17450	16553	16181	16181
query23_1	16400	16330	16348	16330
query24	7502	1795	1313	1313
query24_1	1323	1300	1316	1300
query25	573	473	403	403
query26	1345	368	205	205
query27	2541	597	392	392
query28	4415	2051	2063	2051
query29	1107	633	515	515
query30	342	265	234	234
query31	1135	1097	986	986
query32	119	64	62	62
query33	543	335	268	268
query34	1239	1129	648	648
query35	774	777	689	689
query36	1399	1430	1224	1224
query37	159	122	100	100
query38	1935	1697	1650	1650
query39	920	916	899	899
query39_1	882	909	880	880
query40	247	162	139	139
query41	65	73	66	66
query42	94	97	94	94
query43	323	323	275	275
query44	1471	776	772	772
query45	202	186	178	178
query46	1126	1201	732	732
query47	2342	2400	2267	2267
query48	418	370	287	287
query49	588	428	313	313
query50	1026	427	330	330
query51	4395	4420	4342	4342
query52	84	86	75	75
query53	274	274	206	206
query54	273	229	207	207
query55	73	71	65	65
query56	303	289	279	279
query57	1461	1432	1323	1323
query58	272	259	246	246
query59	1569	1624	1406	1406
query60	316	272	255	255
query61	147	152	157	152
query62	694	642	582	582
query63	242	207	221	207
query64	2519	767	629	629
query65	4867	4820	4765	4765
query66	1825	527	375	375
query67	29724	29569	29420	29420
query68	3111	1522	942	942
query69	411	308	272	272
query70	1075	981	941	941
query71	365	324	301	301
query72	2984	2564	2333	2333
query73	822	783	408	408
query74	5097	5015	4766	4766
query75	2605	2592	2249	2249
query76	2318	1213	809	809
query77	349	369	279	279
query78	12301	12529	11851	11851
query79	1427	1200	765	765
query80	1307	547	474	474
query81	555	328	285	285
query82	563	152	121	121
query83	388	319	297	297
query84	287	162	134	134
query85	993	608	517	517
query86	419	293	283	283
query87	1836	1819	1760	1760
query88	3765	2838	2795	2795
query89	454	406	354	354
query90	1966	196	196	196
query91	203	193	166	166
query92	59	61	58	58
query93	1647	1554	963	963
query94	720	358	322	322
query95	785	497	586	497
query96	1147	814	332	332
query97	2690	2679	2586	2586
query98	216	207	196	196
query99	1170	1154	1035	1035
Total cold run time: 260055 ms
Total hot run time: 173927 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.25 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 75171cb7c77f903c96542ea473e4083fe0ecd070, data reload: false

query1	0.01	0.01	0.01
query2	0.11	0.05	0.06
query3	0.26	0.14	0.13
query4	1.61	0.13	0.14
query5	0.25	0.22	0.22
query6	1.26	1.09	1.08
query7	0.04	0.01	0.00
query8	0.06	0.04	0.04
query9	0.37	0.30	0.31
query10	0.55	0.57	0.53
query11	0.19	0.14	0.14
query12	0.19	0.14	0.15
query13	0.48	0.48	0.48
query14	1.00	1.00	1.02
query15	0.61	0.60	0.59
query16	0.30	0.31	0.32
query17	1.13	1.10	1.13
query18	0.22	0.21	0.21
query19	2.03	1.92	1.97
query20	0.02	0.01	0.02
query21	15.47	0.22	0.13
query22	4.84	0.05	0.06
query23	16.11	0.32	0.12
query24	3.03	0.44	0.33
query25	0.12	0.06	0.04
query26	0.75	0.20	0.15
query27	0.04	0.03	0.03
query28	3.58	0.88	0.55
query29	12.53	4.30	3.47
query30	0.27	0.16	0.16
query31	2.77	0.61	0.32
query32	3.22	0.61	0.49
query33	3.29	3.24	3.32
query34	15.50	4.20	3.47
query35	3.48	3.50	3.51
query36	0.55	0.44	0.43
query37	0.09	0.08	0.07
query38	0.05	0.04	0.03
query39	0.04	0.03	0.03
query40	0.17	0.16	0.16
query41	0.08	0.03	0.04
query42	0.04	0.03	0.02
query43	0.05	0.04	0.04
Total cold run time: 96.76 s
Total hot run time: 25.25 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants