Skip to content

[fix](mtmv): serialize alterJob with running tasks to prevent concurrent refresh on same MTMV#64958

Open
yujun777 wants to merge 1 commit into
apache:masterfrom
yujun777:fix-mtmv-alterjob-concurrency
Open

[fix](mtmv): serialize alterJob with running tasks to prevent concurrent refresh on same MTMV#64958
yujun777 wants to merge 1 commit into
apache:masterfrom
yujun777:fix-mtmv-alterjob-concurrency

Conversation

@yujun777

@yujun777 yujun777 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

ALTER MTMV REFRESH ON COMMIT calls alterJob() which does dropJob() + createJob(), creating a new MTMVJob instance with a new ReentrantReadWriteLock. If the CREATE MTMV immediate build task is still running under the old job's writeLock, the new on-commit task can acquire the new job's writeLock concurrently, causing two refresh tasks to operate on the same MTMV simultaneously.

This race triggers "partition not found" when the on-commit task reads partition metadata via UpdateMvByPartitionCommand.constructTableWithPredicates(), while the immediate build task is in the middle of replacing partitions.

Root Cause

  • alterJob() drops the old job and creates a new one
  • The new job has a brand new ReentrantReadWriteLock instance
  • MTMVTask.runTask() acquires the writeLock from getJobOrJobException()
  • After drop+create, getJobOrJobException() returns the new job with a different lock
  • Two tasks can hold two different locks → no mutual exclusion

Fix

In alterJob(), acquire the existing job's writeLock before drop+create, ensuring all running tasks complete before the job is rebuilt. This serializes the alter with any in-flight tasks using the same lock instance.

Change

Single file: fe/fe-core/src/main/java/org/apache/doris/mtmv/MTMVJobManager.java

public void alterJob(MTMV mtmv, boolean isReplay) {
    MTMVJob oldJob = getJobByMTMV(mtmv);
    if (!isReplay && oldJob != null) {
        oldJob.writeLock();   // block until running tasks complete
    }
    try {
        dropJob(mtmv, isReplay);
        createJob(mtmv, isReplay);
    } finally {
        if (!isReplay && oldJob != null) {
            oldJob.writeUnlock();
        }
    }
}
  • No deadlock risk: lock ordering is always PER-JOB → GLOBAL, no reverse path
  • Replay path (isReplay=true) unchanged
  • Null-safe on oldJob

🤖 Generated with Claude Code

…t refresh

ALTER MTMV REFRESH ON COMMIT calls alterJob() which does dropJob() + createJob(),
creating a new MTMVJob instance with a new ReentrantReadWriteLock. If the CREATE
MTMV immediate build task is still running under the old jobs writeLock, the new
on-commit task can acquire the new jobs writeLock concurrently, causing two refresh
tasks to operate on the same MTMV simultaneously and triggering "partition not found".

Fix: In alterJob(), acquire the existing jobs writeLock before drop+create, ensuring
all running tasks complete before the job is rebuilt. This serializes the alter with
any in-flight tasks using the same lock instance.

Co-Authored-By: Claude <noreply@anthropic.com>
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yujun777 yujun777 changed the title fix(mtmv): serialize alterJob with running tasks to prevent concurrent refresh on same MTMV [fix](mtmv): serialize alterJob with running tasks to prevent concurrent refresh on same MTMV Jun 29, 2026
@yujun777

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29395 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8e024d852d991549438a9ac79ad765a7abaacf00, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17813	4150	4035	4035
q2	2059	307	184	184
q3	10294	1437	853	853
q4	4678	470	339	339
q5	7501	857	585	585
q6	187	194	141	141
q7	789	866	636	636
q8	9392	1514	1547	1514
q9	5590	4533	4541	4533
q10	6843	1782	1518	1518
q11	440	280	253	253
q12	636	422	287	287
q13	18236	3400	2736	2736
q14	263	266	243	243
q15	q16	813	785	714	714
q17	1058	1082	1092	1082
q18	6822	5661	5597	5597
q19	1214	1338	1147	1147
q20	500	408	274	274
q21	5709	2686	2421	2421
q22	444	360	303	303
Total cold run time: 101281 ms
Total hot run time: 29395 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4356	4265	4250	4250
q2	329	351	227	227
q3	4606	4989	4422	4422
q4	2066	2156	1390	1390
q5	4480	4293	4337	4293
q6	234	182	136	136
q7	1740	1656	2040	1656
q8	2568	2321	2248	2248
q9	8366	8449	8143	8143
q10	4808	4749	4332	4332
q11	575	445	385	385
q12	739	756	552	552
q13	3242	3664	3026	3026
q14	294	307	278	278
q15	q16	735	730	658	658
q17	1386	1347	1416	1347
q18	7984	7273	7266	7266
q19	1194	1182	1117	1117
q20	2277	2215	1945	1945
q21	5344	4661	4480	4480
q22	512	464	407	407
Total cold run time: 57835 ms
Total hot run time: 52558 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173548 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8e024d852d991549438a9ac79ad765a7abaacf00, data reload: false

query5	4312	623	501	501
query6	433	185	169	169
query7	4834	562	306	306
query8	358	181	163	163
query9	8739	4107	4070	4070
query10	446	310	258	258
query11	5954	2385	2164	2164
query12	151	99	97	97
query13	1230	621	420	420
query14	6297	5275	4957	4957
query14_1	4288	4270	4288	4270
query15	209	202	182	182
query16	1023	474	441	441
query17	957	699	595	595
query18	2439	477	337	337
query19	205	180	139	139
query20	109	106	111	106
query21	231	145	116	116
query22	13668	13599	13407	13407
query23	17558	16638	16182	16182
query23_1	16265	16277	16323	16277
query24	7518	1788	1327	1327
query24_1	1328	1306	1330	1306
query25	565	478	404	404
query26	1306	310	176	176
query27	2674	602	361	361
query28	4492	2075	2076	2075
query29	1195	634	518	518
query30	320	242	201	201
query31	1126	1073	945	945
query32	111	64	63	63
query33	527	338	263	263
query34	1234	1224	649	649
query35	777	797	693	693
query36	1386	1394	1275	1275
query37	163	111	96	96
query38	1900	1720	1664	1664
query39	936	932	880	880
query39_1	883	872	882	872
query40	230	130	107	107
query41	72	70	67	67
query42	96	93	91	91
query43	317	327	285	285
query44	1514	816	799	799
query45	204	193	178	178
query46	1107	1204	774	774
query47	2383	2402	2249	2249
query48	410	431	312	312
query49	600	427	329	329
query50	1017	372	264	264
query51	4418	4442	4274	4274
query52	83	87	73	73
query53	260	273	191	191
query54	280	267	214	214
query55	77	73	71	71
query56	263	265	241	241
query57	1440	1411	1322	1322
query58	270	234	222	222
query59	1553	1659	1449	1449
query60	290	260	247	247
query61	207	145	156	145
query62	702	647	580	580
query63	236	191	192	191
query64	2541	752	600	600
query65	4856	4750	4788	4750
query66	1814	465	339	339
query67	30086	29638	29631	29631
query68	3119	1517	988	988
query69	414	311	268	268
query70	1046	921	971	921
query71	301	240	216	216
query72	3175	2693	2393	2393
query73	848	816	426	426
query74	5135	4975	4741	4741
query75	2575	2543	2181	2181
query76	2303	1222	822	822
query77	344	379	276	276
query78	12461	12403	11902	11902
query79	1458	1185	773	773
query80	601	475	397	397
query81	452	280	238	238
query82	638	158	120	120
query83	357	280	246	246
query84	311	145	117	117
query85	880	526	423	423
query86	395	292	272	272
query87	1836	1858	1776	1776
query88	3730	2845	2811	2811
query89	444	395	342	342
query90	1993	179	184	179
query91	167	162	132	132
query92	65	64	55	55
query93	1508	1558	864	864
query94	546	360	310	310
query95	687	464	354	354
query96	1024	804	352	352
query97	2689	2686	2593	2593
query98	217	209	200	200
query99	1159	1130	1011	1011
Total cold run time: 258786 ms
Total hot run time: 173548 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.19 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 8e024d852d991549438a9ac79ad765a7abaacf00, data reload: false

query1	0.01	0.01	0.01
query2	0.10	0.06	0.05
query3	0.25	0.14	0.13
query4	1.60	0.14	0.13
query5	0.29	0.22	0.23
query6	1.25	1.05	1.08
query7	0.03	0.01	0.00
query8	0.06	0.04	0.04
query9	0.40	0.32	0.31
query10	0.54	0.58	0.56
query11	0.20	0.13	0.14
query12	0.18	0.15	0.14
query13	0.48	0.48	0.49
query14	1.02	1.00	1.01
query15	0.62	0.58	0.59
query16	0.32	0.32	0.32
query17	1.14	1.10	1.19
query18	0.24	0.22	0.21
query19	2.09	1.95	2.02
query20	0.01	0.01	0.01
query21	15.47	0.19	0.13
query22	4.92	0.05	0.05
query23	16.12	0.31	0.11
query24	3.03	0.43	0.35
query25	0.13	0.06	0.03
query26	0.73	0.21	0.15
query27	0.05	0.03	0.05
query28	3.48	0.95	0.54
query29	12.53	4.34	3.45
query30	0.27	0.14	0.17
query31	2.77	0.62	0.31
query32	3.24	0.58	0.48
query33	3.22	3.23	3.24
query34	15.56	4.15	3.50
query35	3.53	3.52	3.50
query36	0.58	0.44	0.40
query37	0.09	0.07	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.16	0.14
query41	0.09	0.03	0.03
query42	0.06	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 97.02 s
Total hot run time: 25.19 s

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 71.43% (5/7) 🎉
Increment coverage report
Complete coverage report

@yujun777

Copy link
Copy Markdown
Contributor Author

run cloud_p0

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 3.70% (5/135) 🎉
Increment coverage report
Complete coverage report

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants