Skip to content

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #60376

… shared sample_infos (#60376)

Previously, all compaction types (base, cumulative, full) shared a
single sample_infos vector per tablet. When different compaction types
ran concurrently on the same tablet, one compaction could resize
sample_infos while another was accessing it, causing out-of-bounds
access and crash.

Crash stack:

```gdb
*** Aborted at 1769502009 (unix time) try "date -d @1769502009" if you are using GNU date ***
*** Current BE git commitID: 0c75960cd13 ***
*** SIGABRT unknown detail explain (@0x4c61) received by PID 19553 (TID 20096 OR 0x7b7f13caa640) from PID 19553; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/common/signal_handler.h:420
 1# 0x00007F82B398B520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# 0x000055BA75135461 in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
 6# std::vector >::operator[](unsigned long) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/stl_vector.h:1263
 7# doris::estimate_batch_size(int, std::shared_ptr, long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:416
 8# doris::Merger::vertical_merge_rowsets(std::shared_ptr, doris::ReaderType, doris::TabletSchema const&, std::vector, std::allocator > > const&, doris::RowsetWriter*, unsigned int, long, doris::Merger::Statistics*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:496
 9# doris::Compaction::merge_input_rowsets() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:210
10# doris::CloudCompactionMixin::execute_compact_impl(long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1490
11# doris::CloudCompactionMixin::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1528
12# doris::CloudBaseCompaction::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_base_compaction.cpp:296
13# doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0::operator()() const at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_storage_engine.cpp:806
14# void std::__invoke_impl const&)::$_0&>(std::__invoke_other, doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
15# std::enable_if const&)::$_0&>, void>::type std::__invoke_r const&)::$_0&>(doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
16# std::_Function_handler const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
17# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
18# doris::FunctionRunnable::run() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:60
19# doris::ThreadPool::dispatch_thread() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:616
20# void std::__invoke_impl(std::__invoke_memfun_deref, void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:76
21# std::__invoke_result::type std::__invoke(void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:98
22# void std::_Bind::__call(std::tuple<>&&, std::_Index_tuple<0ul>) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:515
23# void std::_Bind::operator()<, void>() at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:600
24# void std::__invoke_impl&>(std::__invoke_other, std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
25# std::enable_if&>, void>::type std::__invoke_r&>(std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
26# std::_Function_handler >::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
27# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
28# doris::Thread::supervise_thread(void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/thread.cpp:460
29# asan_thread_start(void*) in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
30# start_thread at ./nptl/pthread_create.c:442
31# 0x00007F82B3A6F850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
```

Root cause:
Base/Full/Cumulative compactions can run concurrently on the same tablet
  They share a single sample_infos vector
  resize() and operator[] are not in the same critical section

Fix:
  Separate sample_infos for each compaction type (cumu/base/full)
  Each type has its own mutex and vector
  Add getter methods to select the correct sample_infos by ReaderType
@github-actions github-actions bot requested a review from yiguolei as a code owner January 30, 2026 16:06
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Jan 30, 2026
@hello-stephen
Copy link
Contributor

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 85.71% (36/42) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.98% (19017/35898)
Line Coverage 36.10% (177063/490484)
Region Coverage 32.73% (137225/419309)
Branch Coverage 33.59% (59454/176981)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants