Deadlock or Wrong Results
Files Affected
stan/lib/stan_math/stan/math/prim/functor/reduce_sum.hpp
tbb::parallel_reduce call site
Root Cause Analysis
When reduce_sum calls are nested (one reduce_sum inside another's ReduceFunction), TBB task stealing can corrupt partial sums:
struct outer_fn {
double operator()(const std::vector<double>& slice, size_t start, size_t end,
std::ostream* msgs) const {
// Inner reduce_sum inside outer reduce_sum!
return stan::math::reduce_sum<inner_fn>(slice, grainsize, msgs);
}
};
std::vector<double> outer_data(100, 1.0);
double result = stan::math::reduce_sum<outer_fn>(outer_data, 5, nullptr);
The Problem:
TBB's task stealing allows work from one task to be stolen by another:
Outer reduce_sum:
Chunk A: running inner_reduce_sum
Chunk B: ready to run
Chunk C: ready to run
Inner reduce_sum (from Chunk A):
Task 1: running
Task 2: ready
TBB's work stealer:
"Chunk B is idle, but Inner Task 1 is ready → steal it for Chunk B"
Result:
Work from different reduction trees intermixes
Partial sums from Inner get added to Outer incorrectly
Deadlock possible if all threads stuck waiting on wrong tasks
The Fix
// BEFORE (allows task stealing):
tbb::parallel_reduce(range, worker);
return_type result = worker.sum_;
// AFTER (isolates task arena):
return_type result(0);
tbb::this_task_arena::isolate([&] {
tbb::parallel_reduce(range, worker);
result = worker.sum_;
});
return result;
tbb::this_task_arena::isolate() creates a boundary that TBB respects — work inside cannot be stolen by work outside, and vice versa.
Test Coverage
TEST(StanMathPrim_reduce_sum, nested_reduce_sum_isolation) {
struct inner_fn {
double operator()(const std::vector<double>& slice, size_t, size_t,
std::ostream*) const {
double s = 0;
for (auto x : slice) s += x;
return s;
}
};
struct outer_fn {
double operator()(const std::vector<double>& slice, size_t, size_t,
std::ostream* msgs) const {
// Inner reduce_sum
return stan::math::reduce_sum<inner_fn>(slice, 1, msgs);
}
};
std::vector<double> outer_data(100, 1.0);
// Without isolation, task stealing could corrupt partial sums
// With isolation, each reduce_sum respects its boundary
double result = stan::math::reduce_sum<outer_fn>(outer_data, 5, nullptr);
EXPECT_DOUBLE_EQ(result, 100.0);
}
Impact
Before Fix:
- Nested
reduce_sum calls silently produce wrong results
- Symptoms: unpredictable values, race conditions, possible deadlock
- Only manifests under specific threading/load conditions
- Very difficult to debug
After Fix:
- Task boundaries respected
- Nested
reduce_sum works correctly
- Deterministic results
Deadlock or Wrong Results
Files Affected
stan/lib/stan_math/stan/math/prim/functor/reduce_sum.hpptbb::parallel_reducecall siteRoot Cause Analysis
When
reduce_sumcalls are nested (onereduce_suminside another'sReduceFunction), TBB task stealing can corrupt partial sums:The Problem:
TBB's task stealing allows work from one task to be stolen by another:
The Fix
tbb::this_task_arena::isolate()creates a boundary that TBB respects — work inside cannot be stolen by work outside, and vice versa.Test Coverage
Impact
Before Fix:
reduce_sumcalls silently produce wrong resultsAfter Fix:
reduce_sumworks correctly