Skip to content

Wait for blocksync goroutines on Stop to fix leveldb shutdown panic#3415

Draft
masih wants to merge 1 commit into
mainfrom
masih/panic-leveldb-iter-tm
Draft

Wait for blocksync goroutines on Stop to fix leveldb shutdown panic#3415
masih wants to merge 1 commit into
mainfrom
masih/panic-leveldb-iter-tm

Conversation

@masih
Copy link
Copy Markdown
Collaborator

@masih masih commented May 11, 2026

Reactor.OnStart and BlockPool.OnStart started their long-running goroutines (requestRoutine, poolRoutine, processBlockSyncCh, processPeerUpdates, makeRequestersRoutine) with raw go fn(ctx) using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned.

During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released".

Route each goroutine through BaseService.Spawn so it is tracked by the WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks until they exit, which happens before the node closes the BlockStore DB. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.


Note

Medium Risk
Touches blocksync shutdown/concurrency by changing how long-running goroutines are started and awaited, which can affect node sync and shutdown behavior. Scope is limited and covered by a new regression test, but failures could manifest as hangs or incomplete shutdown.

Overview
Fixes blocksync shutdown ordering by starting Reactor and BlockPool long-running routines via BaseService.Spawn() instead of raw go calls, ensuring they are bound to the service’s inner context and are waited on during Stop().

Updates Reactor.OnStop documentation to reflect that Stop() now blocks until requestRoutine, poolRoutine, processBlockSyncCh, and processPeerUpdates exit, and adds a regression test (TestReactor_OnStopWaitsForGoroutines) that asserts no blocksync goroutines remain after Reactor.Stop() returns to prevent the LevelDB shutdown panic.

Reviewed by Cursor Bugbot for commit e4972b7. Bugbot is set up for automated code reviews on this repo. Configure here.

Reactor.OnStart and BlockPool.OnStart started their long-running
goroutines (requestRoutine, poolRoutine, processBlockSyncCh,
processPeerUpdates, makeRequestersRoutine) with raw `go fn(ctx)` using
the outer context. They were therefore not registered with the
BaseService WaitGroup, and Stop() never waited for them. The outer ctx
also outlived Stop, so the goroutines kept running after Stop returned.

During node shutdown this raced nodeImpl.OnStop's blockStore.Close():
poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator,
observed its leveldb table reader released and panicked with
"leveldb/table: reader released".

Route each goroutine through BaseService.Spawn so it is tracked by the
WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks
until they exit, which happens before the node closes the BlockStore
DB. Add a regression test that asserts no blocksync goroutines remain
after Reactor.Stop() returns.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 11, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 11, 2026, 1:23 PM

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.24%. Comparing base (0543e0e) to head (e4972b7).

Files with missing lines Patch % Lines
sei-tendermint/internal/blocksync/reactor.go 66.66% 8 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3415   +/-   ##
=======================================
  Coverage   59.24%   59.24%           
=======================================
  Files        2110     2110           
  Lines      174149   174170   +21     
=======================================
+ Hits       103175   103193   +18     
- Misses      62041    62044    +3     
  Partials     8933     8933           
Flag Coverage Δ
sei-chain-pr 70.59% <71.42%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/blocksync/pool.go 81.58% <100.00%> (+1.01%) ⬆️
sei-tendermint/internal/blocksync/reactor.go 61.53% <66.66%> (+0.27%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@masih masih marked this pull request as ready for review May 11, 2026 13:50
@masih masih requested review from sei-will and wen-coding May 11, 2026 13:54
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e4972b7. Configure here.

r.Spawn("poolRoutine", func(ctx context.Context) error {
r.poolRoutine(ctx, false)
return nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SwitchToConsensus receives prematurely-cancelled inner context

Medium Severity

Switching poolRoutine from go r.poolRoutine(ctx, false) to r.Spawn(...) changes the ctx it receives from the outer (node-scoped) context to the Reactor's inner.ctx. Inside poolRoutine, this ctx is passed to r.consReactor.SwitchToConsensus(ctx, state, ...) on line 495. The consensus reactor now receives a context that is cancelled when the blocksync Reactor.Stop() runs, rather than when the node shuts down. This can prematurely cancel consensus operations that are expected to outlive the blocksync reactor.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e4972b7. Configure here.

@masih masih marked this pull request as draft May 11, 2026 17:35
@masih
Copy link
Copy Markdown
Collaborator Author

masih commented May 11, 2026

Marked back as draft to take a closer look at reactor code before opening back up for review

@pompon0 pompon0 self-requested a review May 11, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants