Wait for blocksync goroutines on Stop to fix leveldb shutdown panic#3415
Wait for blocksync goroutines on Stop to fix leveldb shutdown panic#3415masih wants to merge 1 commit into
Conversation
Reactor.OnStart and BlockPool.OnStart started their long-running goroutines (requestRoutine, poolRoutine, processBlockSyncCh, processPeerUpdates, makeRequestersRoutine) with raw `go fn(ctx)` using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned. During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released". Route each goroutine through BaseService.Spawn so it is tracked by the WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks until they exit, which happens before the node closes the BlockStore DB. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3415 +/- ##
=======================================
Coverage 59.24% 59.24%
=======================================
Files 2110 2110
Lines 174149 174170 +21
=======================================
+ Hits 103175 103193 +18
- Misses 62041 62044 +3
Partials 8933 8933
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e4972b7. Configure here.
| r.Spawn("poolRoutine", func(ctx context.Context) error { | ||
| r.poolRoutine(ctx, false) | ||
| return nil | ||
| }) |
There was a problem hiding this comment.
SwitchToConsensus receives prematurely-cancelled inner context
Medium Severity
Switching poolRoutine from go r.poolRoutine(ctx, false) to r.Spawn(...) changes the ctx it receives from the outer (node-scoped) context to the Reactor's inner.ctx. Inside poolRoutine, this ctx is passed to r.consReactor.SwitchToConsensus(ctx, state, ...) on line 495. The consensus reactor now receives a context that is cancelled when the blocksync Reactor.Stop() runs, rather than when the node shuts down. This can prematurely cancel consensus operations that are expected to outlive the blocksync reactor.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e4972b7. Configure here.
|
Marked back as draft to take a closer look at reactor code before opening back up for review |


Reactor.OnStart and BlockPool.OnStart started their long-running goroutines (requestRoutine, poolRoutine, processBlockSyncCh, processPeerUpdates, makeRequestersRoutine) with raw
go fn(ctx)using the outer context. They were therefore not registered with the BaseService WaitGroup, and Stop() never waited for them. The outer ctx also outlived Stop, so the goroutines kept running after Stop returned.During node shutdown this raced nodeImpl.OnStop's blockStore.Close(): poolRoutine, still inside SaveBlock -> Base() -> bs.db.Iterator, observed its leveldb table reader released and panicked with "leveldb/table: reader released".
Route each goroutine through BaseService.Spawn so it is tracked by the WaitGroup and bound to inner.ctx. Stop() now cancels them and blocks until they exit, which happens before the node closes the BlockStore DB. Add a regression test that asserts no blocksync goroutines remain after Reactor.Stop() returns.
Note
Medium Risk
Touches blocksync shutdown/concurrency by changing how long-running goroutines are started and awaited, which can affect node sync and shutdown behavior. Scope is limited and covered by a new regression test, but failures could manifest as hangs or incomplete shutdown.
Overview
Fixes blocksync shutdown ordering by starting
ReactorandBlockPoollong-running routines viaBaseService.Spawn()instead of rawgocalls, ensuring they are bound to the service’s inner context and are waited on duringStop().Updates
Reactor.OnStopdocumentation to reflect thatStop()now blocks untilrequestRoutine,poolRoutine,processBlockSyncCh, andprocessPeerUpdatesexit, and adds a regression test (TestReactor_OnStopWaitsForGoroutines) that asserts no blocksync goroutines remain afterReactor.Stop()returns to prevent the LevelDB shutdown panic.Reviewed by Cursor Bugbot for commit e4972b7. Bugbot is set up for automated code reviews on this repo. Configure here.