sql/backfill: report panics in backfill goroutines#169062
sql/backfill: report panics in backfill goroutines#169062trunk-io[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
The index backfiller and the MVCC index merger each spawn goroutines that, until now, had no panic recovery. A panic inside the goroutine spawned by indexBackfiller.Run (or one re-thrown from a ctxgroup worker by g.Wait) would tear down the SQL pod with no Sentry report and no CRDB-formatted log entry — only the Go runtime's bare stderr dump. Switch the indexBackfiller goroutine to stopper.RunAsyncTaskEx so that the stopper's recover wrapper reports the panic to Sentry before re-panicking. The MVCC index merger keeps its bare goroutine because Run depends on g.Wait returning before the deferred memory monitor cleanup runs (a refused stopper task would force an early return with workers still using the bound account); instead it gets a defer logcrash.RecoverAndReportPanic so the same Sentry visibility applies. Both changes are defense in depth: the SQL pod still crashes after a panic in either path, but now the crash is observable instead of silent. Informs: cockroachdb#169059 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
😎 Merged successfully - details. |
spilchen
left a comment
There was a problem hiding this comment.
@spilchen made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on mw5h and rafiss).
pkg/sql/rowexec/indexbackfiller.go line 578 at r1 (raw file):
// we loop over progCh, which is closed only after the goroutine returns. if startErr := ib.flowCtx.Stopper().RunAsyncTaskEx(ctx, stop.TaskOpts{ TaskName: "indexBackfiller-runBackfill",
nit: we don't typically use camel case for task names. I'm fine if you want to ignore this, I just thought it looked a bit odd.
rafiss
left a comment
There was a problem hiding this comment.
TFTR!
/trunk merge
@rafiss made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on mw5h and spilchen).
pkg/sql/rowexec/indexbackfiller.go line 578 at r1 (raw file):
Previously, spilchen wrote…
nit: we don't typically use camel case for task names. I'm fine if you want to ignore this, I just thought it looked a bit odd.
we use camel case (and also a few other non-standard formats) for this in a few other places:
cockroach/pkg/backup/backup_processor.go
Line 210 in 02c2c76
since there's no unified convention, i'll keep this as is
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. merge conflict cherry-picking 5fdf1a2 to blathers/backport-release-25.4-169062 Backport to branch 25.4.x failed. See errors above. merge conflict cherry-picking 5fdf1a2 to blathers/backport-release-26.1-169062 Backport to branch 26.1.x failed. See errors above. merge conflict cherry-picking 5fdf1a2 to blathers/backport-release-26.2-169062 Backport to branch 26.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The index backfiller and the MVCC index merger each spawn goroutines that, until now, had no panic recovery. A panic inside the goroutine spawned by indexBackfiller.Run (or one re-thrown from a ctxgroup worker by g.Wait) would tear down the SQL pod with no Sentry report and no CRDB-formatted log entry — only the Go runtime's bare stderr dump.
Switch the indexBackfiller goroutine to stopper.RunAsyncTaskEx so that the stopper's recover wrapper reports the panic to Sentry before re-panicking. The MVCC index merger keeps its bare goroutine because Run depends on g.Wait returning before the deferred memory monitor cleanup runs (a refused stopper task would force an early return with workers still using the bound account); instead it gets a
defer logcrash.RecoverAndReportPanic so the same Sentry visibility applies.
Both changes are defense in depth: the SQL pod still crashes after a panic in either path, but now the crash is observable instead of silent.
Informs: #169059
Epic: none
Release note: None