fix(minion): fail single-segment task when segment upload fails#18813
fix(minion): fail single-segment task when segment upload fails#18813tarun11Mavani wants to merge 2 commits into
Conversation
BaseSingleSegmentConversionExecutor caught upload exceptions, logged and metered them, then returned the conversion result normally -- so the minion task reported SUCCESS even though the converted segment was never uploaded. Helix marked the task COMPLETED and never retried, silently leaving the segment un-refreshed/un-purged/un-compacted. The swallow was an accidental control-flow change introduced in apache#10978 (observability for upload/download failures): the download branch metered, logged, and rethrew, but the upload branch omitted the rethrow. Rethrow the upload exception, mirroring the download path, so the task fails and is retried. Move the tarred-file cleanup into a finally block so it still runs on the failure path, and remove the now-dead uploadSuccessful flag. Affects all single-segment conversion tasks (RealtimeToOffline, Purge, RefreshSegment, UpsertCompaction, etc.). On upload failure these now report task FAILURE (and retry) instead of SUCCESS; operators alerting on task state will see failures that were previously hidden.
…andling Adds BaseSingleSegmentConversionExecutorTest covering executeTask's upload- failure path: a failed segment upload must propagate (task fails and retries) instead of being silently reported as success. The test static-mocks SegmentConversionUtils.uploadSegment and uses a test-only executor that stubs the download/CRC/convert/ZK-modifier hooks so executeTask reaches the upload step without a server, controller, or deep store. Includes a success-path control test. Verified to fail on the pre-fix executor and pass on the fixed one.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18813 +/- ##
============================================
+ Coverage 64.82% 64.85% +0.02%
- Complexity 1319 1327 +8
============================================
Files 3388 3388
Lines 210228 210225 -3
Branches 32948 32947 -1
============================================
+ Hits 136282 136343 +61
+ Misses 62978 62903 -75
- Partials 10968 10979 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal operability issue; see inline comment.
|
|
||
| if (uploadSuccessful) { | ||
| LOGGER.info("Done executing {} on table: {}, segment: {}", taskType, tableNameWithType, segmentName); | ||
| throw e; |
There was a problem hiding this comment.
In METADATA push mode this rethrow turns controller-side push failures into task retries, but the retry comes back through moveSegmentToOutputPinotFS() with the same <segment>.tar.gz target. overwriteOutput is normally left false in MinionTaskUtils.getPushTaskConfig(), so if the first attempt already copied the tar before failing, the retry now dies on Output file already exists before it can resend metadata. That makes transient metadata-push failures sticky instead of self-healing. Can we make the staged tar idempotent across retries or clean it up before rethrowing?
Summary
BaseSingleSegmentConversionExecutorcaught segment-upload exceptions, logged and metered them, and then returned the conversion result normally — so the minion task was reported as SUCCESS even though the converted segment was never uploaded. Helix marked the task COMPLETED and never retried it, silently leavingthe segment un-refreshed / un-purged / un-compacted.
This affects all single-segment conversion tasks:
RealtimeToOfflineSegments,PurgeTask,RefreshSegment,UpsertCompaction, etc.Root cause
The swallow was an accidental control-flow change introduced in #10978 ("Add minion observability for segment upload/download failures"). That PR wrapped both the download and the upload calls to add metrics + logging:
So the upload path has silently swallowed failures since June 2023. The download-path asymmetry shows this was an oversight, not an intentional "don't retry on upload failure" design — the PR's stated intent was observability only.
Change
finallyblock so it still runs on the failure path (the file also lives undertempDataDir, which the outerfinallyalready deletes, so cleanup is preserved either way).uploadSuccessfulflag.Testing
Added
BaseSingleSegmentConversionExecutorTest(new file):testExecuteTaskRethrowsWhenUploadFails— static-mocksSegmentConversionUtils.uploadSegmentto throw and assertsexecuteTaskpropagates the exception (the regression guard for this fix).testExecuteTaskSucceedsWhenUploadSucceeds— control test asserting thesuccess path returns the conversion result and the upload is invoked.
The test uses a test-only executor that stubs the download / CRC / convert / ZK-metadata-modifier hooks so
executeTaskreaches the upload step without a server, controller, or deep store, and restores the mutated process-global state in teardown.