Skip to content

fix(minion): fail single-segment task when segment upload fails#18813

Open
tarun11Mavani wants to merge 2 commits into
apache:masterfrom
tarun11Mavani:fix-minion-single-segment-upload-failure
Open

fix(minion): fail single-segment task when segment upload fails#18813
tarun11Mavani wants to merge 2 commits into
apache:masterfrom
tarun11Mavani:fix-minion-single-segment-upload-failure

Conversation

@tarun11Mavani

@tarun11Mavani tarun11Mavani commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

BaseSingleSegmentConversionExecutor caught segment-upload exceptions, logged and metered them, and then returned the conversion result normally — so the minion task was reported as SUCCESS even though the converted segment was never uploaded. Helix marked the task COMPLETED and never retried it, silently leaving
the segment un-refreshed / un-purged / un-compacted.

This affects all single-segment conversion tasks: RealtimeToOfflineSegments, PurgeTask, RefreshSegment, UpsertCompaction, etc.

Root cause

The swallow was an accidental control-flow change introduced in #10978 ("Add minion observability for segment upload/download failures"). That PR wrapped both the download and the upload calls to add metrics + logging:

  • the download branch metered, logged, and rethrew the exception;
  • the upload branch metered and logged but omitted the rethrow.

So the upload path has silently swallowed failures since June 2023. The download-path asymmetry shows this was an oversight, not an intentional "don't retry on upload failure" design — the PR's stated intent was observability only.

Change

  • Rethrow the upload exception, mirroring the download path, so the task fails and is retried by the framework.
  • Move the tarred-file cleanup into a finally block so it still runs on the failure path (the file also lives under tempDataDir, which the outer finally already deletes, so cleanup is preserved either way).
  • Remove the now-dead uploadSuccessful flag.

Testing

Added BaseSingleSegmentConversionExecutorTest (new file):

  • testExecuteTaskRethrowsWhenUploadFails — static-mocks SegmentConversionUtils.uploadSegment to throw and asserts executeTask propagates the exception (the regression guard for this fix).
  • testExecuteTaskSucceedsWhenUploadSucceeds — control test asserting the
    success path returns the conversion result and the upload is invoked.

The test uses a test-only executor that stubs the download / CRC / convert / ZK-metadata-modifier hooks so executeTask reaches the upload step without a server, controller, or deep store, and restores the mutated process-global state in teardown.

BaseSingleSegmentConversionExecutor caught upload exceptions, logged and
metered them, then returned the conversion result normally -- so the minion
task reported SUCCESS even though the converted segment was never uploaded.
Helix marked the task COMPLETED and never retried, silently leaving the
segment un-refreshed/un-purged/un-compacted.

The swallow was an accidental control-flow change introduced in apache#10978
(observability for upload/download failures): the download branch metered,
logged, and rethrew, but the upload branch omitted the rethrow.

Rethrow the upload exception, mirroring the download path, so the task fails
and is retried. Move the tarred-file cleanup into a finally block so it still
runs on the failure path, and remove the now-dead uploadSuccessful flag.

Affects all single-segment conversion tasks (RealtimeToOffline, Purge,
RefreshSegment, UpsertCompaction, etc.). On upload failure these now report
task FAILURE (and retry) instead of SUCCESS; operators alerting on task state
will see failures that were previously hidden.
…andling

Adds BaseSingleSegmentConversionExecutorTest covering executeTask's upload-
failure path: a failed segment upload must propagate (task fails and retries)
instead of being silently reported as success.

The test static-mocks SegmentConversionUtils.uploadSegment and uses a test-only
executor that stubs the download/CRC/convert/ZK-modifier hooks so executeTask
reaches the upload step without a server, controller, or deep store. Includes a
success-path control test. Verified to fail on the pre-fix executor and pass on
the fixed one.
@codecov-commenter

codecov-commenter commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.85%. Comparing base (5526d5b) to head (bda2c72).

Files with missing lines Patch % Lines
...ion/tasks/BaseSingleSegmentConversionExecutor.java 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18813      +/-   ##
============================================
+ Coverage     64.82%   64.85%   +0.02%     
- Complexity     1319     1327       +8     
============================================
  Files          3388     3388              
  Lines        210228   210225       -3     
  Branches      32948    32947       -1     
============================================
+ Hits         136282   136343      +61     
+ Misses        62978    62903      -75     
- Partials      10968    10979      +11     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.85% <50.00%> (+0.02%) ⬆️
temurin 64.85% <50.00%> (+0.02%) ⬆️
unittests 64.85% <50.00%> (+0.02%) ⬆️
unittests1 56.99% <ø> (-0.01%) ⬇️
unittests2 37.30% <50.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one high-signal operability issue; see inline comment.


if (uploadSuccessful) {
LOGGER.info("Done executing {} on table: {}, segment: {}", taskType, tableNameWithType, segmentName);
throw e;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In METADATA push mode this rethrow turns controller-side push failures into task retries, but the retry comes back through moveSegmentToOutputPinotFS() with the same <segment>.tar.gz target. overwriteOutput is normally left false in MinionTaskUtils.getPushTaskConfig(), so if the first attempt already copied the tar before failing, the retry now dies on Output file already exists before it can resend metadata. That makes transient metadata-push failures sticky instead of self-healing. Can we make the staged tar idempotent across retries or clean it up before rethrowing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants