Skip to content

branch-4.1: [fix](cloud) fix segment footer CORRUPTION not triggering file cache retry #61386#61426

Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61386-branch-4.1
Open

branch-4.1: [fix](cloud) fix segment footer CORRUPTION not triggering file cache retry #61386#61426
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-61386-branch-4.1

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #61386

…retry (#61386)

## Proposed changes

The three-tier retry logic in `Segment::_open()` (static method) was
structured as `if-else-if`, so when `open_file()` succeeded but
`_parse_footer()` returned `CORRUPTION`, the retry branch was
unreachable.

### Root cause

```cpp
// Before: if-else-if structure
auto st = fs->open_file(path, &file_reader, &reader_options);
if (st) {                                  // open_file succeeded (almost always)
    segment->_file_reader = ...;
    st = segment->_open(stats);            // _parse_footer() → CORRUPTION stored in st
} else if (st.is<CORRUPTION>() && ...) {   // UNREACHABLE: already entered `if` branch
    // Tier 1: clear cache, retry
    // Tier 2: bypass cache, read remote directly
}
RETURN_IF_ERROR(st);  // CORRUPTION returned without any retry
```

`open_file()` only opens a file handle and rarely returns `CORRUPTION`.
The actual footer checksum validation happens inside `_parse_footer()`
(called via `segment->_open()`). Because the retry was in an `else if`
guarded by the same `st` from `open_file()`, it was never reachable for
the common `_parse_footer()` corruption case.

### Fix

Change `else if` to a separate `if` block, so CORRUPTION from either
`open_file()` or `_parse_footer()` triggers the three-tier retry:

```cpp
// After: independent if blocks
if (st) {
    segment->_file_reader = ...;
    st = segment->_open(stats);   // _parse_footer() → CORRUPTION stored in st
}
// NOW reachable regardless of where CORRUPTION came from
if (st.is<CORRUPTION>() && reader_options.cache_type == FILE_BLOCK_CACHE) {
    // Tier 1: clear file cache, re-download from remote
    // Tier 2: bypass cache entirely, read remote directly
    // Tier 3: remote source itself corrupt (logs warning)
}
```

### Issue

Observed in cloud (S3) deployments: schema change fails with
`CORRUPTION: Bad segment file footer checksum not match`. Log analysis
confirmed that no retry log messages were ever emitted, consistent with
this code bug.

## Tests

- Added `TestFooterCorruptionTriggersRetry` to
`segment_corruption_test.cpp`
- Uses the existing `Segment::parse_footer:magic_number_corruption` sync
point to corrupt the footer magic number on the first `_parse_footer()`
call only (simulating file cache corruption)
- Verifies the segment opens successfully via the retry path
@github-actions github-actions bot requested a review from yiguolei as a code owner March 17, 2026 07:15
@hello-stephen
Copy link
Contributor

run buildall

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.88% (19290/36478)
Line Coverage 36.20% (180144/497590)
Region Coverage 32.70% (139262/425863)
Branch Coverage 33.75% (60611/179614)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.49% (25532/35713)
Line Coverage 54.22% (269292/496706)
Region Coverage 51.66% (222197/430138)
Branch Coverage 53.20% (95895/180259)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants