Skip to content

SDSTOR-21465: scrubber phase 1#413

Open
JacksonYao287 wants to merge 3 commits into
eBay:stable/v4.xfrom
JacksonYao287:scrubber-phase-1
Open

SDSTOR-21465: scrubber phase 1#413
JacksonYao287 wants to merge 3 commits into
eBay:stable/v4.xfrom
JacksonYao287:scrubber-phase-1

Conversation

@JacksonYao287

Copy link
Copy Markdown
Collaborator

No description provided.

@codecov-commenter

codecov-commenter commented May 7, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 50.26212% with 759 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (stable/v4.x@e1c23e1). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/lib/homestore_backend/scrub_manager.cpp 56.28% 351 Missing and 87 partials ⚠️
src/lib/homestore_backend/hs_http_manager.cpp 0.42% 237 Missing ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 72.50% 21 Missing and 12 partials ⚠️
src/lib/homestore_backend/hs_http_manager.hpp 0.00% 20 Missing ⚠️
src/lib/homestore_backend/scrub_manager.hpp 72.97% 20 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp 64.70% 5 Missing and 1 partial ⚠️
src/lib/homestore_backend/MPMCPriorityQueue.hpp 94.59% 1 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_homeobject.cpp 66.66% 0 Missing and 2 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp 88.88% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@              Coverage Diff               @@
##             stable/v4.x     #413   +/-   ##
==============================================
  Coverage               ?   53.14%           
==============================================
  Files                  ?       39           
  Lines                  ?     6902           
  Branches               ?      943           
==============================================
  Hits                   ?     3668           
  Misses                 ?     2823           
  Partials               ?      411           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 3 times, most recently from b25a7a3 to 83d0375 Compare May 8, 2026 08:56
Comment thread src/lib/homestore_backend/scrub_manager.cpp
@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 2 times, most recently from 1cec0d0 to 96736e5 Compare May 17, 2026 03:49
@JacksonYao287 JacksonYao287 force-pushed the scrubber-phase-1 branch 6 times, most recently from fe2f1c0 to dfdb099 Compare June 8, 2026 06:27
Comment thread src/lib/homestore_backend/scrub_manager.cpp Outdated

@xiaoxichen xiaoxichen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPE bug should be fixed before merging, other part LGTM.

The NPE can be triggered if some SM enabled scrubbing during config change/upgrade, then causing cluster wide crashing.

@JacksonYao287 JacksonYao287 requested a review from xiaoxichen June 11, 2026 23:43
@xiaoxichen

Copy link
Copy Markdown
Collaborator

the ut failure is in scubber test, please take a look

@JacksonYao287

Copy link
Copy Markdown
Collaborator Author

I triggered this UT again and CI passed. Moreover, I run this UT locally for several times , but can not reproduce this failure.

I think it is a flaky case

@JacksonYao287

JacksonYao287 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

I tried several times again, but still can not reproduce this case.

I suspect the root cause is the unexpected leader switch during UT, just like the flaky stuck case we can sometimes see in homestore_test_pg/shard/blob.

if unexpected leader switch happens during UT,
1 for scrubber test, the scrub task will be cancelled and leader will get an incorrect scrub report , which will result the UT failure.

2 for homestore_test_pg/shard/blob, follower will wait for something to happen, and leader think it is not leader any more( because of leader switch) and do not schedule some op, then all the member will sync and wait at some point, and thus the UT is stuck.

so, I think thing we need try to handle the leader switch case in raft test framework.

@xiaoxichen

Copy link
Copy Markdown
Collaborator

OK, lets wait a few days for other team members to review. If no further comment lets merge as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants