Skip to content

[scrubber] phase1: add scrub manager#395

Open
JacksonYao287 wants to merge 1 commit intoeBay:mainfrom
JacksonYao287:add-scrub-manager
Open

[scrubber] phase1: add scrub manager#395
JacksonYao287 wants to merge 1 commit intoeBay:mainfrom
JacksonYao287:add-scrub-manager

Conversation

@JacksonYao287
Copy link
Collaborator

@JacksonYao287 JacksonYao287 commented Mar 8, 2026

this pr implements the framwork and basic logic of scrubber, including:
1 thread model
2 scrubber rpc
3 local scrub: deep and shallow scrub for pg, shard and blob

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.28807% with 361 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.78%. Comparing base (1746bcc) to head (8ae8569).
⚠️ Report is 150 commits behind head on main.

Files with missing lines Patch % Lines
src/lib/homestore_backend/scrub_manager.cpp 67.57% 196 Missing and 89 partials ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 60.92% 41 Missing and 18 partials ⚠️
src/lib/homestore_backend/scrub_manager.hpp 90.29% 10 Missing ⚠️
src/lib/homestore_backend/MPMCPriorityQueue.hpp 94.28% 1 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp 86.66% 2 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp 87.50% 2 Missing ⚠️
src/lib/homestore_backend/hs_http_manager.cpp 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
- Coverage   63.15%   58.78%   -4.37%     
==========================================
  Files          32       38       +6     
  Lines        1900     6122    +4222     
  Branches      204      800     +596     
==========================================
+ Hits         1200     3599    +2399     
- Misses        600     2122    +1522     
- Partials      100      401     +301     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add comprehensive scrub infrastructure to detect data corruption and
inconsistencies across replicas in HomeObject. This is phase 1 of the
scrubber implementation.

- Implements deep and shallow scrubbing for PG metadata, shards, and blobs
- Supports periodic and manual scrub triggering modes
- Uses priority queue (MPMCPriorityQueue) for scrub task scheduling
- Persists scrub metadata using superblocks to track last scrub times
- Coordinates scrub operations across all replicas in a PG

1. **Deep Scrub**: Full data integrity verification
   - PG metadata validation
   - Shard existence and consistency checks
   - Blob hash verification (reads data and computes checksums)
   - Detects corrupted, missing, and inconsistent data across replicas

2. **Shallow Scrub**: Lightweight metadata-only verification
   - Shard existence checks
   - Blob index validation (no data reads)
   - Faster execution for routine checks

- FlatBuffer-based serialization for scrub requests and responses
- Leader sends scrub requests to all replicas
- Followers return scrub maps with their local state
- Retry logic with configurable timeouts for reliability

- **ShallowScrubReport**: Tracks missing shards and blobs per peer
- **DeepScrubReport**: Extends shallow report with:
  - Corrupted blobs/shards with error details
  - Inconsistent blobs (different hashes across replicas)
  - Corrupted PG metadata

- Scrubs data in configurable ranges to avoid timeouts
- Shard range: 2M shards per request
- Blob range: Based on HDD IOPS for deep scrub, 2M for shallow
- Early cancellation support for graceful shutdown

1. **DeepScrubTest**: Verifies detection of:
   - Missing blobs on followers
   - Missing shards on followers
   - Corrupted blob data (IO errors)
   - Inconsistent blob hashes across replicas

2. **MPMCPriorityQueue Tests**: Lock-free queue validation
   - Concurrent push/pop operations
   - Priority ordering verification
   - Thread safety under contention
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants