DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count by wangshilong · Pull Request #17704 · daos-stack/daos

wangshilong · 2026-03-14T15:27:24Z

Fix yield-count accounting in the scanner: rebuild_object() is a pure in-memory btree insert and does not need to contribute yield pressure. A send-side batching policy is also introduced: the send ULT defers flushing until at least REBUILD_SEND_BATCH_MIN OIDs are queued or REBUILD_SEND_BATCH_TIMEOUT_SEC seconds have elapsed.

Without batching, a fast scanner floods the destination rank with many small RPCs, exhausting IB receive buffers and triggering timeouts. This is especially severe during reintegration, where all OIDs are concentrated on a single target rank.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

Fix yield-count accounting in the scanner: rebuild_object() is a pure in-memory btree insert and does not need to contribute yield pressure. A send-side batching policy is also introduced: the send ULT defers flushing until at least REBUILD_SEND_BATCH_MIN OIDs are queued or REBUILD_SEND_BATCH_TIMEOUT_SEC seconds have elapsed, preventing a flood of small migrate RPCs when the scanner runs faster than the sender — particularly under reintegration workloads. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

github-actions · 2026-03-14T15:27:38Z

Ticket title is 'Rebuild stuck on Bear cluster'
Status is 'In Progress'
Labels: 'test_2.8'
https://daosio.atlassian.net/browse/DAOS-18541

gnailzenh · 2026-03-15T03:50:47Z

src/rebuild/scan.c

 		if (rc)
 			D_GOTO(out, rc);
-
-		arg->yield_cnt--;


why is this removed?

I thought rebuild_object() is a btree_insert()(it is a pure memory operations) probably ok not acccounting for yield, could add it back.

gnailzenh · 2026-03-15T03:51:22Z

src/rebuild/scan.c

-		if (dbtree_is_empty(tls->rebuild_tree_hdl)) {
+		tree_empty = dbtree_is_empty(tls->rebuild_tree_hdl);
+		scan_done  = tls->rebuild_pool_scan_done;
+


how about "now = daos_gettime_coarse()" at here

gnailzenh · 2026-03-15T03:52:39Z

src/rebuild/scan.c

+		if (tree_empty) {
+			/* Reset wait clock and yield to let scan make progress. */
+			tls->rebuild_send_wait_start = daos_gettime_coarse();
 			dss_sleep(0);


I'd suggest to change this to dss_sleep(10) as well, make it consistent with the code below

gnailzenh · 2026-03-15T04:37:33Z

src/rebuild/scan.c

+/* Minimum pending objects before the send ULT flushes a batch (25% of max). */
+#define REBUILD_SEND_BATCH_MIN         (REBUILD_SEND_LIMIT / 4)
+/* Maximum seconds to wait for a batch to fill before flushing anyway. */
+#define REBUILD_SEND_BATCH_TIMEOUT_SEC 2


probably change this to 1 seconds is OK

wangshilong requested review from a team as code owners March 14, 2026 15:27

wangshilong requested review from gnailzenh, kccain and liuxuezhao March 14, 2026 15:31

wangshilong changed the title ~~DAOS-18541 rebuild: reduce redundant migration OID RPCs~~ DAOS-18541 rebuild: batch migration OID send RPCs Mar 14, 2026

wangshilong changed the title ~~DAOS-18541 rebuild: batch migration OID send RPCs~~ DAOS-18541 rebuild: increase migration OID batch size to reduce RPC flood Mar 15, 2026

wangshilong changed the title ~~DAOS-18541 rebuild: increase migration OID batch size to reduce RPC flood~~ DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count Mar 15, 2026

gnailzenh reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count#17704

DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count#17704
wangshilong wants to merge 1 commit intomasterfrom
shilongw/DAOS-18541

wangshilong commented Mar 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

gnailzenh Mar 15, 2026

Uh oh!

wangshilong Mar 15, 2026

Uh oh!

gnailzenh Mar 15, 2026

Uh oh!

wangshilong Mar 15, 2026

Uh oh!

gnailzenh Mar 15, 2026

Uh oh!

wangshilong Mar 15, 2026

Uh oh!

gnailzenh Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

wangshilong commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Mar 14, 2026

Uh oh!

gnailzenh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gnailzenh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gnailzenh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

wangshilong Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

gnailzenh Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

wangshilong commented Mar 14, 2026 •

edited

Loading