DAOS-18633 rebuild: throttle rebuild status logs to reduce overhead by wangshilong · Pull Request #17696 · daos-stack/daos

wangshilong · 2026-03-13T02:31:05Z

Currently, each target dumps its rebuild progress to the log every 2 seconds unconditionally. In a large-scale scenario where a system has 100 pools rebuilding concurrently across 16 targets per rank, running for 10 hours can generate massive amounts of log data (several GBs per rank). This continuous, high-frequency logging (around 50 logs per second per xstream) causes severe I/O contention and negatively impacts overall I/O performance and ULT scheduling.

There is no necessary reason to print background progress logs this frequently. This patch throttles the rebuild status log dumping from 2 seconds to 5 minutes. The final status will still be printed immediately if a rebuild completes or aborts, ensuring that we retain sufficient visibility for debugging while avoiding log storms.

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

Currently, each target dumps its rebuild progress to the log every 2 seconds unconditionally. In a large-scale scenario where a system has 100 pools rebuilding concurrently across 16 targets per rank, running for 10 hours can generate massive amounts of log data (several GBs per rank). This continuous, high-frequency logging (around 50 logs per second per xstream) causes severe I/O contention and negatively impacts overall I/O performance and ULT scheduling. There is no necessary reason to print background progress logs this frequently. This patch throttles the rebuild status log dumping from 2 seconds to 5 minutes. The final status will still be printed immediately if a rebuild completes or aborts, ensuring that we retain sufficient visibility for debugging while avoiding log storms. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

github-actions · 2026-03-13T02:31:22Z

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18633

daosbuild3 · 2026-03-13T03:17:10Z

Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17696/1/testReport/

src/rebuild/srv.c

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

kccain

any reason the leader loop doesn't use the same modulo approach as implemented in the tgt status loop?

wangshilong · 2026-03-14T00:36:39Z

any reason the leader loop doesn't use the same modulo approach as implemented in the tgt status loop?

I tried to avoid double when updating leader parts and over-optimize to avoid % operations, but forgot to update tgt status loop, but I guess this is not a big deal..

wangshilong requested review from a team as code owners March 13, 2026 02:31

wangshilong requested review from kccain and liuxuezhao March 13, 2026 02:31

wangshilong requested review from NiuYawei and gnailzenh March 13, 2026 02:31

NiuYawei previously approved these changes Mar 13, 2026

View reviewed changes

liuxuezhao reviewed Mar 13, 2026

View reviewed changes

src/rebuild/srv.c Outdated Show resolved Hide resolved

src/rebuild/srv.c Outdated Show resolved Hide resolved

Address comments and cleanup

873c96c

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>

wangshilong dismissed NiuYawei’s stale review via 873c96c March 13, 2026 07:21

wangshilong requested review from NiuYawei and liuxuezhao March 13, 2026 07:22

liuxuezhao approved these changes Mar 13, 2026

View reviewed changes

kccain approved these changes Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-18633 rebuild: throttle rebuild status logs to reduce overhead#17696

DAOS-18633 rebuild: throttle rebuild status logs to reduce overhead#17696
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-1863-reduce-log

wangshilong commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

daosbuild3 commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

kccain left a comment

Uh oh!

wangshilong commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Conversation

wangshilong commented Mar 13, 2026

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

daosbuild3 commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

kccain left a comment

Choose a reason for hiding this comment

Uh oh!

wangshilong commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

wangshilong commented Mar 14, 2026 •

edited

Loading