Skip to content

DAOS-18633 rebuild: throttle rebuild status logs to reduce overhead#17696

Open
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-1863-reduce-log
Open

DAOS-18633 rebuild: throttle rebuild status logs to reduce overhead#17696
wangshilong wants to merge 2 commits intomasterfrom
shilongw/DAOS-1863-reduce-log

Conversation

@wangshilong
Copy link
Contributor

Currently, each target dumps its rebuild progress to the log every 2 seconds unconditionally. In a large-scale scenario where a system has 100 pools rebuilding concurrently across 16 targets per rank, running for 10 hours can generate massive amounts of log data (several GBs per rank). This continuous, high-frequency logging (around 50 logs per second per xstream) causes severe I/O contention and negatively impacts overall I/O performance and ULT scheduling.

There is no necessary reason to print background progress logs this frequently. This patch throttles the rebuild status log dumping from 2 seconds to 5 minutes. The final status will still be printed immediately if a rebuild completes or aborts, ensuring that we retain sufficient visibility for debugging while avoiding log storms.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Currently, each target dumps its rebuild progress to the log every 2
seconds unconditionally. In a large-scale scenario where a system has
100 pools rebuilding concurrently across 16 targets per rank, running
for 10 hours can generate massive amounts of log data (several GBs
per rank). This continuous, high-frequency logging (around 50 logs
per second per xstream) causes severe I/O contention and negatively
impacts overall I/O performance and ULT scheduling.

There is no necessary reason to print background progress logs this
frequently. This patch throttles the rebuild status log dumping
from 2 seconds to 5 minutes. The final status will still be printed
immediately if a rebuild completes or aborts, ensuring that we
retain sufficient visibility for debugging while avoiding log storms.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong requested review from a team as code owners March 13, 2026 02:31
@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18633

NiuYawei
NiuYawei previously approved these changes Mar 13, 2026
@daosbuild3
Copy link
Collaborator

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Copy link
Contributor

@kccain kccain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason the leader loop doesn't use the same modulo approach as implemented in the tgt status loop?

@wangshilong
Copy link
Contributor Author

wangshilong commented Mar 14, 2026

any reason the leader loop doesn't use the same modulo approach as implemented in the tgt status loop?

I tried to avoid double when updating leader parts and over-optimize to avoid % operations, but forgot to update tgt status loop, but I guess this is not a big deal..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants