Skip to content

Mschaara/aurora 2.6 test#17670

Draft
mchaarawi wants to merge 34 commits intorelease/2.6from
mschaara/aurora-2.6-test
Draft

Mschaara/aurora 2.6 test#17670
mchaarawi wants to merge 34 commits intorelease/2.6from
mschaara/aurora-2.6-test

Conversation

@mchaarawi
Copy link
Contributor

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

gnailzenh and others added 23 commits January 24, 2026 10:58
A degraded EC read will allocate and register an extra buffer
to recover data, which may cause ENOMEM in some cases.

this workaround does not prevent dynamic buffer allocation and
registration, it does provide relatively precise control over the
resources consumed by degraded EC reads.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
For data migration, after being waken up, the ULT should try
to wake up another ULT if there is still available resource.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- Add resource bucket so overall resource consumption wouldn't
  grow on system configured with more targets
- Track demanded resource and waitq for blocked ULT, and wakeup
  as many waiters as resource(being released) allowed
- Code cleanup

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
increase default resource limit

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Fix a reference leak in migrate_fini_one_ult()

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@github-actions
Copy link

github-actions bot commented Mar 9, 2026

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Mschaara/aurora

@daosbuild3
Copy link
Collaborator

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@mchaarawi mchaarawi force-pushed the mschaara/aurora-2.6-test branch from 51e9635 to c22d71c Compare March 11, 2026 12:04
@daosbuild3
Copy link
Collaborator

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@mchaarawi mchaarawi force-pushed the mschaara/aurora-2.6-test branch from c22d71c to 839040f Compare March 11, 2026 16:33
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

gnailzenh and others added 4 commits March 12, 2026 15:05
- hulk data handling is not required anymore, it's replaced by
  starveling mechanism
- remove the "yield" and simplify code

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
If a rebuild hang is detected, dump resource bucket information and the waiter queue head

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@mchaarawi mchaarawi force-pushed the mschaara/aurora-2.6-test branch from b337427 to 6b707a3 Compare March 12, 2026 12:39
@daosbuild3
Copy link
Collaborator

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
The current DAOS only reserves RF × group_size targets for a GX object,
whereas it should reserve targets_per_domain × RF.

In addition, when the number of reserved targets is a fixed value, the
likelihood of losing those targets grows significantly as the cluster
scales, which can cause collocated shards and extra data movement
during rebuild.

This patch changes the number of reserved targets for a GX object to
be no less than 30% of the targets.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
- remove _get_target and _get_dom, these function are not required.
  pool_map of DAOS can guarantee extension always appends new children
  after old children, meanwhile, dom_avail_children (the original
  get_num_domains) only returns the number of valid sub-domains/targets,
  so it doesn't make sense to traverse, which is very expensive for
  large object.

- Instead of checking bit by bit, tgt_isset_range and dom_isset_range
  can skip the entire byte of the bitmap if it is 0xFF.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Skip the big loop for logging if DB_PL is not enabled, otherwise
it could loop for 10K+ times for nothing on large system.

pool_map_find_domain() can just return the first domain if domain
type is ROOT.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@mchaarawi mchaarawi force-pushed the mschaara/aurora-2.6-test branch from 6b707a3 to 46b871a Compare March 12, 2026 13:34
@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants