DAOS-0000 placement: Introduce O(1) fast path for massive GX objects#17667
Draft
wangshilong wants to merge 1 commit intomasterfrom
Draft
DAOS-0000 placement: Introduce O(1) fast path for massive GX objects#17667wangshilong wants to merge 1 commit intomasterfrom
wangshilong wants to merge 1 commit intomasterfrom
Conversation
Problem: The traditional jump hash placement algorithm suffers severe performance degradation when generating layouts for Object Classes with a massive number of groups (e.g., GX class). For instance, in a 500-node cluster using EC16P3GX (~16K targets), the group count (grp_nr) easily exceeds 800. During layout generation, the first ~30 groups quickly exhaust all domain usage bitsets (dom_used). The remaining ~770 groups consistently collide in d_hash_jump(), forcing the algorithm into slow O(D) fallback loops (dom_isset_2ranges). This generates tens of millions of inner-loop bitmap checks (O(G*S*D)), severely blocking CPU and causing unacceptable latency spikes during object creation and rebuild layout mapping. Solution: Introduce a fast path optimization specifically designed for objects with heavy group counts, guarded by a new pool layout version to guarantee strict backward compatibility. 1. Bump DAOS_POOL_OBJ_VERSION to 3 (DAOS_POOL_OBJ_VERSION_3). 2. In get_object_layout(), detect if layout_ver >= 3 and group counts are sufficiently large (jmop_grp_nr >= jmop_dom_nr * 1.5). 3. Under these conditions, bypass the standard non-leaf hash-collision routines by heavily leaning into get_object_layout_gx_fast(). 4. The GX fast path uses an OID-seeded Fisher-Yates algorithm to pre-shuffle domains, dealing out targets to shards sequentially. By reducing traditional collisions, the time complexity drops significantly from O(G * S * D) down to near O(D + G * S), eliminating the massive CPU stalls and providing an estimated 500x-3000x speedup for GX placement on large-scale clusters. Legacy pools (layout <= v2) naturally bypass this avoiding any unexpected layout shifts. Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
|
Errors are Unable to load ticket data |
Collaborator
|
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17667/1/execution/node/1073/log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
The traditional jump hash placement algorithm suffers severe performance degradation when generating layouts for Object Classes with a massive number of groups (e.g., GX class). For instance, in a 500-node cluster using EC16P3GX (~16K targets), the group count (grp_nr) easily exceeds
800. During layout generation, the first ~30 groups quickly exhaust all domain usage bitsets (dom_used). The remaining ~770 groups consistently collide in d_hash_jump(), forcing the algorithm into slow O(D) fallback loops (dom_isset_2ranges). This generates tens of millions of inner-loop bitmap checks (O(GSD)), severely blocking CPU and causing unacceptable latency spikes during object creation and rebuild layout mapping.
Solution:
Introduce a fast path optimization specifically designed for objects with heavy group counts, guarded by a new pool layout version to guarantee strict backward compatibility.
By reducing traditional collisions, the time complexity drops significantly from O(G * S * D) down to near O(D + G * S), eliminating the massive CPU stalls and providing an estimated 500x-3000x speedup for GX placement on large-scale clusters. Legacy pools (layout <= v2) naturally bypass this avoiding any unexpected layout shifts.
Steps for the author:
After all prior steps are complete: