Filed separately from #2102 per @cjnolet's suggestion. Thanks for looking at it.
Describe the bug
CAGRA's nn_descent graph builder silently produces low-recall graphs at specific corpus sizes. When N % extended_graph_degree == 0, the initial random graph in init_random_graph() has degenerate symmetry that prevents convergence at normal iteration counts. Recall drops from 0.99 to as low as 0.17, deterministically, for any dataset.
The bug is in GnndGraph::init_random_graph() in cpp/src/neighbors/detail/nn_descent.cuh. The function assigns initial random neighbors by indexing into a pre-shuffled sequence (rand_seq):
// nn_descent.cuh, GnndGraph::init_random_graph()
for (size_t i = 0; i < nrow; i++) {
size_t base_idx = i * node_degree + seg_idx * segment_size;
auto h_neighbor_list = h_graph + base_idx;
auto h_dist_list = h_dists.data_handle() + base_idx;
size_t idx = base_idx; // <-- BUG: periodic access into rand_seq
size_t self_in_this_seg = 0;
for (size_t j = 0; j < static_cast<size_t>(segment_size); j++) {
Index_t id = rand_seq[idx % rand_seq.size()] * num_segments + seg_idx;
// ...
idx++;
}
}
idx starts at base_idx (i * node_degree + seg_idx * segment_size) and walks forward through rand_seq. Because base_idx grows linearly with i by stride node_degree, the access pattern into rand_seq is periodic. When gcd(node_degree, rand_seq.size()) is large, which happens exactly when N % extended_graph_degree == 0, groups of nodes receive identical initial neighbors. This creates symmetric partition boundaries across the CUDA thread grid. local_join_kernel_wmma then performs redundant neighbor comparisons across these partitions and early-exits without producing graph updates. The graph barely improves across iterations.
Three hard-coded constants interact to produce extended_graph_degree:
| Constant |
Location |
Value |
DEGREE_ON_DEVICE |
nn_descent_gnnd.hpp:24 |
32 |
NUM_SAMPLES |
nn_descent.cuh:201 |
32 |
| Extension formula |
nn_descent_gnnd.hpp:307-310 |
roundUp32(intermediate_graph_degree * 1.3) |
Common parameter choices (e.g. intermediate_graph_degree=100 -> ext=160, or intermediate_graph_degree=123 -> ext=160 as Elasticsearch uses) hit the bug at round corpus sizes.
Steps/Code to reproduce bug
import cupy as cp
from cuvs.neighbors import cagra
data = cp.random.randn(54400, 128, dtype=cp.float32) # N%160==0 -> FAIL
params = cagra.IndexParams(
graph_degree=34, intermediate_graph_degree=100,
nn_descent_niter=20, build_algo="nn_descent")
index = cagra.build(params, data)
# recall ~0.40. Change 54400 -> 54401: recall jumps to 0.996.
Tested on cuVS 26.06.00, bug is present in the latest release.
Expected behavior
Recall should be consistent regardless of corpus size. Adding or removing a single vector should not cause recall to swing from 0.99 to 0.40.
Environment details:
- Environment location: Bare-metal (Ubuntu 24.04, RTX 4060 Ti 16GB)
- Method of cuVS install: conda (cuVS 26.06.00 standalone, also tested with 25.12.0 bundled in ES 9.4.1)
- CUDA 12.8-13.0, Driver 580.159.03
- Datasets: SIFT-128, Cohere embed-english-v2.0 (768d), AG News + mxbai-embed-large-v1 (1024d)
Additional context
1. Corpus size survey (63 sizes, 100-vector intervals)
Tested at N from 54,000 to 60,200 with graph_degree=34, intermediate_graph_degree=100, niter=20 on 1024d cosine embeddings:
n % 160 == 0: recall ~0.40 (catastrophic)
n % 160 == 80: recall ~0.60 (severe)
n % 160 == 40/120: recall ~0.96 (mildly degraded)
n % 160 == other: recall ~0.997 (normal)
The period 160 matches extended_graph_degree exactly. The failure has multiple severity levels depending on gcd(N, ext).
2. +/-1 vector boundaries (razor-sharp transitions)
Binary search at partition boundaries shows single-vector precision:
54400: 0.40 recall -> 54401: 0.996 (1 vector difference)
55995: 0.998 -> 55996: 0.41
56000: 0.41 -> 56001: 0.997
59900: 0.40 -> 59901: 0.997
3. Data-independent and deterministic
Five random subsets at N=54,400: 0.406, 0.387, 0.409, 0.402, 0.393, all fail.
Three runs on same data: 0.4000, 0.4015, 0.3995. Same size always fails; content doesn't matter.
4. Graph-degree independent
At N=54,400 with inter=100, niter=20:
graph_degree=12: recall=0.388
graph_degree=34: recall=0.414
graph_degree=48: recall=0.425
graph_degree=64: recall=0.429
All fail. The bug is in the extended_graph_degree calculation, not the ratio between graph_degree and ext.
5. Convergence analysis (niter 1-100)
Same data, same params, only corpus size differs by 401 vectors:
niter aligned (N=54,400) non-aligned (N=54,801)
1 0.209 0.723
2 0.211 0.930
20 0.366 0.961 <- default niter
50 0.739 0.973
75 0.971 0.975 <- aligned finally converges
100 0.977 0.975
Iterations to 0.90 recall: non-aligned needs 2, aligned needs 75 (37.5x). The graph topology IS reachable, convergence is just starved by degenerate update patterns at normal iteration counts.
6. Kernel profiling (ncu)
local_join_kernel_wmma at N=54,400 vs N=54,401 (gd=34, inter=100, niter=20):
Aligned Non-aligned
Kernel launches: 20 20
Avg duration: 4,244 us 16,696 us
Total (20 iters): 84,870 us 333,910 us
SM throughput: 34.6% 39.3%
The aligned case runs 3.9x faster but produces 0.37 recall vs 0.96. Same iteration count, same kernel launches, each launch does far less useful work because symmetric partitions cause degenerate neighbor comparisons that early-exit. Build speed is a misleading quality signal.
7. CUPTI cross-validation
Independently confirmed the ncu findings using an internal BPF-based GPU observability tool:
Aligned (N=54,400) Non-aligned (N=54,401) Ratio
Total kernel launches: 164 164 1.0x
Total GPU time: 118.5 ms 339.2 ms 2.86x
Avg kernel duration: 722 us 2,068 us 2.86x
Memory operations: 262 262 1.0x
Memory transferred: 871.7 MB 871.7 MB 1.0x
Sync events: 94 94 1.0x
Identical pipeline structure. Identical memory transfer (byte-for-byte, 188 copy ops, 38 allocs, 36 frees). The entire delta is in kernel computation time. Duration histograms show bimodal distribution: aligned heavy kernels peak at 512us-4ms (early-exit), non-aligned at 8-32ms (productive work).
8. All ext values fail at scale (200K-1M)
Initial testing at 40K-60K suggested only ext=160 was affected. Testing at 200K-1M proved ALL ext values fail:
SIFT-128 (128d), gd=34, niter=20:
| N |
ext=128 |
ext=160 |
ext=192 |
| 200K |
0.994 |
0.322 |
0.992 |
| 400K |
0.992 |
0.305 |
0.993 |
| 600K |
0.989 |
0.320 |
0.970 |
| 800K |
0.825 |
0.317 |
0.992 |
Cohere-768 (768d), gd=34, niter=20:
| N |
ext=128 |
ext=160 |
ext=192 |
| 200K |
0.915 |
0.313 |
0.933 |
| 400K |
0.820 |
0.259 |
0.939 |
| 600K |
0.896 |
0.252 |
0.689 |
| 800K |
0.603 |
0.242 |
0.934 |
Higher dimensions make every ext value fail harder and earlier. ext=128 drops to 0.574 at ~1M on Cohere-768. ext=192 fails at 600K. All N+/-1 controls pass (>0.93).
Clamping to a "safe" ext just shifts the failure to a different corpus size.
9. 1024d cross-validation
At 1024d (AG News + mxbai-embed-large-v1), N=100,000 aligned vs N=100,001 non-aligned, m=48, ext=160:
| nc |
Aligned recall |
Non-aligned recall |
Aligned 1/R@10 |
Non-aligned 1/R@10 |
| 10 |
0.1652 |
0.7968 |
0.804 |
0.984 |
| 100 |
0.3179 |
0.9899 |
0.870 |
1.000 |
| 1000 |
0.6131 |
0.9996 |
0.942 |
1.000 |
At higher dimensionality the bug degrades distance quality too. 1/Ratio@10 (Dimitropoulos & Mamoulis 2025) drops to 0.80 at nc=10, meaning the returned neighbors are measurably farther from the true nearest neighbors, not just reordered.
10. Patched build validation (end-to-end through Elasticsearch)
Built a patched libcuvs.so from v25.12.00 source with the fix below, swapped it into an ES 9.4.1 GPU container, and re-ran the 1024d test. An unpatched instance ran simultaneously as a control.
Aligned (N=100,000, N%160=0):
| nc |
Unpatched recall |
Patched recall |
Unpatched 1/R@10 |
Patched 1/R@10 |
| 10 |
0.1652 |
0.7960 |
0.804 |
0.984 |
| 50 |
0.2614 |
0.9699 |
0.849 |
0.999 |
| 100 |
0.3179 |
0.9903 |
0.870 |
1.000 |
| 200 |
0.3948 |
0.9966 |
0.893 |
1.000 |
| 500 |
0.5199 |
0.9991 |
0.922 |
1.000 |
| 1000 |
0.6131 |
0.9997 |
0.942 |
1.000 |
Non-aligned control (N=100,001), confirms no regression:
| nc |
Unpatched recall |
Patched recall |
| 10 |
0.7968 |
0.7931 |
| 100 |
0.9899 |
0.9908 |
| 1000 |
0.9996 |
0.9997 |
Patched aligned (0.796) matches non-aligned (0.793) at nc=10. Bug eliminated. No regression.
11. ivf_pq is unaffected
Same params, same data:
nn_descent: 0.40-0.99 (size-dependent)
ivf_pq: 0.987 (stable at all sizes)
The bug is specific to nn_descent's init_random_graph(), not CAGRA's graph format or search.
Proposed fix
Option A : Fix init_random_graph() index calculation
In GnndGraph::init_random_graph() (nn_descent.cuh), replace the sequential starting index with a multiplicative hash:
// Before:
size_t idx = base_idx;
// After:
size_t idx = static_cast<size_t>(i) * 2654435761u;
Knuth's golden-ratio hash spreads starting positions uniformly across rand_seq, breaking the periodicity for all corpus sizes and all ext values. The constant is coprime to all practical sizes.
I built a patched libcuvs.so from v25.12.00 with this change, swapped it into an ES 9.4.1 container, and validated end-to-end at 1024d (see evidence #10). The fix eliminates the alignment gap with no regression.
This is the same approach as PR #819, which has been open since April 2025.
Option B: Phantom node padding
When N % ext == 0, pad the working set by one phantom (zero-filled) vector:
bool needs_phantom = (num_rows % extended_graph_degree == 0);
size_t alloc_rows = needs_phantom ? num_rows + 1 : num_rows;
Verified to restore recall from 0.30-0.40 to 0.94-0.998 across SIFT-128 and Cohere-768 at 55K and 200K. More invasive (touches data copy paths in GNND::build()).
Downstream impact
Elasticsearch 9.x uses CAGRA nn_descent for GPU HNSW index builds. With default ES params (m=48 -> intermediate_graph_degree=123 -> ext=160), any corpus where N % 160 == 0 silently produces a low-quality graph. Users see degraded kNN search results with no error or warning.
Related issues
Filed separately from #2102 per @cjnolet's suggestion. Thanks for looking at it.
Describe the bug
CAGRA's nn_descent graph builder silently produces low-recall graphs at specific corpus sizes. When
N % extended_graph_degree == 0, the initial random graph ininit_random_graph()has degenerate symmetry that prevents convergence at normal iteration counts. Recall drops from 0.99 to as low as 0.17, deterministically, for any dataset.The bug is in
GnndGraph::init_random_graph()incpp/src/neighbors/detail/nn_descent.cuh. The function assigns initial random neighbors by indexing into a pre-shuffled sequence (rand_seq):idxstarts atbase_idx(i * node_degree + seg_idx * segment_size) and walks forward throughrand_seq. Becausebase_idxgrows linearly withiby stridenode_degree, the access pattern intorand_seqis periodic. Whengcd(node_degree, rand_seq.size())is large, which happens exactly whenN % extended_graph_degree == 0, groups of nodes receive identical initial neighbors. This creates symmetric partition boundaries across the CUDA thread grid.local_join_kernel_wmmathen performs redundant neighbor comparisons across these partitions and early-exits without producing graph updates. The graph barely improves across iterations.Three hard-coded constants interact to produce
extended_graph_degree:DEGREE_ON_DEVICEnn_descent_gnnd.hpp:24NUM_SAMPLESnn_descent.cuh:201nn_descent_gnnd.hpp:307-310roundUp32(intermediate_graph_degree * 1.3)Common parameter choices (e.g.
intermediate_graph_degree=100->ext=160, orintermediate_graph_degree=123->ext=160as Elasticsearch uses) hit the bug at round corpus sizes.Steps/Code to reproduce bug
Tested on cuVS 26.06.00, bug is present in the latest release.
Expected behavior
Recall should be consistent regardless of corpus size. Adding or removing a single vector should not cause recall to swing from 0.99 to 0.40.
Environment details:
Additional context
1. Corpus size survey (63 sizes, 100-vector intervals)
Tested at N from 54,000 to 60,200 with
graph_degree=34, intermediate_graph_degree=100, niter=20on 1024d cosine embeddings:The period 160 matches
extended_graph_degreeexactly. The failure has multiple severity levels depending ongcd(N, ext).2. +/-1 vector boundaries (razor-sharp transitions)
Binary search at partition boundaries shows single-vector precision:
3. Data-independent and deterministic
Five random subsets at N=54,400:
0.406, 0.387, 0.409, 0.402, 0.393, all fail.Three runs on same data:
0.4000, 0.4015, 0.3995. Same size always fails; content doesn't matter.4. Graph-degree independent
At N=54,400 with
inter=100,niter=20:All fail. The bug is in the
extended_graph_degreecalculation, not the ratio between graph_degree and ext.5. Convergence analysis (niter 1-100)
Same data, same params, only corpus size differs by 401 vectors:
Iterations to 0.90 recall: non-aligned needs 2, aligned needs 75 (37.5x). The graph topology IS reachable, convergence is just starved by degenerate update patterns at normal iteration counts.
6. Kernel profiling (ncu)
local_join_kernel_wmmaat N=54,400 vs N=54,401 (gd=34, inter=100, niter=20):The aligned case runs 3.9x faster but produces 0.37 recall vs 0.96. Same iteration count, same kernel launches, each launch does far less useful work because symmetric partitions cause degenerate neighbor comparisons that early-exit. Build speed is a misleading quality signal.
7. CUPTI cross-validation
Independently confirmed the ncu findings using an internal BPF-based GPU observability tool:
Identical pipeline structure. Identical memory transfer (byte-for-byte, 188 copy ops, 38 allocs, 36 frees). The entire delta is in kernel computation time. Duration histograms show bimodal distribution: aligned heavy kernels peak at 512us-4ms (early-exit), non-aligned at 8-32ms (productive work).
8. All ext values fail at scale (200K-1M)
Initial testing at 40K-60K suggested only ext=160 was affected. Testing at 200K-1M proved ALL ext values fail:
SIFT-128 (128d), gd=34, niter=20:
Cohere-768 (768d), gd=34, niter=20:
Higher dimensions make every ext value fail harder and earlier. ext=128 drops to 0.574 at ~1M on Cohere-768. ext=192 fails at 600K. All N+/-1 controls pass (>0.93).
Clamping to a "safe" ext just shifts the failure to a different corpus size.
9. 1024d cross-validation
At 1024d (AG News + mxbai-embed-large-v1), N=100,000 aligned vs N=100,001 non-aligned, m=48, ext=160:
At higher dimensionality the bug degrades distance quality too. 1/Ratio@10 (Dimitropoulos & Mamoulis 2025) drops to 0.80 at nc=10, meaning the returned neighbors are measurably farther from the true nearest neighbors, not just reordered.
10. Patched build validation (end-to-end through Elasticsearch)
Built a patched
libcuvs.sofrom v25.12.00 source with the fix below, swapped it into an ES 9.4.1 GPU container, and re-ran the 1024d test. An unpatched instance ran simultaneously as a control.Aligned (N=100,000, N%160=0):
Non-aligned control (N=100,001), confirms no regression:
Patched aligned (0.796) matches non-aligned (0.793) at nc=10. Bug eliminated. No regression.
11. ivf_pq is unaffected
Same params, same data:
The bug is specific to nn_descent's
init_random_graph(), not CAGRA's graph format or search.Proposed fix
Option A : Fix init_random_graph() index calculation
In
GnndGraph::init_random_graph()(nn_descent.cuh), replace the sequential starting index with a multiplicative hash:Knuth's golden-ratio hash spreads starting positions uniformly across
rand_seq, breaking the periodicity for all corpus sizes and all ext values. The constant is coprime to all practical sizes.I built a patched
libcuvs.sofrom v25.12.00 with this change, swapped it into an ES 9.4.1 container, and validated end-to-end at 1024d (see evidence #10). The fix eliminates the alignment gap with no regression.This is the same approach as PR #819, which has been open since April 2025.
Option B: Phantom node padding
When
N % ext == 0, pad the working set by one phantom (zero-filled) vector:Verified to restore recall from 0.30-0.40 to 0.94-0.998 across SIFT-128 and Cohere-768 at 55K and 200K. More invasive (touches data copy paths in
GNND::build()).Downstream impact
Elasticsearch 9.x uses CAGRA nn_descent for GPU HNSW index builds. With default ES params (
m=48->intermediate_graph_degree=123->ext=160), any corpus whereN % 160 == 0silently produces a low-quality graph. Users see degraded kNN search results with no error or warning.Related issues