Skip to content

[BUG] nn_descent init_random_graph() produces degenerate graphs when N % extended_graph_degree == 0 #2273

Description

@jdel12

Filed separately from #2102 per @cjnolet's suggestion. Thanks for looking at it.

Describe the bug

CAGRA's nn_descent graph builder silently produces low-recall graphs at specific corpus sizes. When N % extended_graph_degree == 0, the initial random graph in init_random_graph() has degenerate symmetry that prevents convergence at normal iteration counts. Recall drops from 0.99 to as low as 0.17, deterministically, for any dataset.

The bug is in GnndGraph::init_random_graph() in cpp/src/neighbors/detail/nn_descent.cuh. The function assigns initial random neighbors by indexing into a pre-shuffled sequence (rand_seq):

// nn_descent.cuh, GnndGraph::init_random_graph()
for (size_t i = 0; i < nrow; i++) {
  size_t base_idx         = i * node_degree + seg_idx * segment_size;
  auto h_neighbor_list    = h_graph + base_idx;
  auto h_dist_list        = h_dists.data_handle() + base_idx;
  size_t idx              = base_idx;  // <-- BUG: periodic access into rand_seq
  size_t self_in_this_seg = 0;
  for (size_t j = 0; j < static_cast<size_t>(segment_size); j++) {
    Index_t id = rand_seq[idx % rand_seq.size()] * num_segments + seg_idx;
    // ...
    idx++;
  }
}

idx starts at base_idx (i * node_degree + seg_idx * segment_size) and walks forward through rand_seq. Because base_idx grows linearly with i by stride node_degree, the access pattern into rand_seq is periodic. When gcd(node_degree, rand_seq.size()) is large, which happens exactly when N % extended_graph_degree == 0, groups of nodes receive identical initial neighbors. This creates symmetric partition boundaries across the CUDA thread grid. local_join_kernel_wmma then performs redundant neighbor comparisons across these partitions and early-exits without producing graph updates. The graph barely improves across iterations.

Three hard-coded constants interact to produce extended_graph_degree:

Constant Location Value
DEGREE_ON_DEVICE nn_descent_gnnd.hpp:24 32
NUM_SAMPLES nn_descent.cuh:201 32
Extension formula nn_descent_gnnd.hpp:307-310 roundUp32(intermediate_graph_degree * 1.3)

Common parameter choices (e.g. intermediate_graph_degree=100 -> ext=160, or intermediate_graph_degree=123 -> ext=160 as Elasticsearch uses) hit the bug at round corpus sizes.

Steps/Code to reproduce bug

import cupy as cp
from cuvs.neighbors import cagra

data = cp.random.randn(54400, 128, dtype=cp.float32)  # N%160==0 -> FAIL
params = cagra.IndexParams(
    graph_degree=34, intermediate_graph_degree=100,
    nn_descent_niter=20, build_algo="nn_descent")
index = cagra.build(params, data)
# recall ~0.40. Change 54400 -> 54401: recall jumps to 0.996.

Tested on cuVS 26.06.00, bug is present in the latest release.

Expected behavior

Recall should be consistent regardless of corpus size. Adding or removing a single vector should not cause recall to swing from 0.99 to 0.40.

Environment details:

  • Environment location: Bare-metal (Ubuntu 24.04, RTX 4060 Ti 16GB)
  • Method of cuVS install: conda (cuVS 26.06.00 standalone, also tested with 25.12.0 bundled in ES 9.4.1)
  • CUDA 12.8-13.0, Driver 580.159.03
  • Datasets: SIFT-128, Cohere embed-english-v2.0 (768d), AG News + mxbai-embed-large-v1 (1024d)

Additional context

1. Corpus size survey (63 sizes, 100-vector intervals)

Tested at N from 54,000 to 60,200 with graph_degree=34, intermediate_graph_degree=100, niter=20 on 1024d cosine embeddings:

n % 160 == 0:      recall ~0.40 (catastrophic)
n % 160 == 80:     recall ~0.60 (severe)
n % 160 == 40/120: recall ~0.96 (mildly degraded)
n % 160 == other:  recall ~0.997 (normal)

The period 160 matches extended_graph_degree exactly. The failure has multiple severity levels depending on gcd(N, ext).

2. +/-1 vector boundaries (razor-sharp transitions)

Binary search at partition boundaries shows single-vector precision:

54400: 0.40 recall  ->  54401: 0.996   (1 vector difference)
55995: 0.998        ->  55996: 0.41
56000: 0.41         ->  56001: 0.997
59900: 0.40         ->  59901: 0.997
3. Data-independent and deterministic

Five random subsets at N=54,400: 0.406, 0.387, 0.409, 0.402, 0.393, all fail.
Three runs on same data: 0.4000, 0.4015, 0.3995. Same size always fails; content doesn't matter.

4. Graph-degree independent

At N=54,400 with inter=100, niter=20:

graph_degree=12:  recall=0.388
graph_degree=34:  recall=0.414
graph_degree=48:  recall=0.425
graph_degree=64:  recall=0.429

All fail. The bug is in the extended_graph_degree calculation, not the ratio between graph_degree and ext.

5. Convergence analysis (niter 1-100)

Same data, same params, only corpus size differs by 401 vectors:

niter     aligned (N=54,400)     non-aligned (N=54,801)
    1          0.209                  0.723
    2          0.211                  0.930
   20          0.366                  0.961          <- default niter
   50          0.739                  0.973
   75          0.971                  0.975          <- aligned finally converges
  100          0.977                  0.975

Iterations to 0.90 recall: non-aligned needs 2, aligned needs 75 (37.5x). The graph topology IS reachable, convergence is just starved by degenerate update patterns at normal iteration counts.

6. Kernel profiling (ncu)

local_join_kernel_wmma at N=54,400 vs N=54,401 (gd=34, inter=100, niter=20):

                     Aligned       Non-aligned
Kernel launches:     20            20
Avg duration:        4,244 us      16,696 us
Total (20 iters):   84,870 us     333,910 us
SM throughput:       34.6%         39.3%

The aligned case runs 3.9x faster but produces 0.37 recall vs 0.96. Same iteration count, same kernel launches, each launch does far less useful work because symmetric partitions cause degenerate neighbor comparisons that early-exit. Build speed is a misleading quality signal.

7. CUPTI cross-validation

Independently confirmed the ncu findings using an internal BPF-based GPU observability tool:

                        Aligned (N=54,400)    Non-aligned (N=54,401)    Ratio
Total kernel launches:         164                    164              1.0x
Total GPU time:            118.5 ms               339.2 ms            2.86x
Avg kernel duration:         722 us               2,068 us            2.86x
Memory operations:             262                    262              1.0x
Memory transferred:        871.7 MB               871.7 MB            1.0x
Sync events:                    94                     94              1.0x

Identical pipeline structure. Identical memory transfer (byte-for-byte, 188 copy ops, 38 allocs, 36 frees). The entire delta is in kernel computation time. Duration histograms show bimodal distribution: aligned heavy kernels peak at 512us-4ms (early-exit), non-aligned at 8-32ms (productive work).

8. All ext values fail at scale (200K-1M)

Initial testing at 40K-60K suggested only ext=160 was affected. Testing at 200K-1M proved ALL ext values fail:

SIFT-128 (128d), gd=34, niter=20:

N ext=128 ext=160 ext=192
200K 0.994 0.322 0.992
400K 0.992 0.305 0.993
600K 0.989 0.320 0.970
800K 0.825 0.317 0.992

Cohere-768 (768d), gd=34, niter=20:

N ext=128 ext=160 ext=192
200K 0.915 0.313 0.933
400K 0.820 0.259 0.939
600K 0.896 0.252 0.689
800K 0.603 0.242 0.934

Higher dimensions make every ext value fail harder and earlier. ext=128 drops to 0.574 at ~1M on Cohere-768. ext=192 fails at 600K. All N+/-1 controls pass (>0.93).

Clamping to a "safe" ext just shifts the failure to a different corpus size.

9. 1024d cross-validation

At 1024d (AG News + mxbai-embed-large-v1), N=100,000 aligned vs N=100,001 non-aligned, m=48, ext=160:

nc Aligned recall Non-aligned recall Aligned 1/R@10 Non-aligned 1/R@10
10 0.1652 0.7968 0.804 0.984
100 0.3179 0.9899 0.870 1.000
1000 0.6131 0.9996 0.942 1.000

At higher dimensionality the bug degrades distance quality too. 1/Ratio@10 (Dimitropoulos & Mamoulis 2025) drops to 0.80 at nc=10, meaning the returned neighbors are measurably farther from the true nearest neighbors, not just reordered.

10. Patched build validation (end-to-end through Elasticsearch)

Built a patched libcuvs.so from v25.12.00 source with the fix below, swapped it into an ES 9.4.1 GPU container, and re-ran the 1024d test. An unpatched instance ran simultaneously as a control.

Aligned (N=100,000, N%160=0):

nc Unpatched recall Patched recall Unpatched 1/R@10 Patched 1/R@10
10 0.1652 0.7960 0.804 0.984
50 0.2614 0.9699 0.849 0.999
100 0.3179 0.9903 0.870 1.000
200 0.3948 0.9966 0.893 1.000
500 0.5199 0.9991 0.922 1.000
1000 0.6131 0.9997 0.942 1.000

Non-aligned control (N=100,001), confirms no regression:

nc Unpatched recall Patched recall
10 0.7968 0.7931
100 0.9899 0.9908
1000 0.9996 0.9997

Patched aligned (0.796) matches non-aligned (0.793) at nc=10. Bug eliminated. No regression.

11. ivf_pq is unaffected

Same params, same data:

nn_descent: 0.40-0.99 (size-dependent)
ivf_pq:     0.987 (stable at all sizes)

The bug is specific to nn_descent's init_random_graph(), not CAGRA's graph format or search.

Proposed fix

Option A : Fix init_random_graph() index calculation

In GnndGraph::init_random_graph() (nn_descent.cuh), replace the sequential starting index with a multiplicative hash:

// Before:
  size_t idx              = base_idx;

// After:
  size_t idx              = static_cast<size_t>(i) * 2654435761u;

Knuth's golden-ratio hash spreads starting positions uniformly across rand_seq, breaking the periodicity for all corpus sizes and all ext values. The constant is coprime to all practical sizes.

I built a patched libcuvs.so from v25.12.00 with this change, swapped it into an ES 9.4.1 container, and validated end-to-end at 1024d (see evidence #10). The fix eliminates the alignment gap with no regression.

This is the same approach as PR #819, which has been open since April 2025.

Option B: Phantom node padding

When N % ext == 0, pad the working set by one phantom (zero-filled) vector:

bool needs_phantom = (num_rows % extended_graph_degree == 0);
size_t alloc_rows  = needs_phantom ? num_rows + 1 : num_rows;

Verified to restore recall from 0.30-0.40 to 0.94-0.998 across SIFT-128 and Cohere-768 at 55K and 200K. More invasive (touches data copy paths in GNND::build()).

Downstream impact

Elasticsearch 9.x uses CAGRA nn_descent for GPU HNSW index builds. With default ES params (m=48 -> intermediate_graph_degree=123 -> ext=160), any corpus where N % 160 == 0 silently produces a low-quality graph. Users see degraded kNN search results with no error or warning.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions