[BUG]  nn_descent `init_random_graph()` produces degenerate graphs when N % extended_graph_degree == 0

Filed separately from #2102 per @cjnolet's suggestion. Thanks for looking at it. 

**Describe the bug**

CAGRA's nn_descent graph builder silently produces low-recall graphs at specific corpus sizes. When `N % extended_graph_degree == 0`, the initial random graph in `init_random_graph()` has degenerate symmetry that prevents convergence at normal iteration counts. Recall drops from 0.99 to as low as 0.17, deterministically, for any dataset.

The bug is in `GnndGraph::init_random_graph()` in [`cpp/src/neighbors/detail/nn_descent.cuh`](https://github.com/rapidsai/cuvs/blob/branch-25.12/cpp/src/neighbors/detail/nn_descent.cuh). The function assigns initial random neighbors by indexing into a pre-shuffled sequence (`rand_seq`):

```cpp
// nn_descent.cuh, GnndGraph::init_random_graph()
for (size_t i = 0; i < nrow; i++) {
 size_t base_idx = i * node_degree + seg_idx * segment_size;
 auto h_neighbor_list = h_graph + base_idx;
 auto h_dist_list = h_dists.data_handle() + base_idx;
 size_t idx = base_idx; // <-- BUG: periodic access into rand_seq
 size_t self_in_this_seg = 0;
 for (size_t j = 0; j < static_cast<size_t>(segment_size); j++) {
 Index_t id = rand_seq[idx % rand_seq.size()] * num_segments + seg_idx;
 // ...
 idx++;
 }
}
```

`idx` starts at `base_idx` (`i * node_degree + seg_idx * segment_size`) and walks forward through `rand_seq`. Because `base_idx` grows linearly with `i` by stride `node_degree`, the access pattern into `rand_seq` is periodic. When `gcd(node_degree, rand_seq.size())` is large, which happens exactly when `N % extended_graph_degree == 0`, groups of nodes receive identical initial neighbors. This creates symmetric partition boundaries across the CUDA thread grid. `local_join_kernel_wmma` then performs redundant neighbor comparisons across these partitions and early-exits without producing graph updates. The graph barely improves across iterations.

Three hard-coded constants interact to produce `extended_graph_degree`:

| Constant | Location | Value |
|---|---|---|
| `DEGREE_ON_DEVICE` | `nn_descent_gnnd.hpp:24` | 32 |
| `NUM_SAMPLES` | `nn_descent.cuh:201` | 32 |
| Extension formula | `nn_descent_gnnd.hpp:307-310` | `roundUp32(intermediate_graph_degree * 1.3)` |

Common parameter choices (e.g. `intermediate_graph_degree=100` -> `ext=160`, or `intermediate_graph_degree=123` -> `ext=160` as Elasticsearch uses) hit the bug at round corpus sizes.

**Steps/Code to reproduce bug**

```python
import cupy as cp
from cuvs.neighbors import cagra

data = cp.random.randn(54400, 128, dtype=cp.float32) # N%160==0 -> FAIL
params = cagra.IndexParams(
 graph_degree=34, intermediate_graph_degree=100,
 nn_descent_niter=20, build_algo="nn_descent")
index = cagra.build(params, data)
# recall ~0.40. Change 54400 -> 54401: recall jumps to 0.996.
```

Tested on cuVS 26.06.00, bug is present in the latest release.

**Expected behavior**

Recall should be consistent regardless of corpus size. Adding or removing a single vector should not cause recall to swing from 0.99 to 0.40.

**Environment details:**
- Environment location: Bare-metal (Ubuntu 24.04, RTX 4060 Ti 16GB)
- Method of cuVS install: conda (cuVS 26.06.00 standalone, also tested with 25.12.0 bundled in ES 9.4.1)
- CUDA 12.8-13.0, Driver 580.159.03
- Datasets: SIFT-128, Cohere embed-english-v2.0 (768d), AG News + mxbai-embed-large-v1 (1024d)

**Additional context**

<details>
<summary>1. Corpus size survey (63 sizes, 100-vector intervals)</summary>

Tested at N from 54,000 to 60,200 with `graph_degree=34, intermediate_graph_degree=100, niter=20` on 1024d cosine embeddings:

```
n % 160 == 0: recall ~0.40 (catastrophic)
n % 160 == 80: recall ~0.60 (severe)
n % 160 == 40/120: recall ~0.96 (mildly degraded)
n % 160 == other: recall ~0.997 (normal)
```

The period 160 matches `extended_graph_degree` exactly. The failure has multiple severity levels depending on `gcd(N, ext)`.
</details>

<details>
<summary>2. +/-1 vector boundaries (razor-sharp transitions)</summary>

Binary search at partition boundaries shows single-vector precision:

```
54400: 0.40 recall -> 54401: 0.996 (1 vector difference)
55995: 0.998 -> 55996: 0.41
56000: 0.41 -> 56001: 0.997
59900: 0.40 -> 59901: 0.997
```
</details>

<details>
<summary>3. Data-independent and deterministic</summary>

Five random subsets at N=54,400: `0.406, 0.387, 0.409, 0.402, 0.393`, all fail.
Three runs on same data: `0.4000, 0.4015, 0.3995`. Same size always fails; content doesn't matter.
</details>

<details>
<summary>4. Graph-degree independent</summary>

At N=54,400 with `inter=100`, `niter=20`:
```
graph_degree=12: recall=0.388
graph_degree=34: recall=0.414
graph_degree=48: recall=0.425
graph_degree=64: recall=0.429
```

All fail. The bug is in the `extended_graph_degree` calculation, not the ratio between graph_degree and ext.
</details>

<details>
<summary>5. Convergence analysis (niter 1-100)</summary>

Same data, same params, only corpus size differs by 401 vectors:

```
niter aligned (N=54,400) non-aligned (N=54,801)
 1 0.209 0.723
 2 0.211 0.930
 20 0.366 0.961 <- default niter
 50 0.739 0.973
 75 0.971 0.975 <- aligned finally converges
 100 0.977 0.975
```

Iterations to 0.90 recall: non-aligned needs **2**, aligned needs **75** (37.5x). The graph topology IS reachable, convergence is just starved by degenerate update patterns at normal iteration counts.
</details>

<details>
<summary>6. Kernel profiling (ncu)</summary>

`local_join_kernel_wmma` at N=54,400 vs N=54,401 (gd=34, inter=100, niter=20):

```
 Aligned Non-aligned
Kernel launches: 20 20
Avg duration: 4,244 us 16,696 us
Total (20 iters): 84,870 us 333,910 us
SM throughput: 34.6% 39.3%
```

The aligned case runs **3.9x faster** but produces 0.37 recall vs 0.96. Same iteration count, same kernel launches, each launch does far less useful work because symmetric partitions cause degenerate neighbor comparisons that early-exit. Build speed is a misleading quality signal.
</details>

<details>
<summary>7. CUPTI cross-validation</summary>

Independently confirmed the ncu findings using an internal BPF-based GPU observability tool:

```
 Aligned (N=54,400) Non-aligned (N=54,401) Ratio
Total kernel launches: 164 164 1.0x
Total GPU time: 118.5 ms 339.2 ms 2.86x
Avg kernel duration: 722 us 2,068 us 2.86x
Memory operations: 262 262 1.0x
Memory transferred: 871.7 MB 871.7 MB 1.0x
Sync events: 94 94 1.0x
```

Identical pipeline structure. Identical memory transfer (byte-for-byte, 188 copy ops, 38 allocs, 36 frees). The entire delta is in kernel computation time. Duration histograms show bimodal distribution: aligned heavy kernels peak at 512us-4ms (early-exit), non-aligned at 8-32ms (productive work).
</details>

<details>
<summary>8. All ext values fail at scale (200K-1M)</summary>

Initial testing at 40K-60K suggested only ext=160 was affected. Testing at 200K-1M proved ALL ext values fail:

**SIFT-128 (128d), gd=34, niter=20:**

| N | ext=128 | ext=160 | ext=192 |
|---|---------|---------|---------|
| 200K | 0.994 | **0.322** | 0.992 |
| 400K | 0.992 | **0.305** | 0.993 |
| 600K | 0.989 | **0.320** | 0.970 |
| 800K | **0.825** | **0.317** | 0.992 |

**Cohere-768 (768d), gd=34, niter=20:**

| N | ext=128 | ext=160 | ext=192 |
|---|---------|---------|---------|
| 200K | 0.915 | **0.313** | 0.933 |
| 400K | **0.820** | **0.259** | 0.939 |
| 600K | 0.896 | **0.252** | **0.689** |
| 800K | **0.603** | **0.242** | 0.934 |

Higher dimensions make every ext value fail harder and earlier. ext=128 drops to 0.574 at ~1M on Cohere-768. ext=192 fails at 600K. All N+/-1 controls pass (>0.93).

Clamping to a "safe" ext just shifts the failure to a different corpus size.
</details>

<details>
<summary>9. 1024d cross-validation</summary>

At 1024d (AG News + mxbai-embed-large-v1), N=100,000 aligned vs N=100,001 non-aligned, m=48, ext=160:

| nc | Aligned recall | Non-aligned recall | Aligned 1/R@10 | Non-aligned 1/R@10 |
|----|---------------|-------------------|----------------|-------------------|
| 10 | 0.1652 | 0.7968 | 0.804 | 0.984 |
| 100 | 0.3179 | 0.9899 | 0.870 | 1.000 |
| 1000 | 0.6131 | 0.9996 | 0.942 | 1.000 |

At higher dimensionality the bug degrades distance quality too. 1/Ratio@10 (Dimitropoulos & Mamoulis 2025) drops to 0.80 at nc=10, meaning the returned neighbors are measurably farther from the true nearest neighbors, not just reordered.
</details>

<details>
<summary>10. Patched build validation (end-to-end through Elasticsearch)</summary>

Built a patched `libcuvs.so` from v25.12.00 source with the fix below, swapped it into an ES 9.4.1 GPU container, and re-ran the 1024d test. An unpatched instance ran simultaneously as a control.

**Aligned (N=100,000, N%160=0):**

| nc | Unpatched recall | Patched recall | Unpatched 1/R@10 | Patched 1/R@10 |
|----|------|------|------|------|
| 10 | 0.1652 | **0.7960** | 0.804 | **0.984** |
| 50 | 0.2614 | **0.9699** | 0.849 | **0.999** |
| 100 | 0.3179 | **0.9903** | 0.870 | **1.000** |
| 200 | 0.3948 | **0.9966** | 0.893 | **1.000** |
| 500 | 0.5199 | **0.9991** | 0.922 | **1.000** |
| 1000 | 0.6131 | **0.9997** | 0.942 | **1.000** |

**Non-aligned control (N=100,001), confirms no regression:**

| nc | Unpatched recall | Patched recall |
|----|------|------|
| 10 | 0.7968 | 0.7931 |
| 100 | 0.9899 | 0.9908 |
| 1000 | 0.9996 | 0.9997 |

Patched aligned (0.796) matches non-aligned (0.793) at nc=10. Bug eliminated. No regression.
</details>

<details>
<summary>11. ivf_pq is unaffected</summary>

Same params, same data:
```
nn_descent: 0.40-0.99 (size-dependent)
ivf_pq: 0.987 (stable at all sizes)
```

The bug is specific to nn_descent's `init_random_graph()`, not CAGRA's graph format or search.
</details>

### Proposed fix

**Option A : Fix init_random_graph() index calculation**

In `GnndGraph::init_random_graph()` (`nn_descent.cuh`), replace the sequential starting index with a multiplicative hash:

```cpp
// Before:
 size_t idx = base_idx;

// After:
 size_t idx = static_cast<size_t>(i) * 2654435761u;
```

Knuth's golden-ratio hash spreads starting positions uniformly across `rand_seq`, breaking the periodicity for all corpus sizes and all ext values. The constant is coprime to all practical sizes.

I built a patched `libcuvs.so` from v25.12.00 with this change, swapped it into an ES 9.4.1 container, and validated end-to-end at 1024d (see evidence #10). The fix eliminates the alignment gap with no regression.

This is the same approach as PR #819, which has been open since April 2025.

**Option B: Phantom node padding**

When `N % ext == 0`, pad the working set by one phantom (zero-filled) vector:

```cpp
bool needs_phantom = (num_rows % extended_graph_degree == 0);
size_t alloc_rows = needs_phantom ? num_rows + 1 : num_rows;
```

Verified to restore recall from 0.30-0.40 to 0.94-0.998 across SIFT-128 and Cohere-768 at 55K and 200K. More invasive (touches data copy paths in `GNND::build()`).

### Downstream impact

Elasticsearch 9.x uses CAGRA nn_descent for GPU HNSW index builds. With default ES params (`m=48` -> `intermediate_graph_degree=123` -> `ext=160`), any corpus where `N % 160 == 0` silently produces a low-quality graph. Users see degraded kNN search results with no error or warning.

### Related issues

- #204 - low recall for large datasets (likely shares this root cause)
- #2102 - recall degradation at scale (related but distinct, search param tuning vs build-path bug)
- #184 - early termination
- #1370 - lower recall than CPU pynndescent
- #761 - low recall with high sparsity/dimensionality
- PR #819 - proposed fix (open since April 2025)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] nn_descent `init_random_graph()` produces degenerate graphs when N % extended_graph_degree == 0 #2273

Proposed fix

Downstream impact

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Constant	Location	Value
`DEGREE_ON_DEVICE`	`nn_descent_gnnd.hpp:24`	32
`NUM_SAMPLES`	`nn_descent.cuh:201`	32
Extension formula	`nn_descent_gnnd.hpp:307-310`	`roundUp32(intermediate_graph_degree * 1.3)`

N	ext=128	ext=160	ext=192
200K	0.994	0.322	0.992
400K	0.992	0.305	0.993
600K	0.989	0.320	0.970
800K	0.825	0.317	0.992

N	ext=128	ext=160	ext=192
200K	0.915	0.313	0.933
400K	0.820	0.259	0.939
600K	0.896	0.252	0.689
800K	0.603	0.242	0.934

nc	Aligned recall	Non-aligned recall	Aligned 1/R@10	Non-aligned 1/R@10
10	0.1652	0.7968	0.804	0.984
100	0.3179	0.9899	0.870	1.000
1000	0.6131	0.9996	0.942	1.000

nc	Unpatched recall	Patched recall	Unpatched 1/R@10	Patched 1/R@10
10	0.1652	0.7960	0.804	0.984
50	0.2614	0.9699	0.849	0.999
100	0.3179	0.9903	0.870	1.000
200	0.3948	0.9966	0.893	1.000
500	0.5199	0.9991	0.922	1.000
1000	0.6131	0.9997	0.942	1.000

Uh oh!

[BUG] nn_descent init_random_graph() produces degenerate graphs when N % extended_graph_degree == 0 #2273

Description

Proposed fix

Downstream impact

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] nn_descent `init_random_graph()` produces degenerate graphs when N % extended_graph_degree == 0 #2273