Support IVF-RaBitQ in cuVS Library by Stardust-SJF · Pull Request #1866 · rapidsai/cuvs

Stardust-SJF · 2026-03-03T11:36:04Z

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.

- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite

- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive

* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh

…q' into jamxia_cuvs_ivf_rabitq

* Download Eigen automatically by rapids-cmake * Disable FAISS and DISKANN benchmarks * add config files and update readme * Update Readme and openai_1M config * Update python bench command line * update README * update README --------- Co-authored-by: James Xia <jamxia@nvidia.com>

- Error-checking - Stream-ordered CUDA calls

jamxia155 · 2026-03-03T23:05:42Z

/ok to test fb26176

jamxia155 · 2026-03-04T01:43:00Z

/ok to test 994e951

tfeher · 2026-03-04T14:11:42Z

/ok to test 994e951

tfeher

Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).

tfeher · 2026-03-10T17:59:27Z

cpp/bench/ann/src/cuvs/cuvs_benchmark.cu

 #endif
+#ifdef CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
+  if constexpr (std::is_same_v<T, float>) {
+    if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") {


We don't need to have raft_ivf_rabitq. We needed those only to handle legacy config files.

Suggested change

if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") {

if (algo_name == "cuvs_ivf_rabitq") {

tfeher · 2026-03-17T12:33:12Z

cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h

+  cuvs_ivf_rabitq(Metric metric, int dim, const build_param& param)
+    : algo<T>(metric, dim), index_params_(param), dimension_(dim)
+  {
+    // index_params_.metric = parse_metric_type(metric);


Remove if not needed

tfeher · 2026-03-17T12:40:47Z

cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h

+  static_assert(std::is_integral_v<algo_base::index_type>);
+  static_assert(std::is_integral_v<IdxT>);
+
+  IdxT* neighbors_idx_t;


We often use the _t suffix for type names. It would be easier to read if you rename the variable as neighbor_idx_ptr or neighbor_idx

Renamed to neighbor_idx.

tfeher · 2026-03-17T12:43:46Z

cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h

+  auto queries_view =
+    raft::make_device_matrix_view<const T, uint32_t>(queries, batch_size, dimension_);
+  auto neighbors_view =
+    raft::make_device_matrix_view<IdxT, uint32_t>(neighbors_idx_t, batch_size, k);
+  auto distances_view = raft::make_device_matrix_view<float, uint32_t>(distances, batch_size, k);


Please avoid using uint32_t as the mdspan indexing type (unless there is a good reason for keeping it 32 bit). The public API uses int64_t, right?

Updated indexing type to int64_t.

tfeher · 2026-03-17T12:58:06Z

cpp/include/cuvs/neighbors/ivf_rabitq.hpp

+  /** The number of iterations searching for kmeans centers (index building). */
+  uint32_t kmeans_n_iters = 20;
+  /** The fraction of data to use during iterative kmeans building. */
+  double kmeans_trainset_fraction = 0.5;


It would be better to control the number of points per cluster, since then changing the number of clusters would automatically update the number of points used for kmeans training. We are moving in this direction in our other APIs:

cuvs/cpp/include/cuvs/neighbors/common.hpp

Lines 94 to 96 in e75d1bf

* The max number of data points to use per VQ cluster during training.

*/

uint32_t max_train_points_per_vq_cluster = 1024;

cuvs/cpp/include/cuvs/preprocessing/quantize/pq.hpp

Line 76 in e75d1bf

uint32_t max_train_points_per_vq_cluster = 1024;

Suggested change

double kmeans_trainset_fraction = 0.5;

uint32_t max_train_points_per_cluster = 256;

Moved to max_train_points_per_cluster.

tfeher · 2026-03-17T12:58:46Z

cpp/include/cuvs/neighbors/ivf_rabitq.hpp

+  /** The fraction of data to use during iterative kmeans building. */
+  double kmeans_trainset_fraction = 0.5;
+  /** Flag for using the fast quantize method */
+  bool fast_quantize_flag = false;


Shouldn't we have true as default?

Updated default to true.

tfeher · 2026-03-17T17:13:31Z

cpp/include/cuvs/neighbors/ivf_rabitq.hpp

+  index(raft::resources const& handle);
+
+  /** Construct an empty index yet to be populated. */
+  index(raft::resources const& handle,
+        size_t n_rows,
+        uint32_t dim,
+        uint32_t n_lists,
+        uint32_t bits_per_dim);
+
+  /** Construct an empty index. It needs to be trained and then populated. */
+  index(raft::resources const& handle, const index_params& params, uint32_t dim);


Are all these overloads used?

Removed this overload since it internally calls index(raft::resources const& handle) (the latter is required for instantiating an index object before index parameters are known).

tfeher · 2026-03-17T17:27:20Z

cpp/include/cuvs/neighbors/ivf_rabitq.hpp

+ * @param[out] idx reference to ivf_rabitq::index
+ *
+ */
+void build(raft::resources const& handle,


Other cuvs indices also have a build API that return the built index (instead of taking a pointer). If I remember correctly the version that takes a pointer was needed in the past by our Python wrappers. But now the Python wrappers use the C interface, therefore we don't necessarily need that overload. @jamxia155 could you clarify with the team what is the preferred overload set we want to provide for build methods?

Updated build API to return the built index instead, as advised by @cjnolet.

tfeher · 2026-03-17T17:32:39Z

cpp/src/neighbors/detail/vamana/robust_prune.cuh

Is it just a merge error, or are these changes intentional?

This was a compiler warning-as-error (about the unused variable prev_edges) that prevented me from building at some point. I can try reverting this change and see if it still shows up as werror.

Reverted the change and nothing seems to be breaking.

tfeher

A few more comments for the build method.

tfeher · 2026-03-17T17:54:15Z

cpp/src/neighbors/ivf_rabitq.cu

+      d_dataset_array = raft::make_device_mdarray<T>(
+        handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));


We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:

clusters trained on a subset of the data. Only subset needs to be copied to the GPU.

We compress the dataset batch-wise. Only copy one batch at a time to the GPU.

Can the compression step be done batch-wise, or do we need to see the whole dataset for that?

We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.

In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.

From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?

At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.

Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.

The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.

I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.

tfeher · 2026-03-17T18:00:28Z

cpp/src/neighbors/ivf_rabitq.cu

+        "kmeans_trainset_fraction, or set large_workspace_resource appropriately.");
+      throw;
+    }
+    // TODO: a proper sampling


The TODO is outdated, sample_rows is expected to do a proper sampling.

Suggested change

// TODO: a proper sampling

tfeher · 2026-03-17T18:02:25Z

cpp/src/neighbors/ivf_rabitq.cu

+    // TODO: a proper sampling
+    if constexpr (std::is_same_v<T, float>) {
+      raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view());
+    } else {


Do we need the else branch? k-means shall support different input types.

Thanks for pointing this out! Removed unnecessary casting.

- Remove commented-out code - Rename a variable - Update index type for device matrix views

Replace `kmeans_trainset_fraction` with `max_points_per_cluster`

Implement streaming index construction for IVF-RaBitQ to handle datasets that exceed available GPU memory. This enables building indices for large datasets by processing data in batches streamed from host memory. Key features: - Automatic detection based on dataset size vs available workspace - Complete-cluster batching strategy (no partial clusters across batches) - OpenMP parallel host data gathering with persistent thread pool - Contiguous data handling in quantizer for improved performance - Configurable batch size via streaming_batch_size parameter The implementation uses omp_get_max_threads() to scale with available hardware while maintaining efficient memory bandwidth utilization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add an optional parameter to force streaming construction regardless of dataset size. This provides users with explicit control over the construction method for testing or specific use cases. When force_streaming is enabled: - Streaming construction is used even if dataset fits in GPU memory - Distinct log message indicates explicit vs automatic decision to use streaming construction Default behavior (force_streaming=false) remains unchanged, with automatic detection based on dataset size vs available workspace. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Enable control of the force_streaming parameter through JSON benchmark configuration files. Users can now specify force_streaming in the build_param section of their benchmark configs. Example usage: "build_param": { "nlist": 10000, "force_streaming": true, ... } This allows benchmark configurations to explicitly control streaming construction for performance testing and comparison. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.

Adds build_forced_streaming test case that explicitly enables streaming construction even for small datasets that fit in GPU memory. This validates the streaming code path with dynamic batch sizing and ensures compatibility with serialization/deserialization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove batch_flag member variables and associated dead code, then refactor DataQuantizerGPU to move private methods to free functions for better encapsulation. Part 1: Remove batch_flag dead code - Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq - Remove dual code path conditionals (AoS vs SoA layouts) - Simplify helper methods: first_block_batch() → first_block(), ex_factor_batch() → ex_factor() - Simplify GetExFactorBytes() and block_bytes() to single return - Maintain backward compatibility in save/load (legacy flag handling) - Remove dead methods: quantize(), quantize_contiguous(), data_transformation(), data_transformation_contiguous() - Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction) Part 2: Move private methods to free functions - Convert 5 private methods to free functions in anonymous namespace: * data_transformation_batch_opt() * data_transformation_batch_opt_contiguous() * rabitq_codes_and_factors_fused() * exrabitq_codes_and_factors_fused() * exrabitq_codes_and_factors_fused_ori() - Pass all needed class members as explicit parameters - Remove ~40 lines from public header (quantizer_gpu.cuh) - Clean up unused variables Benefits: - Eliminates confusing dual code paths - Cleaner public API with implementation details hidden - Faster compilation for files including headers - Better separation of interface and implementation Files modified: - cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu} - cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu} - cpp/src/neighbors/ivf_rabitq.cu Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

jamxia155 and others added 30 commits November 3, 2025 08:03

Add library for IVF-RaBitQ

e90929a

- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.

Add benchmarking executables for IVF-RaBitQ

4a5c6dc

- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite

Add --executable-dir option

ec88c25

Add documentation about how to describe new datasets

bcc80e2

fix style

4db552f

update docstring

fbfe242

Do not prompt for executable-dir

5c53816

Enable IVF-RaBitQ in cuvs_bench python wrapper

285a9dd

Use SPDX for copyright headers

9f5a3d2

Add documentation for 3rd-party dependency

133f808

Add FAISS CPU IVF-RaBitQ algorithm

09075bc

Enable FAISS CPU IVF-RaBitQ in cuvs_bench python wrapper

4884e89

Rename parameter for consistency

cbd2a05

- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive

Fix cuVS build issues with RaBitQ (rapidsai#4)

7249455

* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh

Handle host and device data in build

1a26a71

Merge remote-tracking branch 'Stardust-SJF_fork/jamxia_cuvs_ivf_rabit…

d49bd0b

…q' into jamxia_cuvs_ivf_rabitq

Disable separable compilation for IVF-RaBitQ code

794b421

Remove outdated instructions

928945b

Plumbing for passing raft handle to IVF-RaBitQ

9a5c0ef

Update rotator_gpu class

389c917

- Error-checking - Stream-ordered CUDA calls

Migrate RotatorGPU class to RAFT

114d560

Remove cuBLAS from RotatorGPU class

db8a437

Remove Eigen dependency in DataQuantizerGPU

b263628

Remove uses of Eigen library

8404b99

Remove dependency Eigen

035e978

(WIP) Add tests for IVF-RaBitQ

c73ef60

Merge remote-tracking branch 'upstream/main' into jamxia_cuvs_ivf_rabitq

6cac52d

Replace header guards with #pragma once

f03ec47

Add namespace

f0be854

Fix style check failures

fb26176

Merge remote-tracking branch 'upstream/main' into cuvs_ivf_rabitq

994e951

aamijar assigned Stardust-SJF Mar 4, 2026

aamijar moved this to In Progress in Unstructured Data Processing Mar 4, 2026

tfeher requested changes Mar 17, 2026

View reviewed changes

tfeher reviewed Mar 17, 2026

View reviewed changes

tfeher requested changes Mar 17, 2026

View reviewed changes

jamxia155 and others added 20 commits March 18, 2026 06:13

Remove oudated algo name

5684f7b

Minor updates based on review comments

bbeb058

- Remove commented-out code - Rename a variable - Update index type for device matrix views

Merge remote-tracking branch 'upstream/main' into cuvs_ivf_rabitq

75b7b19

Update index build parameters

adad8bc

Replace `kmeans_trainset_fraction` with `max_points_per_cluster`

Update index build parameter default

dd6b4ba

Remove outdated parameter

088665b

Remove unnecessary overload

04ae19c

Let build API return the built index

4496d93

Merge remote-tracking branch 'upstream/main' into cuvs_ivf_rabitq

bef16e9

Revert change to unrelated file

485eade

Use public header for kmeans clustering

9761933

Remove unnecessary casting

09c42a5

Clarify force_streaming only applies to datasets on host

9e76b26

Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.

Code cleanup

24e1e3d

Consolidate quantizer_gpu implementation into single file

52793fc

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

	if (algo_name == "raft_ivf_rabitq" \|\| algo_name == "cuvs_ivf_rabitq") {
	if (algo_name == "cuvs_ivf_rabitq") {

	* The max number of data points to use per VQ cluster during training.
	*/
	uint32_t max_train_points_per_vq_cluster = 1024;

	double kmeans_trainset_fraction = 0.5;

	uint32_t max_train_points_per_cluster = 256;

		d_dataset_array = raft::make_device_mdarray<T>(
		handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));

Conversation

Stardust-SJF commented Mar 3, 2026

Uh oh!

jamxia155 commented Mar 3, 2026

Uh oh!

jamxia155 commented Mar 4, 2026

Uh oh!

tfeher commented Mar 4, 2026

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfeher Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stardust-SJF Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

tfeher Mar 17, 2026 •

edited

Loading

Stardust-SJF Mar 19, 2026 •

edited

Loading