Skip to content

Support IVF-RaBitQ in cuVS Library#1866

Open
Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq
Open

Support IVF-RaBitQ in cuVS Library#1866
Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq

Conversation

@Stardust-SJF
Copy link

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

jamxia155 and others added 30 commits November 3, 2025 08:03
- Currently built as a separate library.
- To be merged with existing `cuvs_objs` library.
- Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work
is completed.
- CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1
- Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ

* Align line formatting && Delete unused variables in robust_prune.cuh
* Download Eigen automatically by rapids-cmake

* Disable FAISS and DISKANN benchmarks

* add config files and update readme

* Update Readme and openai_1M config

* Update python bench command line

* update README

* update README

---------

Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking
- Stream-ordered CUDA calls
@jamxia155
Copy link
Contributor

/ok to test fb26176

@jamxia155
Copy link
Contributor

/ok to test 994e951

@tfeher
Copy link
Contributor

tfeher commented Mar 4, 2026

/ok to test 994e951

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).

#endif
#ifdef CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ
if constexpr (std::is_same_v<T, float>) {
if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have raft_ivf_rabitq. We needed those only to handle legacy config files.

Suggested change
if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") {
if (algo_name == "cuvs_ivf_rabitq") {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

cuvs_ivf_rabitq(Metric metric, int dim, const build_param& param)
: algo<T>(metric, dim), index_params_(param), dimension_(dim)
{
// index_params_.metric = parse_metric_type(metric);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

static_assert(std::is_integral_v<algo_base::index_type>);
static_assert(std::is_integral_v<IdxT>);

IdxT* neighbors_idx_t;
Copy link
Contributor

@tfeher tfeher Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We often use the _t suffix for type names. It would be easier to read if you rename the variable as neighbor_idx_ptr or neighbor_idx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to neighbor_idx.

Comment on lines +151 to +155
auto queries_view =
raft::make_device_matrix_view<const T, uint32_t>(queries, batch_size, dimension_);
auto neighbors_view =
raft::make_device_matrix_view<IdxT, uint32_t>(neighbors_idx_t, batch_size, k);
auto distances_view = raft::make_device_matrix_view<float, uint32_t>(distances, batch_size, k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using uint32_t as the mdspan indexing type (unless there is a good reason for keeping it 32 bit). The public API uses int64_t, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated indexing type to int64_t.

/** The number of iterations searching for kmeans centers (index building). */
uint32_t kmeans_n_iters = 20;
/** The fraction of data to use during iterative kmeans building. */
double kmeans_trainset_fraction = 0.5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to control the number of points per cluster, since then changing the number of clusters would automatically update the number of points used for kmeans training. We are moving in this direction in our other APIs:

Suggested change
double kmeans_trainset_fraction = 0.5;
uint32_t max_train_points_per_cluster = 256;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to max_train_points_per_cluster.

/** The fraction of data to use during iterative kmeans building. */
double kmeans_trainset_fraction = 0.5;
/** Flag for using the fast quantize method */
bool fast_quantize_flag = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have true as default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated default to true.

Comment on lines +118 to +128
index(raft::resources const& handle);

/** Construct an empty index yet to be populated. */
index(raft::resources const& handle,
size_t n_rows,
uint32_t dim,
uint32_t n_lists,
uint32_t bits_per_dim);

/** Construct an empty index. It needs to be trained and then populated. */
index(raft::resources const& handle, const index_params& params, uint32_t dim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these overloads used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this overload since it internally calls index(raft::resources const& handle) (the latter is required for instantiating an index object before index parameters are known).

* @param[out] idx reference to ivf_rabitq::index
*
*/
void build(raft::resources const& handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cuvs indices also have a build API that return the built index (instead of taking a pointer). If I remember correctly the version that takes a pointer was needed in the past by our Python wrappers. But now the Python wrappers use the C interface, therefore we don't necessarily need that overload. @jamxia155 could you clarify with the team what is the preferred overload set we want to provide for build methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated build API to return the built index instead, as advised by @cjnolet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just a merge error, or are these changes intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a compiler warning-as-error (about the unused variable prev_edges) that prevented me from building at some point. I can try reverting this change and see if it still shows up as werror.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the change and nothing seems to be breaking.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments for the build method.

Comment on lines +63 to +64
d_dataset_array = raft::make_device_mdarray<T>(
handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:

  1. clusters trained on a subset of the data. Only subset needs to be copied to the GPU.
  2. We compress the dataset batch-wise. Only copy one batch at a time to the GPU.

Can the compression step be done batch-wise, or do we need to see the whole dataset for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.

In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?

At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.

Copy link
Author

@Stardust-SJF Stardust-SJF Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.

"kmeans_trainset_fraction, or set large_workspace_resource appropriately.");
throw;
}
// TODO: a proper sampling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO is outdated, sample_rows is expected to do a proper sampling.

Suggested change
// TODO: a proper sampling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

// TODO: a proper sampling
if constexpr (std::is_same_v<T, float>) {
raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view());
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the else branch? k-means shall support different input types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Removed unnecessary casting.

jamxia155 and others added 20 commits March 18, 2026 06:13
- Remove commented-out code
- Rename a variable
- Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
  Implement streaming index construction for IVF-RaBitQ to handle datasets
  that exceed available GPU memory. This enables building indices for
  large datasets by processing data in batches streamed from host memory.

  Key features:
  - Automatic detection based on dataset size vs available workspace
  - Complete-cluster batching strategy (no partial clusters across batches)
  - OpenMP parallel host data gathering with persistent thread pool
  - Contiguous data handling in quantizer for improved performance
  - Configurable batch size via streaming_batch_size parameter

The implementation uses omp_get_max_threads() to scale with available
  hardware while maintaining efficient memory bandwidth utilization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Add an optional parameter to force streaming construction regardless of
  dataset size. This provides users with explicit control over the
  construction method for testing or specific use cases.

  When force_streaming is enabled:
  - Streaming construction is used even if dataset fits in GPU memory
  - Distinct log message indicates explicit vs automatic decision to
  use streaming construction

  Default behavior (force_streaming=false) remains unchanged, with
  automatic detection based on dataset size vs available workspace.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Enable control of the force_streaming parameter through JSON benchmark
  configuration files. Users can now specify force_streaming in the
  build_param section of their benchmark configs.

  Example usage:
    "build_param": {
      "nlist": 10000,
      "force_streaming": true,
      ...
    }

  This allows benchmark configurations to explicitly control streaming
  construction for performance testing and comparison.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Update documentation to note that force_streaming has no effect when
  the dataset is already in device memory, as streaming construction is
  only applicable for host-to-device data transfer.
  Adds build_forced_streaming test case that explicitly enables streaming
  construction even for small datasets that fit in GPU memory. This validates
  the streaming code path with dynamic batch sizing and ensures compatibility
  with serialization/deserialization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.

Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
  ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
  data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)

Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
  * data_transformation_batch_opt()
  * data_transformation_batch_opt_contiguous()
  * rabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables

Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation

Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants