Support IVF-RaBitQ in cuVS Library#1866
Support IVF-RaBitQ in cuVS Library#1866Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Conversation
- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh
…q' into jamxia_cuvs_ivf_rabitq
* Download Eigen automatically by rapids-cmake * Disable FAISS and DISKANN benchmarks * add config files and update readme * Update Readme and openai_1M config * Update python bench command line * update README * update README --------- Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking - Stream-ordered CUDA calls
|
/ok to test fb26176 |
|
/ok to test 994e951 |
|
/ok to test 994e951 |
tfeher
left a comment
There was a problem hiding this comment.
Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).
| #endif | ||
| #ifdef CUVS_ANN_BENCH_USE_CUVS_IVF_RABITQ | ||
| if constexpr (std::is_same_v<T, float>) { | ||
| if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") { |
There was a problem hiding this comment.
We don't need to have raft_ivf_rabitq. We needed those only to handle legacy config files.
| if (algo_name == "raft_ivf_rabitq" || algo_name == "cuvs_ivf_rabitq") { | |
| if (algo_name == "cuvs_ivf_rabitq") { |
| cuvs_ivf_rabitq(Metric metric, int dim, const build_param& param) | ||
| : algo<T>(metric, dim), index_params_(param), dimension_(dim) | ||
| { | ||
| // index_params_.metric = parse_metric_type(metric); |
| static_assert(std::is_integral_v<algo_base::index_type>); | ||
| static_assert(std::is_integral_v<IdxT>); | ||
|
|
||
| IdxT* neighbors_idx_t; |
There was a problem hiding this comment.
We often use the _t suffix for type names. It would be easier to read if you rename the variable as neighbor_idx_ptr or neighbor_idx
There was a problem hiding this comment.
Renamed to neighbor_idx.
| auto queries_view = | ||
| raft::make_device_matrix_view<const T, uint32_t>(queries, batch_size, dimension_); | ||
| auto neighbors_view = | ||
| raft::make_device_matrix_view<IdxT, uint32_t>(neighbors_idx_t, batch_size, k); | ||
| auto distances_view = raft::make_device_matrix_view<float, uint32_t>(distances, batch_size, k); |
There was a problem hiding this comment.
Please avoid using uint32_t as the mdspan indexing type (unless there is a good reason for keeping it 32 bit). The public API uses int64_t, right?
There was a problem hiding this comment.
Updated indexing type to int64_t.
| /** The number of iterations searching for kmeans centers (index building). */ | ||
| uint32_t kmeans_n_iters = 20; | ||
| /** The fraction of data to use during iterative kmeans building. */ | ||
| double kmeans_trainset_fraction = 0.5; |
There was a problem hiding this comment.
It would be better to control the number of points per cluster, since then changing the number of clusters would automatically update the number of points used for kmeans training. We are moving in this direction in our other APIs:
cuvs/cpp/include/cuvs/neighbors/common.hpp
Lines 94 to 96 in e75d1bf
| double kmeans_trainset_fraction = 0.5; | |
| uint32_t max_train_points_per_cluster = 256; |
There was a problem hiding this comment.
Moved to max_train_points_per_cluster.
| /** The fraction of data to use during iterative kmeans building. */ | ||
| double kmeans_trainset_fraction = 0.5; | ||
| /** Flag for using the fast quantize method */ | ||
| bool fast_quantize_flag = false; |
There was a problem hiding this comment.
Shouldn't we have true as default?
There was a problem hiding this comment.
Updated default to true.
| index(raft::resources const& handle); | ||
|
|
||
| /** Construct an empty index yet to be populated. */ | ||
| index(raft::resources const& handle, | ||
| size_t n_rows, | ||
| uint32_t dim, | ||
| uint32_t n_lists, | ||
| uint32_t bits_per_dim); | ||
|
|
||
| /** Construct an empty index. It needs to be trained and then populated. */ | ||
| index(raft::resources const& handle, const index_params& params, uint32_t dim); |
There was a problem hiding this comment.
Are all these overloads used?
There was a problem hiding this comment.
Removed this overload since it internally calls index(raft::resources const& handle) (the latter is required for instantiating an index object before index parameters are known).
| * @param[out] idx reference to ivf_rabitq::index | ||
| * | ||
| */ | ||
| void build(raft::resources const& handle, |
There was a problem hiding this comment.
Other cuvs indices also have a build API that return the built index (instead of taking a pointer). If I remember correctly the version that takes a pointer was needed in the past by our Python wrappers. But now the Python wrappers use the C interface, therefore we don't necessarily need that overload. @jamxia155 could you clarify with the team what is the preferred overload set we want to provide for build methods?
There was a problem hiding this comment.
Updated build API to return the built index instead, as advised by @cjnolet.
There was a problem hiding this comment.
Is it just a merge error, or are these changes intentional?
There was a problem hiding this comment.
This was a compiler warning-as-error (about the unused variable prev_edges) that prevented me from building at some point. I can try reverting this change and see if it still shows up as werror.
There was a problem hiding this comment.
Reverted the change and nothing seems to be breaking.
tfeher
left a comment
There was a problem hiding this comment.
A few more comments for the build method.
| d_dataset_array = raft::make_device_mdarray<T>( | ||
| handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim)); |
There was a problem hiding this comment.
We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:
- clusters trained on a subset of the data. Only subset needs to be copied to the GPU.
- We compress the dataset batch-wise. Only copy one batch at a time to the GPU.
Can the compression step be done batch-wise, or do we need to see the whole dataset for that?
There was a problem hiding this comment.
We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.
In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.
There was a problem hiding this comment.
From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?
At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.
There was a problem hiding this comment.
Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.
There was a problem hiding this comment.
The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.
There was a problem hiding this comment.
I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.
cpp/src/neighbors/ivf_rabitq.cu
Outdated
| "kmeans_trainset_fraction, or set large_workspace_resource appropriately."); | ||
| throw; | ||
| } | ||
| // TODO: a proper sampling |
There was a problem hiding this comment.
The TODO is outdated, sample_rows is expected to do a proper sampling.
| // TODO: a proper sampling |
cpp/src/neighbors/ivf_rabitq.cu
Outdated
| // TODO: a proper sampling | ||
| if constexpr (std::is_same_v<T, float>) { | ||
| raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view()); | ||
| } else { |
There was a problem hiding this comment.
Do we need the else branch? k-means shall support different input types.
There was a problem hiding this comment.
Thanks for pointing this out! Removed unnecessary casting.
- Remove commented-out code - Rename a variable - Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
Implement streaming index construction for IVF-RaBitQ to handle datasets that exceed available GPU memory. This enables building indices for large datasets by processing data in batches streamed from host memory. Key features: - Automatic detection based on dataset size vs available workspace - Complete-cluster batching strategy (no partial clusters across batches) - OpenMP parallel host data gathering with persistent thread pool - Contiguous data handling in quantizer for improved performance - Configurable batch size via streaming_batch_size parameter The implementation uses omp_get_max_threads() to scale with available hardware while maintaining efficient memory bandwidth utilization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add an optional parameter to force streaming construction regardless of dataset size. This provides users with explicit control over the construction method for testing or specific use cases. When force_streaming is enabled: - Streaming construction is used even if dataset fits in GPU memory - Distinct log message indicates explicit vs automatic decision to use streaming construction Default behavior (force_streaming=false) remains unchanged, with automatic detection based on dataset size vs available workspace. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Enable control of the force_streaming parameter through JSON benchmark
configuration files. Users can now specify force_streaming in the
build_param section of their benchmark configs.
Example usage:
"build_param": {
"nlist": 10000,
"force_streaming": true,
...
}
This allows benchmark configurations to explicitly control streaming
construction for performance testing and comparison.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.
Adds build_forced_streaming test case that explicitly enables streaming construction even for small datasets that fit in GPU memory. This validates the streaming code path with dynamic batch sizing and ensures compatibility with serialization/deserialization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.
Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)
Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
* data_transformation_batch_opt()
* data_transformation_batch_opt_contiguous()
* rabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables
Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation
Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.