Skip to content

Commit 2ba8f5f

Browse files
committed
Updated quick_start.md
1 parent 6c7dfa8 commit 2ba8f5f

File tree

1 file changed

+21
-21
lines changed

1 file changed

+21
-21
lines changed

docs/docs/quick_start.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22

33
## RaBitQ Quantizer
44

5-
The RaBitQ Library offers simple interfaces, making it a drop-in replacement for scalar and binary quantization.
6-
The interface offers two underlying implementations of RaBitQ: one delivers optimal accuracy with longer quantization time, while the other provides near-optimal accuracy with significantly faster quantization.
5+
The RaBitQ Library provides simple interfaces, making it a drop-in replacement for scalar and binary quantization.
6+
The interface offers two underlying RaBitQ implementations: one delivers optimal accuracy with longer quantization time, and another delivers near-optimal accuracy with significantly faster quantization.
77

8-
The library provides advanced data formats for supporting efficient distance estimation. The details can be found in [Quantizer](rabitq/quantizer.md).
8+
The library provides advanced data formats to support efficient distance estimation. The details can be found in [Quantizer](rabitq/quantizer.md).
99

1010
### Example Code in C++
1111
```cpp
@@ -51,18 +51,18 @@ int main() {
5151

5252

5353
## RaBitQ + IVF
54-
[IVF](https://dl.acm.org/doi/10.1109/TPAMI.2010.57) is a classical clustering-based ANN index. IVF + RaBitQ consumes minimum memory across IVF, HNSW and QG. Powered by [FastScan](https://arxiv.org/abs/1704.07355), it also achieves promising time-accuracy trade-off. To use RaBitQ + IVF, users need to cluster raw data vectors (e.g., using Kmeans), then to quantize each cluster and construct the IVF. The following is an example of using RaBitQ + IVF to search ANN on the deep1M dataset.
54+
[IVF](https://dl.acm.org/doi/10.1109/TPAMI.2010.57) is a classical clustering-based ANN index. IVF + RaBitQ offers the lowest memory usage compared to IVF, HNSW and QG. Powered by [FastScan](https://arxiv.org/abs/1704.07355), it also provides a promising time-accuracy trade-off. To use RaBitQ + IVF, first cluster the raw vectors (e.g., via KMeans), then quantize each cluster and build the IVF index. Below is an example of using RaBitQ + IVF for ANN search on the Deep1M dataset.
5555

5656
### Dataset downloading and clustering
57-
Use the following commands in the shell to download the deep1M dataset, generate clustering information, and store it on disk.
57+
Use the following shell commands to download the Deep1M dataset, generate clustering information, and save it to disk.
5858
```shell
5959
wget http://www.cse.cuhk.edu.hk/systems/hash/gqr/dataset/deep1M.tar.gz
6060
tar -zxvf deep1M.tar.gz
6161
python python/ivf.py deep1M/deep1M_base.fvecs 4096 deep1M/deep1M_centroids_4096.fvecs deep1M/deep1M_clusterids_4096.ivecs
6262
```
6363

6464
### Example Code in C++ for index construction
65-
The following codes show how to load deep1M's vector data, centroids information, and cluster ids from disk, build the IVF + RaBitQ index, and finally save the index to disk.
65+
The following code demonstrates how to load Deep1M's vector data, centroids information, and cluster IDs from disk, build an IVF + RaBitQ index, and save the index back to disk.
6666
```cpp
6767
#include <cstdint>
6868
#include <iostream>
@@ -133,14 +133,14 @@ int main(int argc, char** argv) {
133133
return 0;
134134
}
135135
```
136-
After compilation (suppose it is compiled to an executable named `ivf_build`), run the following command to build the IVF:
136+
After compilation (suppose it is compiled to an executable named `ivf_build`), run the following command to build the IVF index:
137137
```shell
138138
./ivf_build deep1M/deep1M_base.fvecs deep1M/deep1M_centroids_4096.fvecs deep1M/deep1M_clusterids_4096.ivecs 4 deep1M/deep1M_rabitqlib_ivf_4.index true
139139
```
140-
This will build an IVF that uses 4 (1+3) bits to quantize each vector for the deep1M dataset through RaBitQ.
140+
This builds an IVF index for the Deep1M dataset using RaBitQ with 4 (1+3) bits to quantize each vector.
141141

142142
### Example Code in C++ for querying
143-
After building the index, you can execute queries on it. The following codes show how to load ivf index and query from disk, execute queries, and compare the results to the groundtruth.
143+
After building the index, you can execute queries on it. The following code shows how to load the IVF index and queries from disk, execute the queries, and compare results against the ground truth.
144144

145145
```c++
146146
#include <iostream>
@@ -309,25 +309,25 @@ static std::vector<size_t> get_nprobes(
309309
return nprobes;
310310
}
311311
```
312-
To execute queries on deep1M, run the following command for the compiled codes (suppose that it is named `ivf_query`):
312+
To execute queries on the Deep1M dataset, run the following command for the compiled codes (suppose that it is named `ivf_query`):
313313
```shell
314314
./ivf_query deep1M/deep1M_rabitqlib_ivf_4.index deep1M/deep1M_query.fvecs deep1M/deep1M_groundtruth.ivecs
315315
```
316316

317317

318318
## RaBitQ + HNSW
319-
[HNSW](https://arxiv.org/abs/1603.09320) is a popular graph-based index. HNSW + RaBitQ consumes the more memory than IVF + RaBitQ because it needs to store the edges of every vertex in a graph (e.g., 32 edges = 1,024 bits). In terms of the time-accuracy trade-off, HNSW + RaBitQ and IVF + RaBitQ perform differently across datasets—sometimes the former works better, and sometimes the latter does.
320-
RaBitQ + HNSW receives raw data vectors as inputs. It first conducts KMeans using a Python script. The centroid vectors will be used in the normalization of data vectors for improving accuracy.
319+
[HNSW](https://arxiv.org/abs/1603.09320) is a popular graph-based index. Compared to IVF + RaBitQ, HNSW + RaBitQ consumes more memory due to the need to store edges of every vertex in a graph (e.g., 32 edges = 1,024 bits). In terms of time-accuracy trade-off, HNSW + RaBitQ and IVF + RaBitQ perform differently across datasets—each may outperform the other depending on the scenario.
320+
RaBitQ + HNSW takes raw data vectors as input. It begins with KMeans clustering (via a Python script), and the resulting centroids are used to normalize the data vectors for improved accuracy.
321321

322322
#### Perform Clustering using Faiss
323-
First, conduct [Kmeans clustering](https://github.com/VectorDB-NTU/RaBitQ-Library/blob/main/python/ivf.py) on raw data vectors to get centroid vectors. We recommend 16 centroids(clusters). This will save two files: centroids file and cluster ids file.
324-
Use the following command to conduct KMeans clustering on deep1M dataset.
323+
First, run [Kmeans clustering](https://github.com/VectorDB-NTU/RaBitQ-Library/blob/main/python/ivf.py) on raw data vectors to get centroid vectors. We recommend using 16 centroids (clusters). This will generate two output files: a centroids file and a cluster IDs file.
324+
Use the following command to perform KMeans clustering on the Deep1M dataset.
325325
```shell
326326
python python/ivf.py deep1M/deep1M_base.fvecs 16 deep1M/deep1M_centroids_16.fvecs deep1M/deep1M_clusterids_16.ivecs l2
327327
```
328328

329329
#### Example Code in C++ for index construction
330-
Second, load raw data, centroids, and cluster ids files to build the index. Index file is then saved.
330+
Second, load raw data, centroids, and cluster IDs files to build the index. Index file is then saved.
331331

332332
```cpp
333333
#include <cstdint>
@@ -430,15 +430,15 @@ int main(int argc, char* argv[]) {
430430
}
431431

432432
```
433-
After compilation (get an excutable named `hnsw_build`), run the following command to build the HNSW.
433+
After compilation (resulting in an executable named `hnsw_build`), run the following command to build the HNSW index.
434434
```shell
435435
./hnsw_build deep1M/deep1M_base.fvecs deep1M/deep1M_centroids_16.fvecs deep1M/deep1M_clusterids_16.ivecs 16 100 5 deep1M/deep1M_c16_b5.index l2 true
436436
```
437-
This will build a HNSW that uses 5(1+4) bits to quantize each vector.
437+
This will build an HNSW index that uses 5 (1+4) bits to quantize each vector.
438438

439439
#### Example Code in C++ for querying
440-
Third, load index, query and groundtruth files to test ANN Search.
441-
440+
Third, load the index, queries and ground truth files to evaluate ANN search performance.
441+
442442
```cpp
443443
#include <iostream>
444444
#include <vector>
@@ -556,13 +556,13 @@ int main(int argc, char* argv[]) {
556556
}
557557
}
558558
```
559-
To execute queries on deep1M, run the command for the executable(named `hnsw_query`) after compilation.
559+
To execute queries on the Deep1M dataset, run the following command for the executable (named `hnsw_query`) after compilation.
560560
```shell
561561
./hnsw_query deep1M/deep1M_c16_b5.index deep1M/deep1M_query.fvecs deep1M/deep1M_groundtruth.ivecs l2
562562
```
563563

564564
## RaBitQ + QG ([SymphonyQG](https://dl.acm.org/doi/10.1145/3709730))
565-
[QG](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0) is a graph-based index originated from the [NGT library](https://github.com/yahoojapan/NGT). Different from HNSW, it creates multiple quantization codes for every vector and carefully re-organizes their layout to minimize random memory accesses in querying. RaBitQ + QG in developped from our research project [SymphonyQG](https://dl.acm.org/doi/10.1145/3709730). Unlike IVF + RaBitQ and HNSW + RaBitQ, which consumes less memory than the raw datasets, RaBitQ + QG consumes more memory to pursue the best time-accuracy trade-off.
565+
[QG](https://medium.com/@masajiro.iwasaki/fusion-of-graph-based-indexing-and-product-quantization-for-ann-search-7d1f0336d0d0) is a graph-based index originating from the [NGT library](https://github.com/yahoojapan/NGT). Unlike HNSW, it generates multiple quantization codes per vector and carefully re-organizes their layout to minimize random memory accesses during querying. RaBitQ + QG is developed from our research project [SymphonyQG](https://dl.acm.org/doi/10.1145/3709730). In contrast to IVF + RaBitQ and HNSW + RaBitQ, which consumes less memory than the raw datasets, RaBitQ + QG consumes more memory to achieve the best time-accuracy trade-off.
566566

567567

568568
#### Example Code in C++

0 commit comments

Comments
 (0)