From 4dcdf3573afd91b0941647d2875b6ec9cb6a3d5b Mon Sep 17 00:00:00 2001 From: geetu040 Date: Tue, 13 Feb 2024 00:54:30 +0500 Subject: [PATCH 1/8] Add new example: Build a search engine quickly using Qdrant --- quick-search-engine/README.md | 116 +++ quick-search-engine/quick-search-engine.ipynb | 891 ++++++++++++++++++ quick-search-engine/quick-search-engine.py | 122 +++ 3 files changed, 1129 insertions(+) create mode 100644 quick-search-engine/README.md create mode 100644 quick-search-engine/quick-search-engine.ipynb create mode 100644 quick-search-engine/quick-search-engine.py diff --git a/quick-search-engine/README.md b/quick-search-engine/README.md new file mode 100644 index 0000000..6d33dd9 --- /dev/null +++ b/quick-search-engine/README.md @@ -0,0 +1,116 @@ +# Qdrant Search Engine + +This code contains code and instructions for building a search engine using Qdrant. Qdrant is particularly well-suited for tasks involving similarity search, such as recommendation systems, image or text retrieval, and clustering. + +## Article Overview + +The associated Medium article titled ["Build a search engine in 5 minutes using Qdrant"](https://medium.com/@armaghanshakir/build-a-search-engine-in-5-minutes-using-qdrant-5c7a3ef6ac5) provides a step-by-step guide on building a search engine using Qdrant. + +## Instructions + +### Setting Up + +1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/c/quora-question-pairs). +2. Download the file `train.csv.zip`. +3. Unzip the downloaded file. +4. Save the path to the dataset in the `DATA_PATH` variable. + +```python +DATA_PATH = "/kaggle/working/train.csv" # path to your train.csv file +``` + +### Initializing Constants + +```python +# Name of Qdrant Collection for saving vectors +QD_COLLECTION_NAME = "collection_name" + +# Sample size since the complete dataset is very long and can take long processing time +N = 30_000 +``` + +### Loading Dataset + +Load the dataset using pandas in Python: + +```python +import pandas as pd + +df = pd.read_csv(DATA_PATH) + +print("Shape of DataFrame:", df.shape) +print("First 10 rows:") +df.head(10) +``` + +### Extracting Questions + +Extract questions from the dataframe, remove duplicates, and create a sample to see results on a part of the data: + +```python +# extract the questions from df +questions = pd.concat([df['question1'], df['question2']], axis=0) + +# remove all the duplicate questions +questions = questions.drop_duplicates() + +# print total number of questions +print("Total Questions:", len(questions)) + +# sample questions from complete data to avoid long processing +questions = questions.sample(N) + +# print first 10 questions +print("First 10 Questions:") +questions.iloc[:10] +``` + +### Installing Qdrant and Adding Documents + +First, install qdrant with fastembed: + +```bash +!pip install qdrant-client[fastembed] +``` + +Then add the documents to the Qdrant vector space: + +```python +from qdrant_client import QdrantClient + +client = QdrantClient(":memory:") + +client.add( + collection_name=QD_COLLECTION_NAME, + documents=questions, +) + +print("Completed") +``` + +### Creating a Search Function + +Create a search function that processes the query and prints the results: + +```python +def search(query): + results = client.query( + collection_name=QD_COLLECTION_NAME, + query_text=query, + limit=5 + ) + print("Query:", query) + for i, result in enumerate(results): + print() + print(f"{i+1}) {result.document}") + +search("what is the best early morning meal?") +search("How should one introduce themselves?") +search("Why is the Earth a sphere?") +``` + +## Additional Resources + +- [E-Commerce Products Search Engine Using Qdrant](https://medium.com/@armaghanshakir/e-commerce-products-search-engine-using-qdrant-b65dc6ab1983) +- [Qdrant Documentation](https://qdrant.github.io/) +- [Qdrant Python Client Documentation](https://qdrant-client.readthedocs.io/en/latest/) \ No newline at end of file diff --git a/quick-search-engine/quick-search-engine.ipynb b/quick-search-engine/quick-search-engine.ipynb new file mode 100644 index 0000000..5cd455e --- /dev/null +++ b/quick-search-engine/quick-search-engine.ipynb @@ -0,0 +1,891 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "61a34468", + "metadata": { + "papermill": { + "duration": 0.007485, + "end_time": "2024-02-12T18:05:22.504227", + "exception": false, + "start_time": "2024-02-12T18:05:22.496742", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Overview\n", + "\n", + "### Quora Question Pairs\n", + "\n", + "It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them\n", + "\n", + "### Qdrant\n", + "\n", + "Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service.\n", + "\n", + "### Abstract\n", + "\n", + "This script implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries.\n", + "\n", + "### Methodology\n", + "\n", + "Here's a detailed overview of implementation:\n", + "\n", + "- The script begins by loading the Quora dataset and extracting questions from it. Duplicate questions are removed to ensure uniqueness, and a sample of questions is taken to expedite processing. These questions are then indexed using the `Qdrant library`.\n", + "- A search function is defined to query the indexed questions for similar matches to the user input query. The top similar questions found are displayed as results.\n", + "- Several example queries are provided to demonstrate the functionality of the search engine. These queries cover various topics, allowing users to observe how the engine retrieves relevant matches based on semantic similarity.\n", + "\n", + "### Summary\n", + "In summary, the script offers a practical demonstration of building a search engine for similar questions using real-world data and a specialized library. It provides a starting point for developing more sophisticated search functionalities and can be adapted for various applications requiring semantic similarity matching." + ] + }, + { + "cell_type": "markdown", + "id": "0732ecb4", + "metadata": { + "papermill": { + "duration": 0.006234, + "end_time": "2024-02-12T18:05:22.517556", + "exception": false, + "start_time": "2024-02-12T18:05:22.511322", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Setting Up\n", + "1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs).\n", + "2. Download the file [train.csv.zip](https://www.kaggle.com/competitions/quora-question-pairs/data?select=train.csv.zip).\n", + "3. Unzip the downloaded file.\n", + "4. Save the path to the dataset in `DATA_PATH`." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "192c482a", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:24.562387Z", + "iopub.status.busy": "2024-02-12T18:05:24.561855Z", + "iopub.status.idle": "2024-02-12T18:05:24.568197Z", + "shell.execute_reply": "2024-02-12T18:05:24.566603Z" + }, + "papermill": { + "duration": 0.018116, + "end_time": "2024-02-12T18:05:24.570688", + "exception": false, + "start_time": "2024-02-12T18:05:24.552572", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "DATA_PATH = \"/kaggle/working/train.csv\"" + ] + }, + { + "cell_type": "markdown", + "id": "7bf0a4a1", + "metadata": { + "papermill": { + "duration": 0.007003, + "end_time": "2024-02-12T18:05:24.584544", + "exception": false, + "start_time": "2024-02-12T18:05:24.577541", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "## Initialize Constants" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "c87d7fff", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:24.601081Z", + "iopub.status.busy": "2024-02-12T18:05:24.599792Z", + "iopub.status.idle": "2024-02-12T18:05:24.606677Z", + "shell.execute_reply": "2024-02-12T18:05:24.605342Z" + }, + "papermill": { + "duration": 0.01855, + "end_time": "2024-02-12T18:05:24.609984", + "exception": false, + "start_time": "2024-02-12T18:05:24.591434", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "# Name of Qdrant Collection for saving vectors\n", + "QD_COLLECTION_NAME = \"collection_name\"\n", + "\n", + "# Sample size since the complete dataset is very long and can take long processing time\n", + "N = 30_000" + ] + }, + { + "cell_type": "markdown", + "id": "fcc4c336", + "metadata": { + "papermill": { + "duration": 0.007079, + "end_time": "2024-02-12T18:05:24.624181", + "exception": false, + "start_time": "2024-02-12T18:05:24.617102", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Dataset\n", + "- **Title:** Quora Question Pairs\n", + "- **Source:** Kaggle Competition\n", + "- **Link:** [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "c555334b", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:24.641025Z", + "iopub.status.busy": "2024-02-12T18:05:24.640557Z", + "iopub.status.idle": "2024-02-12T18:05:27.489049Z", + "shell.execute_reply": "2024-02-12T18:05:27.487548Z" + }, + "papermill": { + "duration": 2.860333, + "end_time": "2024-02-12T18:05:27.492034", + "exception": false, + "start_time": "2024-02-12T18:05:24.631701", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Shape of DataFrame: (404290, 6)\n", + "First 10 rows:\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idqid1qid2question1question2is_duplicate
0012What is the step by step guide to invest in sh...What is the step by step guide to invest in sh...0
1134What is the story of Kohinoor (Koh-i-Noor) Dia...What would happen if the Indian government sto...0
2256How can I increase the speed of my internet co...How can Internet speed be increased by hacking...0
3378Why am I mentally very lonely? How can I solve...Find the remainder when [math]23^{24}[/math] i...0
44910Which one dissolve in water quikly sugar, salt...Which fish would survive in salt water?0
551112Astrology: I am a Capricorn Sun Cap moon and c...I'm a triple Capricorn (Sun, Moon and ascendan...1
661314Should I buy tiago?What keeps childern active and far from phone ...0
771516How can I be a good geologist?What should I do to be a great geologist?1
881718When do you use シ instead of し?When do you use \"&\" instead of \"and\"?0
991920Motorola (company): Can I hack my Charter Moto...How do I hack Motorola DCX3400 for free internet?0
\n", + "
" + ], + "text/plain": [ + " id qid1 qid2 question1 \\\n", + "0 0 1 2 What is the step by step guide to invest in sh... \n", + "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n", + "2 2 5 6 How can I increase the speed of my internet co... \n", + "3 3 7 8 Why am I mentally very lonely? How can I solve... \n", + "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n", + "5 5 11 12 Astrology: I am a Capricorn Sun Cap moon and c... \n", + "6 6 13 14 Should I buy tiago? \n", + "7 7 15 16 How can I be a good geologist? \n", + "8 8 17 18 When do you use シ instead of し? \n", + "9 9 19 20 Motorola (company): Can I hack my Charter Moto... \n", + "\n", + " question2 is_duplicate \n", + "0 What is the step by step guide to invest in sh... 0 \n", + "1 What would happen if the Indian government sto... 0 \n", + "2 How can Internet speed be increased by hacking... 0 \n", + "3 Find the remainder when [math]23^{24}[/math] i... 0 \n", + "4 Which fish would survive in salt water? 0 \n", + "5 I'm a triple Capricorn (Sun, Moon and ascendan... 1 \n", + "6 What keeps childern active and far from phone ... 0 \n", + "7 What should I do to be a great geologist? 1 \n", + "8 When do you use \"&\" instead of \"and\"? 0 \n", + "9 How do I hack Motorola DCX3400 for free internet? 0 " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "df = pd.read_csv(DATA_PATH)\n", + "\n", + "print(\"Shape of DataFrame:\", df.shape)\n", + "print(\"First 10 rows:\")\n", + "df.head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "f9820bf8", + "metadata": { + "papermill": { + "duration": 0.007339, + "end_time": "2024-02-12T18:05:27.506765", + "exception": false, + "start_time": "2024-02-12T18:05:27.499426", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "## Questions\n", + "Extracting Questions from dataset, removing duplications and sample a portion of data to use for search engine" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "68b9feee", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:27.524292Z", + "iopub.status.busy": "2024-02-12T18:05:27.523161Z", + "iopub.status.idle": "2024-02-12T18:05:27.925310Z", + "shell.execute_reply": "2024-02-12T18:05:27.923902Z" + }, + "papermill": { + "duration": 0.413583, + "end_time": "2024-02-12T18:05:27.928053", + "exception": false, + "start_time": "2024-02-12T18:05:27.514470", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total Questions: 537361\n", + "First 10 Questions:\n" + ] + }, + { + "data": { + "text/plain": [ + "91814 Is Dubai a good place to settle?\n", + "328520 Why is talking to someone \"easier\" when you ha...\n", + "160048 Can Buddha, Jesus and Mohammad be the same per...\n", + "384538 Is light energy real in a way in which heat en...\n", + "247131 How are Altoids so strong?\n", + "102424 How do I view Verizon text messages?\n", + "190401 Which school is better for BS in computer scie...\n", + "153208 How does the US prevent electoral fraud?\n", + "81016 How do I stream live video?\n", + "172748 Free party halls in Chennai triplicane?\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# extract the questions from df\n", + "questions = pd.concat([df['question1'], df['question2']], axis=0)\n", + "\n", + "# remove all the duplicate questions\n", + "questions = questions.drop_duplicates()\n", + "\n", + "# print total number of questions\n", + "print(\"Total Questions:\", len(questions))\n", + "\n", + "# sample questions from complete data to avoid long processing\n", + "questions = questions.sample(N)\n", + "\n", + "# print first 10 questions\n", + "print(\"First 10 Questions:\")\n", + "questions.iloc[:10]" + ] + }, + { + "cell_type": "markdown", + "id": "62566f8f", + "metadata": { + "papermill": { + "duration": 0.007047, + "end_time": "2024-02-12T18:05:27.943195", + "exception": false, + "start_time": "2024-02-12T18:05:27.936148", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Qdrant" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "d6f42199", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:05:27.960408Z", + "iopub.status.busy": "2024-02-12T18:05:27.960023Z", + "iopub.status.idle": "2024-02-12T18:06:09.747078Z", + "shell.execute_reply": "2024-02-12T18:06:09.745908Z" + }, + "papermill": { + "duration": 41.799227, + "end_time": "2024-02-12T18:06:09.750235", + "exception": false, + "start_time": "2024-02-12T18:05:27.951008", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting qdrant-client[fastembed]\r\n", + " Downloading qdrant_client-1.7.3-py3-none-any.whl.metadata (9.3 kB)\r\n", + "Requirement already satisfied: grpcio>=1.41.0 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.60.0)\r\n", + "Collecting grpcio-tools>=1.41.0 (from qdrant-client[fastembed])\r\n", + " Downloading grpcio_tools-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)\r\n", + "Collecting httpx>=0.14.0 (from httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", + " Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)\r\n", + "Requirement already satisfied: numpy>=1.21 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.24.4)\r\n", + "Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client[fastembed])\r\n", + " Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)\r\n", + "Requirement already satisfied: pydantic>=1.10.8 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (2.5.3)\r\n", + "Requirement already satisfied: urllib3<3,>=1.26.14 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.26.18)\r\n", + "Collecting fastembed==0.1.1 (from qdrant-client[fastembed])\r\n", + " Downloading fastembed-0.1.1-py3-none-any.whl.metadata (3.8 kB)\r\n", + "Requirement already satisfied: onnx<2.0,>=1.11 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (1.15.0)\r\n", + "Collecting onnxruntime<2.0,>=1.15 (from fastembed==0.1.1->qdrant-client[fastembed])\r\n", + " Downloading onnxruntime-1.17.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.2 kB)\r\n", + "Requirement already satisfied: requests<3.0,>=2.31 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (2.31.0)\r\n", + "Collecting tokenizers<0.14,>=0.13 (from fastembed==0.1.1->qdrant-client[fastembed])\r\n", + " Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m55.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hRequirement already satisfied: tqdm<5.0,>=4.65 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (4.66.1)\r\n", + "Collecting protobuf<5.0dev,>=4.21.6 (from grpcio-tools>=1.41.0->qdrant-client[fastembed])\r\n", + " Downloading protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)\r\n", + "Collecting grpcio>=1.41.0 (from qdrant-client[fastembed])\r\n", + " Downloading grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)\r\n", + "Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from grpcio-tools>=1.41.0->qdrant-client[fastembed]) (69.0.3)\r\n", + "Requirement already satisfied: anyio in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (4.2.0)\r\n", + "Requirement already satisfied: certifi in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (2023.11.17)\r\n", + "Collecting httpcore==1.* (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", + " Downloading httpcore-1.0.2-py3-none-any.whl.metadata (20 kB)\r\n", + "Requirement already satisfied: idna in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (3.6)\r\n", + "Requirement already satisfied: sniffio in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (1.3.0)\r\n", + "Requirement already satisfied: h11<0.15,>=0.13 in /opt/conda/lib/python3.10/site-packages (from httpcore==1.*->httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (0.14.0)\r\n", + "Collecting h2<5,>=3 (from httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", + " Downloading h2-4.1.0-py3-none-any.whl (57 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.5/57.5 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hRequirement already satisfied: annotated-types>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (0.6.0)\r\n", + "Requirement already satisfied: pydantic-core==2.14.6 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (2.14.6)\r\n", + "Requirement already satisfied: typing-extensions>=4.6.1 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (4.9.0)\r\n", + "Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", + " Downloading hyperframe-6.0.1-py3-none-any.whl (12 kB)\r\n", + "Collecting hpack<5,>=4.0 (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", + " Downloading hpack-4.0.0-py3-none-any.whl (32 kB)\r\n", + "Collecting coloredlogs (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed])\r\n", + " Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hRequirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (23.5.26)\r\n", + "Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (21.3)\r\n", + "Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (1.12)\r\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0,>=2.31->fastembed==0.1.1->qdrant-client[fastembed]) (3.3.2)\r\n", + "Requirement already satisfied: exceptiongroup>=1.0.2 in /opt/conda/lib/python3.10/site-packages (from anyio->httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (1.2.0)\r\n", + "Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed])\r\n", + " Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hRequirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (3.1.1)\r\n", + "Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (1.3.0)\r\n", + "Downloading fastembed-0.1.1-py3-none-any.whl (14 kB)\r\n", + "Downloading grpcio_tools-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m56.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m63.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading httpx-0.26.0-py3-none-any.whl (75 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading httpcore-1.0.2-py3-none-any.whl (76 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading portalocker-2.8.2-py3-none-any.whl (17 kB)\r\n", + "Downloading qdrant_client-1.7.3-py3-none-any.whl (206 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m206.3/206.3 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading onnxruntime-1.17.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.8/6.8 MB\u001b[0m \u001b[31m65.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hDownloading protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl (294 kB)\r\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m294.6/294.6 kB\u001b[0m \u001b[31m12.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", + "\u001b[?25hInstalling collected packages: tokenizers, protobuf, portalocker, hyperframe, humanfriendly, httpcore, hpack, grpcio, httpx, h2, grpcio-tools, coloredlogs, onnxruntime, qdrant-client, fastembed\r\n", + " Attempting uninstall: tokenizers\r\n", + " Found existing installation: tokenizers 0.15.1\r\n", + " Uninstalling tokenizers-0.15.1:\r\n", + " Successfully uninstalled tokenizers-0.15.1\r\n", + " Attempting uninstall: protobuf\r\n", + " Found existing installation: protobuf 3.20.3\r\n", + " Uninstalling protobuf-3.20.3:\r\n", + " Successfully uninstalled protobuf-3.20.3\r\n", + " Attempting uninstall: grpcio\r\n", + " Found existing installation: grpcio 1.60.0\r\n", + " Uninstalling grpcio-1.60.0:\r\n", + " Successfully uninstalled grpcio-1.60.0\r\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", + "apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.\r\n", + "apache-beam 2.46.0 requires protobuf<4,>3.12.2, but you have protobuf 4.25.2 which is incompatible.\r\n", + "apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.0 which is incompatible.\r\n", + "google-cloud-aiplatform 0.6.0a1 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.11.1 which is incompatible.\r\n", + "google-cloud-automl 1.0.1 requires google-api-core[grpc]<2.0.0dev,>=1.14.0, but you have google-api-core 2.11.1 which is incompatible.\r\n", + "google-cloud-bigquery 2.34.4 requires protobuf<4.0.0dev,>=3.12.0, but you have protobuf 4.25.2 which is incompatible.\r\n", + "google-cloud-bigtable 1.7.3 requires protobuf<4.0.0dev, but you have protobuf 4.25.2 which is incompatible.\r\n", + "google-cloud-datastore 1.15.5 requires protobuf<4.0.0dev, but you have protobuf 4.25.2 which is incompatible.\r\n", + "google-cloud-vision 2.8.0 requires protobuf<4.0.0dev,>=3.19.0, but you have protobuf 4.25.2 which is incompatible.\r\n", + "kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.\r\n", + "kfp 2.5.0 requires protobuf<4,>=3.13.0, but you have protobuf 4.25.2 which is incompatible.\r\n", + "kfp-pipeline-spec 0.2.2 requires protobuf<4,>=3.13.0, but you have protobuf 4.25.2 which is incompatible.\r\n", + "tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.\r\n", + "tensorflow-metadata 0.14.0 requires protobuf<4,>=3.7, but you have protobuf 4.25.2 which is incompatible.\r\n", + "tensorflow-transform 0.14.0 requires protobuf<4,>=3.7, but you have protobuf 4.25.2 which is incompatible.\r\n", + "tensorflowjs 4.16.0 requires packaging~=23.1, but you have packaging 21.3 which is incompatible.\r\n", + "transformers 4.37.0 requires tokenizers<0.19,>=0.14, but you have tokenizers 0.13.3 which is incompatible.\u001b[0m\u001b[31m\r\n", + "\u001b[0mSuccessfully installed coloredlogs-15.0.1 fastembed-0.1.1 grpcio-1.60.1 grpcio-tools-1.60.1 h2-4.1.0 hpack-4.0.0 httpcore-1.0.2 httpx-0.26.0 humanfriendly-10.0 hyperframe-6.0.1 onnxruntime-1.17.0 portalocker-2.8.2 protobuf-3.20.3 qdrant-client-1.7.3 tokenizers-0.13.3\r\n" + ] + } + ], + "source": [ + "!pip install qdrant-client[fastembed]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "58d75809", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:06:09.780598Z", + "iopub.status.busy": "2024-02-12T18:06:09.779364Z", + "iopub.status.idle": "2024-02-12T18:16:18.533873Z", + "shell.execute_reply": "2024-02-12T18:16:18.532156Z" + }, + "papermill": { + "duration": 608.787004, + "end_time": "2024-02-12T18:16:18.550945", + "exception": false, + "start_time": "2024-02-12T18:06:09.763941", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 77.7M/77.7M [00:01<00:00, 44.9MiB/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Completed\n" + ] + } + ], + "source": [ + "from qdrant_client import QdrantClient\n", + "\n", + "client = QdrantClient(\":memory:\")\n", + "\n", + "client.add(\n", + " collection_name=QD_COLLECTION_NAME,\n", + " documents=questions,\n", + ")\n", + "\n", + "print(\"Completed\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "48bb2415", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.582461Z", + "iopub.status.busy": "2024-02-12T18:16:18.581711Z", + "iopub.status.idle": "2024-02-12T18:16:18.589335Z", + "shell.execute_reply": "2024-02-12T18:16:18.587929Z" + }, + "papermill": { + "duration": 0.027265, + "end_time": "2024-02-12T18:16:18.592322", + "exception": false, + "start_time": "2024-02-12T18:16:18.565057", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "def search(query):\n", + " results = client.query(\n", + " collection_name=QD_COLLECTION_NAME,\n", + " query_text=query,\n", + " limit=5\n", + " )\n", + " print(\"Query:\", query)\n", + " for i, result in enumerate(results):\n", + " print()\n", + " print(f\"{i+1}) {result.document}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "c132090b", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.622919Z", + "iopub.status.busy": "2024-02-12T18:16:18.622422Z", + "iopub.status.idle": "2024-02-12T18:16:18.766211Z", + "shell.execute_reply": "2024-02-12T18:16:18.764524Z" + }, + "papermill": { + "duration": 0.164352, + "end_time": "2024-02-12T18:16:18.770923", + "exception": false, + "start_time": "2024-02-12T18:16:18.606571", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: what is the best earyly morning meal?\n", + "\n", + "1) What’s the best breakfast in the morning?\n", + "\n", + "2) What's the healthiest thing to eat for a quick and easy western breakfast?\n", + "\n", + "3) What high protein foods are good for breakfast?\n", + "\n", + "4) What are the healthiest foods to eat for dinner?\n", + "\n", + "5) What are the best breakfast recipes in India?\n" + ] + } + ], + "source": [ + "search(\"what is the best earyly morning meal?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "93362c88", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:18.835972Z", + "iopub.status.busy": "2024-02-12T18:16:18.835126Z", + "iopub.status.idle": "2024-02-12T18:16:18.977442Z", + "shell.execute_reply": "2024-02-12T18:16:18.975638Z" + }, + "papermill": { + "duration": 0.180746, + "end_time": "2024-02-12T18:16:18.983214", + "exception": false, + "start_time": "2024-02-12T18:16:18.802468", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: How should one introduce themselves?\n", + "\n", + "1) How should you not introduce yourself?\n", + "\n", + "2) How can I introspect myself?\n", + "\n", + "3) What would be your answer to the interview question \"introduce yourself\"?\n", + "\n", + "4) How do I effectively introduce myself in college?\n", + "\n", + "5) How do you get to know yourself?\n" + ] + } + ], + "source": [ + "search(\"How should one introduce themselves?\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "dbb0e763", + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-12T18:16:19.053710Z", + "iopub.status.busy": "2024-02-12T18:16:19.052919Z", + "iopub.status.idle": "2024-02-12T18:16:19.160098Z", + "shell.execute_reply": "2024-02-12T18:16:19.157372Z" + }, + "papermill": { + "duration": 0.144939, + "end_time": "2024-02-12T18:16:19.162846", + "exception": false, + "start_time": "2024-02-12T18:16:19.017907", + "status": "completed" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: Why is the Earth a sphere?\n", + "\n", + "1) Why is the earth called earth?\n", + "\n", + "2) What proof do people who say the Earth is flat and not a sphere have?\n", + "\n", + "3) Why is the earth round?\n", + "\n", + "4) What is the shape of the earth?\n", + "\n", + "5) Why do extraterrestrial bodies always appear as a spherical shape? Why not square or cylindrical?\n" + ] + } + ], + "source": [ + "search(\"Why is the Earth a sphere?\")" + ] + }, + { + "cell_type": "markdown", + "id": "1cc0cd3d", + "metadata": { + "papermill": { + "duration": 0.016819, + "end_time": "2024-02-12T18:16:19.194619", + "exception": false, + "start_time": "2024-02-12T18:16:19.177800", + "status": "completed" + }, + "tags": [] + }, + "source": [ + "# Explore More\n", + "\n", + "- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1)\n", + "- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant)\n", + "- [Qdrant](https://qdrant.tech)\n", + "- [Qdrant Documentation](https://qdrant.tech/documentation/)\n", + "- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)\n", + "- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1096b11d", + "metadata": { + "papermill": { + "duration": 0.029937, + "end_time": "2024-02-12T18:16:19.255323", + "exception": false, + "start_time": "2024-02-12T18:16:19.225386", + "status": "completed" + }, + "tags": [] + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kaggle": { + "accelerator": "none", + "dataSources": [ + { + "databundleVersionId": 323734, + "sourceId": 6277, + "sourceType": "competition" + } + ], + "dockerImageVersionId": 30646, + "isGpuEnabled": false, + "isInternetEnabled": true, + "language": "python", + "sourceType": "notebook" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + }, + "papermill": { + "default_parameters": {}, + "duration": 661.57748, + "end_time": "2024-02-12T18:16:20.604787", + "environment_variables": {}, + "exception": null, + "input_path": "__notebook__.ipynb", + "output_path": "__notebook__.ipynb", + "parameters": {}, + "start_time": "2024-02-12T18:05:19.027307", + "version": "2.5.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/quick-search-engine/quick-search-engine.py b/quick-search-engine/quick-search-engine.py new file mode 100644 index 0000000..8e7bfe0 --- /dev/null +++ b/quick-search-engine/quick-search-engine.py @@ -0,0 +1,122 @@ +# -*- coding: utf-8 -*- +""" + +# Overview + +### Quora Question Pairs + +It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them + +### Qdrant + +Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service. + +### Abstract + +This script implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries. + +### Methodology + +Here's a detailed overview of implementation: + +- The script begins by loading the Quora dataset and extracting questions from it. Duplicate questions are removed to ensure uniqueness, and a sample of questions is taken to expedite processing. These questions are then indexed using the `Qdrant library`. +- A search function is defined to query the indexed questions for similar matches to the user input query. The top similar questions found are displayed as results. +- Several example queries are provided to demonstrate the functionality of the search engine. These queries cover various topics, allowing users to observe how the engine retrieves relevant matches based on semantic similarity. + +### Summary +In summary, the script offers a practical demonstration of building a search engine for similar questions using real-world data and a specialized library. It provides a starting point for developing more sophisticated search functionalities and can be adapted for various applications requiring semantic similarity matching. + +# Setting Up +1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs). +2. Download the file [train.csv.zip](https://www.kaggle.com/competitions/quora-question-pairs/data?select=train.csv.zip). +3. Unzip the downloaded file. +4. Save the path to the dataset in `DATA_PATH`. +""" + +DATA_PATH = "/kaggle/working/train.csv" + +"""## Initialize Constants""" + +# Name of Qdrant Collection for saving vectors +QD_COLLECTION_NAME = "collection_name" + +# Sample size since the complete dataset is very long and can take long processing time +N = 30_000 + +"""# Dataset +- **Title:** Quora Question Pairs +- **Source:** Kaggle Competition +- **Link:** [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs) +""" + +import pandas as pd + +df = pd.read_csv(DATA_PATH) + +print("Shape of DataFrame:", df.shape) +print("First 10 rows:") +df.head(10) + +"""## Questions +Extracting Questions from dataset, removing duplications and sample a portion of data to use for search engine +""" + +# extract the questions from df +questions = pd.concat([df['question1'], df['question2']], axis=0) + +# remove all the duplicate questions +questions = questions.drop_duplicates() + +# print total number of questions +print("Total Questions:", len(questions)) + +# sample questions from complete data to avoid long processing +questions = questions.sample(N) + +# print first 10 questions +print("First 10 Questions:") +questions.iloc[:10] + +"""# Qdrant""" + +# !pip install qdrant-client[fastembed] + +from qdrant_client import QdrantClient + +client = QdrantClient(":memory:") + +client.add( + collection_name=QD_COLLECTION_NAME, + documents=questions, +) + +print("Completed") + +def search(query): + results = client.query( + collection_name=QD_COLLECTION_NAME, + query_text=query, + limit=5 + ) + print("Query:", query) + for i, result in enumerate(results): + print() + print(f"{i+1}) {result.document}") + +search("what is the best earyly morning meal?") + +search("How should one introduce themselves?") + +search("Why is the Earth a sphere?") + +"""# Explore More + +- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1) +- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant) +- [Qdrant](https://qdrant.tech) +- [Qdrant Documentation](https://qdrant.tech/documentation/) +- [Qdrant Python Client Documentation](https://python-client.qdrant.tech) +- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs) + +""" + From 04253a9357d771e4f529e38e8ff73c617f54c828 Mon Sep 17 00:00:00 2001 From: geetu040 Date: Mon, 25 Mar 2024 00:51:08 +0500 Subject: [PATCH 2/8] quick-search-engine: removes py script (identical to the notebook) --- quick-search-engine/quick-search-engine.py | 122 --------------------- 1 file changed, 122 deletions(-) delete mode 100644 quick-search-engine/quick-search-engine.py diff --git a/quick-search-engine/quick-search-engine.py b/quick-search-engine/quick-search-engine.py deleted file mode 100644 index 8e7bfe0..0000000 --- a/quick-search-engine/quick-search-engine.py +++ /dev/null @@ -1,122 +0,0 @@ -# -*- coding: utf-8 -*- -""" - -# Overview - -### Quora Question Pairs - -It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them - -### Qdrant - -Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service. - -### Abstract - -This script implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries. - -### Methodology - -Here's a detailed overview of implementation: - -- The script begins by loading the Quora dataset and extracting questions from it. Duplicate questions are removed to ensure uniqueness, and a sample of questions is taken to expedite processing. These questions are then indexed using the `Qdrant library`. -- A search function is defined to query the indexed questions for similar matches to the user input query. The top similar questions found are displayed as results. -- Several example queries are provided to demonstrate the functionality of the search engine. These queries cover various topics, allowing users to observe how the engine retrieves relevant matches based on semantic similarity. - -### Summary -In summary, the script offers a practical demonstration of building a search engine for similar questions using real-world data and a specialized library. It provides a starting point for developing more sophisticated search functionalities and can be adapted for various applications requiring semantic similarity matching. - -# Setting Up -1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs). -2. Download the file [train.csv.zip](https://www.kaggle.com/competitions/quora-question-pairs/data?select=train.csv.zip). -3. Unzip the downloaded file. -4. Save the path to the dataset in `DATA_PATH`. -""" - -DATA_PATH = "/kaggle/working/train.csv" - -"""## Initialize Constants""" - -# Name of Qdrant Collection for saving vectors -QD_COLLECTION_NAME = "collection_name" - -# Sample size since the complete dataset is very long and can take long processing time -N = 30_000 - -"""# Dataset -- **Title:** Quora Question Pairs -- **Source:** Kaggle Competition -- **Link:** [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs) -""" - -import pandas as pd - -df = pd.read_csv(DATA_PATH) - -print("Shape of DataFrame:", df.shape) -print("First 10 rows:") -df.head(10) - -"""## Questions -Extracting Questions from dataset, removing duplications and sample a portion of data to use for search engine -""" - -# extract the questions from df -questions = pd.concat([df['question1'], df['question2']], axis=0) - -# remove all the duplicate questions -questions = questions.drop_duplicates() - -# print total number of questions -print("Total Questions:", len(questions)) - -# sample questions from complete data to avoid long processing -questions = questions.sample(N) - -# print first 10 questions -print("First 10 Questions:") -questions.iloc[:10] - -"""# Qdrant""" - -# !pip install qdrant-client[fastembed] - -from qdrant_client import QdrantClient - -client = QdrantClient(":memory:") - -client.add( - collection_name=QD_COLLECTION_NAME, - documents=questions, -) - -print("Completed") - -def search(query): - results = client.query( - collection_name=QD_COLLECTION_NAME, - query_text=query, - limit=5 - ) - print("Query:", query) - for i, result in enumerate(results): - print() - print(f"{i+1}) {result.document}") - -search("what is the best earyly morning meal?") - -search("How should one introduce themselves?") - -search("Why is the Earth a sphere?") - -"""# Explore More - -- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1) -- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant) -- [Qdrant](https://qdrant.tech) -- [Qdrant Documentation](https://qdrant.tech/documentation/) -- [Qdrant Python Client Documentation](https://python-client.qdrant.tech) -- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs) - -""" - From 1ea3e0f775f9a2dbf3409987dec4e5e874b57abc Mon Sep 17 00:00:00 2001 From: geetu040 Date: Mon, 25 Mar 2024 01:15:47 +0500 Subject: [PATCH 3/8] quick-search-engine: use pretty_print instead of search --- quick-search-engine/quick-search-engine.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/quick-search-engine/quick-search-engine.ipynb b/quick-search-engine/quick-search-engine.ipynb index 5cd455e..32a3434 100644 --- a/quick-search-engine/quick-search-engine.ipynb +++ b/quick-search-engine/quick-search-engine.ipynb @@ -657,7 +657,7 @@ }, "outputs": [], "source": [ - "def search(query):\n", + "def pretty_print(query):\n", " results = client.query(\n", " collection_name=QD_COLLECTION_NAME,\n", " query_text=query,\n", @@ -709,7 +709,7 @@ } ], "source": [ - "search(\"what is the best earyly morning meal?\")" + "pretty_print(\"what is the best earyly morning meal?\")" ] }, { @@ -752,7 +752,7 @@ } ], "source": [ - "search(\"How should one introduce themselves?\")" + "pretty_print(\"How should one introduce themselves?\")" ] }, { @@ -795,7 +795,7 @@ } ], "source": [ - "search(\"Why is the Earth a sphere?\")" + "pretty_print(\"Why is the Earth a sphere?\")" ] }, { From 01ca718b8ba83ea8d102ad7b6a6bac5b5f38a5af Mon Sep 17 00:00:00 2001 From: geetu040 Date: Mon, 25 Mar 2024 01:16:07 +0500 Subject: [PATCH 4/8] quick-search-engine: add requirements file --- quick-search-engine/requirements.txt | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 quick-search-engine/requirements.txt diff --git a/quick-search-engine/requirements.txt b/quick-search-engine/requirements.txt new file mode 100644 index 0000000..89b3d60 --- /dev/null +++ b/quick-search-engine/requirements.txt @@ -0,0 +1,3 @@ +numpy +pandas +qdrant-client[fastembed] \ No newline at end of file From 78580f0f7481f32fd64d7a1b36e39ce155ae41b4 Mon Sep 17 00:00:00 2001 From: geetu040 Date: Tue, 26 Mar 2024 14:07:56 +0500 Subject: [PATCH 5/8] quick-search-engine: use hf datasets for quora dataset --- quick-search-engine/quick-search-engine.ipynb | 660 ++++-------------- 1 file changed, 139 insertions(+), 521 deletions(-) diff --git a/quick-search-engine/quick-search-engine.ipynb b/quick-search-engine/quick-search-engine.ipynb index 32a3434..6dc8111 100644 --- a/quick-search-engine/quick-search-engine.ipynb +++ b/quick-search-engine/quick-search-engine.ipynb @@ -2,333 +2,75 @@ "cells": [ { "cell_type": "markdown", - "id": "61a34468", + "id": "fcc4c336", "metadata": { "papermill": { - "duration": 0.007485, - "end_time": "2024-02-12T18:05:22.504227", + "duration": 0.007079, + "end_time": "2024-02-12T18:05:24.624181", "exception": false, - "start_time": "2024-02-12T18:05:22.496742", + "start_time": "2024-02-12T18:05:24.617102", "status": "completed" }, "tags": [] }, "source": [ - "# Overview\n", - "\n", - "### Quora Question Pairs\n", - "\n", - "It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them\n", - "\n", - "### Qdrant\n", - "\n", - "Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service.\n", - "\n", - "### Abstract\n", - "\n", - "This script implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries.\n", - "\n", - "### Methodology\n", - "\n", - "Here's a detailed overview of implementation:\n", - "\n", - "- The script begins by loading the Quora dataset and extracting questions from it. Duplicate questions are removed to ensure uniqueness, and a sample of questions is taken to expedite processing. These questions are then indexed using the `Qdrant library`.\n", - "- A search function is defined to query the indexed questions for similar matches to the user input query. The top similar questions found are displayed as results.\n", - "- Several example queries are provided to demonstrate the functionality of the search engine. These queries cover various topics, allowing users to observe how the engine retrieves relevant matches based on semantic similarity.\n", - "\n", - "### Summary\n", - "In summary, the script offers a practical demonstration of building a search engine for similar questions using real-world data and a specialized library. It provides a starting point for developing more sophisticated search functionalities and can be adapted for various applications requiring semantic similarity matching." + "# Dataset" ] }, { "cell_type": "markdown", - "id": "0732ecb4", - "metadata": { - "papermill": { - "duration": 0.006234, - "end_time": "2024-02-12T18:05:22.517556", - "exception": false, - "start_time": "2024-02-12T18:05:22.511322", - "status": "completed" - }, - "tags": [] - }, + "id": "1289b4eb", + "metadata": {}, "source": [ - "# Setting Up\n", - "1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs).\n", - "2. Download the file [train.csv.zip](https://www.kaggle.com/competitions/quora-question-pairs/data?select=train.csv.zip).\n", - "3. Unzip the downloaded file.\n", - "4. Save the path to the dataset in `DATA_PATH`." + "### Loading\n", + "1. Install `datasets` library\n", + "2. Load `Quora` dataset\n", + "3. Extract Questions\n", + "4. Concatenate all the questions" ] }, { "cell_type": "code", - "execution_count": 2, - "id": "192c482a", - "metadata": { - "execution": { - "iopub.execute_input": "2024-02-12T18:05:24.562387Z", - "iopub.status.busy": "2024-02-12T18:05:24.561855Z", - "iopub.status.idle": "2024-02-12T18:05:24.568197Z", - "shell.execute_reply": "2024-02-12T18:05:24.566603Z" - }, - "papermill": { - "duration": 0.018116, - "end_time": "2024-02-12T18:05:24.570688", - "exception": false, - "start_time": "2024-02-12T18:05:24.552572", - "status": "completed" - }, - "tags": [] - }, + "execution_count": null, + "id": "827c1ad1", + "metadata": {}, "outputs": [], "source": [ - "DATA_PATH = \"/kaggle/working/train.csv\"" - ] - }, - { - "cell_type": "markdown", - "id": "7bf0a4a1", - "metadata": { - "papermill": { - "duration": 0.007003, - "end_time": "2024-02-12T18:05:24.584544", - "exception": false, - "start_time": "2024-02-12T18:05:24.577541", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "## Initialize Constants" + "!pip install datasets" ] }, { "cell_type": "code", "execution_count": 3, - "id": "c87d7fff", - "metadata": { - "execution": { - "iopub.execute_input": "2024-02-12T18:05:24.601081Z", - "iopub.status.busy": "2024-02-12T18:05:24.599792Z", - "iopub.status.idle": "2024-02-12T18:05:24.606677Z", - "shell.execute_reply": "2024-02-12T18:05:24.605342Z" - }, - "papermill": { - "duration": 0.01855, - "end_time": "2024-02-12T18:05:24.609984", - "exception": false, - "start_time": "2024-02-12T18:05:24.591434", - "status": "completed" - }, - "tags": [] - }, + "id": "a790c935", + "metadata": {}, "outputs": [], "source": [ - "# Name of Qdrant Collection for saving vectors\n", - "QD_COLLECTION_NAME = \"collection_name\"\n", + "from datasets import load_dataset\n", "\n", - "# Sample size since the complete dataset is very long and can take long processing time\n", - "N = 30_000" - ] - }, - { - "cell_type": "markdown", - "id": "fcc4c336", - "metadata": { - "papermill": { - "duration": 0.007079, - "end_time": "2024-02-12T18:05:24.624181", - "exception": false, - "start_time": "2024-02-12T18:05:24.617102", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "# Dataset\n", - "- **Title:** Quora Question Pairs\n", - "- **Source:** Kaggle Competition\n", - "- **Link:** [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs)" + "dataset = load_dataset(\"quora\", split=\"train\")" ] }, { "cell_type": "code", "execution_count": 4, - "id": "c555334b", - "metadata": { - "execution": { - "iopub.execute_input": "2024-02-12T18:05:24.641025Z", - "iopub.status.busy": "2024-02-12T18:05:24.640557Z", - "iopub.status.idle": "2024-02-12T18:05:27.489049Z", - "shell.execute_reply": "2024-02-12T18:05:27.487548Z" - }, - "papermill": { - "duration": 2.860333, - "end_time": "2024-02-12T18:05:27.492034", - "exception": false, - "start_time": "2024-02-12T18:05:24.631701", - "status": "completed" - }, - "tags": [] - }, + "id": "5234d425", + "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Shape of DataFrame: (404290, 6)\n", - "First 10 rows:\n" - ] - }, { "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idqid1qid2question1question2is_duplicate
0012What is the step by step guide to invest in sh...What is the step by step guide to invest in sh...0
1134What is the story of Kohinoor (Koh-i-Noor) Dia...What would happen if the Indian government sto...0
2256How can I increase the speed of my internet co...How can Internet speed be increased by hacking...0
3378Why am I mentally very lonely? How can I solve...Find the remainder when [math]23^{24}[/math] i...0
44910Which one dissolve in water quikly sugar, salt...Which fish would survive in salt water?0
551112Astrology: I am a Capricorn Sun Cap moon and c...I'm a triple Capricorn (Sun, Moon and ascendan...1
661314Should I buy tiago?What keeps childern active and far from phone ...0
771516How can I be a good geologist?What should I do to be a great geologist?1
881718When do you use シ instead of し?When do you use \"&\" instead of \"and\"?0
991920Motorola (company): Can I hack my Charter Moto...How do I hack Motorola DCX3400 for free internet?0
\n", - "
" - ], "text/plain": [ - " id qid1 qid2 question1 \\\n", - "0 0 1 2 What is the step by step guide to invest in sh... \n", - "1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... \n", - "2 2 5 6 How can I increase the speed of my internet co... \n", - "3 3 7 8 Why am I mentally very lonely? How can I solve... \n", - "4 4 9 10 Which one dissolve in water quikly sugar, salt... \n", - "5 5 11 12 Astrology: I am a Capricorn Sun Cap moon and c... \n", - "6 6 13 14 Should I buy tiago? \n", - "7 7 15 16 How can I be a good geologist? \n", - "8 8 17 18 When do you use シ instead of し? \n", - "9 9 19 20 Motorola (company): Can I hack my Charter Moto... \n", - "\n", - " question2 is_duplicate \n", - "0 What is the step by step guide to invest in sh... 0 \n", - "1 What would happen if the Indian government sto... 0 \n", - "2 How can Internet speed be increased by hacking... 0 \n", - "3 Find the remainder when [math]23^{24}[/math] i... 0 \n", - "4 Which fish would survive in salt water? 0 \n", - "5 I'm a triple Capricorn (Sun, Moon and ascendan... 1 \n", - "6 What keeps childern active and far from phone ... 0 \n", - "7 What should I do to be a great geologist? 1 \n", - "8 When do you use \"&\" instead of \"and\"? 0 \n", - "9 How do I hack Motorola DCX3400 for free internet? 0 " + "(808580,\n", + " ['What is the step by step guide to invest in share market in india?',\n", + " 'What is the step by step guide to invest in share market?',\n", + " 'What is the story of Kohinoor (Koh-i-Noor) Diamond?',\n", + " 'What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?',\n", + " 'How can I increase the speed of my internet connection while using a VPN?',\n", + " 'How can Internet speed be increased by hacking through DNS?',\n", + " 'Why am I mentally very lonely? How can I solve it?',\n", + " 'Find the remainder when [math]23^{24}[/math] is divided by 24,23?',\n", + " 'Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?',\n", + " 'Which fish would survive in salt water?'])" ] }, "execution_count": 4, @@ -337,99 +79,100 @@ } ], "source": [ - "import pandas as pd\n", - "\n", - "df = pd.read_csv(DATA_PATH)\n", + "questions = []\n", + "for q in dataset['questions']:\n", + "\tquestions.extend(q['text'])\n", "\n", - "print(\"Shape of DataFrame:\", df.shape)\n", - "print(\"First 10 rows:\")\n", - "df.head(10)" + "len(questions), questions[:10]" ] }, { "cell_type": "markdown", - "id": "f9820bf8", - "metadata": { - "papermill": { - "duration": 0.007339, - "end_time": "2024-02-12T18:05:27.506765", - "exception": false, - "start_time": "2024-02-12T18:05:27.499426", - "status": "completed" - }, - "tags": [] - }, + "id": "51dbb60a", + "metadata": {}, "source": [ - "## Questions\n", - "Extracting Questions from dataset, removing duplications and sample a portion of data to use for search engine" + "### Preprocess" ] }, { "cell_type": "code", "execution_count": 5, - "id": "68b9feee", - "metadata": { - "execution": { - "iopub.execute_input": "2024-02-12T18:05:27.524292Z", - "iopub.status.busy": "2024-02-12T18:05:27.523161Z", - "iopub.status.idle": "2024-02-12T18:05:27.925310Z", - "shell.execute_reply": "2024-02-12T18:05:27.923902Z" - }, - "papermill": { - "duration": 0.413583, - "end_time": "2024-02-12T18:05:27.928053", - "exception": false, - "start_time": "2024-02-12T18:05:27.514470", - "status": "completed" - }, - "tags": [] - }, + "id": "4187f4a5", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove all duplicates\n", + "questions = list(set(questions))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f9ceb941", + "metadata": {}, + "outputs": [], + "source": [ + "# Filter shorter or longer questions\n", + "\n", + "min_len = 10\n", + "max_len = 50\n", + "\n", + "def filter_function(question):\n", + "\twords = question.split()\n", + "\tn_words = len(words)\n", + "\tif n_words in range(min_len, max_len):\n", + "\t\treturn True\n", + "\treturn False\n", + "\n", + "questions = list(filter(filter_function, questions))" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "073cf435", + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "\n", + "# Shuffle and Sample the dataset\n", + "# Since complete data is very large\n", + "# and can take longer processing time\n", + "N = 30_000\n", + "\n", + "questions = random.choices(questions, k=N)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "24d1f281", + "metadata": {}, "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Total Questions: 537361\n", - "First 10 Questions:\n" - ] - }, { "data": { "text/plain": [ - "91814 Is Dubai a good place to settle?\n", - "328520 Why is talking to someone \"easier\" when you ha...\n", - "160048 Can Buddha, Jesus and Mohammad be the same per...\n", - "384538 Is light energy real in a way in which heat en...\n", - "247131 How are Altoids so strong?\n", - "102424 How do I view Verizon text messages?\n", - "190401 Which school is better for BS in computer scie...\n", - "153208 How does the US prevent electoral fraud?\n", - "81016 How do I stream live video?\n", - "172748 Free party halls in Chennai triplicane?\n", - "dtype: object" + "(30000,\n", + " ['What is it like to be black (African migrant or African American) in Australia?',\n", + " 'What are these Canada people? Why Canada is not a state of America? Do they look like us?',\n", + " 'Why do we use long transmission line for longer than 240 km?',\n", + " \"I'm 11 and I want my nose pierced, I'm ok with waiting till I'm 12 which is in January, my dad said he will think about when I'm 12, can I u think?\",\n", + " 'What is If there was one thing you would like to change about Quora what would it be? That one thing where Quora need to improve?',\n", + " 'My car steering not working as my wheel got stuck?',\n", + " 'What is the best shipping option for an online business in Nigeria sending products to the USA?',\n", + " 'Who is the best center to ever play in the NBA?',\n", + " 'How are red blood cells structured and how do they function?',\n", + " 'What are some things new employees should know going into their first day at Commerce Bank?'])" ] }, - "execution_count": 5, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "# extract the questions from df\n", - "questions = pd.concat([df['question1'], df['question2']], axis=0)\n", - "\n", - "# remove all the duplicate questions\n", - "questions = questions.drop_duplicates()\n", - "\n", - "# print total number of questions\n", - "print(\"Total Questions:\", len(questions))\n", - "\n", - "# sample questions from complete data to avoid long processing\n", - "questions = questions.sample(N)\n", - "\n", - "# print first 10 questions\n", - "print(\"First 10 Questions:\")\n", - "questions.iloc[:10]" + "len(questions), questions[:10]" ] }, { @@ -451,7 +194,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "d6f42199", "metadata": { "execution": { @@ -469,126 +212,25 @@ }, "tags": [] }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collecting qdrant-client[fastembed]\r\n", - " Downloading qdrant_client-1.7.3-py3-none-any.whl.metadata (9.3 kB)\r\n", - "Requirement already satisfied: grpcio>=1.41.0 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.60.0)\r\n", - "Collecting grpcio-tools>=1.41.0 (from qdrant-client[fastembed])\r\n", - " Downloading grpcio_tools-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)\r\n", - "Collecting httpx>=0.14.0 (from httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", - " Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)\r\n", - "Requirement already satisfied: numpy>=1.21 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.24.4)\r\n", - "Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client[fastembed])\r\n", - " Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)\r\n", - "Requirement already satisfied: pydantic>=1.10.8 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (2.5.3)\r\n", - "Requirement already satisfied: urllib3<3,>=1.26.14 in /opt/conda/lib/python3.10/site-packages (from qdrant-client[fastembed]) (1.26.18)\r\n", - "Collecting fastembed==0.1.1 (from qdrant-client[fastembed])\r\n", - " Downloading fastembed-0.1.1-py3-none-any.whl.metadata (3.8 kB)\r\n", - "Requirement already satisfied: onnx<2.0,>=1.11 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (1.15.0)\r\n", - "Collecting onnxruntime<2.0,>=1.15 (from fastembed==0.1.1->qdrant-client[fastembed])\r\n", - " Downloading onnxruntime-1.17.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.2 kB)\r\n", - "Requirement already satisfied: requests<3.0,>=2.31 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (2.31.0)\r\n", - "Collecting tokenizers<0.14,>=0.13 (from fastembed==0.1.1->qdrant-client[fastembed])\r\n", - " Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m55.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hRequirement already satisfied: tqdm<5.0,>=4.65 in /opt/conda/lib/python3.10/site-packages (from fastembed==0.1.1->qdrant-client[fastembed]) (4.66.1)\r\n", - "Collecting protobuf<5.0dev,>=4.21.6 (from grpcio-tools>=1.41.0->qdrant-client[fastembed])\r\n", - " Downloading protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)\r\n", - "Collecting grpcio>=1.41.0 (from qdrant-client[fastembed])\r\n", - " Downloading grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)\r\n", - "Requirement already satisfied: setuptools in /opt/conda/lib/python3.10/site-packages (from grpcio-tools>=1.41.0->qdrant-client[fastembed]) (69.0.3)\r\n", - "Requirement already satisfied: anyio in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (4.2.0)\r\n", - "Requirement already satisfied: certifi in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (2023.11.17)\r\n", - "Collecting httpcore==1.* (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", - " Downloading httpcore-1.0.2-py3-none-any.whl.metadata (20 kB)\r\n", - "Requirement already satisfied: idna in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (3.6)\r\n", - "Requirement already satisfied: sniffio in /opt/conda/lib/python3.10/site-packages (from httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (1.3.0)\r\n", - "Requirement already satisfied: h11<0.15,>=0.13 in /opt/conda/lib/python3.10/site-packages (from httpcore==1.*->httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (0.14.0)\r\n", - "Collecting h2<5,>=3 (from httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", - " Downloading h2-4.1.0-py3-none-any.whl (57 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.5/57.5 kB\u001b[0m \u001b[31m2.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hRequirement already satisfied: annotated-types>=0.4.0 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (0.6.0)\r\n", - "Requirement already satisfied: pydantic-core==2.14.6 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (2.14.6)\r\n", - "Requirement already satisfied: typing-extensions>=4.6.1 in /opt/conda/lib/python3.10/site-packages (from pydantic>=1.10.8->qdrant-client[fastembed]) (4.9.0)\r\n", - "Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", - " Downloading hyperframe-6.0.1-py3-none-any.whl (12 kB)\r\n", - "Collecting hpack<5,>=4.0 (from h2<5,>=3->httpx[http2]>=0.14.0->qdrant-client[fastembed])\r\n", - " Downloading hpack-4.0.0-py3-none-any.whl (32 kB)\r\n", - "Collecting coloredlogs (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed])\r\n", - " Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m46.0/46.0 kB\u001b[0m \u001b[31m1.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hRequirement already satisfied: flatbuffers in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (23.5.26)\r\n", - "Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (21.3)\r\n", - "Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (1.12)\r\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests<3.0,>=2.31->fastembed==0.1.1->qdrant-client[fastembed]) (3.3.2)\r\n", - "Requirement already satisfied: exceptiongroup>=1.0.2 in /opt/conda/lib/python3.10/site-packages (from anyio->httpx>=0.14.0->httpx[http2]>=0.14.0->qdrant-client[fastembed]) (1.2.0)\r\n", - "Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed])\r\n", - " Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.8/86.8 kB\u001b[0m \u001b[31m4.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hRequirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (3.1.1)\r\n", - "Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->onnxruntime<2.0,>=1.15->fastembed==0.1.1->qdrant-client[fastembed]) (1.3.0)\r\n", - "Downloading fastembed-0.1.1-py3-none-any.whl (14 kB)\r\n", - "Downloading grpcio_tools-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m56.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading grpcio-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m63.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading httpx-0.26.0-py3-none-any.whl (75 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.9/75.9 kB\u001b[0m \u001b[31m3.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading httpcore-1.0.2-py3-none-any.whl (76 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.9/76.9 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading portalocker-2.8.2-py3-none-any.whl (17 kB)\r\n", - "Downloading qdrant_client-1.7.3-py3-none-any.whl (206 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m206.3/206.3 kB\u001b[0m \u001b[31m10.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading onnxruntime-1.17.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.8/6.8 MB\u001b[0m \u001b[31m65.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hDownloading protobuf-4.25.2-cp37-abi3-manylinux2014_x86_64.whl (294 kB)\r\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m294.6/294.6 kB\u001b[0m \u001b[31m12.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", - "\u001b[?25hInstalling collected packages: tokenizers, protobuf, portalocker, hyperframe, humanfriendly, httpcore, hpack, grpcio, httpx, h2, grpcio-tools, coloredlogs, onnxruntime, qdrant-client, fastembed\r\n", - " Attempting uninstall: tokenizers\r\n", - " Found existing installation: tokenizers 0.15.1\r\n", - " Uninstalling tokenizers-0.15.1:\r\n", - " Successfully uninstalled tokenizers-0.15.1\r\n", - " Attempting uninstall: protobuf\r\n", - " Found existing installation: protobuf 3.20.3\r\n", - " Uninstalling protobuf-3.20.3:\r\n", - " Successfully uninstalled protobuf-3.20.3\r\n", - " Attempting uninstall: grpcio\r\n", - " Found existing installation: grpcio 1.60.0\r\n", - " Uninstalling grpcio-1.60.0:\r\n", - " Successfully uninstalled grpcio-1.60.0\r\n", - "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n", - "apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.\r\n", - "apache-beam 2.46.0 requires protobuf<4,>3.12.2, but you have protobuf 4.25.2 which is incompatible.\r\n", - "apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.0 which is incompatible.\r\n", - "google-cloud-aiplatform 0.6.0a1 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.11.1 which is incompatible.\r\n", - "google-cloud-automl 1.0.1 requires google-api-core[grpc]<2.0.0dev,>=1.14.0, but you have google-api-core 2.11.1 which is incompatible.\r\n", - "google-cloud-bigquery 2.34.4 requires protobuf<4.0.0dev,>=3.12.0, but you have protobuf 4.25.2 which is incompatible.\r\n", - "google-cloud-bigtable 1.7.3 requires protobuf<4.0.0dev, but you have protobuf 4.25.2 which is incompatible.\r\n", - "google-cloud-datastore 1.15.5 requires protobuf<4.0.0dev, but you have protobuf 4.25.2 which is incompatible.\r\n", - "google-cloud-vision 2.8.0 requires protobuf<4.0.0dev,>=3.19.0, but you have protobuf 4.25.2 which is incompatible.\r\n", - "kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.\r\n", - "kfp 2.5.0 requires protobuf<4,>=3.13.0, but you have protobuf 4.25.2 which is incompatible.\r\n", - "kfp-pipeline-spec 0.2.2 requires protobuf<4,>=3.13.0, but you have protobuf 4.25.2 which is incompatible.\r\n", - "tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.\r\n", - "tensorflow-metadata 0.14.0 requires protobuf<4,>=3.7, but you have protobuf 4.25.2 which is incompatible.\r\n", - "tensorflow-transform 0.14.0 requires protobuf<4,>=3.7, but you have protobuf 4.25.2 which is incompatible.\r\n", - "tensorflowjs 4.16.0 requires packaging~=23.1, but you have packaging 21.3 which is incompatible.\r\n", - "transformers 4.37.0 requires tokenizers<0.19,>=0.14, but you have tokenizers 0.13.3 which is incompatible.\u001b[0m\u001b[31m\r\n", - "\u001b[0mSuccessfully installed coloredlogs-15.0.1 fastembed-0.1.1 grpcio-1.60.1 grpcio-tools-1.60.1 h2-4.1.0 hpack-4.0.0 httpcore-1.0.2 httpx-0.26.0 humanfriendly-10.0 hyperframe-6.0.1 onnxruntime-1.17.0 portalocker-2.8.2 protobuf-3.20.3 qdrant-client-1.7.3 tokenizers-0.13.3\r\n" - ] - } - ], + "outputs": [], "source": [ "!pip install qdrant-client[fastembed]" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 9, + "id": "4911a12e", + "metadata": {}, + "outputs": [], + "source": [ + "# Name of Qdrant Collection for saving vectors\n", + "QD_COLLECTION_NAME = \"collection_name\"" + ] + }, + { + "cell_type": "code", + "execution_count": 10, "id": "58d75809", "metadata": { "execution": { @@ -607,13 +249,6 @@ "tags": [] }, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "100%|██████████| 77.7M/77.7M [00:01<00:00, 44.9MiB/s]\n" - ] - }, { "name": "stdout", "output_type": "stream", @@ -637,7 +272,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 11, "id": "48bb2415", "metadata": { "execution": { @@ -671,7 +306,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 12, "id": "c132090b", "metadata": { "execution": { @@ -696,15 +331,15 @@ "text": [ "Query: what is the best earyly morning meal?\n", "\n", - "1) What’s the best breakfast in the morning?\n", + "1) What is your favorite food for a chilly winter day?\n", "\n", - "2) What's the healthiest thing to eat for a quick and easy western breakfast?\n", + "2) Can you give me some recipes for a healthy and easy packed lunch?\n", "\n", - "3) What high protein foods are good for breakfast?\n", + "3) What is the best meal you ever had in your life?\n", "\n", - "4) What are the healthiest foods to eat for dinner?\n", + "4) What's the first thing you put in your mouth in the morning?\n", "\n", - "5) What are the best breakfast recipes in India?\n" + "5) What is served for breakfast on a typical US army base?\n" ] } ], @@ -714,7 +349,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 13, "id": "93362c88", "metadata": { "execution": { @@ -739,15 +374,15 @@ "text": [ "Query: How should one introduce themselves?\n", "\n", - "1) How should you not introduce yourself?\n", + "1) What is the first step anyone will take before stating his/her own business?\n", "\n", - "2) How can I introspect myself?\n", + "2) How can a fresh graduate face his 1st interview for a bank job if the question is \"say something about yourself / introduce yourself\"?\n", "\n", - "3) What would be your answer to the interview question \"introduce yourself\"?\n", + "3) What is the best way to respond to an email introduction?\n", "\n", - "4) How do I effectively introduce myself in college?\n", + "4) How do I give a welcoming speech to students of the freshman year?\n", "\n", - "5) How do you get to know yourself?\n" + "5) How do I speak up in front of many people?\n" ] } ], @@ -757,7 +392,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 14, "id": "dbb0e763", "metadata": { "execution": { @@ -782,15 +417,15 @@ "text": [ "Query: Why is the Earth a sphere?\n", "\n", - "1) Why is the earth called earth?\n", + "1) Why do extraterrestrial bodies always appear as a spherical shape? Why not square or cylindrical?\n", "\n", - "2) What proof do people who say the Earth is flat and not a sphere have?\n", + "2) If the earth is a sphere, how is it that wherever we stand, we never fall off?\n", "\n", - "3) Why is the earth round?\n", + "3) All things are supposed to fall to the earth because of gravity, then why do clouds float?\n", "\n", - "4) What is the shape of the earth?\n", + "4) Why do some people think that the Earth is flat?\n", "\n", - "5) Why do extraterrestrial bodies always appear as a spherical shape? Why not square or cylindrical?\n" + "5) What is the reason for existence of Earth's magnetic field?\n" ] } ], @@ -821,23 +456,6 @@ "- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)\n", "- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)\n" ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1096b11d", - "metadata": { - "papermill": { - "duration": 0.029937, - "end_time": "2024-02-12T18:16:19.255323", - "exception": false, - "start_time": "2024-02-12T18:16:19.225386", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [] } ], "metadata": { @@ -871,7 +489,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.13" + "version": "3.11.4" }, "papermill": { "default_parameters": {}, From 5b614c74ba29b1b62591cc36b0050f326d4bb228 Mon Sep 17 00:00:00 2001 From: geetu040 Date: Tue, 26 Mar 2024 14:08:30 +0500 Subject: [PATCH 6/8] quick-search-engine: add version numbers with libraries --- quick-search-engine/requirements.txt | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/quick-search-engine/requirements.txt b/quick-search-engine/requirements.txt index 89b3d60..c7a76c7 100644 --- a/quick-search-engine/requirements.txt +++ b/quick-search-engine/requirements.txt @@ -1,3 +1,2 @@ -numpy -pandas -qdrant-client[fastembed] \ No newline at end of file +datasets==2.18.0 +qdrant-client[fastembed]==1.8.0 \ No newline at end of file From bc711bc55e65c707916b06404b69336e954b646e Mon Sep 17 00:00:00 2001 From: geetu040 Date: Tue, 26 Mar 2024 14:08:58 +0500 Subject: [PATCH 7/8] quick-search-engine: simplify README --- quick-search-engine/README.md | 124 ++++++---------------------------- 1 file changed, 21 insertions(+), 103 deletions(-) diff --git a/quick-search-engine/README.md b/quick-search-engine/README.md index 6d33dd9..351dec4 100644 --- a/quick-search-engine/README.md +++ b/quick-search-engine/README.md @@ -1,116 +1,34 @@ -# Qdrant Search Engine +# Overview -This code contains code and instructions for building a search engine using Qdrant. Qdrant is particularly well-suited for tasks involving similarity search, such as recommendation systems, image or text retrieval, and clustering. +### Quora Question Pairs -## Article Overview +It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them -The associated Medium article titled ["Build a search engine in 5 minutes using Qdrant"](https://medium.com/@armaghanshakir/build-a-search-engine-in-5-minutes-using-qdrant-5c7a3ef6ac5) provides a step-by-step guide on building a search engine using Qdrant. +### Qdrant -## Instructions +Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service. -### Setting Up +### Abstract -1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/c/quora-question-pairs). -2. Download the file `train.csv.zip`. -3. Unzip the downloaded file. -4. Save the path to the dataset in the `DATA_PATH` variable. +This notebook implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries. -```python -DATA_PATH = "/kaggle/working/train.csv" # path to your train.csv file -``` +### Methodology -### Initializing Constants +Here's a detailed overview of implementation: -```python -# Name of Qdrant Collection for saving vectors -QD_COLLECTION_NAME = "collection_name" +- Load the Quora Dataset and apply preprocessing steps. +- Vectorize the textual data and store in a vector space, where questions entered by users can be vectorized and compared in the same vector space - All these steps are covered by internal functionality of Qdrant. +- Several example queries are provided to demonstrate the functionality of the search engine. -# Sample size since the complete dataset is very long and can take long processing time -N = 30_000 -``` +### Summary -### Loading Dataset +In summary, the notebook demonstrates how easily and efficiently, complete search engine can be created using Qdrant Vector Database and Client. -Load the dataset using pandas in Python: +### Explore More! -```python -import pandas as pd - -df = pd.read_csv(DATA_PATH) - -print("Shape of DataFrame:", df.shape) -print("First 10 rows:") -df.head(10) -``` - -### Extracting Questions - -Extract questions from the dataframe, remove duplicates, and create a sample to see results on a part of the data: - -```python -# extract the questions from df -questions = pd.concat([df['question1'], df['question2']], axis=0) - -# remove all the duplicate questions -questions = questions.drop_duplicates() - -# print total number of questions -print("Total Questions:", len(questions)) - -# sample questions from complete data to avoid long processing -questions = questions.sample(N) - -# print first 10 questions -print("First 10 Questions:") -questions.iloc[:10] -``` - -### Installing Qdrant and Adding Documents - -First, install qdrant with fastembed: - -```bash -!pip install qdrant-client[fastembed] -``` - -Then add the documents to the Qdrant vector space: - -```python -from qdrant_client import QdrantClient - -client = QdrantClient(":memory:") - -client.add( - collection_name=QD_COLLECTION_NAME, - documents=questions, -) - -print("Completed") -``` - -### Creating a Search Function - -Create a search function that processes the query and prints the results: - -```python -def search(query): - results = client.query( - collection_name=QD_COLLECTION_NAME, - query_text=query, - limit=5 - ) - print("Query:", query) - for i, result in enumerate(results): - print() - print(f"{i+1}) {result.document}") - -search("what is the best early morning meal?") -search("How should one introduce themselves?") -search("Why is the Earth a sphere?") -``` - -## Additional Resources - -- [E-Commerce Products Search Engine Using Qdrant](https://medium.com/@armaghanshakir/e-commerce-products-search-engine-using-qdrant-b65dc6ab1983) -- [Qdrant Documentation](https://qdrant.github.io/) -- [Qdrant Python Client Documentation](https://qdrant-client.readthedocs.io/en/latest/) \ No newline at end of file +- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1) +- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant) +- [Qdrant](https://qdrant.tech) +- [Qdrant Documentation](https://qdrant.tech/documentation/) +- [Qdrant Python Client Documentation](https://python-client.qdrant.tech) +- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs) From fcbfbaadabd76672ebb86effde087f487688cb93 Mon Sep 17 00:00:00 2001 From: geetu040 Date: Tue, 26 Mar 2024 14:13:11 +0500 Subject: [PATCH 8/8] quick-search-engine: move README content to notebook overview --- quick-search-engine/README.md | 34 --------------- quick-search-engine/quick-search-engine.ipynb | 41 +++++++++++++++++++ 2 files changed, 41 insertions(+), 34 deletions(-) delete mode 100644 quick-search-engine/README.md diff --git a/quick-search-engine/README.md b/quick-search-engine/README.md deleted file mode 100644 index 351dec4..0000000 --- a/quick-search-engine/README.md +++ /dev/null @@ -1,34 +0,0 @@ -# Overview - -### Quora Question Pairs - -It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them - -### Qdrant - -Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service. - -### Abstract - -This notebook implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries. - -### Methodology - -Here's a detailed overview of implementation: - -- Load the Quora Dataset and apply preprocessing steps. -- Vectorize the textual data and store in a vector space, where questions entered by users can be vectorized and compared in the same vector space - All these steps are covered by internal functionality of Qdrant. -- Several example queries are provided to demonstrate the functionality of the search engine. - -### Summary - -In summary, the notebook demonstrates how easily and efficiently, complete search engine can be created using Qdrant Vector Database and Client. - -### Explore More! - -- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1) -- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant) -- [Qdrant](https://qdrant.tech) -- [Qdrant Documentation](https://qdrant.tech/documentation/) -- [Qdrant Python Client Documentation](https://python-client.qdrant.tech) -- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs) diff --git a/quick-search-engine/quick-search-engine.ipynb b/quick-search-engine/quick-search-engine.ipynb index 6dc8111..524fa0e 100644 --- a/quick-search-engine/quick-search-engine.ipynb +++ b/quick-search-engine/quick-search-engine.ipynb @@ -1,5 +1,46 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "b549c3c7", + "metadata": {}, + "source": [ + "# Overview\n", + "\n", + "### Quora Question Pairs\n", + "\n", + "It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them\n", + "\n", + "### Qdrant\n", + "\n", + "Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service.\n", + "\n", + "### Abstract\n", + "\n", + "This notebook implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries.\n", + "\n", + "### Methodology\n", + "\n", + "Here's a detailed overview of implementation:\n", + "\n", + "- Load the Quora Dataset and apply preprocessing steps.\n", + "- Vectorize the textual data and store in a vector space, where questions entered by users can be vectorized and compared in the same vector space - All these steps are covered by internal functionality of Qdrant.\n", + "- Several example queries are provided to demonstrate the functionality of the search engine.\n", + "\n", + "### Summary\n", + "\n", + "In summary, the notebook demonstrates how easily and efficiently, complete search engine can be created using Qdrant Vector Database and Client.\n", + "\n", + "### Explore More!\n", + "\n", + "- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1)\n", + "- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant)\n", + "- [Qdrant](https://qdrant.tech)\n", + "- [Qdrant Documentation](https://qdrant.tech/documentation/)\n", + "- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)\n", + "- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)\n" + ] + }, { "cell_type": "markdown", "id": "fcc4c336",