From 96df443d7602908751f51a860ed4d5972ec2f7ea Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Mon, 17 Nov 2025 14:49:06 +0530 Subject: [PATCH 1/4] DA-1319: update: frontmatter and tutorial content for RAG with Couchbase and OpenAI - Changed tutorial path and titles to reflect the use of Couchbase Search Vector Index instead of FTS. - Updated descriptions to clarify the integration of Couchbase's vector search capabilities with OpenAI embeddings. - Removed the fts_index.json file as it is no longer needed. - Enhanced the RAG notebook to include updated content and examples using the new index types. - Adjusted tags in frontmatter to align with the new terminology. --- ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 437 +++++++++++- haystack/fts/frontmatter.md | 14 +- ...ts_index.json => search_vector_index.json} | 0 ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 671 +++++++++++++++--- haystack/gsi/frontmatter.md | 18 +- 5 files changed, 991 insertions(+), 149 deletions(-) rename haystack/fts/{fts_index.json => search_vector_index.json} (100%) diff --git a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index d31b6509..429841a1 100644 --- a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -4,11 +4,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# BBC News Dataset RAG Pipeline with Couchbase and OpenAI\n", + "# BBC News Dataset RAG Pipeline with Haystack, Couchbase Search Vector Index, and OpenAI\n", "\n", "This notebook demonstrates how to build a Retrieval Augmented Generation (RAG) system using:\n", "- The BBC News dataset containing real-time news articles\n", - "- Couchbase Capella as the vector store with FTS (Full Text Search)\n", + "- Couchbase Capella Search Vector Index for low-latency vector retrieval\n", "- Haystack framework for the RAG pipeline\n", "- OpenAI for embeddings and text generation\n", "\n", @@ -26,9 +26,225 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting datasets\n", + " Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)\n", + "Collecting haystack-ai\n", + " Downloading haystack_ai-2.20.0-py3-none-any.whl.metadata (15 kB)\n", + "Collecting couchbase-haystack\n", + " Using cached couchbase_haystack-2.1.0-py3-none-any.whl.metadata (31 kB)\n", + "Collecting openai\n", + " Downloading openai-2.8.0-py3-none-any.whl.metadata (29 kB)\n", + "Collecting pandas\n", + " Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)\n", + "Collecting filelock (from datasets)\n", + " Using cached filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)\n", + "Collecting numpy>=1.17 (from datasets)\n", + " Downloading numpy-2.3.5-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)\n", + "Collecting pyarrow>=21.0.0 (from datasets)\n", + " Using cached pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (3.2 kB)\n", + "Collecting dill<0.4.1,>=0.3.0 (from datasets)\n", + " Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting requests>=2.32.2 (from datasets)\n", + " Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)\n", + "Collecting httpx<1.0.0 (from datasets)\n", + " Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)\n", + "Collecting tqdm>=4.66.3 (from datasets)\n", + " Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)\n", + "Collecting xxhash (from datasets)\n", + " Using cached xxhash-3.6.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (13 kB)\n", + "Collecting multiprocess<0.70.19 (from datasets)\n", + " Downloading multiprocess-0.70.18-py313-none-any.whl.metadata (7.2 kB)\n", + "Collecting fsspec<=2025.10.0,>=2023.1.0 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)\n", + " Downloading huggingface_hub-1.1.4-py3-none-any.whl.metadata (13 kB)\n", + "Requirement already satisfied: packaging in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets) (25.0)\n", + "Collecting pyyaml>=5.1 (from datasets)\n", + " Using cached pyyaml-6.0.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.4 kB)\n", + "Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Downloading aiohttp-3.13.2-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.1 kB)\n", + "Collecting anyio (from httpx<1.0.0->datasets)\n", + " Using cached anyio-4.11.0-py3-none-any.whl.metadata (4.1 kB)\n", + "Collecting certifi (from httpx<1.0.0->datasets)\n", + " Downloading certifi-2025.11.12-py3-none-any.whl.metadata (2.5 kB)\n", + "Collecting httpcore==1.* (from httpx<1.0.0->datasets)\n", + " Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)\n", + "Collecting idna (from httpx<1.0.0->datasets)\n", + " Using cached idna-3.11-py3-none-any.whl.metadata (8.4 kB)\n", + "Collecting h11>=0.16 (from httpcore==1.*->httpx<1.0.0->datasets)\n", + " Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)\n", + "Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl.metadata (4.9 kB)\n", + "Collecting shellingham (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Using cached shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)\n", + "Collecting typer-slim (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading typer_slim-0.20.0-py3-none-any.whl.metadata (16 kB)\n", + "Collecting typing-extensions>=3.7.4.3 (from huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)\n", + "Collecting docstring-parser (from haystack-ai)\n", + " Using cached docstring_parser-0.17.0-py3-none-any.whl.metadata (3.5 kB)\n", + "Collecting filetype (from haystack-ai)\n", + " Using cached filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)\n", + "Collecting haystack-experimental (from haystack-ai)\n", + " Using cached haystack_experimental-0.14.2-py3-none-any.whl.metadata (18 kB)\n", + "Collecting jinja2 (from haystack-ai)\n", + " Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting jsonschema (from haystack-ai)\n", + " Using cached jsonschema-4.25.1-py3-none-any.whl.metadata (7.6 kB)\n", + "Collecting lazy-imports (from haystack-ai)\n", + " Using cached lazy_imports-1.1.0-py3-none-any.whl.metadata (11 kB)\n", + "Collecting more-itertools (from haystack-ai)\n", + " Using cached more_itertools-10.8.0-py3-none-any.whl.metadata (39 kB)\n", + "Collecting networkx (from haystack-ai)\n", + " Using cached networkx-3.5-py3-none-any.whl.metadata (6.3 kB)\n", + "Collecting posthog!=3.12.0 (from haystack-ai)\n", + " Downloading posthog-7.0.1-py3-none-any.whl.metadata (6.0 kB)\n", + "Collecting pydantic (from haystack-ai)\n", + " Using cached pydantic-2.12.4-py3-none-any.whl.metadata (89 kB)\n", + "Requirement already satisfied: python-dateutil in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai) (2.9.0.post0)\n", + "Collecting tenacity!=8.4.0 (from haystack-ai)\n", + " Using cached tenacity-9.1.2-py3-none-any.whl.metadata (1.2 kB)\n", + "Collecting backports-datetime-fromisoformat (from couchbase-haystack)\n", + " Using cached backports_datetime_fromisoformat-2.0.3-cp313-cp313-macosx_10_13_universal2.whl\n", + "Collecting couchbase==4.* (from couchbase-haystack)\n", + " Using cached couchbase-4.5.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (23 kB)\n", + "Collecting distro<2,>=1.7.0 (from openai)\n", + " Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)\n", + "Collecting jiter<1,>=0.10.0 (from openai)\n", + " Downloading jiter-0.12.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.2 kB)\n", + "Collecting sniffio (from openai)\n", + " Using cached sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)\n", + "Collecting annotated-types>=0.6.0 (from pydantic->haystack-ai)\n", + " Using cached annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n", + "Collecting pydantic-core==2.41.5 (from pydantic->haystack-ai)\n", + " Using cached pydantic_core-2.41.5-cp313-cp313-macosx_11_0_arm64.whl.metadata (7.3 kB)\n", + "Collecting typing-inspection>=0.4.2 (from pydantic->haystack-ai)\n", + " Using cached typing_inspection-0.4.2-py3-none-any.whl.metadata (2.6 kB)\n", + "Collecting pytz>=2020.1 (from pandas)\n", + " Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)\n", + "Collecting tzdata>=2022.7 (from pandas)\n", + " Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)\n", + "Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl.metadata (5.9 kB)\n", + "Collecting aiosignal>=1.4.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached aiosignal-1.4.0-py3-none-any.whl.metadata (3.7 kB)\n", + "Collecting attrs>=17.3.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached attrs-25.4.0-py3-none-any.whl.metadata (10 kB)\n", + "Collecting frozenlist>=1.1.1 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (20 kB)\n", + "Collecting multidict<7.0,>=4.5 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached multidict-6.7.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.3 kB)\n", + "Collecting propcache>=0.2.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached propcache-0.4.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (13 kB)\n", + "Collecting yarl<2.0,>=1.17.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets)\n", + " Using cached yarl-1.22.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (75 kB)\n", + "Requirement already satisfied: six>=1.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from posthog!=3.12.0->haystack-ai) (1.17.0)\n", + "Collecting backoff>=1.10.0 (from posthog!=3.12.0->haystack-ai)\n", + " Using cached backoff-2.2.1-py3-none-any.whl.metadata (14 kB)\n", + "Collecting charset_normalizer<4,>=2 (from requests>=2.32.2->datasets)\n", + " Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl.metadata (37 kB)\n", + "Collecting urllib3<3,>=1.21.1 (from requests>=2.32.2->datasets)\n", + " Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)\n", + "Collecting rich (from haystack-experimental->haystack-ai)\n", + " Downloading rich-14.2.0-py3-none-any.whl.metadata (18 kB)\n", + "Collecting MarkupSafe>=2.0 (from jinja2->haystack-ai)\n", + " Using cached markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (2.7 kB)\n", + "Collecting jsonschema-specifications>=2023.03.6 (from jsonschema->haystack-ai)\n", + " Using cached jsonschema_specifications-2025.9.1-py3-none-any.whl.metadata (2.9 kB)\n", + "Collecting referencing>=0.28.4 (from jsonschema->haystack-ai)\n", + " Using cached referencing-0.37.0-py3-none-any.whl.metadata (2.8 kB)\n", + "Collecting rpds-py>=0.7.1 (from jsonschema->haystack-ai)\n", + " Downloading rpds_py-0.29.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (4.1 kB)\n", + "Collecting markdown-it-py>=2.2.0 (from rich->haystack-experimental->haystack-ai)\n", + " Using cached markdown_it_py-4.0.0-py3-none-any.whl.metadata (7.3 kB)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->haystack-experimental->haystack-ai) (2.19.2)\n", + "Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich->haystack-experimental->haystack-ai)\n", + " Using cached mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)\n", + "Collecting click>=8.0.0 (from typer-slim->huggingface-hub<2.0,>=0.25.0->datasets)\n", + " Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)\n", + "Downloading datasets-4.4.1-py3-none-any.whl (511 kB)\n", + "Using cached dill-0.4.0-py3-none-any.whl (119 kB)\n", + "Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB)\n", + "Using cached httpx-0.28.1-py3-none-any.whl (73 kB)\n", + "Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)\n", + "Downloading huggingface_hub-1.1.4-py3-none-any.whl (515 kB)\n", + "Downloading hf_xet-1.2.0-cp37-abi3-macosx_11_0_arm64.whl (2.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.7/2.7 MB\u001b[0m \u001b[31m20.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multiprocess-0.70.18-py313-none-any.whl (151 kB)\n", + "Downloading haystack_ai-2.20.0-py3-none-any.whl (624 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m624.7/624.7 kB\u001b[0m \u001b[31m20.4 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached couchbase_haystack-2.1.0-py3-none-any.whl (33 kB)\n", + "Using cached couchbase-4.5.0-cp313-cp313-macosx_11_0_arm64.whl (4.3 MB)\n", + "Downloading openai-2.8.0-py3-none-any.whl (1.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.0/1.0 MB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached anyio-4.11.0-py3-none-any.whl (109 kB)\n", + "Using cached distro-1.9.0-py3-none-any.whl (20 kB)\n", + "Downloading jiter-0.12.0-cp313-cp313-macosx_11_0_arm64.whl (318 kB)\n", + "Using cached pydantic-2.12.4-py3-none-any.whl (463 kB)\n", + "Using cached pydantic_core-2.41.5-cp313-cp313-macosx_11_0_arm64.whl (1.9 MB)\n", + "Using cached typing_extensions-4.15.0-py3-none-any.whl (44 kB)\n", + "Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl (10.7 MB)\n", + "Downloading aiohttp-3.13.2-cp313-cp313-macosx_11_0_arm64.whl (489 kB)\n", + "Using cached multidict-6.7.0-cp313-cp313-macosx_11_0_arm64.whl (43 kB)\n", + "Using cached yarl-1.22.0-cp313-cp313-macosx_11_0_arm64.whl (93 kB)\n", + "Using cached aiohappyeyeballs-2.6.1-py3-none-any.whl (15 kB)\n", + "Using cached aiosignal-1.4.0-py3-none-any.whl (7.5 kB)\n", + "Using cached annotated_types-0.7.0-py3-none-any.whl (13 kB)\n", + "Using cached attrs-25.4.0-py3-none-any.whl (67 kB)\n", + "Using cached frozenlist-1.8.0-cp313-cp313-macosx_11_0_arm64.whl (49 kB)\n", + "Using cached h11-0.16.0-py3-none-any.whl (37 kB)\n", + "Using cached idna-3.11-py3-none-any.whl (71 kB)\n", + "Downloading numpy-2.3.5-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.1/5.1 MB\u001b[0m \u001b[31m18.5 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hDownloading posthog-7.0.1-py3-none-any.whl (145 kB)\n", + "Using cached requests-2.32.5-py3-none-any.whl (64 kB)\n", + "Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl (208 kB)\n", + "Using cached urllib3-2.5.0-py3-none-any.whl (129 kB)\n", + "Using cached backoff-2.2.1-py3-none-any.whl (15 kB)\n", + "Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)\n", + "Using cached propcache-0.4.1-cp313-cp313-macosx_11_0_arm64.whl (46 kB)\n", + "Using cached pyarrow-22.0.0-cp313-cp313-macosx_12_0_arm64.whl (34.2 MB)\n", + "Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)\n", + "Using cached pyyaml-6.0.3-cp313-cp313-macosx_11_0_arm64.whl (173 kB)\n", + "Using cached sniffio-1.3.1-py3-none-any.whl (10 kB)\n", + "Using cached tenacity-9.1.2-py3-none-any.whl (28 kB)\n", + "Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)\n", + "Using cached typing_inspection-0.4.2-py3-none-any.whl (14 kB)\n", + "Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)\n", + "Using cached docstring_parser-0.17.0-py3-none-any.whl (36 kB)\n", + "Using cached filelock-3.20.0-py3-none-any.whl (16 kB)\n", + "Using cached filetype-1.2.0-py2.py3-none-any.whl (19 kB)\n", + "Using cached haystack_experimental-0.14.2-py3-none-any.whl (79 kB)\n", + "Using cached jinja2-3.1.6-py3-none-any.whl (134 kB)\n", + "Using cached markupsafe-3.0.3-cp313-cp313-macosx_11_0_arm64.whl (12 kB)\n", + "Using cached jsonschema-4.25.1-py3-none-any.whl (90 kB)\n", + "Using cached jsonschema_specifications-2025.9.1-py3-none-any.whl (18 kB)\n", + "Using cached referencing-0.37.0-py3-none-any.whl (26 kB)\n", + "Downloading rpds_py-0.29.0-cp313-cp313-macosx_11_0_arm64.whl (360 kB)\n", + "Using cached lazy_imports-1.1.0-py3-none-any.whl (18 kB)\n", + "Using cached more_itertools-10.8.0-py3-none-any.whl (69 kB)\n", + "Using cached networkx-3.5-py3-none-any.whl (2.0 MB)\n", + "Downloading rich-14.2.0-py3-none-any.whl (243 kB)\n", + "Using cached markdown_it_py-4.0.0-py3-none-any.whl (87 kB)\n", + "Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)\n", + "Using cached shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)\n", + "Downloading typer_slim-0.20.0-py3-none-any.whl (47 kB)\n", + "Downloading click-8.3.1-py3-none-any.whl (108 kB)\n", + "Using cached xxhash-3.6.0-cp313-cp313-macosx_11_0_arm64.whl (30 kB)\n", + "Installing collected packages: pytz, filetype, xxhash, urllib3, tzdata, typing-extensions, tqdm, tenacity, sniffio, shellingham, rpds-py, pyyaml, pyarrow, propcache, numpy, networkx, multidict, more-itertools, mdurl, MarkupSafe, lazy-imports, jiter, idna, hf-xet, h11, fsspec, frozenlist, filelock, docstring-parser, distro, dill, couchbase, click, charset_normalizer, certifi, backports-datetime-fromisoformat, backoff, attrs, annotated-types, aiohappyeyeballs, yarl, typing-inspection, typer-slim, requests, referencing, pydantic-core, pandas, multiprocess, markdown-it-py, jinja2, httpcore, anyio, aiosignal, rich, pydantic, posthog, jsonschema-specifications, httpx, aiohttp, openai, jsonschema, huggingface-hub, datasets, haystack-experimental, haystack-ai, couchbase-haystack\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m66/66\u001b[0m [couchbase-haystack]aystack-ai]hub]fications]\n", + "\u001b[1A\u001b[2KSuccessfully installed MarkupSafe-3.0.3 aiohappyeyeballs-2.6.1 aiohttp-3.13.2 aiosignal-1.4.0 annotated-types-0.7.0 anyio-4.11.0 attrs-25.4.0 backoff-2.2.1 backports-datetime-fromisoformat-2.0.3 certifi-2025.11.12 charset_normalizer-3.4.4 click-8.3.1 couchbase-4.5.0 couchbase-haystack-2.1.0 datasets-4.4.1 dill-0.4.0 distro-1.9.0 docstring-parser-0.17.0 filelock-3.20.0 filetype-1.2.0 frozenlist-1.8.0 fsspec-2025.10.0 h11-0.16.0 haystack-ai-2.20.0 haystack-experimental-0.14.2 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-1.1.4 idna-3.11 jinja2-3.1.6 jiter-0.12.0 jsonschema-4.25.1 jsonschema-specifications-2025.9.1 lazy-imports-1.1.0 markdown-it-py-4.0.0 mdurl-0.1.2 more-itertools-10.8.0 multidict-6.7.0 multiprocess-0.70.18 networkx-3.5 numpy-2.3.5 openai-2.8.0 pandas-2.3.3 posthog-7.0.1 propcache-0.4.1 pyarrow-22.0.0 pydantic-2.12.4 pydantic-core-2.41.5 pytz-2025.2 pyyaml-6.0.3 referencing-0.37.0 requests-2.32.5 rich-14.2.0 rpds-py-0.29.0 shellingham-1.5.4 sniffio-1.3.1 tenacity-9.1.2 tqdm-4.67.1 typer-slim-0.20.0 typing-extensions-4.15.0 typing-inspection-0.4.2 tzdata-2025.2 urllib3-2.5.0 xxhash-3.6.0 yarl-1.22.0\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "%pip install datasets haystack-ai couchbase-haystack openai pandas" ] @@ -44,9 +260,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], "source": [ "import getpass\n", "import base64\n", @@ -134,12 +359,12 @@ "\n", "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", "\n", - "**INDEX_NAME** is the name of the FTS search index we will use for vector search operations." + "**INDEX_NAME** is the name of the Search Vector Index used for vector search operations." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -153,15 +378,26 @@ "OPENAI_API_KEY = input(\"OpenAI API Key: \")\n", "\n", "# Check if the variables are correctly loaded\n", - "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, CB_OPENAI_API_KEY]):\n", + "if not all([CB_CONNECTION_STRING, CB_USERNAME, CB_PASSWORD, CB_BUCKET_NAME, CB_SCOPE_NAME, CB_COLLECTION_NAME, CB_INDEX_NAME, OPENAI_API_KEY]):\n", " raise ValueError(\"All configuration variables must be provided.\")" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Bucket 'b' already exists.\n", + "Scope 's' already exists.\n", + "Collection 'c' already exists in scope 's'.\n", + "Search Vector Index 'vector_search' already exists at scope level.\n" + ] + } + ], "source": [ "from couchbase.cluster import Cluster \n", "from couchbase.options import ClusterOptions\n", @@ -208,8 +444,8 @@ " collection_manager.create_collection(collection_name=CB_COLLECTION_NAME, scope_name=CB_SCOPE_NAME)\n", " print(f\"Collection '{CB_COLLECTION_NAME}' created successfully.\")\n", "\n", - "# Create search index from search_index.json file at scope level\n", - "with open('fts_index.json', 'r') as search_file:\n", + "# Create Search Vector Index from search_vector_index.json file at scope level\n", + "with open('search_vector_index.json', 'r') as search_file:\n", " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", " \n", " # Update search index definition with user inputs\n", @@ -229,13 +465,13 @@ " try:\n", " # Check if index exists at scope level\n", " existing_index = scope_search_manager.get_index(search_index_name)\n", - " print(f\"Search index '{search_index_name}' already exists at scope level.\")\n", + " print(f\"Search Vector Index '{search_index_name}' already exists at scope level.\")\n", " except Exception as e:\n", - " print(f\"Search index '{search_index_name}' does not exist at scope level. Creating search index from fts_index.json...\")\n", - " with open('fts_index.json', 'r') as search_file:\n", + " print(f\"Search Vector Index '{search_index_name}' does not exist at scope level. Creating index from search_vector_index.json...\")\n", + " with open('search_vector_index.json', 'r') as search_file:\n", " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", " scope_search_manager.upsert_index(search_index_definition)\n", - " print(f\"Search index '{search_index_name}' created successfully at scope level.\")" + " print(f\"Search Vector Index '{search_index_name}' created successfully at scope level.\")" ] }, { @@ -249,9 +485,32 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading TMDB dataset...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating train split: 100%|██████████| 4803/4803 [00:00<00:00, 123144.70 examples/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total movies found: 4803\n", + "Created 4800 documents with valid overviews\n" + ] + } + ], "source": [ "# Load TMDB dataset\n", "print(\"Loading TMDB dataset...\")\n", @@ -300,9 +559,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Couchbase document store initialized successfully.\n" + ] + } + ], "source": [ "# Initialize document store\n", "document_store = CouchbaseSearchDocumentStore(\n", @@ -335,7 +602,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "metadata": {}, "outputs": [], "source": [ @@ -360,7 +627,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -380,9 +647,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "🚅 Components\n", + " - cleaner: DocumentCleaner\n", + " - embedder: OpenAIDocumentEmbedder\n", + " - writer: DocumentWriter\n", + "🛤️ Connections\n", + " - cleaner.documents -> embedder.documents (list[Document])\n", + " - embedder.documents -> writer.documents (list[Document])" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Create indexing pipeline\n", "index_pipeline = Pipeline()\n", @@ -406,16 +691,55 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 4it [00:06, 1.73s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed batch 1: 100 documents\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 4it [00:06, 1.66s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Processed batch 2: 100 documents\n", + "\n", + "Successfully processed 200 documents\n", + "Sample document metadata: {'title': 'Four Rooms', 'genres': '[{\"id\": 80, \"name\": \"Crime\"}, {\"id\": 35, \"name\": \"Comedy\"}]', 'original_language': 'en', 'popularity': 22.87623, 'release_date': '1995-12-09', 'vote_average': 6.5, 'vote_count': 530, 'budget': 4000000, 'revenue': 4300000}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], "source": [ "# Run indexing pipeline\n", "\n", "if documents:\n", " # Process documents in batches for better performance\n", " batch_size = 100\n", - " total_docs = len(documents)\n", + " total_docs = len(documents[:200])\n", " \n", " for i in range(0, total_docs, batch_size):\n", " batch = documents[i:i + batch_size]\n", @@ -439,9 +763,24 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RAG pipeline created successfully.\n" + ] + } + ], "source": [ "# Define RAG prompt template\n", "prompt_template = \"\"\"\n", @@ -489,12 +828,36 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=== Retrieved Documents ===\n", + "Id: 006b97c08110cb1b9b58e03943c91fa9412cfe7a2a22830ba5b9e3eb0c342344 Title: Run Lola Run\n", + "Id: 33543dab4c048c9467d632f319e02bca94da6f178250c14d26eabfb30911a823 Title: Mambo Italiano\n", + "Id: 94c55246e02c290767531f6359b5f44145191e3f2d62a3a64ed4718a666be9f2 Title: Good bye, Lenin!\n", + "Id: 00b4d1f455e45fbffa39f72be6de635bdcdb6b8a04289ba4aea41061700b9096 Title: Mean Streets\n", + "Id: 9241f819303fe61a25e05469856c01a8843d53a6ce7cec340bf0def848ddb470 Title: Magnolia\n", + "\n", + "=== Final Answer ===\n", + "Question: Why did Manni call Lolla?\n", + "Answer: Manni called Lola because he lost 100,000 DM in a subway train that belongs to a very bad guy, and he needs her help to raise the money within 20 minutes to prevent him from having to rob a store to get the money.\n", + "\n", + "Sources:\n", + "-> Run Lola Run\n", + "-> Mambo Italiano\n", + "-> Good bye, Lenin!\n", + "-> Mean Streets\n", + "-> Magnolia\n" + ] + } + ], "source": [ "# Example question\n", - "question = \"Who does Savva want to save from the vicious hyenas?\"\n", + "question = \"Why did Manni call Lolla?\"\n", "\n", "# Run the RAG pipeline\n", "result = rag_pipeline.run(\n", @@ -531,10 +894,10 @@ "source": [ "# Conclusion\n", "\n", - "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine vector search capabilities with large language models to answer questions about current events using real-time information.\n", + "In this tutorial, we built a Retrieval-Augmented Generation (RAG) system using Couchbase Capella, OpenAI, and Haystack with the BBC News dataset. This demonstrates how to combine Couchbase Search Vector Index with large language models to answer questions about current events using real-time information.\n", "\n", "The key components include:\n", - "- **Couchbase Capella** for vector storage and FTS-based retrieval\n", + "- **Couchbase Capella Search Vector Index** for vector storage and retrieval\n", "- **Haystack** for pipeline orchestration and component management \n", "- **OpenAI** for embeddings (`text-embedding-3-large`) and text generation (`gpt-4o`)\n", "\n", @@ -544,7 +907,7 @@ ], "metadata": { "kernelspec": { - "display_name": "haystack", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -558,7 +921,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.13.9" } }, "nbformat": 4, diff --git a/haystack/fts/frontmatter.md b/haystack/fts/frontmatter.md index d29ba8fe..5225174d 100644 --- a/haystack/fts/frontmatter.md +++ b/haystack/fts/frontmatter.md @@ -1,12 +1,12 @@ --- # frontmatter -path: "/tutorial-openai-haystack-rag-with-fts" -title: "Retrieval-Augmented Generation (RAG) with OpenAI, Haystack and Couchbase Search Vector Index" -short_title: "RAG with OpenAI, Haystack and Couchbase Search Vector Index" +path: "/tutorial-openai-haystack-rag-with-search-vector-index" +title: "RAG with OpenAI, Haystack, and Couchbase Search Vector Index" +short_title: "RAG with OpenAI, Haystack, and Search Vector Index" description: - - Learn how to build a semantic search engine using Couchbase's Search Vector Index. - - This tutorial demonstrates how to integrate Couchbase's vector search capabilities with the embeddings generated by OpenAI Services. - - You will understand how to perform Retrieval-Augmented Generation (RAG) using Haystack, Couchbase and OpenAI services. + - Learn how to build a semantic search engine using the Couchbase Search Vector Index. + - This tutorial demonstrates how Haystack integrates Couchbase Search Vector Index with embeddings generated by OpenAI services. + - Perform Retrieval-Augmented Generation (RAG) using Haystack with Couchbase and OpenAI services. content_type: tutorial filter: sdk technology: @@ -15,7 +15,7 @@ tags: - OpenAI - Artificial Intelligence - Haystack - - FTS + - Search Vector Index sdk_language: - python length: 60 Mins diff --git a/haystack/fts/fts_index.json b/haystack/fts/search_vector_index.json similarity index 100% rename from haystack/fts/fts_index.json rename to haystack/fts/search_vector_index.json diff --git a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index d07c49b0..b044ca4d 100644 --- a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -6,17 +6,17 @@ "source": [ "# Introduction\n", "\n", - "In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application using Couchbase Capella as the database, [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) model as the large language model provided by OpenAI. We will use the [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings/embedding-models) model for generating embeddings.\n", + "In this guide, we will walk you through building a Retrieval Augmented Generation (RAG) application with Haystack orchestrating OpenAI models and Couchbase Capella. We will use the [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) model for response generation and the [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings/embedding-models) model for generating embeddings.\n", "\n", "This notebook demonstrates how to build a RAG system using:\n", "- The [BBC News dataset](https://huggingface.co/datasets/RealTimeData/bbc_news_alltime) containing news articles\n", - "- Couchbase Capella as the vector store with GSI (Global Secondary Index) for vector search\n", + "- Couchbase Capella Hyperscale and Composite Vector Indexes for vector search\n", "- Haystack framework for the RAG pipeline\n", "- OpenAI for embeddings and text generation\n", "\n", - "We leverage Couchbase's Global Secondary Index (GSI) vector search capabilities to create and manage vector indexes, enabling efficient semantic search capabilities. GSI provides high-performance vector search with support for both Hyperscale Vector Indexes and Composite Vector Indexes, designed to scale to billions of vectors with low memory footprint and optimized concurrent operations.\n", + "We leverage Couchbase's Hyperscale and Composite Vector Indexes to enable efficient semantic search at scale. Hyperscale indexes prioritize high-throughput vector similarity across billions of vectors with a compact on-disk footprint, while Composite indexes blend scalar predicates with a vector column to narrow candidate sets before similarity search. For a deeper dive into how these indexes work, see the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).\n", "\n", - "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial will equip you with the knowledge to create a fully functional RAG system using OpenAI Services and Haystack with Couchbase's advanced GSI vector search." + "Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial shows how to combine OpenAI Services and Haystack with Couchbase's Hyperscale and Composite Vector Indexes to deliver a production-ready RAG workflow." ] }, { @@ -73,9 +73,204 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: pandas>=2.1.4 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 1)) (2.3.3)\n", + "Requirement already satisfied: datasets>=2.14.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 2)) (4.4.1)\n", + "Collecting setuptools>=75.8.0 (from -r requirements.txt (line 3))\n", + " Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)\n", + "Requirement already satisfied: couchbase-haystack==2.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 4)) (2.1.0)\n", + "Collecting transformers>=4.49.0 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)\n", + "Collecting tensorflow>=2.18.0 (from -r requirements.txt (line 6))\n", + " Downloading tensorflow-2.20.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (4.5 kB)\n", + "Requirement already satisfied: backports-datetime-fromisoformat in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.0.3)\n", + "Requirement already satisfied: couchbase==4.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (4.5.0)\n", + "Requirement already satisfied: haystack-ai>=2.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.20.0)\n", + "Requirement already satisfied: numpy>=1.26.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2.3.5)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2.9.0.post0)\n", + "Requirement already satisfied: pytz>=2020.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2025.2)\n", + "Requirement already satisfied: tzdata>=2022.7 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2025.2)\n", + "Requirement already satisfied: filelock in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (3.20.0)\n", + "Requirement already satisfied: pyarrow>=21.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (22.0.0)\n", + "Requirement already satisfied: dill<0.4.1,>=0.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.4.0)\n", + "Requirement already satisfied: requests>=2.32.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (2.32.5)\n", + "Requirement already satisfied: httpx<1.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.28.1)\n", + "Requirement already satisfied: tqdm>=4.66.3 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (4.67.1)\n", + "Requirement already satisfied: xxhash in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (3.6.0)\n", + "Requirement already satisfied: multiprocess<0.70.19 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.70.18)\n", + "Requirement already satisfied: fsspec<=2025.10.0,>=2023.1.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2025.10.0)\n", + "Requirement already satisfied: huggingface-hub<2.0,>=0.25.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (1.1.4)\n", + "Requirement already satisfied: packaging in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (25.0)\n", + "Requirement already satisfied: pyyaml>=5.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (6.0.3)\n", + "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (3.13.2)\n", + "Requirement already satisfied: anyio in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (4.11.0)\n", + "Requirement already satisfied: certifi in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2025.11.12)\n", + "Requirement already satisfied: httpcore==1.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.0.9)\n", + "Requirement already satisfied: idna in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (3.11)\n", + "Requirement already satisfied: h11>=0.16 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpcore==1.*->httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.16.0)\n", + "Requirement already satisfied: hf-xet<2.0.0,>=1.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.2.0)\n", + "Requirement already satisfied: shellingham in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.5.4)\n", + "Requirement already satisfied: typer-slim in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.20.0)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (4.15.0)\n", + "Collecting huggingface-hub<2.0,>=0.25.0 (from datasets>=2.14.5->-r requirements.txt (line 2))\n", + " Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)\n", + "Collecting regex!=2019.12.17 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Using cached regex-2025.11.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (40 kB)\n", + "Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)\n", + "Collecting safetensors>=0.4.3 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Using cached safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)\n", + "Collecting absl-py>=1.0.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)\n", + "Collecting astunparse>=1.6.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)\n", + "Collecting flatbuffers>=24.3.25 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Using cached flatbuffers-25.9.23-py2.py3-none-any.whl.metadata (875 bytes)\n", + "Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)\n", + "Collecting google_pasta>=0.1.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)\n", + "Collecting libclang>=13.0.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading libclang-18.1.1-1-py2.py3-none-macosx_11_0_arm64.whl.metadata (5.2 kB)\n", + "Collecting opt_einsum>=2.3.2 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)\n", + "Collecting protobuf>=5.28.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading protobuf-6.33.1-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)\n", + "Requirement already satisfied: six>=1.12.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from tensorflow>=2.18.0->-r requirements.txt (line 6)) (1.17.0)\n", + "Collecting termcolor>=1.1.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading termcolor-3.2.0-py3-none-any.whl.metadata (6.4 kB)\n", + "Collecting wrapt>=1.11.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading wrapt-2.0.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.0 kB)\n", + "Collecting grpcio<2.0,>=1.24.3 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading grpcio-1.76.0-cp313-cp313-macosx_11_0_universal2.whl.metadata (3.7 kB)\n", + "Collecting tensorboard~=2.20.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading tensorboard-2.20.0-py3-none-any.whl.metadata (1.8 kB)\n", + "Collecting keras>=3.10.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading keras-3.12.0-py3-none-any.whl.metadata (5.9 kB)\n", + "Collecting h5py>=3.11.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading h5py-3.15.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (3.0 kB)\n", + "Collecting ml_dtypes<1.0.0,>=0.5.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading ml_dtypes-0.5.3-cp313-cp313-macosx_10_13_universal2.whl.metadata (8.9 kB)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from requests>=2.32.2->datasets>=2.14.5->-r requirements.txt (line 2)) (3.4.4)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from requests>=2.32.2->datasets>=2.14.5->-r requirements.txt (line 2)) (2.5.0)\n", + "Collecting markdown>=2.6.8 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading markdown-3.10-py3-none-any.whl.metadata (5.1 kB)\n", + "Collecting pillow (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading pillow-12.0.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.8 kB)\n", + "Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)\n", + "Collecting werkzeug>=1.0.1 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading werkzeug-3.1.3-py3-none-any.whl.metadata (3.7 kB)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2.6.1)\n", + "Requirement already satisfied: aiosignal>=1.4.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.4.0)\n", + "Requirement already satisfied: attrs>=17.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (25.4.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.8.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (6.7.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.4.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.22.0)\n", + "Collecting wheel<1.0,>=0.23.0 (from astunparse>=1.6.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)\n", + "Requirement already satisfied: docstring-parser in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.17.0)\n", + "Requirement already satisfied: filetype in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.2.0)\n", + "Requirement already satisfied: haystack-experimental in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.14.2)\n", + "Requirement already satisfied: jinja2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (3.1.6)\n", + "Requirement already satisfied: jsonschema in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (4.25.1)\n", + "Requirement already satisfied: lazy-imports in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.1.0)\n", + "Requirement already satisfied: more-itertools in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (10.8.0)\n", + "Requirement already satisfied: networkx in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (3.5)\n", + "Requirement already satisfied: openai>=1.99.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.8.0)\n", + "Requirement already satisfied: posthog!=3.12.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (7.0.1)\n", + "Requirement already satisfied: pydantic in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.12.4)\n", + "Requirement already satisfied: tenacity!=8.4.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (9.1.2)\n", + "Requirement already satisfied: rich in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (14.2.0)\n", + "Collecting namex (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading namex-0.1.0-py3-none-any.whl.metadata (322 bytes)\n", + "Collecting optree (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", + " Downloading optree-0.18.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (34 kB)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.9.0)\n", + "Requirement already satisfied: jiter<1,>=0.10.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.12.0)\n", + "Requirement already satisfied: sniffio in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.3.1)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.41.5)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.4.2)\n", + "Requirement already satisfied: backoff>=1.10.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from posthog!=3.12.0->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.2.1)\n", + "Collecting torch>=2.2 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)\n", + "Collecting accelerate>=0.26.0 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)\n", + "Requirement already satisfied: psutil in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from accelerate>=0.26.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5)) (7.1.3)\n", + "Collecting sympy>=1.13.3 (from torch>=2.2->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)\n", + "Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch>=2.2->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", + " Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)\n", + "Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from werkzeug>=1.0.1->tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (3.0.3)\n", + "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2025.9.1)\n", + "Requirement already satisfied: referencing>=0.28.4 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.37.0)\n", + "Requirement already satisfied: rpds-py>=0.7.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.29.0)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (4.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (2.19.2)\n", + "Requirement already satisfied: mdurl~=0.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (0.1.2)\n", + "Requirement already satisfied: click>=8.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from typer-slim->huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (8.3.1)\n", + "Using cached setuptools-80.9.0-py3-none-any.whl (1.2 MB)\n", + "Downloading transformers-4.57.1-py3-none-any.whl (12.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.0/12.0 MB\u001b[0m \u001b[31m17.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hUsing cached huggingface_hub-0.36.0-py3-none-any.whl (566 kB)\n", + "Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB)\n", + "Downloading tensorflow-2.20.0-cp313-cp313-macosx_12_0_arm64.whl (200.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.7/200.7 MB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m \u001b[33m0:00:12\u001b[0mm0:00:01\u001b[0m00:01\u001b[0m\n", + "\u001b[?25hDownloading grpcio-1.76.0-cp313-cp313-macosx_11_0_universal2.whl (11.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m18.7 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0mm0:00:01\u001b[0m0:01\u001b[0m\n", + "\u001b[?25hDownloading ml_dtypes-0.5.3-cp313-cp313-macosx_10_13_universal2.whl (663 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m663.8/663.8 kB\u001b[0m \u001b[31m10.7 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading tensorboard-2.20.0-py3-none-any.whl (5.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hDownloading tensorboard_data_server-0.7.2-py3-none-any.whl (2.4 kB)\n", + "Downloading absl_py-2.3.1-py3-none-any.whl (135 kB)\n", + "Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)\n", + "Downloading wheel-0.45.1-py3-none-any.whl (72 kB)\n", + "Using cached flatbuffers-25.9.23-py2.py3-none-any.whl (30 kB)\n", + "Downloading gast-0.6.0-py3-none-any.whl (21 kB)\n", + "Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)\n", + "Downloading h5py-3.15.1-cp313-cp313-macosx_11_0_arm64.whl (2.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m21.1 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading keras-3.12.0-py3-none-any.whl (1.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m17.6 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading libclang-18.1.1-1-py2.py3-none-macosx_11_0_arm64.whl (25.8 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m25.8/25.8 MB\u001b[0m \u001b[31m18.7 MB/s\u001b[0m \u001b[33m0:00:01\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hDownloading markdown-3.10-py3-none-any.whl (107 kB)\n", + "Downloading opt_einsum-3.4.0-py3-none-any.whl (71 kB)\n", + "Downloading protobuf-6.33.1-cp39-abi3-macosx_10_9_universal2.whl (427 kB)\n", + "Using cached regex-2025.11.3-cp313-cp313-macosx_11_0_arm64.whl (288 kB)\n", + "Using cached safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl (432 kB)\n", + "Downloading termcolor-3.2.0-py3-none-any.whl (7.7 kB)\n", + "Downloading accelerate-1.11.0-py3-none-any.whl (375 kB)\n", + "Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl (74.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m74.5/74.5 MB\u001b[0m \u001b[31m21.5 MB/s\u001b[0m \u001b[33m0:00:03\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hUsing cached sympy-1.14.0-py3-none-any.whl (6.3 MB)\n", + "Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)\n", + "Downloading werkzeug-3.1.3-py3-none-any.whl (224 kB)\n", + "Downloading wrapt-2.0.1-cp313-cp313-macosx_11_0_arm64.whl (61 kB)\n", + "Downloading namex-0.1.0-py3-none-any.whl (5.9 kB)\n", + "Downloading optree-0.18.0-cp313-cp313-macosx_11_0_arm64.whl (346 kB)\n", + "Downloading pillow-12.0.0-cp313-cp313-macosx_11_0_arm64.whl (4.7 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.7/4.7 MB\u001b[0m \u001b[31m17.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hInstalling collected packages: namex, mpmath, libclang, flatbuffers, wrapt, wheel, werkzeug, termcolor, tensorboard-data-server, sympy, setuptools, safetensors, regex, protobuf, pillow, optree, opt_einsum, ml_dtypes, markdown, h5py, grpcio, google_pasta, gast, absl-py, torch, tensorboard, huggingface-hub, astunparse, tokenizers, keras, accelerate, transformers, tensorflow\n", + "\u001b[2K Attempting uninstall: huggingface-hub[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", + "\u001b[2K Found existing installation: huggingface_hub 1.1.4━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", + "\u001b[2K Uninstalling huggingface_hub-1.1.4:[90m╺\u001b[0m\u001b[90m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", + "\u001b[2K Successfully uninstalled huggingface_hub-1.1.4m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m33/33\u001b[0m [tensorflow]3\u001b[0m [tensorflow]s]ub]\n", + "\u001b[1A\u001b[2KSuccessfully installed absl-py-2.3.1 accelerate-1.11.0 astunparse-1.6.3 flatbuffers-25.9.23 gast-0.6.0 google_pasta-0.2.0 grpcio-1.76.0 h5py-3.15.1 huggingface-hub-0.36.0 keras-3.12.0 libclang-18.1.1 markdown-3.10 ml_dtypes-0.5.3 mpmath-1.3.0 namex-0.1.0 opt_einsum-3.4.0 optree-0.18.0 pillow-12.0.0 protobuf-6.33.1 regex-2025.11.3 safetensors-0.6.2 setuptools-80.9.0 sympy-1.14.0 tensorboard-2.20.0 tensorboard-data-server-0.7.2 tensorflow-2.20.0 termcolor-3.2.0 tokenizers-0.22.1 torch-2.9.1 transformers-4.57.1 werkzeug-3.1.3 wheel-0.45.1 wrapt-2.0.1\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "# Install required packages\n", "%pip install -r requirements.txt" @@ -91,9 +286,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], "source": [ "import getpass\n", "import base64\n", @@ -139,12 +343,12 @@ "\n", "**OPENAI_API_KEY** is your OpenAI API key which can be obtained from your OpenAI dashboard at [platform.openai.com](https://platform.openai.com/api-keys).\n", "\n", - "**INDEX_NAME** is the name of the GSI vector index we will create for vector search operations." + "**INDEX_NAME** is the name of the Hyperscale or Composite Vector Index we will create for vector search operations." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ @@ -172,7 +376,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ @@ -194,9 +398,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:36:34,556 - INFO - Successfully connected to the Couchbase cluster\n" + ] + } + ], "source": [ "try:\n", " # Initialize the Couchbase Cluster\n", @@ -224,9 +436,19 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Bucket 'b' already exists.\n", + "Scope 's' already exists.\n", + "Collection 'c' already exists in scope 's'.\n" + ] + } + ], "source": [ "from couchbase.management.buckets import CreateBucketSettings\n", "import json\n", @@ -277,9 +499,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loaded the BBC News dataset with 2687 rows\n" + ] + } + ], "source": [ "try:\n", " news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n", @@ -297,9 +527,20 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 8, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dataset columns: ['title', 'published_date', 'authors', 'description', 'section', 'content', 'link', 'top_image']\n", + "\n", + "First two examples:\n", + "{'title': [\"Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News\", 'Lockdown DIY linked to Walleys Quarry gases - BBC News'], 'published_date': ['2024-12-01', '2024-12-01'], 'authors': ['https://www.facebook.com/bbcnews', 'https://www.facebook.com/bbcnews'], 'description': [\"Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.\", 'An academic says an increase in plasterboard sent to landfill could be behind a spike in smells.'], 'section': ['Asia', 'Stoke & Staffordshire'], 'content': ['Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\\n\\nImran Khan\\'s wife, Bushra Bibi, encouraged protesters into the heart of Pakistan\\'s capital, Islamabad\\n\\nA charred lorry, empty tear gas shells and posters of former Pakistan Prime Minister Imran Khan - it was all that remained of a massive protest led by Khan’s wife, Bushra Bibi, that had sent the entire capital into lockdown. Just a day earlier, faith healer Bibi - wrapped in a white shawl, her face covered by a white veil - stood atop a shipping container on the edge of the city as thousands of her husband’s devoted followers waved flags and chanted slogans beneath her. It was the latest protest to flare since Khan, the 72-year-old cricketing icon-turned-politician, was jailed more than a year ago after falling foul of the country\\'s influential military which helped catapult him to power. “My children and my brothers! You have to stand with me,” Bibi cried on Tuesday afternoon, her voice cutting through the deafening roar of the crowd. “But even if you don’t,” she continued, “I will still stand firm. “This is not just about my husband. It is about this country and its leader.” It was, noted some watchers of Pakistani politics, her political debut. But as the sun rose on Wednesday morning, there was no sign of Bibi, nor the thousands of protesters who had marched through the country to the heart of the capital, demanding the release of their jailed leader. While other PMs have fallen out with Pakistan\\'s military in the past, Khan\\'s refusal to stay quiet behind bars is presenting an extraordinary challenge - escalating the standoff and leaving the country deeply divided. Exactly what happened to the so-called “final march”, and Bibi, when the city went dark is still unclear. All eyewitnesses like Samia* can say for certain is that the lights went out suddenly, plunging D Chowk, the square where they had gathered, into blackness.\\n\\nWithin a day of arriving, the protesters had scattered - leaving behind Bibi\\'s burnt-out vehicle\\n\\nAs loud screams and clouds of tear gas blanketed the square, Samia describes holding her husband on the pavement, bloodied from a gun shot to his shoulder. \"Everyone was running for their lives,\" she later told BBC Urdu from a hospital in Islamabad, adding it was \"like doomsday or a war\". \"His blood was on my hands and the screams were unending.” But how did the tide turn so suddenly and decisively? Just hours earlier, protesters finally reached D Chowk late afternoon on Tuesday. They had overcome days of tear gas shelling and a maze of barricaded roads to get to the city centre. Many of them were supporters and workers of the Pakistan Tehreek-e-Insaf (PTI), the party led by Khan. He had called for the march from his jail cell, where he has been for more than a year on charges he says are politically motivated. Now Bibi - his third wife, a woman who had been largely shrouded in mystery and out of public view since their unexpected wedding in 2018 - was leading the charge. “We won’t go back until we have Khan with us,” she declared as the march reached D Chowk, deep in the heart of Islamabad’s government district.\\n\\nThousands had marched for days to reach Islamabad, demanding former Prime Minister Imran Khan be released from jail\\n\\nInsiders say even the choice of destination - a place where her husband had once led a successful sit in - was Bibi’s, made in the face of other party leader’s opposition, and appeals from the government to choose another gathering point. Her being at the forefront may have come as a surprise. Bibi, only recently released from prison herself, is often described as private and apolitical. Little is known about her early life, apart from the fact she was a spiritual guide long before she met Khan. Her teachings, rooted in Sufi traditions, attracted many followers - including Khan himself. Was she making her move into politics - or was her sudden appearance in the thick of it a tactical move to keep Imran Khan’s party afloat while he remains behind bars? For critics, it was a move that clashed with Imran Khan’s oft-stated opposition to dynastic politics. There wasn’t long to mull the possibilities. After the lights went out, witnesses say that police started firing fresh rounds of tear gas at around 21:30 local time (16:30 GMT). The crackdown was in full swing just over an hour later. At some point, amid the chaos, Bushra Bibi left. Videos on social media appeared to show her switching cars and leaving the scene. The BBC couldn’t verify the footage. By the time the dust settled, her container had already been set on fire by unknown individuals. By 01:00 authorities said all the protesters had fled.\\n\\nSecurity was tight in the city, and as night fell, lights were switched off - leaving many in the dark as to what exactly happened next\\n\\nEyewitnesses have described scenes of chaos, with tear gas fired and police rounding up protesters. One, Amin Khan, said from behind an oxygen mask that he joined the march knowing that, \"either I will bring back Imran Khan or I will be shot\". The authorities have have denied firing at the protesters. They also said some of the protesters were carrying firearms. The BBC has seen hospital records recording patients with gunshot injuries. However, government spokesperson Attaullah Tarar told the BBC that hospitals had denied receiving or treating gunshot wound victims. He added that \"all security personnel deployed on the ground have been forbidden\" from having live ammunition during protests. But one doctor told BBC Urdu that he had never done so many surgeries for gunshot wounds in a single night. \"Some of the injured came in such critical condition that we had to start surgery right away instead of waiting for anaesthesia,\" he said. While there has been no official toll released, the BBC has confirmed with local hospitals that at least five people have died. Police say at least 500 protesters were arrested that night and are being held in police stations. The PTI claims some people are missing. And one person in particular hasn’t been seen in days: Bushra Bibi.\\n\\nThe next morning, the protesters were gone - leaving behind just wrecked cars and smashed glass\\n\\nOthers defended her. “It wasn’t her fault,” insisted another. “She was forced to leave by the party leaders.” Political commentators have been more scathing. “Her exit damaged her political career before it even started,” said Mehmal Sarfraz, a journalist and analyst. But was that even what she wanted? Khan has previously dismissed any thought his wife might have her own political ambitions - “she only conveys my messages,” he said in a statement attributed to him on his X account.\\n\\nImran Khan and Bushra Bibi, pictured here arriving at court in May 2023, married in 2018\\n\\nSpeaking to BBC Urdu, analyst Imtiaz Gul calls her participation “an extraordinary step in extraordinary circumstances\". Gul believes Bushra Bibi’s role today is only about “keeping the party and its workers active during Imran Khan’s absence”. It is a feeling echoed by some PTI members, who believe she is “stepping in only because Khan trusts her deeply”. Insiders, though, had often whispered that she was pulling the strings behind the scenes - advising her husband on political appointments and guiding high-stakes decisions during his tenure. A more direct intervention came for the first time earlier this month, when she urged a meeting of PTI leaders to back Khan’s call for a rally. Pakistan’s defence minister Khawaja Asif accused her of “opportunism”, claiming she sees “a future for herself as a political leader”. But Asma Faiz, an associate professor of political science at Lahore University of Management Sciences, suspects the PTI’s leadership may have simply underestimated Bibi. “It was assumed that there was an understanding that she is a non-political person, hence she will not be a threat,” she told the AFP news agency. “However, the events of the last few days have shown a different side of Bushra Bibi.” But it probably doesn’t matter what analysts and politicians think. Many PTI supporters still see her as their connection to Imran Khan. It was clear her presence was enough to electrify the base. “She is the one who truly wants to get him out,” says Asim Ali, a resident of Islamabad. “I trust her. Absolutely!”', 'Walleys Quarry was ordered not to accept any new waste as of Friday\\n\\nA chemist and former senior lecturer in environmental sustainability has said powerful odours from a controversial landfill site may be linked to people doing more DIY during the Covid-19 pandemic. Complaints about Walleys Quarry in Silverdale, Staffordshire – which was ordered to close as of Friday – increased significantly during and after coronavirus lockdowns. Issuing the closure notice, the Environment Agency described management of the site as poor, adding it had exhausted all other enforcement tactics at premises where gases had been noxious and periodically above emission level guidelines - which some campaigners linked to ill health locally. Dr Sharon George, who used to teach at Keele University, said she had been to the site with students and found it to be clean and well-managed, and suggested an increase in plasterboard heading to landfills in 2020 could be behind a spike in stenches.\\n\\n“One of the materials that is particularly bad for producing odours and awful emissions is plasterboard,\" she said. “That’s one of the theories behind why Walleys Quarry got worse at that time.” She said the landfill was in a low-lying area, and that some of the gases that came from the site were quite heavy. “They react with water in the atmosphere, so some of the gases you smell can be quite awful and not very good for our health. “It’s why, on some days when it’s colder and muggy and a bit misty, you can smell it more.” Dr George added: “With any landfill, you’re putting things into the ground – and when you put things into the ground, if they can they will start to rot. When they start to rot they’re going to give off gases.” She believed Walleys Quarry’s proximity to people’s homes was another major factor in the amount of complaints that arose from its operation. “If you’ve got a gas that people can smell, they’re going to report it much more than perhaps a pollutant that might go unnoticed.”\\n\\nRebecca Currie said she did not think the site would ever be closed\\n\\nLocal resident and campaigner Rebecca Currie said the closure notice served to Walleys Quarry was \"absolutely amazing\". Her son Matthew has had breathing difficulties after being born prematurely with chronic lung disease, and Ms Currie says the site has made his symptoms worse. “I never thought this day was going to happen,” she explained. “We fought and fought for years.” She told BBC Midlands Today: “Our community have suffered. We\\'ve got kids who are really poorly, people have moved homes.”\\n\\nComplaints about Walleys Quarry to Newcastle-under-Lyme Borough Council exceeded 700 in November, the highest amount since 2021 according to council leader Simon Tagg. The Environment Agency (EA), which is responsible for regulating landfill sites, said it had concluded further operation at the site could result in \"significant long-term pollution\". A spokesperson for Walley\\'s Quarry Ltd said the firm rejected the EA\\'s accusations of poor management, and would be challenging the closure notice. Dr George said she believed the EA was likely to be erring on the side of caution and public safety, adding safety standards were strict. She said a lack of landfill space in the country overall was one of the broader issues that needed addressing. “As people, we just keep using stuff and then have nowhere to put it, and then when we end up putting it in places like Walleys Quarry that is next to houses, I think that’s where the problems are.”\\n\\nTell us which stories we should cover in Staffordshire'], 'link': ['http://www.bbc.co.uk/news/articles/cvg02lvj1e7o', 'http://www.bbc.co.uk/news/articles/c5yg1v16nkpo'], 'top_image': ['https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/9975/live/b22229e0-ad5a-11ef-83bc-1153ed943d1c.jpg', 'https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/0896/live/55209f80-adb2-11ef-8f6c-f1a86bb055ec.jpg']}\n" + ] + } + ], "source": [ "# Print the first two examples from the dataset\n", "print(\"Dataset columns:\", news_dataset.column_names)\n", @@ -318,9 +559,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "We have 1749 unique articles in our database.\n" + ] + } + ], "source": [ "import hashlib\n", "\n", @@ -349,9 +598,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully created embedding models\n" + ] + } + ], "source": [ "try:\n", " # Set up the document embedder for processing documents\n", @@ -381,9 +638,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 11, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:36:51,924 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", + "Embedding dimension: 3072\n" + ] + } + ], "source": [ "test_result = rag_embedder.run(text=\"this is a test sentence\")\n", "test_embedding = test_result[\"embedding\"]\n", @@ -394,18 +660,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Setting Up the Couchbase GSI Document Store\n", - "The Couchbase GSI document store is set up to store the documents from the dataset using Couchbase's Global Secondary Index vector search capabilities. This document store is optimized for high-performance vector similarity search operations and can scale to billions of vectors using Haystack's Couchbase integration." + "# Setting Up the Couchbase Vector Document Store\n", + "The Couchbase document store configuration enables both Hyperscale and Composite Vector Indexes. This stores documents from the dataset while keeping embeddings ready for high-performance semantic search, and it scales to billions of vectors through Haystack's Couchbase integration." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 12, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully created Couchbase vector document store\n" + ] + } + ], "source": [ "try:\n", - " # Create the Couchbase GSI document store\n", + " # Create the Couchbase vector document store\n", " document_store = CouchbaseQueryDocumentStore(\n", " cluster_connection_string=Secret.from_token(CB_CONNECTION_STRING),\n", " authenticator=CouchbasePasswordAuthenticator(\n", @@ -421,9 +695,9 @@ " search_type=QueryVectorSearchType.ANN,\n", " similarity=QueryVectorSearchSimilarity.L2\n", " )\n", - " print(\"Successfully created GSI document store\")\n", + " print(\"Successfully created Couchbase vector document store\")\n", "except Exception as e:\n", - " raise ValueError(f\"Failed to create GSI document store: {str(e)}\")" + " raise ValueError(f\"Failed to create Couchbase vector document store: {str(e)}\")" ] }, { @@ -438,9 +712,24 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document content preview:\n", + "Content: Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\n", + "\n", + "Imran Khan's wife, Bushra Bibi, encouraged protesters into the heart of Pakistan's capital, Islamabad\n", + "\n", + "A charred lorry, ...\n", + "Metadata: {'title': \"Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News\", 'description': \"Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.\", 'published_date': '2024-12-01', 'link': 'http://www.bbc.co.uk/news/articles/cvg02lvj1e7o'}\n", + "Created 1749 documents\n" + ] + } + ], "source": [ "haystack_documents = []\n", "# Process and store documents\n", @@ -487,9 +776,27 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "🚅 Components\n", + " - cleaner: DocumentCleaner\n", + " - embedder: OpenAIDocumentEmbedder\n", + " - writer: DocumentWriter\n", + "🛤️ Connections\n", + " - cleaner.documents -> embedder.documents (list[Document])\n", + " - embedder.documents -> writer.documents (list[Document])" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "\n", "\n", @@ -517,14 +824,101 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 19, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:29,794 - INFO - Running component cleaner\n", + "2025-11-17 14:42:29,800 - INFO - Running component embedder\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 0it [00:00, ?it/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:31,149 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 1it [00:02, 2.94s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:33,448 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 2it [00:04, 2.35s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:35,608 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 3it [00:06, 1.85s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:36,509 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Calculating embeddings: 4it [00:07, 1.87s/it]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:37,301 - INFO - Running component writer\n", + "Indexed 100 document chunks\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], "source": [ "# Run the indexing pipeline\n", "if haystack_documents:\n", - " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents}})\n", - " print(f\"Indexed {len(result['writer']['documents_written'])} document chunks\")\n", + " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents[:100]}})\n", + " print(f\"Indexed {result['writer']['documents_written']} document chunks\")\n", "else:\n", " print(\"No documents created. Skipping indexing.\")\n" ] @@ -543,9 +937,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 20, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:40,424 - INFO - Successfully created the OpenAI generator\n" + ] + } + ], "source": [ "try:\n", " # Set up the LLM generator\n", @@ -574,9 +976,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 21, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:41,774 - WARNING - PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n", + "Successfully created RAG pipeline\n" + ] + } + ], "source": [ "# Define RAG prompt template\n", "prompt_template = \"\"\"\n", @@ -628,14 +1039,47 @@ "\n", "This demonstrates how our system combines the power of vector search with language model capabilities to provide accurate, contextual answers based on the information in our database.\n", "\n", - "**Note:** By default, without any GSI vector index, Couchbase uses linear brute force search which compares the query vector against every document in the collection. This works for small datasets but can become slow as the dataset grows." + "**Note:** By default, without any Hyperscale or Composite Vector Index, Couchbase falls back to linear brute-force search that compares the query vector against every document in the collection. This works for small datasets but can become slow as the dataset grows." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 22, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:42:43,017 - INFO - Running component query_embedder\n", + "2025-11-17 14:42:43,636 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", + "2025-11-17 14:42:43,853 - INFO - Running component retriever\n", + "2025-11-17 14:42:43,990 - INFO - Running component prompt_builder\n", + "2025-11-17 14:42:43,990 - INFO - Running component llm\n", + "2025-11-17 14:42:45,914 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", + "2025-11-17 14:42:45,935 - INFO - Running component answer_builder\n", + "=== Retrieved Documents ===\n", + "Id: 3bd611696904f038e5ceff530ab97539a34e6893001e6e71d5518ecfa4a729ff Title: New Zealand v England: Brydon Carse repays faith with Christchurch haul - BBC Sport\n", + "Id: 5cc142cd0535bcb62c2bce08e87714a2ddab9590c06e2935213c0341360953b1 Title: Ireland 22-19 Australia: 'No emotion' for Andy Farrell in winning send-off before Lions sabbatical - BBC Sport\n", + "Id: 96601a80eaf87c11a39a5cada79bdaa50e227cb03ed36e3bb0573c6297767702 Title: Watch: CCTV shows how Daniel Khalife escaped - BBC News\n", + "Id: 875d0e214cbc2d7d0bd475906cf50b74d042a831c7b00b301c39a5bb2d387f07 Title: Troy Deeney's Team of the Week: Saka, Kluivert, Schade, Rashford - BBC Sport\n", + "Id: 62b98b01f3f453b4046fc92fb97cd73022afc2a6e70133c48985eecc658c3db7 Title: World Athletic Awards: Letsile Tebogo and Sifan Hassan named athletes of the year - BBC Sport\n", + "\n", + "=== Final Answer ===\n", + "Question: Who will Daniel Dubois fight in Saudi Arabia on 22 February?\n", + "Answer: The documents do not provide information on who Daniel Dubois will fight in Saudi Arabia on 22 February.\n", + "\n", + "Sources:\n", + "-> New Zealand v England: Brydon Carse repays faith with Christchurch haul - BBC Sport\n", + "-> Ireland 22-19 Australia: 'No emotion' for Andy Farrell in winning send-off before Lions sabbatical - BBC Sport\n", + "-> Watch: CCTV shows how Daniel Khalife escaped - BBC News\n", + "-> Troy Deeney's Team of the Week: Saka, Kluivert, Schade, Rashford - BBC Sport\n", + "-> World Athletic Awards: Letsile Tebogo and Sifan Hassan named athletes of the year - BBC Sport\n", + "\n", + "Optimized Hyperscale Vector Search Results (completed in 2.92 seconds):\n" + ] + } + ], "source": [ "# Sample query from the dataset\n", "\n", @@ -670,7 +1114,7 @@ " for doc in answer.documents:\n", " print(f\"-> {doc.meta['title']}\")\n", " # Display search results\n", - " print(f\"\\nOptimized GSI Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(f\"\\nOptimized Hyperscale Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", " #print(result[\"generator\"][\"replies\"][0])\n", "\n", "except Exception as e:\n", @@ -681,39 +1125,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Create GSI Vector Index (Optimized Search)\n", - "\n", - "While the above RAG system works effectively, we can significantly improve query performance by leveraging Couchbase's advanced GSI vector search capabilities.\n", + "# Create Hyperscale or Composite Vector Indexes\n", "\n", - "Couchbase offers three types of vector indexes, but for GSI-based vector search we focus on two main types:\n", - "\n", - "In this section, we'll set up the Couchbase vector store using GSI (Global Secondary Index) for high-performance vector search.\n", - "\n", - "GSI vector search supports two main index types:\n", + "While the above RAG system works effectively, you can significantly improve query performance by enabling Couchbase Capella's Hyperscale or Composite Vector Indexes.\n", "\n", "## Hyperscale Vector Indexes\n", "- Specifically designed for vector searches\n", - "- Perform vector similarity and semantic searches faster than the other types of indexes\n", - "- Designed to scale to billions of vectors\n", - "- Most of the index resides in a highly optimized format on disk\n", - "- High accuracy even for vectors with a large number of dimensions\n", - "- Supports concurrent searches and inserts for datasets that are constantly changing\n", + "- Perform vector similarity and semantic searches faster than other index types\n", + "- Scale to billions of vectors while keeping most of the structure in an optimized on-disk format\n", + "- Maintain high accuracy even for vectors with a large number of dimensions\n", + "- Support concurrent searches and inserts for constantly changing datasets\n", "\n", - "Use this type of index when you want to primarily query vector values with a low memory footprint. In general, Hyperscale Vector indexes are the best choice for most applications that use vector searches.\n", + "Use this type of index when you primarily query vector values and need low-latency similarity search at scale. In general, Hyperscale Vector Indexes are the best starting point for most vector search workloads.\n", "\n", "## Composite Vector Indexes\n", - "- Combines a standard Global Secondary index (GSI) with a single vector column\n", - "- Designed for searches using a single vector value along with standard scalar values that filter out large portions of the dataset. The scalar attributes in a query reduce the number of vectors the Couchbase Server has to compare when performing a vector search to find similar vectors.\n", - "- Consume a moderate amount of memory and can index billions of documents.\n", - "- Work well for cases where your queries are highly selective — returning a small number of results from a large dataset\n", + "- Combine scalar filters with a single vector column in the same index definition\n", + "- Designed for searches that apply one vector value alongside scalar attributes that remove large portions of the dataset before similarity scoring\n", + "- Consume a moderate amount of memory and can index Tens of million to billion of documents\n", + "- Excel when your queries must return a small, highly targeted result set\n", "\n", - "Use Composite Vector indexes when you want to perform searches of documents using both scalars and a vector where the scalar values filter out large portions of the dataset.\n", + "Use Composite Vector Indexes when you want to perform searches that blend scalar predicates and vector similarity so that the scalar filters tighten the candidate set.\n", "\n", - "For more details, see the [Couchbase Vector Index documentation](https://docs.couchbase.com/server/current/vector-index/use-vector-indexes.html).\n", + "For an in-depth comparison and tuning guidance, review the [Couchbase vector index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html) and the [overview of Capella vector indexes](https://docs.couchbase.com/cloud/vector-index/vectors-and-indexes-overview.html).\n", "\n", "## Understanding Index Configuration (Couchbase 8.0 Feature)\n", "\n", - "The index_description parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", + "The `index_description` parameter controls how Couchbase optimizes vector storage and search performance through centroids and quantization:\n", "\n", "Format: `'IVF[],{PQ|SQ}'`\n", "\n", @@ -721,34 +1158,42 @@ "- Controls how the dataset is subdivided for faster searches\n", "- More centroids = faster search, slower training \n", "- Fewer centroids = slower search, faster training\n", - "- If omitted (like IVF,SQ8), Couchbase auto-selects based on dataset size\n", + "- If omitted (like `IVF,SQ8`), Couchbase auto-selects based on dataset size\n", "\n", "**Quantization Options:**\n", - "- SQ (Scalar Quantization): SQ4, SQ6, SQ8 (4, 6, or 8 bits per dimension)\n", - "- PQ (Product Quantization): PQx (e.g., PQ32x8)\n", + "- SQ (Scalar Quantization): `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)\n", + "- PQ (Product Quantization): `PQx` (e.g., `PQ32x8`)\n", "- Higher values = better accuracy, larger index size\n", "\n", "**Common Examples:**\n", - "- IVF,SQ8 - Auto centroids, 8-bit scalar quantization (good default)\n", - "- IVF1000,SQ6 - 1000 centroids, 6-bit scalar quantization \n", - "- IVF,PQ32x8 - Auto centroids, 32 subquantizers with 8 bits\n", + "- `IVF,SQ8` – Auto centroids, 8-bit scalar quantization (good default)\n", + "- `IVF1000,SQ6` – 1000 centroids, 6-bit scalar quantization \n", + "- `IVF,PQ32x8` – Auto centroids, 32 subquantizers with 8 bits\n", "\n", "For detailed configuration options, see the [Quantization & Centroid Settings](https://docs.couchbase.com/server/current/vector-index/hyperscale-vector-index.html#algo_settings).\n", "\n", - "In the code below, we demonstrate creating a BHIVE index for optimal performance. This method takes an index type (BHIVE or COMPOSITE) and description parameter for optimization settings. Alternatively, GSI indexes can be created manually from the Couchbase UI. " + "In the code below, we demonstrate creating a Hyperscale index for optimal performance. You can adapt the same flow to create a COMPOSITE index by replacing the index type and options." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 23, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Hyperscale index may already exist or error occurred: InternalServerFailureException()\n" + ] + } + ], "source": [ - "# Create a BHIVE (Hyperscale Vector Index) for optimized vector search\n", + "# Create a Hyperscale Vector Index for optimized vector search\n", "try:\n", - " bhive_index_name = f\"{INDEX_NAME}_bhive\"\n", + " hyperscale_index_name = f\"{INDEX_NAME}_hyperscale\"\n", "\n", - " # Use the cluster connection to create the BHIVE index\n", + " # Use the cluster connection to create the Hyperscale index\n", " scope = cluster.bucket(CB_BUCKET_NAME).scope(SCOPE_NAME)\n", " \n", " options = {\n", @@ -759,39 +1204,70 @@ " \n", " scope.query(\n", " f\"\"\"\n", - " CREATE INDEX {bhive_index_name}\n", + " CREATE INDEX {hyperscale_index_name}\n", " ON {COLLECTION_NAME} (embedding VECTOR)\n", " USING GSI WITH {json.dumps(options)}\n", " \"\"\",\n", " QueryOptions(\n", " timeout=timedelta(seconds=300)\n", " )).execute()\n", - " print(f\"Successfully created BHIVE index: {bhive_index_name}\")\n", + " print(f\"Successfully created Hyperscale index: {hyperscale_index_name}\")\n", "except Exception as e:\n", - " print(f\"BHIVE index may already exist or error occurred: {str(e)}\")\n" + " print(f\"Hyperscale index may already exist or error occurred: {str(e)}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Testing Optimized GSI Vector Search\n", + "# Testing Optimized Hyperscale Vector Search\n", "\n", - "The example below shows running the same RAG query, but now using the BHIVE GSI index we created above. You'll notice improved performance as the index efficiently retrieves data." + "The example below runs the same RAG query, but now uses the Hyperscale index created above. You'll notice improved performance as the index efficiently retrieves data. If you create a Composite index, the workflow is identical — Haystack automatically routes queries through the scalar filters before performing the vector similarity search." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-11-17 14:44:55,645 - INFO - Running component query_embedder\n", + "2025-11-17 14:44:56,291 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", + "2025-11-17 14:44:56,508 - INFO - Running component retriever\n", + "2025-11-17 14:44:56,597 - INFO - Running component prompt_builder\n", + "2025-11-17 14:44:56,598 - INFO - Running component llm\n", + "2025-11-17 14:44:59,603 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", + "2025-11-17 14:44:59,610 - INFO - Running component answer_builder\n", + "=== Retrieved Documents ===\n", + "Id: 0aa95f5a2c515683b16ab26312eb2734ca3a7fb72d00992f323b0f72a0e365a6 Title: Gleision mine deaths: Inquest won't be held until 2026 - BBC News\n", + "Id: 8acf5742f95609c38bdf5941f61c337a1934570faaf21cd478daea2c635ddff6 Title: Bob Bryar dead: Former My Chemical Romance drummer dies aged 44 - BBC News\n", + "Id: af754e31be9bfeca0436b023ecb75d1668fc035c36c81a87ded707729b790653 Title: Terry Griffiths: Former world snooker champion dies aged 77 - BBC Sport\n", + "Id: b7434182a5461e6c5e92fa676cd071cd22a1a109d98563f3a5899894605737bd Title: Diddy on Trial - Enter the Diddy-Verse - BBC Sounds\n", + "\n", + "=== Final Answer ===\n", + "Question: What is latest news on the death of Charles Breslin?\n", + "Answer: The latest news on the death of Charles Breslin is that, after a protracted battle, families were informed in 2022 that a full inquest into the deaths of the four miners, including Charles Breslin, would be held. However, a pre-inquest hearing in Swansea’s Guildhall announced that the full inquest would not occur until \"the early part of 2026\" due to \"significant complexity\" surrounding the documents needed. The inquest involves a large volume of material, with the Coal Authority estimating 75,000 pages of written documents. Families are eagerly awaiting answers about the disaster at the Gleision colliery in which Charles Breslin and three others died, hopeful that the inquest will provide clarity and relief. The coroner has promised a thorough investigation to determine the source of the water and whether responsible parties were aware of the risks involved.\n", + "\n", + "Sources:\n", + "-> Gleision mine deaths: Inquest won't be held until 2026 - BBC News\n", + "-> Bob Bryar dead: Former My Chemical Romance drummer dies aged 44 - BBC News\n", + "-> Terry Griffiths: Former world snooker champion dies aged 77 - BBC Sport\n", + "-> Diddy on Trial - Enter the Diddy-Verse - BBC Sounds\n", + "\n", + "Optimized Hyperscale Vector Search Results (completed in 3.97 seconds):\n" + ] + } + ], "source": [ - "# Test the optimized GSI vector search with BHIVE index\n", - "query = \"Who will Daniel Dubois fight in Saudi Arabia on 22 February?\"\n", + "# Test the optimized Hyperscale vector search\n", + "query = \"What is latest news on the death of Charles Breslin?\"\n", "\n", "try:\n", - " # The RAG pipeline will automatically use the optimized GSI index\n", - " # Perform the semantic search with GSI optimization\n", + " # The RAG pipeline will automatically use the optimized Hyperscale index\n", + " # Perform the semantic search with Hyperscale optimization\n", " start_time = time.time()\n", " result = rag_pipeline.run({\n", " \"query_embedder\": {\"text\": query},\n", @@ -819,7 +1295,7 @@ " for doc in answer.documents:\n", " print(f\"-> {doc.meta['title']}\")\n", " # Display search results\n", - " print(f\"\\nOptimized GSI Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(f\"\\nOptimized Hyperscale Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", " #print(result[\"generator\"][\"replies\"][0])\n", "\n", "except Exception as e:\n", @@ -831,16 +1307,15 @@ "metadata": {}, "source": [ "# Conclusion\n", - "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Couchbase Capella's GSI vector search, OpenAI, and Haystack. We used the BBC News dataset, which contains real-time news articles, to demonstrate how RAG can be used to answer questions about current events and provide up-to-date information that extends beyond the LLM's training data.\n", + "In this tutorial, we've built a Retrieval Augmented Generation (RAG) system using Haystack with OpenAI models and Couchbase Capella's Hyperscale and Composite Vector Indexes. Using the BBC News dataset, we demonstrated how modern vector indexes make it possible to answer up-to-date questions that extend beyond an LLM's original training data.\n", "\n", "The key components of our RAG system include:\n", "\n", - "1. **Couchbase Capella GSI Vector Search** as the high-performance vector database for storing and retrieving document embeddings\n", + "1. **Couchbase Capella Hyperscale & Composite Vector Indexes** for high-performance storage and retrieval of document embeddings\n", "2. **Haystack** as the framework for building modular RAG pipelines with flexible component connections\n", "3. **OpenAI Services** for generating embeddings (`text-embedding-3-large`) and LLM responses (`gpt-4o`)\n", - "4. **GSI Vector Indexes** (BHIVE/Composite) for optimized vector search performance\n", "\n", - "This approach allows us to enhance the capabilities of large language models by grounding their responses in specific, up-to-date information from our knowledge base, while leveraging Couchbase's advanced GSI vector search for optimal performance and scalability. Haystack's modular pipeline approach provides flexibility and extensibility for building complex RAG applications.\n" + "This approach grounds LLM responses in specific, current information from our knowledge base while taking advantage of Couchbase's advanced vector index options for performance and scale. Haystack's modular pipeline model keeps the solution extensible as you layer in additional data sources or services.\n" ] } ], @@ -860,7 +1335,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.2" + "version": "3.13.9" } }, "nbformat": 4, diff --git a/haystack/gsi/frontmatter.md b/haystack/gsi/frontmatter.md index 1da0623e..f2c67f24 100644 --- a/haystack/gsi/frontmatter.md +++ b/haystack/gsi/frontmatter.md @@ -1,12 +1,15 @@ --- # frontmatter -path: "/tutorial-openai-haystack-rag-with-gsi" -title: "RAG with OpenAI, Haystack and Couchbase Hyperscale and Composite Vector Indexes" -short_title: "RAG with OpenAI, Haystack and Couchbase CVI and HVI" +path: "/tutorial-openai-haystack-rag-with-hyperscale-or-composite-vector-index" +alt_paths: + - "/tutorial-openai-haystack-rag-with-hyperscale-vector-index" + - "/tutorial-openai-haystack-rag-with-composite-vector-index" +title: "RAG with OpenAI, Haystack, and Couchbase Hyperscale & Composite Vector Indexes" +short_title: "RAG with OpenAI, Haystack, and Hyperscale & Composite Indexes" description: - - Learn how to build a semantic search engine using Couchbase's Hyperscale and Composite Vector Indexes. - - This tutorial demonstrates how to integrate Couchbase's GSI vector search capabilities with OpenAI embeddings. - - You will understand how to perform Retrieval-Augmented Generation (RAG) using Haystack, Couchbase and OpenAI services. + - Learn how to build a semantic search engine using Couchbase Hyperscale and Composite Vector Indexes. + - This tutorial demonstrates how Haystack integrates Couchbase Hyperscale and Composite Vector Indexes with embeddings generated by OpenAI services. + - Perform Retrieval-Augmented Generation (RAG) using Haystack with Couchbase and OpenAI services while comparing the two index types. content_type: tutorial filter: sdk technology: @@ -15,7 +18,8 @@ tags: - OpenAI - Artificial Intelligence - Haystack - - GSI + - Hyperscale Vector Index + - Composite Vector Index sdk_language: - python length: 60 Mins From a2960ce2fcee75f4e0320d06ecec7313f5247d7c Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Mon, 17 Nov 2025 14:52:43 +0530 Subject: [PATCH 2/4] DA-1319 rename: folders --- .../RAG_with_Couchbase_Capella_and_OpenAI.ipynb | 0 haystack/{gsi => query_based}/frontmatter.md | 0 haystack/{fts => query_based}/requirements.txt | 0 .../RAG_with_Couchbase_Capella_and_OpenAI.ipynb | 0 haystack/{fts => search_based}/frontmatter.md | 0 haystack/{gsi => search_based}/requirements.txt | 0 haystack/{fts => search_based}/search_vector_index.json | 0 7 files changed, 0 insertions(+), 0 deletions(-) rename haystack/{gsi => query_based}/RAG_with_Couchbase_Capella_and_OpenAI.ipynb (100%) rename haystack/{gsi => query_based}/frontmatter.md (100%) rename haystack/{fts => query_based}/requirements.txt (100%) rename haystack/{fts => search_based}/RAG_with_Couchbase_Capella_and_OpenAI.ipynb (100%) rename haystack/{fts => search_based}/frontmatter.md (100%) rename haystack/{gsi => search_based}/requirements.txt (100%) rename haystack/{fts => search_based}/search_vector_index.json (100%) diff --git a/haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 100% rename from haystack/gsi/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb diff --git a/haystack/gsi/frontmatter.md b/haystack/query_based/frontmatter.md similarity index 100% rename from haystack/gsi/frontmatter.md rename to haystack/query_based/frontmatter.md diff --git a/haystack/fts/requirements.txt b/haystack/query_based/requirements.txt similarity index 100% rename from haystack/fts/requirements.txt rename to haystack/query_based/requirements.txt diff --git a/haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb similarity index 100% rename from haystack/fts/RAG_with_Couchbase_Capella_and_OpenAI.ipynb rename to haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb diff --git a/haystack/fts/frontmatter.md b/haystack/search_based/frontmatter.md similarity index 100% rename from haystack/fts/frontmatter.md rename to haystack/search_based/frontmatter.md diff --git a/haystack/gsi/requirements.txt b/haystack/search_based/requirements.txt similarity index 100% rename from haystack/gsi/requirements.txt rename to haystack/search_based/requirements.txt diff --git a/haystack/fts/search_vector_index.json b/haystack/search_based/search_vector_index.json similarity index 100% rename from haystack/fts/search_vector_index.json rename to haystack/search_based/search_vector_index.json From ef85274e04ca4f7b0485651d738c04ad33ce05ce Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Tue, 18 Nov 2025 11:02:14 +0530 Subject: [PATCH 3/4] DA-1319 update: remove execution outputs and reset execution counts - Cleared all output cells to ensure a clean notebook state. - Set execution counts to null for all code cells to allow for fresh execution. - Adjusted the indexing pipeline to process a larger number of documents (from 100 to 1200). - Updated the sample query to reflect current news context. --- ...AG_with_Couchbase_Capella_and_OpenAI.ipynb | 570 ++---------------- 1 file changed, 43 insertions(+), 527 deletions(-) diff --git a/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index b044ca4d..c54edcf4 100644 --- a/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -73,204 +73,9 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Requirement already satisfied: pandas>=2.1.4 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 1)) (2.3.3)\n", - "Requirement already satisfied: datasets>=2.14.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 2)) (4.4.1)\n", - "Collecting setuptools>=75.8.0 (from -r requirements.txt (line 3))\n", - " Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)\n", - "Requirement already satisfied: couchbase-haystack==2.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from -r requirements.txt (line 4)) (2.1.0)\n", - "Collecting transformers>=4.49.0 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Downloading transformers-4.57.1-py3-none-any.whl.metadata (43 kB)\n", - "Collecting tensorflow>=2.18.0 (from -r requirements.txt (line 6))\n", - " Downloading tensorflow-2.20.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (4.5 kB)\n", - "Requirement already satisfied: backports-datetime-fromisoformat in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.0.3)\n", - "Requirement already satisfied: couchbase==4.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (4.5.0)\n", - "Requirement already satisfied: haystack-ai>=2.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.20.0)\n", - "Requirement already satisfied: numpy>=1.26.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2.3.5)\n", - "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2.9.0.post0)\n", - "Requirement already satisfied: pytz>=2020.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2025.2)\n", - "Requirement already satisfied: tzdata>=2022.7 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pandas>=2.1.4->-r requirements.txt (line 1)) (2025.2)\n", - "Requirement already satisfied: filelock in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (3.20.0)\n", - "Requirement already satisfied: pyarrow>=21.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (22.0.0)\n", - "Requirement already satisfied: dill<0.4.1,>=0.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.4.0)\n", - "Requirement already satisfied: requests>=2.32.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (2.32.5)\n", - "Requirement already satisfied: httpx<1.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.28.1)\n", - "Requirement already satisfied: tqdm>=4.66.3 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (4.67.1)\n", - "Requirement already satisfied: xxhash in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (3.6.0)\n", - "Requirement already satisfied: multiprocess<0.70.19 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (0.70.18)\n", - "Requirement already satisfied: fsspec<=2025.10.0,>=2023.1.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2025.10.0)\n", - "Requirement already satisfied: huggingface-hub<2.0,>=0.25.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (1.1.4)\n", - "Requirement already satisfied: packaging in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (25.0)\n", - "Requirement already satisfied: pyyaml>=5.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from datasets>=2.14.5->-r requirements.txt (line 2)) (6.0.3)\n", - "Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (3.13.2)\n", - "Requirement already satisfied: anyio in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (4.11.0)\n", - "Requirement already satisfied: certifi in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2025.11.12)\n", - "Requirement already satisfied: httpcore==1.* in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.0.9)\n", - "Requirement already satisfied: idna in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (3.11)\n", - "Requirement already satisfied: h11>=0.16 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from httpcore==1.*->httpx<1.0.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.16.0)\n", - "Requirement already satisfied: hf-xet<2.0.0,>=1.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.2.0)\n", - "Requirement already satisfied: shellingham in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.5.4)\n", - "Requirement already satisfied: typer-slim in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.20.0)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (4.15.0)\n", - "Collecting huggingface-hub<2.0,>=0.25.0 (from datasets>=2.14.5->-r requirements.txt (line 2))\n", - " Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)\n", - "Collecting regex!=2019.12.17 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Using cached regex-2025.11.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (40 kB)\n", - "Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.8 kB)\n", - "Collecting safetensors>=0.4.3 (from transformers>=4.49.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Using cached safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)\n", - "Collecting absl-py>=1.0.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)\n", - "Collecting astunparse>=1.6.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)\n", - "Collecting flatbuffers>=24.3.25 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Using cached flatbuffers-25.9.23-py2.py3-none-any.whl.metadata (875 bytes)\n", - "Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)\n", - "Collecting google_pasta>=0.1.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)\n", - "Collecting libclang>=13.0.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading libclang-18.1.1-1-py2.py3-none-macosx_11_0_arm64.whl.metadata (5.2 kB)\n", - "Collecting opt_einsum>=2.3.2 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)\n", - "Collecting protobuf>=5.28.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading protobuf-6.33.1-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)\n", - "Requirement already satisfied: six>=1.12.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from tensorflow>=2.18.0->-r requirements.txt (line 6)) (1.17.0)\n", - "Collecting termcolor>=1.1.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading termcolor-3.2.0-py3-none-any.whl.metadata (6.4 kB)\n", - "Collecting wrapt>=1.11.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading wrapt-2.0.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.0 kB)\n", - "Collecting grpcio<2.0,>=1.24.3 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading grpcio-1.76.0-cp313-cp313-macosx_11_0_universal2.whl.metadata (3.7 kB)\n", - "Collecting tensorboard~=2.20.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading tensorboard-2.20.0-py3-none-any.whl.metadata (1.8 kB)\n", - "Collecting keras>=3.10.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading keras-3.12.0-py3-none-any.whl.metadata (5.9 kB)\n", - "Collecting h5py>=3.11.0 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading h5py-3.15.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (3.0 kB)\n", - "Collecting ml_dtypes<1.0.0,>=0.5.1 (from tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading ml_dtypes-0.5.3-cp313-cp313-macosx_10_13_universal2.whl.metadata (8.9 kB)\n", - "Requirement already satisfied: charset_normalizer<4,>=2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from requests>=2.32.2->datasets>=2.14.5->-r requirements.txt (line 2)) (3.4.4)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from requests>=2.32.2->datasets>=2.14.5->-r requirements.txt (line 2)) (2.5.0)\n", - "Collecting markdown>=2.6.8 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading markdown-3.10-py3-none-any.whl.metadata (5.1 kB)\n", - "Collecting pillow (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading pillow-12.0.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.8 kB)\n", - "Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)\n", - "Collecting werkzeug>=1.0.1 (from tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading werkzeug-3.1.3-py3-none-any.whl.metadata (3.7 kB)\n", - "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (2.6.1)\n", - "Requirement already satisfied: aiosignal>=1.4.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.4.0)\n", - "Requirement already satisfied: attrs>=17.3.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (25.4.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.8.0)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (6.7.0)\n", - "Requirement already satisfied: propcache>=0.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (0.4.1)\n", - "Requirement already satisfied: yarl<2.0,>=1.17.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets>=2.14.5->-r requirements.txt (line 2)) (1.22.0)\n", - "Collecting wheel<1.0,>=0.23.0 (from astunparse>=1.6.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)\n", - "Requirement already satisfied: docstring-parser in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.17.0)\n", - "Requirement already satisfied: filetype in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.2.0)\n", - "Requirement already satisfied: haystack-experimental in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.14.2)\n", - "Requirement already satisfied: jinja2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (3.1.6)\n", - "Requirement already satisfied: jsonschema in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (4.25.1)\n", - "Requirement already satisfied: lazy-imports in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.1.0)\n", - "Requirement already satisfied: more-itertools in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (10.8.0)\n", - "Requirement already satisfied: networkx in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (3.5)\n", - "Requirement already satisfied: openai>=1.99.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.8.0)\n", - "Requirement already satisfied: posthog!=3.12.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (7.0.1)\n", - "Requirement already satisfied: pydantic in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.12.4)\n", - "Requirement already satisfied: tenacity!=8.4.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (9.1.2)\n", - "Requirement already satisfied: rich in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (14.2.0)\n", - "Collecting namex (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading namex-0.1.0-py3-none-any.whl.metadata (322 bytes)\n", - "Collecting optree (from keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6))\n", - " Downloading optree-0.18.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (34 kB)\n", - "Requirement already satisfied: distro<2,>=1.7.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.9.0)\n", - "Requirement already satisfied: jiter<1,>=0.10.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.12.0)\n", - "Requirement already satisfied: sniffio in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from openai>=1.99.2->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (1.3.1)\n", - "Requirement already satisfied: annotated-types>=0.6.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.7.0)\n", - "Requirement already satisfied: pydantic-core==2.41.5 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.41.5)\n", - "Requirement already satisfied: typing-inspection>=0.4.2 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from pydantic->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.4.2)\n", - "Requirement already satisfied: backoff>=1.10.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from posthog!=3.12.0->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2.2.1)\n", - "Collecting torch>=2.2 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)\n", - "Collecting accelerate>=0.26.0 (from transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)\n", - "Requirement already satisfied: psutil in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from accelerate>=0.26.0->transformers[torch]>=4.49.0->-r requirements.txt (line 5)) (7.1.3)\n", - "Collecting sympy>=1.13.3 (from torch>=2.2->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)\n", - "Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch>=2.2->transformers[torch]>=4.49.0->-r requirements.txt (line 5))\n", - " Using cached mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)\n", - "Requirement already satisfied: MarkupSafe>=2.1.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from werkzeug>=1.0.1->tensorboard~=2.20.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (3.0.3)\n", - "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (2025.9.1)\n", - "Requirement already satisfied: referencing>=0.28.4 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.37.0)\n", - "Requirement already satisfied: rpds-py>=0.7.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from jsonschema->haystack-ai>=2.3.0->couchbase-haystack==2.*->-r requirements.txt (line 4)) (0.29.0)\n", - "Requirement already satisfied: markdown-it-py>=2.2.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (4.0.0)\n", - "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (2.19.2)\n", - "Requirement already satisfied: mdurl~=0.1 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.10.0->tensorflow>=2.18.0->-r requirements.txt (line 6)) (0.1.2)\n", - "Requirement already satisfied: click>=8.0.0 in /Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages (from typer-slim->huggingface-hub<2.0,>=0.25.0->datasets>=2.14.5->-r requirements.txt (line 2)) (8.3.1)\n", - "Using cached setuptools-80.9.0-py3-none-any.whl (1.2 MB)\n", - "Downloading transformers-4.57.1-py3-none-any.whl (12.0 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.0/12.0 MB\u001b[0m \u001b[31m17.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hUsing cached huggingface_hub-0.36.0-py3-none-any.whl (566 kB)\n", - "Using cached tokenizers-0.22.1-cp39-abi3-macosx_11_0_arm64.whl (2.9 MB)\n", - "Downloading tensorflow-2.20.0-cp313-cp313-macosx_12_0_arm64.whl (200.7 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m200.7/200.7 MB\u001b[0m \u001b[31m16.7 MB/s\u001b[0m \u001b[33m0:00:12\u001b[0mm0:00:01\u001b[0m00:01\u001b[0m\n", - "\u001b[?25hDownloading grpcio-1.76.0-cp313-cp313-macosx_11_0_universal2.whl (11.8 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m11.8/11.8 MB\u001b[0m \u001b[31m18.7 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0mm0:00:01\u001b[0m0:01\u001b[0m\n", - "\u001b[?25hDownloading ml_dtypes-0.5.3-cp313-cp313-macosx_10_13_universal2.whl (663 kB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m663.8/663.8 kB\u001b[0m \u001b[31m10.7 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading tensorboard-2.20.0-py3-none-any.whl (5.5 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.5/5.5 MB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hDownloading tensorboard_data_server-0.7.2-py3-none-any.whl (2.4 kB)\n", - "Downloading absl_py-2.3.1-py3-none-any.whl (135 kB)\n", - "Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)\n", - "Downloading wheel-0.45.1-py3-none-any.whl (72 kB)\n", - "Using cached flatbuffers-25.9.23-py2.py3-none-any.whl (30 kB)\n", - "Downloading gast-0.6.0-py3-none-any.whl (21 kB)\n", - "Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)\n", - "Downloading h5py-3.15.1-cp313-cp313-macosx_11_0_arm64.whl (2.8 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.8/2.8 MB\u001b[0m \u001b[31m21.1 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading keras-3.12.0-py3-none-any.whl (1.5 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m17.6 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m\n", - "\u001b[?25hDownloading libclang-18.1.1-1-py2.py3-none-macosx_11_0_arm64.whl (25.8 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m25.8/25.8 MB\u001b[0m \u001b[31m18.7 MB/s\u001b[0m \u001b[33m0:00:01\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hDownloading markdown-3.10-py3-none-any.whl (107 kB)\n", - "Downloading opt_einsum-3.4.0-py3-none-any.whl (71 kB)\n", - "Downloading protobuf-6.33.1-cp39-abi3-macosx_10_9_universal2.whl (427 kB)\n", - "Using cached regex-2025.11.3-cp313-cp313-macosx_11_0_arm64.whl (288 kB)\n", - "Using cached safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl (432 kB)\n", - "Downloading termcolor-3.2.0-py3-none-any.whl (7.7 kB)\n", - "Downloading accelerate-1.11.0-py3-none-any.whl (375 kB)\n", - "Downloading torch-2.9.1-cp313-none-macosx_11_0_arm64.whl (74.5 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m74.5/74.5 MB\u001b[0m \u001b[31m21.5 MB/s\u001b[0m \u001b[33m0:00:03\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hUsing cached sympy-1.14.0-py3-none-any.whl (6.3 MB)\n", - "Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)\n", - "Downloading werkzeug-3.1.3-py3-none-any.whl (224 kB)\n", - "Downloading wrapt-2.0.1-cp313-cp313-macosx_11_0_arm64.whl (61 kB)\n", - "Downloading namex-0.1.0-py3-none-any.whl (5.9 kB)\n", - "Downloading optree-0.18.0-cp313-cp313-macosx_11_0_arm64.whl (346 kB)\n", - "Downloading pillow-12.0.0-cp313-cp313-macosx_11_0_arm64.whl (4.7 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.7/4.7 MB\u001b[0m \u001b[31m17.9 MB/s\u001b[0m \u001b[33m0:00:00\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n", - "\u001b[?25hInstalling collected packages: namex, mpmath, libclang, flatbuffers, wrapt, wheel, werkzeug, termcolor, tensorboard-data-server, sympy, setuptools, safetensors, regex, protobuf, pillow, optree, opt_einsum, ml_dtypes, markdown, h5py, grpcio, google_pasta, gast, absl-py, torch, tensorboard, huggingface-hub, astunparse, tokenizers, keras, accelerate, transformers, tensorflow\n", - "\u001b[2K Attempting uninstall: huggingface-hub[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", - "\u001b[2K Found existing installation: huggingface_hub 1.1.4━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", - "\u001b[2K Uninstalling huggingface_hub-1.1.4:[90m╺\u001b[0m\u001b[90m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", - "\u001b[2K Successfully uninstalled huggingface_hub-1.1.4m━━━━━━━━━\u001b[0m \u001b[32m25/33\u001b[0m [tensorboard]\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m33/33\u001b[0m [tensorflow]3\u001b[0m [tensorflow]s]ub]\n", - "\u001b[1A\u001b[2KSuccessfully installed absl-py-2.3.1 accelerate-1.11.0 astunparse-1.6.3 flatbuffers-25.9.23 gast-0.6.0 google_pasta-0.2.0 grpcio-1.76.0 h5py-3.15.1 huggingface-hub-0.36.0 keras-3.12.0 libclang-18.1.1 markdown-3.10 ml_dtypes-0.5.3 mpmath-1.3.0 namex-0.1.0 opt_einsum-3.4.0 optree-0.18.0 pillow-12.0.0 protobuf-6.33.1 regex-2025.11.3 safetensors-0.6.2 setuptools-80.9.0 sympy-1.14.0 tensorboard-2.20.0 tensorboard-data-server-0.7.2 tensorflow-2.20.0 termcolor-3.2.0 tokenizers-0.22.1 torch-2.9.1 transformers-4.57.1 werkzeug-3.1.3 wheel-0.45.1 wrapt-2.0.1\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], + "outputs": [], "source": [ "# Install required packages\n", "%pip install -r requirements.txt" @@ -286,18 +91,9 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/viraj.agarwal/Tasks/Task16.5/.venv/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", - " from .autonotebook import tqdm as notebook_tqdm\n" - ] - } - ], + "outputs": [], "source": [ "import getpass\n", "import base64\n", @@ -348,7 +144,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -376,7 +172,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -398,17 +194,9 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:36:34,556 - INFO - Successfully connected to the Couchbase cluster\n" - ] - } - ], + "outputs": [], "source": [ "try:\n", " # Initialize the Couchbase Cluster\n", @@ -436,19 +224,9 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Bucket 'b' already exists.\n", - "Scope 's' already exists.\n", - "Collection 'c' already exists in scope 's'.\n" - ] - } - ], + "outputs": [], "source": [ "from couchbase.management.buckets import CreateBucketSettings\n", "import json\n", @@ -499,17 +277,9 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded the BBC News dataset with 2687 rows\n" - ] - } - ], + "outputs": [], "source": [ "try:\n", " news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n", @@ -527,20 +297,9 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dataset columns: ['title', 'published_date', 'authors', 'description', 'section', 'content', 'link', 'top_image']\n", - "\n", - "First two examples:\n", - "{'title': [\"Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News\", 'Lockdown DIY linked to Walleys Quarry gases - BBC News'], 'published_date': ['2024-12-01', '2024-12-01'], 'authors': ['https://www.facebook.com/bbcnews', 'https://www.facebook.com/bbcnews'], 'description': [\"Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.\", 'An academic says an increase in plasterboard sent to landfill could be behind a spike in smells.'], 'section': ['Asia', 'Stoke & Staffordshire'], 'content': ['Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\\n\\nImran Khan\\'s wife, Bushra Bibi, encouraged protesters into the heart of Pakistan\\'s capital, Islamabad\\n\\nA charred lorry, empty tear gas shells and posters of former Pakistan Prime Minister Imran Khan - it was all that remained of a massive protest led by Khan’s wife, Bushra Bibi, that had sent the entire capital into lockdown. Just a day earlier, faith healer Bibi - wrapped in a white shawl, her face covered by a white veil - stood atop a shipping container on the edge of the city as thousands of her husband’s devoted followers waved flags and chanted slogans beneath her. It was the latest protest to flare since Khan, the 72-year-old cricketing icon-turned-politician, was jailed more than a year ago after falling foul of the country\\'s influential military which helped catapult him to power. “My children and my brothers! You have to stand with me,” Bibi cried on Tuesday afternoon, her voice cutting through the deafening roar of the crowd. “But even if you don’t,” she continued, “I will still stand firm. “This is not just about my husband. It is about this country and its leader.” It was, noted some watchers of Pakistani politics, her political debut. But as the sun rose on Wednesday morning, there was no sign of Bibi, nor the thousands of protesters who had marched through the country to the heart of the capital, demanding the release of their jailed leader. While other PMs have fallen out with Pakistan\\'s military in the past, Khan\\'s refusal to stay quiet behind bars is presenting an extraordinary challenge - escalating the standoff and leaving the country deeply divided. Exactly what happened to the so-called “final march”, and Bibi, when the city went dark is still unclear. All eyewitnesses like Samia* can say for certain is that the lights went out suddenly, plunging D Chowk, the square where they had gathered, into blackness.\\n\\nWithin a day of arriving, the protesters had scattered - leaving behind Bibi\\'s burnt-out vehicle\\n\\nAs loud screams and clouds of tear gas blanketed the square, Samia describes holding her husband on the pavement, bloodied from a gun shot to his shoulder. \"Everyone was running for their lives,\" she later told BBC Urdu from a hospital in Islamabad, adding it was \"like doomsday or a war\". \"His blood was on my hands and the screams were unending.” But how did the tide turn so suddenly and decisively? Just hours earlier, protesters finally reached D Chowk late afternoon on Tuesday. They had overcome days of tear gas shelling and a maze of barricaded roads to get to the city centre. Many of them were supporters and workers of the Pakistan Tehreek-e-Insaf (PTI), the party led by Khan. He had called for the march from his jail cell, where he has been for more than a year on charges he says are politically motivated. Now Bibi - his third wife, a woman who had been largely shrouded in mystery and out of public view since their unexpected wedding in 2018 - was leading the charge. “We won’t go back until we have Khan with us,” she declared as the march reached D Chowk, deep in the heart of Islamabad’s government district.\\n\\nThousands had marched for days to reach Islamabad, demanding former Prime Minister Imran Khan be released from jail\\n\\nInsiders say even the choice of destination - a place where her husband had once led a successful sit in - was Bibi’s, made in the face of other party leader’s opposition, and appeals from the government to choose another gathering point. Her being at the forefront may have come as a surprise. Bibi, only recently released from prison herself, is often described as private and apolitical. Little is known about her early life, apart from the fact she was a spiritual guide long before she met Khan. Her teachings, rooted in Sufi traditions, attracted many followers - including Khan himself. Was she making her move into politics - or was her sudden appearance in the thick of it a tactical move to keep Imran Khan’s party afloat while he remains behind bars? For critics, it was a move that clashed with Imran Khan’s oft-stated opposition to dynastic politics. There wasn’t long to mull the possibilities. After the lights went out, witnesses say that police started firing fresh rounds of tear gas at around 21:30 local time (16:30 GMT). The crackdown was in full swing just over an hour later. At some point, amid the chaos, Bushra Bibi left. Videos on social media appeared to show her switching cars and leaving the scene. The BBC couldn’t verify the footage. By the time the dust settled, her container had already been set on fire by unknown individuals. By 01:00 authorities said all the protesters had fled.\\n\\nSecurity was tight in the city, and as night fell, lights were switched off - leaving many in the dark as to what exactly happened next\\n\\nEyewitnesses have described scenes of chaos, with tear gas fired and police rounding up protesters. One, Amin Khan, said from behind an oxygen mask that he joined the march knowing that, \"either I will bring back Imran Khan or I will be shot\". The authorities have have denied firing at the protesters. They also said some of the protesters were carrying firearms. The BBC has seen hospital records recording patients with gunshot injuries. However, government spokesperson Attaullah Tarar told the BBC that hospitals had denied receiving or treating gunshot wound victims. He added that \"all security personnel deployed on the ground have been forbidden\" from having live ammunition during protests. But one doctor told BBC Urdu that he had never done so many surgeries for gunshot wounds in a single night. \"Some of the injured came in such critical condition that we had to start surgery right away instead of waiting for anaesthesia,\" he said. While there has been no official toll released, the BBC has confirmed with local hospitals that at least five people have died. Police say at least 500 protesters were arrested that night and are being held in police stations. The PTI claims some people are missing. And one person in particular hasn’t been seen in days: Bushra Bibi.\\n\\nThe next morning, the protesters were gone - leaving behind just wrecked cars and smashed glass\\n\\nOthers defended her. “It wasn’t her fault,” insisted another. “She was forced to leave by the party leaders.” Political commentators have been more scathing. “Her exit damaged her political career before it even started,” said Mehmal Sarfraz, a journalist and analyst. But was that even what she wanted? Khan has previously dismissed any thought his wife might have her own political ambitions - “she only conveys my messages,” he said in a statement attributed to him on his X account.\\n\\nImran Khan and Bushra Bibi, pictured here arriving at court in May 2023, married in 2018\\n\\nSpeaking to BBC Urdu, analyst Imtiaz Gul calls her participation “an extraordinary step in extraordinary circumstances\". Gul believes Bushra Bibi’s role today is only about “keeping the party and its workers active during Imran Khan’s absence”. It is a feeling echoed by some PTI members, who believe she is “stepping in only because Khan trusts her deeply”. Insiders, though, had often whispered that she was pulling the strings behind the scenes - advising her husband on political appointments and guiding high-stakes decisions during his tenure. A more direct intervention came for the first time earlier this month, when she urged a meeting of PTI leaders to back Khan’s call for a rally. Pakistan’s defence minister Khawaja Asif accused her of “opportunism”, claiming she sees “a future for herself as a political leader”. But Asma Faiz, an associate professor of political science at Lahore University of Management Sciences, suspects the PTI’s leadership may have simply underestimated Bibi. “It was assumed that there was an understanding that she is a non-political person, hence she will not be a threat,” she told the AFP news agency. “However, the events of the last few days have shown a different side of Bushra Bibi.” But it probably doesn’t matter what analysts and politicians think. Many PTI supporters still see her as their connection to Imran Khan. It was clear her presence was enough to electrify the base. “She is the one who truly wants to get him out,” says Asim Ali, a resident of Islamabad. “I trust her. Absolutely!”', 'Walleys Quarry was ordered not to accept any new waste as of Friday\\n\\nA chemist and former senior lecturer in environmental sustainability has said powerful odours from a controversial landfill site may be linked to people doing more DIY during the Covid-19 pandemic. Complaints about Walleys Quarry in Silverdale, Staffordshire – which was ordered to close as of Friday – increased significantly during and after coronavirus lockdowns. Issuing the closure notice, the Environment Agency described management of the site as poor, adding it had exhausted all other enforcement tactics at premises where gases had been noxious and periodically above emission level guidelines - which some campaigners linked to ill health locally. Dr Sharon George, who used to teach at Keele University, said she had been to the site with students and found it to be clean and well-managed, and suggested an increase in plasterboard heading to landfills in 2020 could be behind a spike in stenches.\\n\\n“One of the materials that is particularly bad for producing odours and awful emissions is plasterboard,\" she said. “That’s one of the theories behind why Walleys Quarry got worse at that time.” She said the landfill was in a low-lying area, and that some of the gases that came from the site were quite heavy. “They react with water in the atmosphere, so some of the gases you smell can be quite awful and not very good for our health. “It’s why, on some days when it’s colder and muggy and a bit misty, you can smell it more.” Dr George added: “With any landfill, you’re putting things into the ground – and when you put things into the ground, if they can they will start to rot. When they start to rot they’re going to give off gases.” She believed Walleys Quarry’s proximity to people’s homes was another major factor in the amount of complaints that arose from its operation. “If you’ve got a gas that people can smell, they’re going to report it much more than perhaps a pollutant that might go unnoticed.”\\n\\nRebecca Currie said she did not think the site would ever be closed\\n\\nLocal resident and campaigner Rebecca Currie said the closure notice served to Walleys Quarry was \"absolutely amazing\". Her son Matthew has had breathing difficulties after being born prematurely with chronic lung disease, and Ms Currie says the site has made his symptoms worse. “I never thought this day was going to happen,” she explained. “We fought and fought for years.” She told BBC Midlands Today: “Our community have suffered. We\\'ve got kids who are really poorly, people have moved homes.”\\n\\nComplaints about Walleys Quarry to Newcastle-under-Lyme Borough Council exceeded 700 in November, the highest amount since 2021 according to council leader Simon Tagg. The Environment Agency (EA), which is responsible for regulating landfill sites, said it had concluded further operation at the site could result in \"significant long-term pollution\". A spokesperson for Walley\\'s Quarry Ltd said the firm rejected the EA\\'s accusations of poor management, and would be challenging the closure notice. Dr George said she believed the EA was likely to be erring on the side of caution and public safety, adding safety standards were strict. She said a lack of landfill space in the country overall was one of the broader issues that needed addressing. “As people, we just keep using stuff and then have nowhere to put it, and then when we end up putting it in places like Walleys Quarry that is next to houses, I think that’s where the problems are.”\\n\\nTell us which stories we should cover in Staffordshire'], 'link': ['http://www.bbc.co.uk/news/articles/cvg02lvj1e7o', 'http://www.bbc.co.uk/news/articles/c5yg1v16nkpo'], 'top_image': ['https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/9975/live/b22229e0-ad5a-11ef-83bc-1153ed943d1c.jpg', 'https://ichef.bbci.co.uk/ace/standard/3840/cpsprodpb/0896/live/55209f80-adb2-11ef-8f6c-f1a86bb055ec.jpg']}\n" - ] - } - ], + "outputs": [], "source": [ "# Print the first two examples from the dataset\n", "print(\"Dataset columns:\", news_dataset.column_names)\n", @@ -559,17 +318,9 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "We have 1749 unique articles in our database.\n" - ] - } - ], + "outputs": [], "source": [ "import hashlib\n", "\n", @@ -598,17 +349,9 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Successfully created embedding models\n" - ] - } - ], + "outputs": [], "source": [ "try:\n", " # Set up the document embedder for processing documents\n", @@ -638,18 +381,9 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:36:51,924 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", - "Embedding dimension: 3072\n" - ] - } - ], + "outputs": [], "source": [ "test_result = rag_embedder.run(text=\"this is a test sentence\")\n", "test_embedding = test_result[\"embedding\"]\n", @@ -666,17 +400,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Successfully created Couchbase vector document store\n" - ] - } - ], + "outputs": [], "source": [ "try:\n", " # Create the Couchbase vector document store\n", @@ -712,24 +438,9 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Document content preview:\n", - "Content: Bushra Bibi led a protest to free Imran Khan - what happened next is a mystery\n", - "\n", - "Imran Khan's wife, Bushra Bibi, encouraged protesters into the heart of Pakistan's capital, Islamabad\n", - "\n", - "A charred lorry, ...\n", - "Metadata: {'title': \"Pakistan protest: Bushra Bibi's march for Imran Khan disappeared - BBC News\", 'description': \"Imran Khan's third wife guided protesters to the heart of the capital - and then disappeared.\", 'published_date': '2024-12-01', 'link': 'http://www.bbc.co.uk/news/articles/cvg02lvj1e7o'}\n", - "Created 1749 documents\n" - ] - } - ], + "outputs": [], "source": [ "haystack_documents = []\n", "# Process and store documents\n", @@ -776,27 +487,9 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "🚅 Components\n", - " - cleaner: DocumentCleaner\n", - " - embedder: OpenAIDocumentEmbedder\n", - " - writer: DocumentWriter\n", - "🛤️ Connections\n", - " - cleaner.documents -> embedder.documents (list[Document])\n", - " - embedder.documents -> writer.documents (list[Document])" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "\n", "\n", @@ -824,100 +517,13 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:29,794 - INFO - Running component cleaner\n", - "2025-11-17 14:42:29,800 - INFO - Running component embedder\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Calculating embeddings: 0it [00:00, ?it/s]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:31,149 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Calculating embeddings: 1it [00:02, 2.94s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:33,448 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Calculating embeddings: 2it [00:04, 2.35s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:35,608 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Calculating embeddings: 3it [00:06, 1.85s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:36,509 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Calculating embeddings: 4it [00:07, 1.87s/it]" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:37,301 - INFO - Running component writer\n", - "Indexed 100 document chunks\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n" - ] - } - ], + "outputs": [], "source": [ "# Run the indexing pipeline\n", "if haystack_documents:\n", - " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents[:100]}})\n", + " result = indexing_pipeline.run({\"cleaner\": {\"documents\": haystack_documents[:1200]}})\n", " print(f\"Indexed {result['writer']['documents_written']} document chunks\")\n", "else:\n", " print(\"No documents created. Skipping indexing.\")\n" @@ -937,17 +543,9 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:40,424 - INFO - Successfully created the OpenAI generator\n" - ] - } - ], + "outputs": [], "source": [ "try:\n", " # Set up the LLM generator\n", @@ -976,18 +574,9 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:41,774 - WARNING - PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.\n", - "Successfully created RAG pipeline\n" - ] - } - ], + "outputs": [], "source": [ "# Define RAG prompt template\n", "prompt_template = \"\"\"\n", @@ -1044,46 +633,13 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:42:43,017 - INFO - Running component query_embedder\n", - "2025-11-17 14:42:43,636 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", - "2025-11-17 14:42:43,853 - INFO - Running component retriever\n", - "2025-11-17 14:42:43,990 - INFO - Running component prompt_builder\n", - "2025-11-17 14:42:43,990 - INFO - Running component llm\n", - "2025-11-17 14:42:45,914 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", - "2025-11-17 14:42:45,935 - INFO - Running component answer_builder\n", - "=== Retrieved Documents ===\n", - "Id: 3bd611696904f038e5ceff530ab97539a34e6893001e6e71d5518ecfa4a729ff Title: New Zealand v England: Brydon Carse repays faith with Christchurch haul - BBC Sport\n", - "Id: 5cc142cd0535bcb62c2bce08e87714a2ddab9590c06e2935213c0341360953b1 Title: Ireland 22-19 Australia: 'No emotion' for Andy Farrell in winning send-off before Lions sabbatical - BBC Sport\n", - "Id: 96601a80eaf87c11a39a5cada79bdaa50e227cb03ed36e3bb0573c6297767702 Title: Watch: CCTV shows how Daniel Khalife escaped - BBC News\n", - "Id: 875d0e214cbc2d7d0bd475906cf50b74d042a831c7b00b301c39a5bb2d387f07 Title: Troy Deeney's Team of the Week: Saka, Kluivert, Schade, Rashford - BBC Sport\n", - "Id: 62b98b01f3f453b4046fc92fb97cd73022afc2a6e70133c48985eecc658c3db7 Title: World Athletic Awards: Letsile Tebogo and Sifan Hassan named athletes of the year - BBC Sport\n", - "\n", - "=== Final Answer ===\n", - "Question: Who will Daniel Dubois fight in Saudi Arabia on 22 February?\n", - "Answer: The documents do not provide information on who Daniel Dubois will fight in Saudi Arabia on 22 February.\n", - "\n", - "Sources:\n", - "-> New Zealand v England: Brydon Carse repays faith with Christchurch haul - BBC Sport\n", - "-> Ireland 22-19 Australia: 'No emotion' for Andy Farrell in winning send-off before Lions sabbatical - BBC Sport\n", - "-> Watch: CCTV shows how Daniel Khalife escaped - BBC News\n", - "-> Troy Deeney's Team of the Week: Saka, Kluivert, Schade, Rashford - BBC Sport\n", - "-> World Athletic Awards: Letsile Tebogo and Sifan Hassan named athletes of the year - BBC Sport\n", - "\n", - "Optimized Hyperscale Vector Search Results (completed in 2.92 seconds):\n" - ] - } - ], + "outputs": [], "source": [ "# Sample query from the dataset\n", "\n", - "query = \"Who will Daniel Dubois fight in Saudi Arabia on 22 February?\"\n", + "query = \"What is latest news on the death of Charles Breslin?\"\n", "\n", "try:\n", " # Perform the semantic search using the RAG pipeline\n", @@ -1114,7 +670,7 @@ " for doc in answer.documents:\n", " print(f\"-> {doc.meta['title']}\")\n", " # Display search results\n", - " print(f\"\\nOptimized Hyperscale Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", + " print(f\"\\nLinear Vector Search Results (completed in {search_elapsed_time:.2f} seconds):\")\n", " #print(result[\"generator\"][\"replies\"][0])\n", "\n", "except Exception as e:\n", @@ -1177,17 +733,9 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Hyperscale index may already exist or error occurred: InternalServerFailureException()\n" - ] - } - ], + "outputs": [], "source": [ "# Create a Hyperscale Vector Index for optimized vector search\n", "try:\n", @@ -1198,15 +746,14 @@ " \n", " options = {\n", " \"dimension\": 3072, # text-embedding-3-large dimension\n", - " \"description\": \"IVF1024,PQ32x8\",\n", " \"similarity\": \"L2\",\n", " }\n", " \n", " scope.query(\n", " f\"\"\"\n", - " CREATE INDEX {hyperscale_index_name}\n", + " CREATE VECTOR INDEX {hyperscale_index_name}\n", " ON {COLLECTION_NAME} (embedding VECTOR)\n", - " USING GSI WITH {json.dumps(options)}\n", + " WITH {json.dumps(options)}\n", " \"\"\",\n", " QueryOptions(\n", " timeout=timedelta(seconds=300)\n", @@ -1227,40 +774,9 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2025-11-17 14:44:55,645 - INFO - Running component query_embedder\n", - "2025-11-17 14:44:56,291 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings \"HTTP/1.1 200 OK\"\n", - "2025-11-17 14:44:56,508 - INFO - Running component retriever\n", - "2025-11-17 14:44:56,597 - INFO - Running component prompt_builder\n", - "2025-11-17 14:44:56,598 - INFO - Running component llm\n", - "2025-11-17 14:44:59,603 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions \"HTTP/1.1 200 OK\"\n", - "2025-11-17 14:44:59,610 - INFO - Running component answer_builder\n", - "=== Retrieved Documents ===\n", - "Id: 0aa95f5a2c515683b16ab26312eb2734ca3a7fb72d00992f323b0f72a0e365a6 Title: Gleision mine deaths: Inquest won't be held until 2026 - BBC News\n", - "Id: 8acf5742f95609c38bdf5941f61c337a1934570faaf21cd478daea2c635ddff6 Title: Bob Bryar dead: Former My Chemical Romance drummer dies aged 44 - BBC News\n", - "Id: af754e31be9bfeca0436b023ecb75d1668fc035c36c81a87ded707729b790653 Title: Terry Griffiths: Former world snooker champion dies aged 77 - BBC Sport\n", - "Id: b7434182a5461e6c5e92fa676cd071cd22a1a109d98563f3a5899894605737bd Title: Diddy on Trial - Enter the Diddy-Verse - BBC Sounds\n", - "\n", - "=== Final Answer ===\n", - "Question: What is latest news on the death of Charles Breslin?\n", - "Answer: The latest news on the death of Charles Breslin is that, after a protracted battle, families were informed in 2022 that a full inquest into the deaths of the four miners, including Charles Breslin, would be held. However, a pre-inquest hearing in Swansea’s Guildhall announced that the full inquest would not occur until \"the early part of 2026\" due to \"significant complexity\" surrounding the documents needed. The inquest involves a large volume of material, with the Coal Authority estimating 75,000 pages of written documents. Families are eagerly awaiting answers about the disaster at the Gleision colliery in which Charles Breslin and three others died, hopeful that the inquest will provide clarity and relief. The coroner has promised a thorough investigation to determine the source of the water and whether responsible parties were aware of the risks involved.\n", - "\n", - "Sources:\n", - "-> Gleision mine deaths: Inquest won't be held until 2026 - BBC News\n", - "-> Bob Bryar dead: Former My Chemical Romance drummer dies aged 44 - BBC News\n", - "-> Terry Griffiths: Former world snooker champion dies aged 77 - BBC Sport\n", - "-> Diddy on Trial - Enter the Diddy-Verse - BBC Sounds\n", - "\n", - "Optimized Hyperscale Vector Search Results (completed in 3.97 seconds):\n" - ] - } - ], + "outputs": [], "source": [ "# Test the optimized Hyperscale vector search\n", "query = \"What is latest news on the death of Charles Breslin?\"\n", From 6011495e4264218809b7f2f8c27fda5278ea252f Mon Sep 17 00:00:00 2001 From: Viraj Agarwal Date: Tue, 18 Nov 2025 11:15:22 +0530 Subject: [PATCH 4/4] DA-1319 update: address gemini comments --- .../query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb | 4 ++-- .../search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb | 3 +-- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index c54edcf4..629a5ca8 100644 --- a/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/query_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -285,7 +285,7 @@ " news_dataset = load_dataset('RealTimeData/bbc_news_alltime', '2024-12', split=\"train\")\n", " print(f\"Loaded the BBC News dataset with {len(news_dataset)} rows\")\n", "except Exception as e:\n", - " raise ValueError(f\"Error loading TREC dataset: {str(e)}\")" + " raise ValueError(f\"Error loading BBC News dataset: {str(e)}\")" ] }, { @@ -478,7 +478,7 @@ "\n", "In this section, we'll create an indexing pipeline to process our documents. The pipeline will:\n", "\n", - "1. Split the documents into smaller chunks using the DocumentSplitter\n", + "1. Split the documents into smaller chunks using the DocumentCleaner\n", "2. Generate embeddings for each chunk using our document embedder\n", "3. Store these chunks with their embeddings in our Couchbase document store\n", "\n", diff --git a/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb index 429841a1..afb2063b 100644 --- a/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb +++ b/haystack/search_based/RAG_with_Couchbase_Capella_and_OpenAI.ipynb @@ -384,7 +384,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, "outputs": [ { @@ -469,7 +469,6 @@ " except Exception as e:\n", " print(f\"Search Vector Index '{search_index_name}' does not exist at scope level. Creating index from search_vector_index.json...\")\n", " with open('search_vector_index.json', 'r') as search_file:\n", - " search_index_definition = SearchIndex.from_json(json.load(search_file))\n", " scope_search_manager.upsert_index(search_index_definition)\n", " print(f\"Search Vector Index '{search_index_name}' created successfully at scope level.\")" ]