Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion test/test-markdown-frontmatter.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const chalk = require('chalk')
// accepted data field values
const sdk_languages = ['nodejs', 'scala', 'python', 'swift', 'csharp', 'objective-c', 'android-java', 'any', 'java', 'kotlin', 'dart', 'golang', 'c++']

const tags = ['Ottoman', 'Ktor', 'REST API', 'Express', 'Flask', 'TLS', 'Configuration', 'Next.js', 'iOS', 'Xcode', '.NET', 'Xamarin', 'Authentication', 'OpenID', 'Keycloak', 'Android', 'P2P', 'UIKit', 'Installation', 'Spring Boot', 'Spring Data', 'Transactions', 'SQL++ (N1QL)', 'Optimization', 'Community Edition', 'Docker', 'Data Modeling', 'Metadata', 'Best Practices', 'Data Ingestion', 'Kafka', 'Support', 'Customer', 'Prometheus', 'Monitoring', 'Observability', 'Metrics', 'Query Workbench', 'ASP.NET', 'linq', 'DBaaS', 'App Services', 'Flutter', 'Gin Gonic', 'FastAPI', 'LangChain', "OpenAI", "Streamlit", 'Google Gemini', 'Nvidia NIM', 'LLama3', 'AWS', 'Artificial Intelligence', 'Cohere', 'Jina AI', 'Mistral AI', 'Ragas', 'Haystack', 'LangGraph', 'Amazon Bedrock', 'CrewAI', 'PydanticAI', 'C++', 'C++ SDK', 'smolagents', 'Ag2', 'Autogen', 'Couchbase Edge Server', 'Deepseek', 'OpenRouter', 'mastra', 'Looker Studio', 'Google Data Studio', 'Connector', 'Couchbase Columnar', 'TAVs', 'Custom Queries', 'Data API', 'FTS', 'GSI', 'Hugging Face', 'LlamaIndex', 'GraphQL', 'Semantic Kernel']
const tags = ['Ottoman', 'Ktor', 'REST API', 'Express', 'Flask', 'TLS', 'Configuration', 'Next.js', 'iOS', 'Xcode', '.NET', 'Xamarin', 'Authentication', 'OpenID', 'Keycloak', 'Android', 'P2P', 'UIKit', 'Installation', 'Spring Boot', 'Spring Data', 'Transactions', 'SQL++ (N1QL)', 'Optimization', 'Community Edition', 'Docker', 'Data Modeling', 'Metadata', 'Best Practices', 'Data Ingestion', 'Kafka', 'Support', 'Customer', 'Prometheus', 'Monitoring', 'Observability', 'Metrics', 'Query Workbench', 'ASP.NET', 'linq', 'DBaaS', 'App Services', 'Flutter', 'Gin Gonic', 'FastAPI', 'LangChain', "OpenAI", "Streamlit", 'Google Gemini', 'Nvidia NIM', 'LLama3', 'AWS', 'Artificial Intelligence', 'Cohere', 'Jina AI', 'Mistral AI', 'Ragas', 'Haystack', 'LangGraph', 'Amazon Bedrock', 'CrewAI', 'PydanticAI', 'C++', 'C++ SDK', 'smolagents', 'Ag2', 'Autogen', 'Couchbase Edge Server', 'Deepseek', 'OpenRouter', 'mastra', 'Looker Studio', 'Google Data Studio', 'Connector', 'Couchbase Columnar', 'TAVs', 'Custom Queries', 'Data API', 'FTS', 'GSI', 'Hugging Face', 'LlamaIndex', 'GraphQL', 'Search Vector Index']

const technologies = ['connectors', 'kv', 'query', 'capella', 'server', 'index', 'mobile', 'fts', 'sync gateway', 'eventing', 'analytics', 'udf', 'vector search', 'react', 'edge-server', 'app-services', 'hyperscale vector index', 'composite vector index']

Expand Down
Original file line number Diff line number Diff line change
@@ -1,31 +1,34 @@
---
# frontmatter
path: "/tutorial-python-langchain-pdf-chat"
path: "/tutorial-python-langchain-pdf-chat-with-search-vector-index"
# title and description do not need to be added to markdown, start with H2 (##)
title: Build PDF Chat App With Couchbase Python SDK and LangChain
title: Build PDF Chat App with LangChain and Couchbase Search Vector Index
short_title: Build PDF Chat App
description:
- Construct a PDF Chat App with LangChain, Couchbase Python SDK, Couchbase Vector Search, and Streamlit.
- Learn to upload PDFs into Couchbase Vector Store with LangChain.
- Discover how to use RAG’s for context-based Q&A’s from PDFs with LLMs.
- Construct a PDF Chat App with LangChain, Couchbase Python SDK, Search Vector Index, and Streamlit.
- Learn to upload PDFs into Couchbase Search Vector Store with LangChain.
- Discover how to use RAG for context-based Q&A from PDFs with LLMs.
content_type: tutorial
filter: sdk
technology:
- fts
- vector search
- kv
tags:
- Streamlit
- LangChain
- OpenAI
- Artificial Intelligence
- Search Vector Index
sdk_language:
- python
length: 45 Mins
---

## Introduction

Welcome to this comprehensive guide on constructing an AI-enhanced Chat Application. We will create a dynamic chat interface capable of delving into PDF documents to extract and provide summaries, key facts, and answers to your queries. By the end of this tutorial, you’ll have a powerful tool at your disposal, transforming the way you interact with and utilize the information contained within PDFs.
Welcome to this comprehensive guide on constructing an AI-enhanced Chat Application. We will create a dynamic chat interface capable of delving into PDF documents to extract and provide summaries, key facts, and answers to your queries. By the end of this tutorial, you'll have a powerful tool at your disposal, transforming the way you interact with and utilize the information contained within PDFs.

**This tutorial uses Search Vector Index** with Couchbase's Search service (formerly known as Full Text Search). If you are looking for Vector Search using Query Service with Hyperscale/Composite Vector Indexes, refer to [this tutorial](https://developer.couchbase.com/tutorial-python-langchain-pdf-chat-with-hyperscale-or-composite-vector-index/) instead.

This tutorial will demonstrate how to -

Expand Down Expand Up @@ -80,9 +83,9 @@ Specifically, you need to do the following:
- For the purpose of this tutorial, we will be using specific bucket, scope and collection. However, you may use any name of your choice but make sure to update names in all the steps.
- Create a bucket named `pdf-chat`. We will use the `_default` scope and `_default` collection of this bucket.

### Create the Search Index on Full Text Service
### Create the Search Vector Index

We need to create the Search Index on the Full Text Service in Couchbase. For this demo, you can import the following index using the instructions.
We need to create the Search Vector Index in Couchbase. For this demo, you can import the following index using the instructions.

- [Couchbase Capella](https://docs.couchbase.com/cloud/search/import-search-index.html)

Expand Down Expand Up @@ -207,7 +210,7 @@ LOGIN_PASSWORD = "<password to access the streamlit app>"

### Running the Application

After starting Couchbase server, adding vector index and installing dependencies. Our Application is ready to run.
After starting Couchbase server, adding search vector index and installing dependencies. Our Application is ready to run.

In the projects root directory, run the following command

Expand Down Expand Up @@ -271,12 +274,12 @@ LangChain is a powerful library that simplifies the process of building applicat

In the PDF Chat app, LangChain is used for several tasks:

- **Loading and processing PDF documents**: LangChain's [_PDFLoader_](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf/) is used to load the PDF files and convert them into text documents.
- **Text splitting**: LangChain's [_RecursiveCharacterTextSplitter_](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/) is used to split the text from the PDF documents into smaller chunks or passages, which are more suitable for embedding and retrieval.
- **Embedding generation**: LangChain integrates with [various embedding models](https://python.langchain.com/docs/modules/data_connection/text_embedding/), such as OpenAI's embeddings, to convert the text chunks into embeddings.
- **Vector store integration**: LangChain provides a [_CouchbaseSearchVectorStore_](https://python.langchain.com/docs/integrations/vectorstores/couchbase/) class that seamlessly integrates with Couchbase's Vector Search, allowing the app to store and search through the embeddings and their corresponding text.
- **Chains**: LangChain provides various [chains](https://python.langchain.com/docs/modules/chains/) for different requirements. For using RAG concept, we require _Retrieval Chain_ for Retrieval and _Question Answering Chain_ for Generation part. We also add _Prompts_ that guide the language model's behavior and output. These all are combined to form a single chain which gives output from user questions.
- **Streaming Output**: LangChain supports [streaming](https://python.langchain.com/docs/expression_language/streaming/), allowing the app to stream the generated answer to the client in real-time.
- **Loading and processing PDF documents**: LangChain's [_PDFLoader_](https://docs.langchain.com/oss/python/integrations/document_loaders) is used to load the PDF files and convert them into text documents.
- **Text splitting**: LangChain's [_RecursiveCharacterTextSplitter_](https://docs.langchain.com/oss/python/integrations/splitters) is used to split the text from the PDF documents into smaller chunks or passages, which are more suitable for embedding and retrieval.
- **Embedding generation**: LangChain integrates with [various embedding models](https://docs.langchain.com/oss/python/integrations/text_embedding), such as OpenAI's embeddings, to convert the text chunks into embeddings.
- **Vector store integration**: LangChain provides a [_CouchbaseSearchVectorStore_](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) class that seamlessly integrates with Couchbase's Vector Search, allowing the app to store and search through the embeddings and their corresponding text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain consistency with the new terminology introduced in this pull request (e.g., in the title and introduction), it would be clearer to use Search Vector Index here instead of the more generic Vector Search.

Suggested change
- **Vector store integration**: LangChain provides a [_CouchbaseSearchVectorStore_](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) class that seamlessly integrates with Couchbase's Vector Search, allowing the app to store and search through the embeddings and their corresponding text.
- **Vector store integration**: LangChain provides a [_CouchbaseSearchVectorStore_](https://couchbase-ecosystem.github.io/langchain-couchbase/langchain_couchbase.html#couchbase-search-vector-store) class that seamlessly integrates with Couchbase's Search Vector Index, allowing the app to store and search through the embeddings and their corresponding text.

- **Chains**: LangChain provides various [chains](https://api.python.langchain.com/en/latest/langchain/chains.html) for different requirements. For using RAG concept, we require _Retrieval Chain_ for Retrieval and _Question Answering Chain_ for Generation part. We also add _Prompts_ that guide the language model's behavior and output. These all are combined to form a single chain which gives output from user questions.
- **Streaming Output**: LangChain supports [streaming](https://docs.langchain.com/oss/python/langchain/streaming), allowing the app to stream the generated answer to the client in real-time.

By combining Vector Search with Couchbase, RAG, and LangChain; the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models. This approach provides users with a powerful and intuitive way to explore and interact with large PDF files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the updated terminology used throughout this tutorial, consider replacing Vector Search with Search Vector Index here.

Suggested change
By combining Vector Search with Couchbase, RAG, and LangChain; the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models. This approach provides users with a powerful and intuitive way to explore and interact with large PDF files.
By combining Search Vector Index with Couchbase, RAG, and LangChain; the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models. This approach provides users with a powerful and intuitive way to explore and interact with large PDF files.


Expand Down Expand Up @@ -390,7 +393,7 @@ with st.form("upload pdf"):

This function ensures that the uploaded PDF file is properly handled, loaded, and prepared for storage or processing in the vector store. It first checks if file was actually uploaded. Then the uploaded file is saved to a temporary file in `binary` format.

From the temporary file, PDF is loaded in [PyPDFLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf/) from the LangChain library which loads the PDF into [LangChain Document](https://python.langchain.com/docs/modules/data_connection/document_loaders/) Format
From the temporary file, PDF is loaded in [PyPDFLoader](https://reference.langchain.com/python/langchain_core/document_loaders/) from the LangChain library which loads the PDF into [LangChain Document](https://reference.langchain.com/python/langchain_core/document_loaders/) Format

```python
def save_to_vector_store(uploaded_file, vector_store):
Expand All @@ -407,7 +410,7 @@ def save_to_vector_store(uploaded_file, vector_store):

### Split Documents

This LangChain document array will contain huge individual files which defeats the purpose while retrieval as we want to send more relevant context to LLM. So we will split it into smaller chunks or passages using LangChain's [_RecursiveCharacterTextSplitter_](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/):
This LangChain document array will contain huge individual files which defeats the purpose while retrieval as we want to send more relevant context to LLM. So we will split it into smaller chunks or passages using LangChain's [_RecursiveCharacterTextSplitter_](https://docs.langchain.com/oss/python/integrations/splitters):

- chunk_size: 1500: This parameter specifies that each chunk should contain approximately 1500 characters.
- chunk_overlap: 150: This parameter ensures that there is an overlap of 150 characters between consecutive chunks. This overlap helps maintain context and prevent important information from being split across chunk boundaries.
Expand Down Expand Up @@ -436,7 +439,7 @@ After uploading the PDF into Couchbase, we are now ready to utilize the power of

### LangChain Expression Language (LCEL)

We will now utilize the power of LangChain Chains using the [LangChain Expression Language](https://python.langchain.com/docs/expression_language/) (LCEL). LCEL makes it easy to build complex chains from basic components, and supports out of the box functionality such as streaming, parallelism, and logging.
We will now utilize the power of LangChain Chains using the LangChain Expression Language (LCEL). LCEL makes it easy to build complex chains from basic components, and supports out of the box functionality such as streaming, parallelism, and logging.

LCEL is a domain-specific language that provides several key advantages when working with LangChain:

Expand All @@ -450,15 +453,15 @@ We will be using LCEL chains in next few sections and will see how LCEL optimize

### Create Retriever Chain

We also create the [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore) of the couchbase vector store. This retriever will be used to retrieve the previously added documents which are similar to current query.
We also create the [retriever](https://docs.langchain.com/oss/python/integrations/retrievers) of the couchbase vector store. This retriever will be used to retrieve the previously added documents which are similar to current query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This sentence could be improved for clarity and to follow brand capitalization guidelines. couchbase should be capitalized to Couchbase. Additionally, the phrasing could be more direct and natural.

Suggested change
We also create the [retriever](https://docs.langchain.com/oss/python/integrations/retrievers) of the couchbase vector store. This retriever will be used to retrieve the previously added documents which are similar to current query.
We also create a [retriever](https://docs.langchain.com/oss/python/integrations/retrievers) for the Couchbase vector store. This retriever is used to retrieve previously added documents that are similar to the current query.


```python
retriever = vector_store.as_retriever()
```

### Prompt Chain

A prompt for a language model is a set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation. We will use a template and create a [prompt chain](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/) using [_ChatPromptTemplate_](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/#chatprompttemplate) Class of LangChain
A prompt for a language model is a set of instructions or input provided by a user to guide the model's response, helping it understand the context and generate relevant and coherent language-based output, such as answering questions, completing sentences, or engaging in a conversation. We will use a template and create a prompt chain using [_ChatPromptTemplate_](https://python.langchain.com/docs/modules/model_io/prompts/quick_start/#chatprompttemplate) Class of LangChain

```python
template = """You are a helpful bot. If you cannot answer based on the context provided, respond with a generic answer. Answer the question as truthfully as possible using the context below:
Expand Down Expand Up @@ -525,7 +528,7 @@ This section creates an interactive chat interface where users can ask questions
- Add the user's question to the chat history.
- Create a placeholder for streaming the assistant's response.
- Use the chain.stream(question) method to generate the response from the RAG chain.
- [Stream](https://python.langchain.com/docs/use_cases/question_answering/streaming/) the response in real-time by updating the placeholder with each response chunk.
- [Stream](https://docs.langchain.com/oss/python/langchain/streaming) the response in real-time by updating the placeholder with each response chunk.
- Add the final assistant's response to the chat history.

This setup allows users to have a conversational experience, asking questions related to the uploaded PDF, with responses generated by the RAG chain and streamed in real-time. Both the user's questions and the assistant's responses are displayed in the chat interface, along with their respective roles and avatars.
Expand Down
Loading