1. Introduction to RAG with Langchain#

In this section we will be creating a simple QA RAG with OpenAI and Langchain. In the following notebooks we dive into intricacies of the RAG pipeline further.

%load_ext dotenv
%dotenv secrets/secrets.env
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

1.1. Process the document and build the vector store using ChromaDB#

First we load all the PDF documents using PyPDFLoader through the DirectoryLoader.load(). After loading we have to generate embeddings for each document to compare with the question when selecting the documents that provide relevant context. To do that, we first generate chunks using the RecursiveCharacterTextSplitter, splitting the each document. Then we represent each chunk using OpenAIEmbeddings embeddings that utilizes text-embedding-ada-002 model. Once the embedding vectors for each chunk generated it will be stored in a database (here we use local ChromaDB) called vecorstore.

loader = DirectoryLoader('data/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=OpenAIEmbeddings(),
                                    persist_directory="data/vectorstore")
vectorstore.persist()
/Users/sakunaharinda/Documents/Repositories/ragatouille/venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(

We can test the vector store by calling its similarity_search method with a query as bellow. As you can see, we retrieved a list of for documents related to the question. Note that each document as several fields, namely, page_content and metadata.

vectorstore.similarity_search("What is QLoRA?")
[Document(page_content='Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA', metadata={'page': 0, 'source': 'data/QLoRA.pdf'}),
 Document(page_content='Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA', metadata={'page': 0, 'source': 'data/QLoRA.pdf'}),
 Document(page_content='Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA', metadata={'page': 0, 'source': 'data/QLoRA.pdf'}),
 Document(page_content='Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA', metadata={'page': 0, 'source': 'data/QLoRA.pdf'})]

1.2. Retriever initialization#

Initializing the retriever using the created vectorstore to retrieve top-5 most similar documents to the given question. This uses the similarity score between the question embedding and the document embeddings from the vector store to identify the most suitable documents as the context.

retriever = vectorstore.as_retriever(search_kwargs={'k':5})

Creating the prompt (already existing at the hub) to provide the {context} and {question} to the LLM.

prompt = hub.pull('rlm/rag-prompt')
prompt
ChatPromptTemplate(input_variables=['context', 'question'], metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

1.3. Creating the QA chain.#

The chain is created using LCEL (LangChain Expression Language). First the LLM we need to answer the question using the provided context is defined. Here we use the GPT-4 model with the temperature 0 indicating that we use greedy decoding to generate the answer. If you want more creative answers you can increase the temperature. Instead of directly adding the retrieved Document objects as the context, we use format_docs method to get the page_content of each retrieved document without metadata and concatenate them as the context. Finally, we define our QA chain that first assigns the context as the formatted output of the retriever and the question as the user input ensuring that the question gets passed unchanged to the next step in the chain using the RunnablePassthrough. Then, the context and question values are applied to the prompt with placeholders which will be passed to the LLM followed by the StrOutputParser to make the model output more readable.

# LLM
llm = ChatOpenAI(model_name="gpt-4", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain using LangChain Expression Language
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Finally, we can invoke the chain with a query!

rag_chain.invoke("What is QLoRA?")
'QLoRA is a tool used to finetune models, specifically for tasks such as instruction following and chatbot performance. It is used across multiple model types and scales, including those that would be infeasible to run with regular finetuning. It has been used to finetune more than 1,000 models.'

One very nice feature that LangChain provides is LangSmith, that allows you to visualize your entire chain through a nice GUI interface. To use LangSmith you have to add 2 envirmonment variables to your .env file.

  • LANGCHAIN_TRACING_V2=true

  • LANGCHAIN_API_KEY=<API Key>

Then after running all the above steps, you can navigate to LangSmith and see something like this.

That’s it for a quick start to RAG with LangChain. Next we will deep dive into “Query Translation” which will be helpful when dealing with ambiguous user queries.