HyDE (Hypothetical Document Embeddings)

3. HyDE (Hypothetical Document Embeddings)#

Instead of generating queries based on the original question, HyDE focuses on generating hypothetical docuemnts for a given query. The intution behind generating such hypothetical documents is their embedding vectors can be used to identify a neighborhood in the corpus embedding space where similar real documents are retrieved based on vector similarity. In that case, RAG will be able to retrieve more relevant documents based on the hypothetical documents to answer the user query accurately.

Let’s try to use HyDE to answer questions through RAG!

%load_ext dotenv
%dotenv secrets/secrets.env

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

First, similar to the previous notebooks, first we create our vector store and initialize the retriever using OpenAIEmbeddings and Chroma.

loader = DirectoryLoader('data/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=OpenAIEmbeddings(),
                                    persist_directory="data/vectorstore")
vectorstore.persist()

retriever = vectorstore.as_retriever(search_kwargs={'k':5})

/Users/sakunaharinda/Documents/Repositories/ragatouille/venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(

Then we ask the LLM to write a “hypothetical” passage on the asked question through a chain.

from langchain.prompts import ChatPromptTemplate

hyde_prompt = ChatPromptTemplate.from_template(
    """
    Please write a scientific passage of a paper to answer the following question:\n
    Question: {question}\n
    Passage: 
    """
)

generate_doc_chain = (
    {'question': RunnablePassthrough()}
    | hyde_prompt
    | ChatOpenAI(model='gpt-4',temperature=0)
    | StrOutputParser()
)

question = "How Low Rank Adapters work in LLMs?"
generate_doc_chain.invoke(question)

"Low Rank Adapters (LRAs) are a recent development in the field of Large Language Models (LLMs) that aim to reduce the computational and memory requirements of these models while maintaining their performance. The fundamental principle behind LRAs is the use of low-rank approximations to reduce the dimensionality of the model's parameters.\n\nIn the context of LLMs, an adapter is a small neural network that is inserted between the layers of a pre-trained model. The purpose of this adapter is to adapt the pre-trained model to a new task without modifying the original parameters of the model. This allows for efficient transfer learning, as the computational cost of training the adapter is significantly less than retraining the entire model.\n\nLow Rank Adapters take this concept a step further by applying a low-rank approximation to the adapter's parameters. This is achieved by decomposing the weight matrix of the adapter into two smaller matrices, effectively reducing the number of parameters that need to be stored and computed. This decomposition is typically achieved using methods such as singular value decomposition (SVD) or principal component analysis (PCA).\n\nThe use of low-rank approximations in LRAs allows for a significant reduction in the computational and memory requirements of LLMs. Despite this reduction, LRAs are able to maintain a high level of performance, as the low-rank approximation captures the most important features of the data. This makes LRAs an effective tool for adapting pre-trained LLMs to new tasks in a computationally efficient manner."

Using the generated passage, we then retrieve the similar documents using our retriever.

retrieval_chain = generate_doc_chain | retriever 
retireved_docs = retrieval_chain.invoke({"question":question})
retireved_docs

[Document(page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3', metadata={'page': 1, 'source': 'data/LoRA.pdf'}),
 Document(page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3', metadata={'page': 1, 'source': 'data/LoRA.pdf'}),
 Document(page_content='over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the\nchange in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed\nLow-RankAdaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural\nnetwork indirectly by optimizing rank decomposition matrices of the dense layers’ change during\nadaptation instead, while keeping the pre-trained weights frozen, as shown in Figure 1. Using GPT-3', metadata={'page': 1, 'source': 'data/LoRA.pdf'}),
 Document(page_content='requirements by using a small set of trainable parameters, often termed adapters, while not updating\nthe full model parameters which remain fixed. Gradients during stochastic gradient descent are\npassed through the fixed pretrained model weights to the adapter, which is updated to optimize the\nloss function. LoRA augments a linear projection through an additional factorized projection. Given\na projection XW =YwithX∈Rb×h,W∈Rh×oLoRA computes:\nY=XW +sXL 1L2, (3)', metadata={'page': 2, 'source': 'data/QLoRA.pdf'}),
 Document(page_content='requirements by using a small set of trainable parameters, often termed adapters, while not updating\nthe full model parameters which remain fixed. Gradients during stochastic gradient descent are\npassed through the fixed pretrained model weights to the adapter, which is updated to optimize the\nloss function. LoRA augments a linear projection through an additional factorized projection. Given\na projection XW =YwithX∈Rb×h,W∈Rh×oLoRA computes:\nY=XW +sXL 1L2, (3)', metadata={'page': 2, 'source': 'data/QLoRA.pdf'})]

Finally, we use the retrieved docuemnts based on the “hypothetical” passage is used as the context to answer our original question through the final_rag_chain.

template = """Answer the following question based on the provided context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | ChatOpenAI(model='gpt-4',temperature=0)
    | StrOutputParser()
)

final_rag_chain.invoke({"context":retireved_docs,"question":question})

"Low-Rank Adapters (LoRA) work in large language models (LLMs) by allowing the training of some dense layers in a neural network indirectly. This is done by optimizing rank decomposition matrices of the dense layers' change during adaptation, while keeping the pre-trained weights frozen. LoRA also augments a linear projection through an additional factorized projection. During stochastic gradient descent, gradients are passed through the fixed pre-trained model weights to the adapter, which is updated to optimize the loss function."

Even though this technique might help answer questions, there is a chance to the answer be wrong due to retrieving documents based on the incorrect/hallucinated hypothetical passage.

In the next section, we talk about “Routing” in RAG.