Query Transformation

2. Query Transformation#

The main idea behind the Query Transformation is that translate/transform the user query in a way that the LLM can correctly answer the question. For instance, if the user asks an ambiguous question, our RAG retriever might retrieve incorrect (or ambiguous) documents based on the embeddings that are not very relevant to answer the user question, leading the LLM to hallucinate answers. There are few ways to tackle this problem. Some of them are,

Step-back prompting: This involves encouraging the LLM to take a step back from a given question or problem and pose a more abstract, higher-level question that encompasses the essence of the original inquiry.
Least-to-most prompting: This allows to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems.
Query re-writing (Multi-Query or RAG Fusion): This allows to generate multiple questions from the original question with different wording and perspectives. Then retrieve documents using the similarity scores between each question and the vector store to answer the orginal question.

A blog post about query transformation by Langchain can be found here.

Now, let’s try to implement the above techniques using LangChain!

%load_ext dotenv
%dotenv secrets/secrets.env

Similar to the Introduction notebook, we first import the libraries, load documents, split them, generate embeddings, store them in a vector store and create the retriever using the vector store.

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.load import loads, dumps
from typing import List

loader = DirectoryLoader('data/',glob="*.pdf",loader_cls=PyPDFLoader)
documents = loader.load()

# Split text into chunks

text_splitter  = RecursiveCharacterTextSplitter(chunk_size=500,chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

vectorstore = Chroma.from_documents(documents=text_chunks, 
                                    embedding=OpenAIEmbeddings(),
                                    persist_directory="data/vectorstore")
vectorstore.persist()

/Users/sakunaharinda/Documents/Repositories/ragatouille/venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: Since Chroma 0.4.x the manual persistence method is no longer supported as docs are automatically persisted.
  warn_deprecated(

retriever = vectorstore.as_retriever(search_kwargs={'k':5})

2.1. Query Translation#

2.1.1. Multi-Query#

In multi-query approach, we first use an LLM (here it is an instance of GPT-4) to generate 5 different questions based on our original question. To do that, we create a prompt and encapsulate it with the ChatPromptTemplate. Then we create the chain using LCEL, to read the user input and assign it to the question placeholder of the prompt, send the prompt to the LLM, parse the output containing 5 questions seperated by new line charcters.

from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """
    You are an intelligent assistant. Your task is to generate 5 questions based on the provided question in different wording and different perspectives to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question: {question}
    """
)

generate_queries = (
    {"question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

We can check whether or not our query generation works by invoking the created chain with a query.

generate_queries.invoke("What are the benefits of QLoRA?")

['1. Can you list the advantages of using QLoRA?',
 '2. What positive outcomes can be expected from using QLoRA?',
 '3. In what ways is QLoRA beneficial?',
 '4. How can QLoRA be advantageous?',
 '5. What are the positive impacts of using QLoRA?']

Once we get the 5 questions, we parallelly retrieve the most relevant 5 documents for each question (resulting in a list of lists) and create a new document list by taking the unique documents of the union of all the retrieved documents. To do that we create another chain, retrieval_chain using LCEL.

def get_context_union(docs: List[List]):
    all_docs = [dumps(d) for doc in docs for d in doc]
    unique_docs = list(set(all_docs))
    
    return [loads(doc).page_content for doc in unique_docs] # We only return page contents


retrieval_chain = (
    {'question': RunnablePassthrough()}
    | generate_queries
    | retriever.map()
    | get_context_union
)
    

retrieval_chain.invoke("What are the benefits of QLoRA?")

/Users/sakunaharinda/Documents/Repositories/ragatouille/venv/lib/python3.12/site-packages/langchain_core/_api/beta_decorator.py:87: LangChainBetaWarning: The function `loads` is in beta. It is actively being worked on, so the API may change.
  warn_beta(

['trade-off exactly lies for QLoRA tuning, which we leave to future work to explore.\nWe proceed to investigate instruction tuning at scales that would be impossible to explore with full\n16-bit finetuning on academic research hardware.\n5 Pushing the Chatbot State-of-the-art with QLoRA\nHaving established that 4-bit QLORAmatches 16-bit performance across scales, tasks, and datasets\nwe conduct an in-depth study of instruction finetuning up to the largest open-source language models',
 'technology. QLORAcan be seen as an equalizing factor that helps to close the resource gap between\nlarge corporations and small teams with consumer GPUs.\nAnother potential source of impact is deployment to mobile phones. We believe our QLORAmethod\nmight enable the critical milestone of enabling the finetuning of LLMs on phones and other low\nresource settings. While 7B models were shown to be able to be run on phones before, QLORAis',
 'There are many directions for future works. 1) LoRA can be combined with other efﬁcient adapta-\ntion methods, potentially providing orthogonal improvement. 2) The mechanism behind ﬁne-tuning\nor LoRA is far from clear – how are features learned during pre-training transformed to do well\non downstream tasks? We believe that LoRA makes it more tractable to answer this than full ﬁne-\n12',
 'Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA',
 'All in all, we believe that QLORAwill have a broadly positive impact making the finetuning of high\nquality LLMs much more widely and easily accessible.\nAcknowledgements\nWe thank Aditya Kusupati, Ofir Press, Ashish Sharma, Margaret Li, Raphael Olivier, Zihao Ye, and\nEvangelia Spiliopoulou for their valuable feedback. Our research was facilitated by the advanced\ncomputational, storage, and networking infrastructure of the Hyak supercomputer system at the']

Finally we put all together by creating a one final chain to read the user query, get the contexts from 5 different documents using the retrieval_chain, add both the question and context to the prompt, send it through the LLM, and get the final formatted output using the StrOutputParser.

prompt = ChatPromptTemplate.from_template(
    """
    Asnwer the given question using the provided context.\n\nContext: {context}\n\nQuestion: {question}
    """
)

multi_query_chain = (
    {'context': retrieval_chain, 'question': RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0)
    | StrOutputParser()
)

multi_query_chain.invoke("What are the benefits of QLoRA?")

'The benefits of QLoRA include matching 16-bit performance across scales, tasks, and datasets with only 4-bit. It makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers. This is because it does not need to calculate the gradients or maintain the optimizer states for most parameters, but only optimizes the injected, much smaller low-rank matrices. Its simple linear design allows the merging of the trainable matrices with the frozen weights when deployed, introducing no inference latency compared to a fully fine-tuned model. QLoRA can also be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs. It might enable the finetuning of LLMs on phones and other low resource settings. Lastly, QLoRA can be used to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across multiple model types and scales.'

After executing all the above cells, you will be able to see a LangSmith trace like this.

2.1.2. RAG Fusion#

In the default multi-query approach, after we retrieved the relevant documents for each question generated for our original question, we take the union of all the documents to select only unique documents (same document can be retrieved by multiple questions). However, we did not pay attention to the rank of each docuemnt in the context, which is important for the LLM to produce the most correct answer. Beacuse the each individual rank would help us to decide the top-k documents to select as the context if we have a huge number of documents with a limited context window of the LLM. Therefore in RAG Fusion, while we do exactly the same thing upto retrieving docuemnts, we use Reciprocal Rank Fusion (RRF) to rank the each retrieved document before using them as the context to answer our original question.

def rrf(results: List[List], k=60):
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}

    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # If the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # Return the reranked results as a list of tuples, each containing the document and its fused score
    return reranked_results

The only difference between the below code compared to the multi-query code we went through earlier is, now we use our rrf method instead of get_context_union to retrieve the final list of documents related to our original question (i.e., context).

from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    """
    You are an intelligent assistant. Your task is to generate 4 questions based on the provided question in different wording and different perspectives to retrieve relevant documents from a vector database. By generating multiple perspectives on the user question, your goal is to help the user overcome some of the limitations of the distance-based similarity search. Provide these alternative questions separated by newlines. Original question: {question}
    """
)

generate_queries = (
    {"question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)


fusion_retrieval_chain = (
    {'question': RunnablePassthrough()}
    | generate_queries
    | retriever.map()
    | rrf
)

fusion_retrieval_chain.invoke("What are the benefits of QLoRA?")

[(Document(page_content='technology. QLORAcan be seen as an equalizing factor that helps to close the resource gap between\nlarge corporations and small teams with consumer GPUs.\nAnother potential source of impact is deployment to mobile phones. We believe our QLORAmethod\nmight enable the critical milestone of enabling the finetuning of LLMs on phones and other low\nresource settings. While 7B models were shown to be able to be run on phones before, QLORAis', metadata={'page': 15, 'source': 'data/QLoRA.pdf'}),
  0.11480532786885246),
 (Document(page_content='Quantization to reduce the average memory footprint by quantizing the quantization\nconstants, and (c) Paged Optimizers to manage memory spikes. We use QLORA\nto finetune more than 1,000 models, providing a detailed analysis of instruction\nfollowing and chatbot performance across 8 instruction datasets, multiple model\ntypes (LLaMA, T5), and model scales that would be infeasible to run with regular\nfinetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA', metadata={'page': 0, 'source': 'data/QLoRA.pdf'}),
  0.11163114439324116),
 (Document(page_content='All in all, we believe that QLORAwill have a broadly positive impact making the finetuning of high\nquality LLMs much more widely and easily accessible.\nAcknowledgements\nWe thank Aditya Kusupati, Ofir Press, Ashish Sharma, Margaret Li, Raphael Olivier, Zihao Ye, and\nEvangelia Spiliopoulou for their valuable feedback. Our research was facilitated by the advanced\ncomputational, storage, and networking infrastructure of the Hyak supercomputer system at the', metadata={'page': 15, 'source': 'data/QLoRA.pdf'}),
  0.09631215742069787)]

Here we format the context by considering only the page contents without meta data or re-ranking scores.

def format_context(documents: List):
    return "\n\n".join([doc[0].page_content for doc in documents])


prompt = ChatPromptTemplate.from_template(
    """
    Asnwer the given question using the provided context.\n\nContext: {context}\n\nQuestion: {question}
    """
)

multi_query_chain = (
    {'context': fusion_retrieval_chain | format_context, 'question': RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model='gpt-4', temperature=0)
    | StrOutputParser()
)

multi_query_chain.invoke("What are the benefits of QLoRA?")

'The benefits of QLoRA include reducing the average memory footprint by quantizing the quantization constants and managing memory spikes. It allows for the finetuning of more than 1,000 models and provides a detailed analysis of instruction following and chatbot performance across multiple datasets and model types. QLoRA can be seen as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs. It might also enable the critical milestone of enabling the finetuning of LLMs on phones and other low resource settings. QLoRA matches 16-bit performance across scales, tasks, and datasets, making the finetuning of high quality LLMs much more widely and easily accessible.'

After executing all the above cells, you will be able to see a LangSmith trace like this.

2.2. Query Decomposition#

In “Query Translation”, we focused on generating multiple questions from our original question with different perspectives (i.e., translate the query) to improve RAG. However, the generated questions all do have the same meaning despite the wording is different, since it is in fact translation. Therefore, the answers for all the questions are somewhat similar. As a result, while the multi-query approach helps avoid ambiguities of the user query by writing it in different ways, it will not help when the user query is complex (e.g., a long mathematical computation).

As a solution we can break down (i.e., decompose) the original query into multiple sub-problems (like in recursion or dynamic programming) and answer each sub-problem sequentially/parallelly to derive the answer to our original query. This simplifies the prompts and increases the context for the retrieval process. We do that using “Query Decomposition”.

2.2.1. Least-to-Most Prompting#

First let’s look at how to implement Least-to-Most Prompting to break down a complex query into subquestions and answer them recursively to derive the final answer.

Similar to the multi-query and RAG fusion we first have generate a few questions based on our original questions. However our prompt should be different as we are generating sub questions by decomposing the original one, instead of generating the same question with different perspectives.

from langchain.prompts import ChatPromptTemplate

decompostion_prompt = ChatPromptTemplate.from_template(
    """
    You are a helpful assistant that can break down complex questions into simpler parts. \n
        Your goal is to decompose the given question into multiple sub-questions that can be answerd in isolation to answer the main question in the end. \n
        Provide these sub-questions separated by the newline character. \n
        Original question: {question}\n
        Output (3 queries): 
    """
)

query_generation_chain = (
    {"question": RunnablePassthrough()}
    | decompostion_prompt
    | ChatOpenAI(model='gpt-4', temperature=0.7)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

questions = query_generation_chain.invoke("What are the benefits of QLoRA?")
questions

['What is QLoRA?',
 'What are the features of QLoRA?',
 'How do these features of QLoRA provide benefits?']

After generating the sub-questions, we iterate through them to answer them individually using the least_to_most_chain. We first extract the question from the user input using the itemgetter and provide it to our retriever to retrieve related documents as the context. q_a_pairs will also be provided as part of the user input. Then we populate our prompt and send to the LLM to get the answer. Each time we store the sub-question Q_{n-1} and its answer A_{n-1} since we provide them as the context to answer the question Q_{n}.

from operator import itemgetter


# Create the final prompt template to answer the question with provided context and background Q&A pairs
template = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

least_to_most_prompt = ChatPromptTemplate.from_template(template) 
llm = ChatOpenAI(model='gpt-4', temperature=0)

least_to_most_chain = (
        {'context': itemgetter('question') | retriever,
        'q_a_pairs': itemgetter('q_a_pairs'),
        'question': itemgetter('question'),
        }
        | least_to_most_prompt
        | llm
        | StrOutputParser()
    )

q_a_pairs = ""
for q in questions:
    
    answer = least_to_most_chain.invoke({"question": q, "q_a_pairs": q_a_pairs})
    q_a_pairs+=f"Question: {q}\n\nAnswer: {answer}\n\n"

After getting answers for the 3 generated sub-questions, finally we answer our original question by invoking the least_to_most_chain once again, but this time with the original question and all q_a_pairs.

least_to_most_chain.invoke({"question": "What are the benefits of QLoRA?", "q_a_pairs": q_a_pairs})

"The benefits of QLoRA include:\n\n1. Reduced Memory Footprint: QLoRA uses quantization to reduce the average memory footprint, allowing for more efficient memory usage. This is particularly beneficial for devices with limited memory resources.\n\n2. Management of Memory Spikes: QLoRA uses Paged Optimizers to manage memory spikes, ensuring smooth operation even when dealing with large datasets. This can prevent crashes or slowdowns due to memory overload.\n\n3. Model Fine-tuning: QLoRA's ability to finetune models, including large-scale models, allows for improved performance across a variety of tasks. This can lead to better results in areas such as instruction following and chatbot performance.\n\n4. Support for Multiple Model Types: QLoRA supports multiple model types, providing flexibility and versatility, allowing it to be used in a wide range of applications.\n\n5. Combination with Other Methods: QLoRA can be combined with other efficient adaptation methods, potentially providing orthogonal improvement. This means it can enhance the effectiveness of other methods, leading to better overall performance.\n\n6. Hyperparameter Search: QLoRA allows for a hyperparameter search over several variables, allowing for more precise and effective model tuning, leading to improved results.\n\n7. Accessibility: QLoRA is designed to make the process of finetuning large language models more tractable, even in low resource settings like phones. This makes it a useful tool for small teams with limited resources, helping to close the resource gap between large corporations and smaller teams. It also opens up the possibility of deploying high-quality language models on mobile devices, making this technology more widely accessible."

The LangSmith trace for the original question answer will look like this.

Instead sequentially answering the sub-questions, we can use the LLM to answer them parallely and use those answers to derive the final answer to our main question.

prompt = hub.pull('rlm/rag-prompt')
prompt

ChatPromptTemplate(input_variables=['context', 'question'], metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

def generate_and_answer(question):
    
    questions = []
    
    sub_questions = query_generation_chain.invoke(question)
    
    sub_qa_chain = (
        {'context': RunnablePassthrough() | retriever, 'question': RunnablePassthrough()}
        | prompt
        | ChatOpenAI(model='gpt-4', temperature=0)
        | StrOutputParser()
    )
    
    for q in sub_questions:
        answer = sub_qa_chain.invoke(q)
        questions.append({"question": q, "answer": answer})
        
    return questions
        
qa_pairs = generate_and_answer("What are the benefits of QLoRA?")

def format_qa_pairs(qa_pairs):
    
    formatted_string = ""
    
    for i, qa in enumerate(qa_pairs):
        formatted_string += f"Question {i}: {qa['question']}\nAnswer {i}: {qa['answer']}\n\n"
    return formatted_string.strip()

context = format_qa_pairs(qa_pairs)

# Prompt

prompt = ChatPromptTemplate.from_template(
    """
    Consider the following Question and Answer Pairs:

    {context}

    Use these to synthesize an answer to the question: {question}
    """
)

final_rag_chain = (
     prompt
    | ChatOpenAI(model='gpt-4', temperature=0)
    | StrOutputParser()
)

final_rag_chain.invoke({'context': context, 'question': "What are the benefits of QLoRA?"})

"The benefits of QLoRA include its ability to provide detailed analysis of instruction following and chatbot performance across various instruction datasets and model types. It serves as an equalizing factor that helps to close the resource gap between large corporations and small teams with consumer GPUs. It also enables the fine-tuning of large language models on phones and other low resource settings. QLoRA's features such as quantization and paged optimizers reduce the average memory footprint and manage memory spikes, making it possible to fine-tune large-scale models that would be infeasible to run with regular fine-tuning. Furthermore, QLoRA can provide orthogonal improvement when combined with other efficient adaptation methods and makes it more tractable to understand how features learned during pre-training are transformed to perform well on downstream tasks. Lastly, it offers major computational benefits by allowing users to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation."

The LangSmith trace for answering the original question will look like this.

2.2.2. Step back prompting#

Step back prompting allows LLMs to step back through in-context learning – prompting them to derive high-level abstractions such as concepts and principles for a specific example (i.e., Abstraction). Then, grounded on the documents regarding the high-level concept or principle, the LLM can reason about the solution to the original question (i.e., Reasoning).

E.g., If the original question is “What happens to the pressure, P, of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?”, a possible step-back question would be “What are the physics principles behind this question?”. Then the context (i.e., documents) retrieved for the step-back question will be used as additional context to answer the original question.

To generate such step-back questions, we use few-shot learning to provide a few examples of (question, step-back question) pairs to the LLM.

from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate

examples = [
    {
        'input': 'What happens to the pressure, P, of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?',
        'output': 'What are the physics principles behind this question?'
    },
    {
        'input': 'Estella Leopold went to which school between Aug 1954 and Nov 1954?',
        'output': "What was Estella Leopold's education history?"
    }
]
example_prompt = ChatPromptTemplate.from_messages(
            [
                ('human', '{input}'), ('ai', '{output}')
            ]
        )
few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=examples,
            # This is a prompt template used to format each individual example.
    example_prompt=example_prompt,
)

final_prompt = ChatPromptTemplate.from_messages(
            [
                ('system', """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:"""),
                few_shot_prompt,
                ('user', '{question}'),
            ]
        )

final_prompt.format(question= "What are the benefits of QLoRA?")

"System: You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:\nHuman: What happens to the pressure, P, of an ideal gas if the temperature is increased by a factor of 2 and the volume is increased by a factor of 8?\nAI: What are the physics principles behind this question?\nHuman: Estella Leopold went to which school between Aug 1954 and Nov 1954?\nAI: What was Estella Leopold's education history?\nHuman: What are the benefits of QLoRA?"

Then we use the created few-shot prompt to generate the step-back question through a chain.

step_back_query_chain = (
    {'question': RunnablePassthrough()}
    | final_prompt 
    | ChatOpenAI(model='gpt-4', temperature=0.7) 
    | StrOutputParser()
    )

step_back_query_chain.invoke("What are the optimal parameters for QLoRA?")

'What factors should be considered when setting parameters for a QLoRA system?'

Finally, we use both the context retrieved for the original question and the context retrieved for the step-back question to answer our original question via the step_back_chain.

response_prompt_template = """You are an expert of world knowledge. 
I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. 
Otherwise, ignore them if they are not relevant.

# {normal_context}
# {step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

step_back_chain = (
    {'normal_context': RunnablePassthrough() | retriever,
     'step_back_context': RunnablePassthrough() | step_back_query_chain | retriever,
     'question': RunnablePassthrough()
     }
    | response_prompt
    | ChatOpenAI(model='gpt-4', temperature=0)
    | StrOutputParser()
)

step_back_chain.invoke("What are the optimal parameters for QLoRA?")

'The optimal parameters for QLoRA are determined through a hyperparameter search over the following variables: LoRA dropout with values { 0.0, 0.05, 0.1}, LoRA r with values { 8, 16, 32, 64, 128, 256}, and LoRA layers which can be {key+query, all attention layers, all FFN layers, all layers, attention + FFN output layers}. The LoRA α is kept fixed and the learning rate is searched, as LoRA α is always proportional to the learning rate.'

The LangSmith trace for the implemented step-back prompting chain will look like this.

In this notebook, we looked at ways to improve the LLMs answers to a user query through the “Query Transformation”. In summary, query transformation may help us to remove ambiguities of the user query and simplify it through techniques such as ,

Multi-query: That re-writes the question in different perspectives (i.e., sub-questions).
RAG Fusion: That not only re-writes the question in different perspectives, but also rank the documents retrieved for each sub-question to provide the most relevant information to answer the original question.
Least-to-Most Prompting: That helps break-down complex questions into mutiple sub problems and answer the final question using the sub problems and their answers as the context.
Step-back Prompting: That generates a step-back question and use the retrieved documents for that step-back question as the additional context to answer the original question.

In the next section, we will generate Hypothetical Documents, instead of questions to help LLMs answer questions more accurately through HyDE (Hypothetical Document Embeddings).