5. Query Construction#
After the user asks a question in natural language and routed to a specific datasource (e.g., vector store, Graph database, etc,), the question should be transformed into a structured query to retrieve information from the selected datasource (e.g., Text-to-SQL, Text-to-Cypher, etc.). In this section we will see how the query construction can be done using Langchain to transform the natural language question into a structured query.
A detailed blog post published by Langchain can be found here.
%load_ext dotenv
%dotenv secrets/secrets.env
from langchain_community.document_loaders import YoutubeLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
In this section, we wil see how the natural language question can be used to construct a query to filter records using their meta data. To do that we consider a use case where we need to filter a YouTube playlist based on meta data for its videos. Let’s first find out what are meta data we can find in a YouTube video.
docs = YoutubeLoader.from_youtube_url(
"https://www.youtube.com/watch?v=sVcwVQRHIc8", add_video_info=True
).load()
docs[0].metadata
{'source': 'sVcwVQRHIc8',
'title': 'Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer',
'description': 'Unknown',
'view_count': 147737,
'thumbnail_url': 'https://i.ytimg.com/vi/sVcwVQRHIc8/hq720.jpg',
'publish_date': '2024-04-17 00:00:00',
'length': 9191,
'author': 'freeCodeCamp.org'}
Now let’s assume that we are going to filter the playlist based on the title
, content
, view_count
, publish_date
, and length
. So, once we ask a question from the LLM, it should return an object containing above fields which is used to select videos from a playlist/database. To do that similar to the one we built in the “Routing” section, we define a Pydantic model.
import datetime
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field
class TutorialSearch(BaseModel):
"""Search over a database/playlist of videos about LLMs and Langchain."""
content_search: str = Field(
...,
description="Similarity search query applied to video transcripts.",
)
title_search: str = Field(
...,
description=(
"Alternate version of the content search query to apply to video titles. "
"Should be succinct and only include key words that could be in a video "
"title."
),
)
min_view_count: Optional[int] = Field(
None,
description="Minimum view count filter, inclusive. Only use if explicitly specified.",
)
max_view_count: Optional[int] = Field(
None,
description="Maximum view count filter, exclusive. Only use if explicitly specified.",
)
earliest_publish_date: Optional[datetime.date] = Field(
None,
description="Earliest publish date filter, inclusive. Only use if explicitly specified.",
)
latest_publish_date: Optional[datetime.date] = Field(
None,
description="Latest publish date filter, exclusive. Only use if explicitly specified.",
)
min_length_sec: Optional[int] = Field(
None,
description="Minimum video length in seconds, inclusive. Only use if explicitly specified.",
)
max_length_sec: Optional[int] = Field(
None,
description="Maximum video length in seconds, exclusive. Only use if explicitly specified.",
)
def pretty_print(self) -> None:
for field in self.__fields__:
if getattr(self, field) is not None and getattr(self, field) != getattr(
self.__fields__[field], "default", None
):
print(f"{field}: {getattr(self, field)}")
Next we build a chain that takes the user question with a prompt, provide it to an LLM that provides answers according to the defined Pydantic model, and return the answer.
meta_data_prompt = ChatPromptTemplate.from_messages(
[
("system", "You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about LLMs and Langchain. \
Given a question, return a database query optimized to retrieve the most relevant results."
),
("user", "{question}")
]
)
llm = ChatOpenAI(model='gpt-4',temperature=0)
structured_llm = llm.with_structured_output(TutorialSearch)
meta_data_chain = (
{'question': RunnablePassthrough()}
| meta_data_prompt
| structured_llm
)
query = meta_data_chain.invoke("Build RAG using Langchain videos published before January 2024 with at least 1000 views.")
query.pretty_print()
content_search: Building RAG using Langchain
title_search: RAG Langchain
min_view_count: 1000
latest_publish_date: 2024-01-01
This can be used to filter the records of the database as shown in this article. We will be using a Text-to-Cypher query construction in our final section to get information from a Graph database.