Query Construction

5. Query Construction#

After the user asks a question in natural language and routed to a specific datasource (e.g., vector store, Graph database, etc,), the question should be transformed into a structured query to retrieve information from the selected datasource (e.g., Text-to-SQL, Text-to-Cypher, etc.). In this section we will see how the query construction can be done using Langchain to transform the natural language question into a structured query.

A detailed blog post published by Langchain can be found here.

%load_ext dotenv
%dotenv secrets/secrets.env

from langchain_community.document_loaders import YoutubeLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

In this section, we wil see how the natural language question can be used to construct a query to filter records using their meta data. To do that we consider a use case where we need to filter a YouTube playlist based on meta data for its videos. Let’s first find out what are meta data we can find in a YouTube video.

docs = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=sVcwVQRHIc8", add_video_info=True
).load()

docs[0].metadata

{'source': 'sVcwVQRHIc8',
 'title': 'Learn RAG From Scratch – Python AI Tutorial from a LangChain Engineer',
 'description': 'Unknown',
 'view_count': 147737,
 'thumbnail_url': 'https://i.ytimg.com/vi/sVcwVQRHIc8/hq720.jpg',
 'publish_date': '2024-04-17 00:00:00',
 'length': 9191,
 'author': 'freeCodeCamp.org'}

Now let’s assume that we are going to filter the playlist based on the title, content, view_count, publish_date, and length. So, once we ask a question from the LLM, it should return an object containing above fields which is used to select videos from a playlist/database. To do that similar to the one we built in the “Routing” section, we define a Pydantic model.

import datetime
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class TutorialSearch(BaseModel):
    """Search over a database/playlist of videos about LLMs and Langchain."""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    title_search: str = Field(
        ...,
        description=(
            "Alternate version of the content search query to apply to video titles. "
            "Should be succinct and only include key words that could be in a video "
            "title."
        ),
    )
    min_view_count: Optional[int] = Field(
        None,
        description="Minimum view count filter, inclusive. Only use if explicitly specified.",
    )
    max_view_count: Optional[int] = Field(
        None,
        description="Maximum view count filter, exclusive. Only use if explicitly specified.",
    )
    earliest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Earliest publish date filter, inclusive. Only use if explicitly specified.",
    )
    latest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Latest publish date filter, exclusive. Only use if explicitly specified.",
    )
    min_length_sec: Optional[int] = Field(
        None,
        description="Minimum video length in seconds, inclusive. Only use if explicitly specified.",
    )
    max_length_sec: Optional[int] = Field(
        None,
        description="Maximum video length in seconds, exclusive. Only use if explicitly specified.",
    )

    def pretty_print(self) -> None:
        for field in self.__fields__:
            if getattr(self, field) is not None and getattr(self, field) != getattr(
                self.__fields__[field], "default", None
            ):
                print(f"{field}: {getattr(self, field)}")

Next we build a chain that takes the user question with a prompt, provide it to an LLM that provides answers according to the defined Pydantic model, and return the answer.

meta_data_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert at converting user questions into database queries. \
                    You have access to a database of tutorial videos about LLMs and Langchain. \
                    Given a question, return a database query optimized to retrieve the most relevant results."
        ),
        ("user", "{question}")
    ]
)

llm = ChatOpenAI(model='gpt-4',temperature=0)
structured_llm = llm.with_structured_output(TutorialSearch)

meta_data_chain = (
    {'question': RunnablePassthrough()}
    | meta_data_prompt
    | structured_llm
)

query = meta_data_chain.invoke("Build RAG using Langchain videos published before January 2024 with at least 1000 views.")
query.pretty_print()

content_search: Building RAG using Langchain
title_search: RAG Langchain
min_view_count: 1000
latest_publish_date: 2024-01-01

This can be used to filter the records of the database as shown in this article. We will be using a Text-to-Cypher query construction in our final section to get information from a Graph database.