How to communicate with a knowledge base in natural language: using LLM to create such a system and evaluating its performance

12 min readApr 29, 2024

Hi, my name is Daniel, I work in the ML department of Doubletapp. In this article, I will tell you about the features of using large language models to optimize business processes.

A Large Language Model (LLM) is a type of language model capable of recognizing and generating meaningful texts, as well as other complex data types (such as code). These models are trained on massive datasets, often collected from open sources. Due to the size of the training data and the high number of parameters, they rank highly in benchmarks for various tasks (such as summarization, QA, code generation, etc.).

However, LLMs still have a number of issues, one of which is hallucination (making up facts). It’s difficult to blame the model for not knowing how a particular process/product in your company works and trying to come up with a coherent answer. Therefore, we need to provide LLM with factual information, and it will then give us a personalized response that is understandable to humans.

Such a question-answering system using factual information is called Retrieval Augmented Generation (RAG). It can be used in various scenarios, such as:

Personalized knowledge base assistant for onboarding new employees.
Optimization of support lines by providing faster and more accurate answers on the front line.
Assistant in online course training, etc.

This article consists of two parts:

We will discuss building an RAG system based on the langchain library.
We will objectively evaluate the performance of the created system using synthetic data in Russian language with the RAGAs framework.

The example data used in this article comes from the information posted in the knowledge base section about the debit card for customers of the “Yellow Bank”.

As evident from the acronym, any RAG consists of three stages:

Retrieval — searching for the most relevant information in our knowledge base using semantic search, word intersection, etc.
Augment — adding the found information (context) to the prompt for LLM along with the user’s query.
Generate — generating a response by the model considering the context.

Retrieval

This is the first and most important step in the system, as the final answer’s accuracy and completeness will depend on it.
The quality of retrieval depends on several components.

Data

The first thing to pay attention to is the data.
Are the topics logically organized? Are they discussed in one place or in several? Can you answer the question yourself using information from the text? If you can’t answer the question, then the system is unlikely to cope with it.

In turn, data that is tangled in structure and content does not allow for effective splitting of the text into small chunks containing complete thoughts. Redundant or contradictory information will affect the quality of the search — it will be more difficult to find the correct context.

Data Chunking

Partitioning plays an important role in building the data retrieval process. How the data is partitioned will determine whether LLM receives the most relevant and complete context for the answer or not.

There are many text partitioning strategies — one of the most popular is dividing into fixed parts (chunks). Often in such cases, it makes sense to overlap the divisions to avoid losing the train of thought in the middle of a sentence. Another strategy could be thematic partitioning, for example, by paragraphs or section headings.

In general, the size of the chunk can be determined based on the logic of your data, but this parameter can vary, and it’s worth trying several options. The trade-off is as follows: smaller chunks contain more specific thoughts, and therefore are better detected by our search, but they can worsen the generation process due to the lack of surrounding context.

Search

Search can be formally divided into two types: vector and keyword-based search. Vector search is a method of information retrieval in which texts are represented as vectors obtained using ML models. Then the search for the closest vector occurs. Vector search is quite popular now, but it is not a panacea, as its quality heavily depends on the model we use to create embeddings. It is worth trying keyword-based search, such as TF-IDF or BM-25, and comparing their capabilities with vector variants or using hybrid search. Since similar results are not always relevant, an effective strategy can be filtering by metadata. For example, if our knowledge base consists of movie reviews, and the user wants to find only those films that were released after 2000 or those with a rating above 8, we can do this using SelfQueryRetriever. It will retrieve metadata from the query and filter the results accordingly.

Another interesting strategy is rewriting the user’s question, which allows improving data retrieval by correcting poorly formulated user questions.

Creating the Retriever

In my case, the data consists of markdown files with relatively small amounts of information within each header. Therefore, when splitting, I will use splitting by headers.

from typing import List, Tuple
from langchain.document_loaders import TextLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter



# Text loader and title splitter
def load_and_split_markdown(filepath: str, splitter: List[Tuple[str, str]]):
    loader = TextLoader(filepath)
    docs = loader.load()

    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=splitter)
    md_header_splits = markdown_splitter.split_text(docs[0].page_content)
    return md_header_splits

To create embeddings for chunks, I use the “text-embedding-ada-002” model from OpenAI for ease of use of the API and high quality of embeddings.

Integrated into Langchain Chroma DB is well suited for storing the obtained vectors. I also want to note that Langchain can work with a variety of vector databases, both open source (ChromaDB, LanceDB, Faiss) and paid alternatives — Weaviate, Pinecone, etc. For our example, the free ChromaDB will be sufficient.

The EnsembleRetriever acts as a retriever, consisting of vector search, key-word search BM-25, and the aforementioned SelfQueryRetriever in proportions of 0.6 / 0.25 / 0.15.

from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.retrievers.self_query.base import SelfQueryRetriever


def get_retriever(splits, bm25_k, mmr_k, 
                  mmr_fetch_k, metadata_field_info, 
                  document_content_description):
    llm = OpenAI(temperature=0)

    # Embeddings for vector search
    embedding = OpenAIEmbeddings()

    # DB for our vectors
    vectorstore = Chroma.from_documents(documents=splits, embedding=embedding)
    
    # Key-word retriever
    bm25_retriever = BM25Retriever.from_documents(splits)
    bm25_retriever.k = bm25_k
    
    # Vector-based retriever
    mmr_retriever = vectorstore.as_retriever(
        search_type="mmr", search_kwargs={'k': mmr_k, 'fetch_k': mmr_fetch_k}
    )
    
    # Self Query Retriever
    self_retriever = SelfQueryRetriever.from_llm(
        llm, 
        vectorstore, 
        document_content_description, 
        metadata_field_info, 
        verbose=True
    )



    # Retriever combination
    ensemble_retriever = EnsembleRetriever(
        retrievers=[self_retriever, bm25_retriever, mmr_retriever], 
        weights=[0.15, 0.25, 0.6]
    )

    return ensemble_retriever

As metadata, I use part of the text (title, subtitle, etc.) from which the information was taken.

CONTENT_DESCRIPTION: Final = "Description of banking products"

METADATA_INFO: Final = [
AttributeInfo(
name="Header",
description="Part of the document from which the text was taken",
type="string or list[string]",
),
]

Augment

At this stage, we construct a query for our neural network, which consists of the context we found in the previous step and the prompt. The prompt is a kind of instruction for the LLM, telling the network what to do with the context we feed into it.

Variation of the Prompt

Typically, something like “Answer the question using the context below” is used as the base prompt, but we can modify it by adding more details to help the model answer more accurately. For example, the above prompt can be supplemented with: “You are an assistant for the bank’s debit card. Rely only on the information provided below. If you don’t know the answer, respond with ‘I don’t know’.”

Context Augmentation

Often, when splitting the knowledge base into smaller chunks to improve search quality, we sacrifice the completeness of the information fed into the model. To address this issue, we can implement context augmentation by adding a window around the found chunk. This can be done either at the splitting stage — by dividing with a sliding window — or during augmentation.

Performing Augmentation

At this stage, I created a prompt that, in my opinion, aligns well with the task and will be understandable to the model.

PROMPT_TEMPLATE: Final = """
You are an assistant for the bank_name bank's products and answer customer questions.
Use fragments of the obtained context to answer the question.
If you don't know the answer, say you don't know, don't make up an answer.
Use a maximum of three sentences and be concise.\n
Question: {question} \n
Context: {context} \n
Answer:
"""

Generation

Generation is the final stage in the pipeline and involves feeding the prompt and context into the model.

Different Models

For experimentation, it’s worth trying different LLMs depending on the specifics of the data, language, etc. Some models will perform better than others. There may also be additional requirements for on-premise deployment, leaving us only with open-source models.

Here’s an interesting comparison of models in the RAG task that I came across on Habr (the article is in Russian).

Fine-tuning the Model

One way to improve the pipeline’s performance is by fine-tuning the LLM on the domain used in the knowledge base. For example, LoRA/QLoRA approaches can be applied.

Completing the Pipeline

The GPT 3.5 Turbo model was chosen for generation as it provides sufficiently good generation quality at a lower cost (e.g. compared to GPT-4).

# Setting up the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

system_message_prompt = SystemMessagePromptTemplate.from_template(Settings.PROMT_TEMPLATE)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


# Loading texts
docs = load_and_split_markdown('data/docs/bank_name_docs.md', Settings.HEADERS_TO_SPLIT)

# Setting up the retriever
ensemble_retriever = get_retriever(
    docs, 
    Settings.BM25_K, Settings.MMR_K, Settings.MMR_FETCH_K, 
    Settings.METADATA_INFO, Settings.CONTENT_DESCRIPTION
)


# RAG pipeline
rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | chat_prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"documents": ensemble_retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

Also, an interesting feature that was added is specifying the title of the context used during answer generation.
For the final touch, I added a ChatGPT-like interface for comfortable interaction with RAG. Here’s what it looks like:

RAG Evaluation

Now that we know how to build RAG, we need to ask ourselves: how well does it work?
For these purposes, we need to create an evaluation system and test what we have achieved, and the RAGAs library will help us with this.

RAGAs is an open-source framework designed to evaluate pipeline components without human intervention. It allows creating a test dataset and obtaining an evaluation of the RAG we built.

In the test dataset, GPT-3.5 is used to generate text-based questions, and GPT-4 generates answers, as the most advanced model at the moment. Additionally, manually crafted questions and answers can be added to the dataset if desired.

RAGAs accepts the following input parameters:

question: the user’s question input into RAG
answer: the answer generated by our pipeline
contexts: the contexts used to answer the question
ground_truth: the correct answer to the user’s question

Now let’s talk about the metrics we will use. Formally, they can be divided into two independent parts: generation evaluation and retrieval evaluation.

Faithfulness
This metric aims to identify factual inconsistencies between the generated answer and the context. It allows counting the model’s hallucinations — incorrect information or information not based on the context — relative to all statements in the answer.

Answer Relevancy
How well the generated answer corresponds to the question. It helps understand the extent to which the system’s answers contain incomplete, repetitive, or redundant information.

Context Precision
A numerical measure of how well the obtained context matches the information needed to answer the question. This metric is calculated as the ratio of correctly retrieved chunks to their total number. It helps find the optimal chunk size when splitting the text.

Context Recall measures how relevant the retrieved context was relative to the ground_truth answers and is the only metric that uses them.

Let’s evaluate the Pipeline

First, we need to generate a synthetic dataset with questions based on the knowledge base. RAGAs provides a convenient class called TestGenerator, which allows creating a dataset in just a few lines. However, it contains prompts in English, so I had to make clarifications to all prompts used by this class to receive answers in Russian.

For this, I added the phrase “Your task is formulated in English, but the answer should be in the language of the context” to each prompt, and then, by inheriting from the TestGenerator class, I redefined the functions using the prompts.

SEED_QUESTION = HumanMessagePromptTemplate.from_template(
    """\
Your instructions are given in English but the answer should be in the same language as the context.
Your task is to formulate a question from given context satisfying the rules given below:
    1.The question should make sense to humans even when read without the given context.
    2.The question should be fully answered from the given context.
    3.The question should be framed from a part of context that contains important information. It can also be from tables,code,etc.
    4.The answer to the question should not contain any links.
    5.The question should be of moderate difficulty.
    6.The question must be reasonable and must be understood and responded by humans.
    7.Do no use phrases like 'provided context',etc in the question
    8.Avoid framing question using word "and" that can be decomposed into more than one question.
    9.The question should not contain more than 10 words, make of use of abbreviation wherever possible.
    
context:{context}
"""  # noqa: E501
)

If you are interested in understanding in detail how the dataset construction loop works in TestGenerator, I recommend reading this article (the article is in Russian).

from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM
from generator import RussianTestGenerator

loader = TextLoader('data/docs/bank_name_docs.md')
docs = loader.load()


# Add custom llms and embeddings
generator_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-3.5-turbo"))
critic_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4"))
embeddings_model = OpenAIEmbeddings()

# Change resulting question type distribution
testset_distribution = {
    "simple": 0.25,
    "reasoning": 0.25,
    "multi_context": 0.25,
    "conditional": 0.25,
}


test_generator = RussianTestGenerator(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings_model=embeddings_model,
    testset_distribution=testset_distribution
)

synth_data = test_generator.generate(docs, test_size=15).to_pandas()

After creating the dataset, we need to obtain answers and accompanying context from the RAG system

answers = []
contexts = []

for query in tqdm(synth_data.question.tolist(), desc='Generation answers'):
    answers.append(rag_chain_with_source.invoke(query)['answer'])
    contexts.append([unicodedata.normalize('NFKD', docs.page_content) for docs in ensemble_retriever.get_relevant_documents(query)])

ground_truth = list(map(ast.literal_eval, synth_data.ground_truth.tolist()))

data = {
    "question": synth_data.question.tolist(),
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truth
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
).to_pandas()

Thus, we obtained a dataset of 15 examples, three of which can be seen in the image.

The average metric values on the dataset are as follows:
context_precision: 0.586185516
context_recall: 0.855654762
faithfulness: 0.852083333
answer_relevancy: 0.836044521

Based on the obtained metrics, it can be said that the system can be improved mainly through experiments with the retriever.

The complete example code with data and evaluation is available on GitHub.

Today, we have looked at the step-by-step creation of the RAG system and the nuances involved in developing each stage. Additionally, we obtained a numerical assessment of the system’s performance using the RAGAs framework.
Thank you for your attention!

Read our other stories:

Linkedin.com