How to Build an Intelligent QA Chatbot on your data with LLM or ChatGPT

15 min readJul 2, 2023

Welcome to the world of intelligent chatbots empowered by large language models (LLMs)!

In this comprehensive article, we will delve into the art of seamlessly integrating LLMs into your organization’s question answering chatbot. Brace yourself for a deep dive into the high-level system design considerations, as well as the nitty-gritty details of the code implementation.

After you’ve finished reading this article, I encourage you to explore my other piece on “How to Deploy a LLM Chatbot” available here: https://mrmaheshrajput.medium.com/how-to-deploy-a-llm-chatbot-7f1e10dd202e

Update — July 13th, 2023:
I would like to highlight a remarkable open-source extension for your Postgres database called pgvector, which enables efficient storage and retrieval of embeddings, facilitating vector similarity search.
Additionally, if you’re seeking a convenient plug-and-play solution to analyze your data files and pose questions using GPT 3.5/4, I highly recommend quivr.
Given the rapid emergence of new tools and models for LLMs and QA with LLMs, it becomes impractical to continually update this article with each new release.
Please note that the information provided in this article may not encompass the latest advancements regarding tools in the field, however the overall approach discussed will remain applicable for a considerable period of time.

While the potential of intelligent chatbots and LLM integration is immense, organizations often encounter challenges along the way. Building intelligent chatbots and integrating LLMs with their own organizational knowledge requires careful consideration and attention to various factors. Let’s explore some of the common problems faced by organizations in this endeavor.

One major hurdle is the availability and quality of training data. LLMs require vast amounts of high-quality data for training, but organizations may struggle to gather sufficient and relevant data that represents their specific domain or industry. Additionally, ensuring the accuracy and diversity of the training data is crucial to prevent biased or skewed responses from the chatbot.
Another challenge lies in the integration of LLMs with the organization’s existing knowledge base. Many organizations possess a wealth of proprietary data and domain-specific information. Integrating this knowledge effectively with LLMs can be complex, requiring careful mapping, preprocessing, and structuring of the data. Organizations must also consider data privacy and security concerns when incorporating sensitive information into the chatbot.
Moreover, ensuring the chatbot’s ability to understand and generate contextually relevant responses is a significant obstacle. LLMs, while powerful, can still struggle with context disambiguation and understanding nuanced queries. Organizations need to invest in robust natural language processing techniques and fine-tuning strategies to enhance the chatbot’s comprehension and response generation capabilities.
Scalability is yet another challenge. As organizations grow, the chatbot must handle an increasing volume of queries while maintaining optimal performance. Ensuring the chatbot’s scalability involves addressing issues such as response latency, handling concurrent requests, and efficient resource allocation.
Lastly, user experience plays a vital role in the success of a chatbot. Organizations must design intuitive and user-friendly interfaces, provide seamless integration with existing communication channels, and implement effective error handling and fallback mechanisms. Achieving a balance between automation and human intervention is crucial to create a satisfying user experience.

Overcoming these challenges requires a combination of technical expertise, domain knowledge, and iterative development processes from various functional teams.

In this article I will walk you through:

How to pre-process your organizational proprietary knowledge to be able to feed to any (almost) LLM
Build a tree of your organization’s knowledge base for fast retrieval of relevant content, so that we pass limited text to LLM instead of whole corpus
Build a qa chatbot from open source Large Language Models
Give memory to the chatbot
How to deploy it for scalable environment
Best practices

Gathering Data

Assemble all organization’s knowledge base into one place. If it’s already in one place like Confluence Wiki (Confluence, a popular wiki platform, serves as a treasure trove of valuable information for organizations) or SharePoint, Wiki.js, Zoho Wiki, etc then it is better to identify relevant pages and create a list of filtered content.

Curating a quality dataset is the most important step for any machine learning model. This step indeed takes a significant time in the entire model development life cycle (MDLC) because the whole project depends on the dataset quality, relevance and pre-processing.

In this example I will build a chatbot on Jira documentation.

On Jira Cloud resources page, there is documentation related to different subjects. I decided to take the topic of ‘Search for issues in Jira’.

Under this topic, there are various child topics related to parent topic.

You can either use Beautiful Soup (web scraping library) or manually go to each page, copy the entire webpage and paste it in a word document.

The desired output should be that all your knowledge base articles are in a .txt file in a single folder.

Creating Embeddings

Embeddings, in simple terms, are numerical representations of words, sentences, or documents that capture their meaning and context. They are like compact codes or summaries that represent the essence of a piece of text.

Why are embeddings required? Well, here we use them to find similar content (or relevant document) to the input question and only pass the relevant document as a context to the LLM model instead of sending the whole corpus with every question.

There are also other numerous use cases of embeddings, but here they are used for nearest neighbour search only.

So convert all our corpus to embeddings, we can use a Transfomer library called SentenceTransformers.

Since your organization’s knowledge base will not change every day, it is suggested to a ANN model (explained in next steps) from the embeddings and save it for faster retrieval in a low latency storage like S3.

Warning

As the name of the library implies, it is a Sentence transformer. And we just used it to transform our entire document.

It is because it will truncate all the words after the max length. And we assume that the initial words describe the zest of the document well.

However in a production setting, you should make finer segmentation of your topics to achieve better quality.

Build ANN model

Approximate Nearest Neighbour (ANN) models, in simple terms, are techniques that help us find items that are similar or close to a given item from a large dataset, without exhaustively searching through every single item.

Imagine you have a massive collection of items, let’s say pictures of animals. You want to find the most similar picture to a given picture of a dog from this collection. Instead of looking at every single picture one by one, an ANN model allows you to quickly narrow down the search and find similar pictures efficiently.

Similarly, an organization may have hundred thousand sub-sets of subjects from which an end user can ask a question to out chatbot. So instead of trying to find the top k relevant source from every single avaiable document in corpus, an ANN model will quickly output few almost similar documents.

ANN models work by creating a space or representation where items are organized based on their similarities. This space is built using mathematical algorithms that transform the data into a more structured form. The items are then mapped to specific locations in this space based on their characteristics or features.

When you want to find the nearest neighbor (most similar item) to a given item, the ANN model doesn’t have to compare it with every single item in the dataset. Instead, it uses clever techniques to navigate through the space and identify a smaller subset of potential candidates that are likely to be close in similarity. This significantly speeds up the search process.

The key idea behind ANN models is that they sacrifice a bit of accuracy in favor of efficiency. They provide an approximate solution that is very close to the exact nearest neighbor, but with much faster search times. So where a brute force KNN(K Nearest Neighbour) recall is 1, depending on which ANN model you use, its recall will be less than 1. This approximation allows us to handle large datasets and perform similarity searches in real-time or near real-time.

How it will be used

We utilize FAISS, a similarity search library developed by Facebook Research for dense vectors, to construct a robust search index. When a user poses a question to the chatbot, we follow a specific process at the backend. Initially, we encode the question as an embedding using Sentence Transformer. Subsequently, we feed this embedding into our search index, which enables us to retrieve the most closely matching embedding. This nearest matching embedding is then associated with the corresponding corpus document. Finally, we incorporate this document as contextual information along with the user’s question for the Large Language Model (LLM) to process.

Build a FAISS Index

First we import faiss library
Then we build the index
Then we add our embeddings to the index
It is good to do some sanity checks. xq is our question, which is encoded by sentence transformer. The we use the encoded text to search from our index. D and I are distances and index matrices.

The output of corpus[0] is below:

You can also pose other questions like “Can I save my search results?”:

Another sample question output from FAISS index

Why use an ANN

In a production environment, a chatbot needs to be responsive and have low latency. If we were to pass the entire corpus to our model, it could take hours to generate a response. Alternatively, employing a brute force, naive approach to find a relevant context from the entire corpus may take several minutes, which is far from ideal for a chatbot.

There are also other popular open source ANN libraries like Scann by google research, Annoy by Spotify, etc that you can employ based on your problem.

Check out ANN-Benchmarks for a comprehensive comparison.

Build QA model

There are two main types of Large Language Models currently in the market:

Base LLM
Instruction Tuned LLM

Base LLM

Base LLM repeatedly predicts the next word based on text training data. And so if I give it a prompt,

Once upon a time there was a unicorn

then it may, by repeatedly predicting one word at a time, come up with a completion that tells a story about a unicorn living in a magical forest with all her unicorn friends.

Now, a downside of this is that if you were to prompt it with

What is the capital of France?

quite possible that on the internet there might be a list of quiz questions about France. So it may complete this with

Response:
What is France’s largest city, what is France’s population?

and so on. But what you really want is you want it to tell you what is the capital of France, probably, rather than list all these questions.

Instruction Tuned LLM

An Instruction Tuned LLM instead tries to follow instructions and will hopefully say,

Response:
The capital of France is Paris

How do you go from a Base LLM to an Instruction Tuned LLM? This is what the process of training an Instruction Tuned LLM, like ChatGPT, looks like. You first train a Base LLM on a lot of data, so hundreds of billions of words, maybe even more. And this is a process that can take months on a large supercomputing system. After you’ve trained the Base LLM, you would then further train the model by fine-tuning it on a smaller set of examples, where the output follows an input instruction.

Going from Base LLM to a Instruct LLM on your corpus might not make sense for your organization if you don’t have the required data, expertise and resources. But you can use an open source instruct LLM or even ChatGPT for your own QA bot.

From open-source LLMs, GPT4All is a free-to-use, locally running, privacy-aware chatbot which does not require GPU or even internet to work on your machine (or even cloud).

In complete essence, GPT4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs. Their GPT4All tool available on their website can be downloaded in any OS and used for development or play (maybe also for production).

Their goal is — be the best instruction tuned assistant-style language model that any person or enterprise can freely use, distribute and build on.

A GPT4All model is a 3GB — 8GB file that you can download and plug into the GPT4All open-source ecosystem software. Nomic AI supports and maintains this software ecosystem to enforce quality and security alongside spearheading the effort to allow any person or enterprise to easily train and deploy their own on-edge large language models.

After installing GPT4All tool in your machine, this is how it will look:

For our chatbot we will use pip gpt4all library.

Import or install GPT4All library. Be mindful of the version, newer versions do not have chat_completion method.
Create a model instance.
In the messages list, we pass the role as user, and in context input the text which is the result of our ANN model index search based on user’s question.
We call the chat_completion method
And print the response.

Input:

Output:

Keeping Context alive

In a production environment for QA, it is desirable for the QA bot to retain the user’s previous messages. Therefore, we include the entire message chain from prior interactions when handling subsequent questions.

Let’s ask a new question which requires knowledge of previous question:

Response:

LLM Response

At times, the model may include unrelated questions in its response towards the end. Although this behaviour might seem unusual, it is not uncommon. The language generated by the model may vary, but the underlying meaning remains the same.

So there you have it — a QA bot equipped with contextual understanding.

To enhance your prompting skills, I recommend taking Isa Fulford and Andrew Ng’s short course on ‘ChatGPT Prompt Engineering for Developers’. They also offer other free short courses on ChatGPT programming, including application development and the use of langchain. Be sure to explore these valuable resources.

Other options

You can also try out other open source LLMs based on your business problem.

Falcon 7B Instruct (7B means 7 Billion parameters, original Falcon has 40 billion parameters Falcon 40B Instruct and requires 100GB of RAM) also shows promising results.

Following code shows how you can use Falcon 7B from transformers library:

And the output of this LLM:

I would say the response is not bad, not bad at all.

You can also try Open Assistant, which is also the base model for HuggingChat.

Or if you’ve got enough RAM, go for Vicuna (there is a 13B version as well as a 7B version).

Check out this awesome website where you can chat will almost all open source LLMs — https://chat.lmsys.org/

The GPT4All tool also has a server mode, where your local tool acts as a server and you can call it using the url.

Using ChatGPT for QA

You can also use ChatGPT for your QA bot. Half of the above mentioned process is similar, upto creating an ANN model.

After that, you can pass the context along with the question to the openai.ChatCompletion API.

You can also use Langchain to build a complete QA bot, including context search and serving. There are many resources available online for that.

ChatGPT LangChain Example for Chatbot Q&A

From notebook to Production

You can download and store the LLM model just like any other machine learning model. It can be deployed on an endpoint for inference. However, there are several other important components in a QA bot project besides the LLM.

One crucial aspect is the fast encoding of the input query using an embedding model. This allows for efficient representation of the query.
Additionally, it involves storing the index search tree and retrieving it during inference. If the index fits, it can be kept in memory to minimize latency.
Another consideration is remembering each user’s chat history to maintain context.

When it comes to system design, there are multiple ways to implement this system, each with its own trade-offs. Let’s discuss a high-level design approach.

High Level Design

One approach is to use an AWS Lambda function to convert the input text into embeddings and find the context. Then, pass the input message to a Sagemaker endpoint, which will generate a response. This response can be returned to the client. To store the user’s chat history, you can utilize the Lambda database proxy to connect to an RDS (Relational Database Service).

In this approach, you can use Lambda database proxy to connect to an RDS and store the user’s chat history.

I recommend checking out “Unfulfilled promise of serverless” by Corey Quinn for further insights.

If you provide customer support through platforms like Discord or Slack, you can create a simple bot specifically for those platforms. This bot can append user messages to an AWS SQS (Simple Queue Service), and the SQS messages can trigger the Lambda function to follow the same process as described above.

By considering these aspects and exploring different implementation approaches, you can build an effective QA bot system.

Best Practises

Use long polling for SQS
Always return a batchItemFailurs list from Lambda, so that SQS will also remove success deliveries from queue
Use a dead letter queue for you main SQS
Prevent harmful behaviour of your bot by making use of Sagemaker Model Monitor
Never share your users chat history with any third party for some processing, analytics or nefarious purpose
And general best practices followed by your organization

Personal Anecdote

I strongly feel that while Large Language Models (LLMs) have shown impressive capabilities in question answering plus other numerous tasks, their role in a Question Answering (QA) bot for an organization does have certain limitations.

1. Contextual Understanding: LLMs excel at understanding and generating responses based on the given context. However, when it comes to analytical and transactional queries, such as retrieving specific data or performing complex operations, LLMs may not possess the necessary domain-specific knowledge or structured access to databases. They might struggle to execute tasks beyond their pre-trained capabilities.

Example: Suppose a user asks an analytical query like, “What are the total sales for Product X in the last quarter?” LLMs may struggle to directly query the organization’s database and provide an accurate answer without additional integration with specific data processing systems.

2. Database Integration: Integrating LLMs with databases and enabling them to perform transactional queries requires additional infrastructure and functionality beyond their core language processing capabilities. While it is possible to incorporate database access and data retrieval mechanisms, it can be challenging to maintain real-time synchronization and efficient query processing within the LLM architecture.

Example: If a user wants to find their meetings for the day, LLMs may struggle to directly query the database and retrieve the relevant information. This type of interaction requires seamless integration between the LLM, database systems, and data processing logic. And this is what we should work on. Instead of showing the path for self help, service them with actual information.

Checkout my article on Supercharge Analytics with ChatGPT.

3. Complexity and Resource Constraints: LLMs are computationally intensive models that require significant computational resources and time for training and inference. Integrating complex analytical and transactional capabilities within the LLM framework can further increase resource requirements and may hinder real-time responsiveness.

Example: Performing complex computations or database operations within the LLM architecture, such as running sophisticated analytics on large datasets, may lead to high computational costs and slow response times, making it impractical for interactive user experiences.

4. Task-Specific Tools and APIs: For certain analytical or transactional tasks, specialized tools and APIs designed for that specific purpose may already exist. These tools may offer more efficient and optimized solutions compared to relying solely on LLMs for all tasks.

Example: Instead of using an LLM to query meeting schedules from a database, existing calendaring tools or APIs that directly interact with the organization’s scheduling system may provide a more reliable and efficient approach.

Upon reading this article, I trust that you will acquire the skills necessary to construct a baseline version of a QA model tailored to your organization.

If you’re interested in learning about various methods to deploy an LLM, I invite you to read my other article available at: https://mrmaheshrajput.medium.com/how-to-deploy-a-llm-chatbot-7f1e10dd202e

You can connect with me on LinkedIn: https://www.linkedin.com/in/maheshrajput/