DIY PDF_Chatbot

Mriganka Nath
6 min readOct 19, 2023

--

Here I present how you can make a simple chatbot that can help you interact with PDFs with the power of open-source LLMs. I tried to keep it as simple as possible, avoiding the use of frameworks like Langchain. Perhaps later I can improve the chatbot by incorporating Langchain's power and giving the bot more versatility. Here I used Llama 2.0 as the LLM from huggingface, pdfminer to read the PDF, FAISS for embedding storage and retrieval, Sentence Transformers to get the embeddings of our input text (from PDF) and Pytorch to deal with everything else. For this project, I was very inspired by the open-source community. I have provided their links in the references; please take a moment to explore their work.

Notebook with all the code: https://www.kaggle.com/code/mrinath/pdf-chatbot.

I used the 7b-chat-hf version of LLama 2 as my LLM for my queries. The 7b-chat version has been instruction-tuned for chatting and responds as if we are conversing with it. However, when dealing with entirely new topics not covered in its data, it might offer incorrect or entirely wrong answers. To address this, we need to provide the LLM with context to prepare an accurate response. This is where Sentence-Transformers and FAISS come into play.

First, we divide our PDF into chunks, as only a few chunks from the entire document would be used as context for our LLM. Sentence Transformers help us convert these PDF chunks into embeddings using a pretrained transformer. FAISS then compares these embeddings with our given query and outputs the top-matched embeddings along with their corresponding chunks. So finally, we have our context for our LLMs. The whole process could be shown as:

Overall architecture

Giving the LLM context where it is first retrieved and then generating the answer, this system is called Retrieval Augmented Generation (RAG). RAG systems not only answer our queries but also allow us to view the context used by it. I believe this is beneficial because now we can see which part of the PDF was utilized, enabling us to double-check the answers.

The first step would be loading the LLM. Also, to use LLama 2, you have to authenticate from Meta and Huggingface and get a key to use it. LLMs are very large in terms of space and take time even in inference mode while generating. To run them efficiently on our GPUs, we can use quantization. I will add further links if you want to delve more into how quantization works. Here I used BitsAndBytesConfig to implement 4-bit quantization, reducing the space requirements for the LLM. Additionally, we need to specify stop tokens to instruct the LLMs when to halt the generation process.

bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

# you need an access token
hf_auth = secret_value_0
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth
)

Then we preprocess the pdf to divide it into chunks for the sentence transformer to convert the chunks into embeddings. Embeddings are rich in semantic information and are useful for retrieval-related tasks.

def preprocess(text , step_size = 70 , window_size = 100 ):
# remove all the whitesapces and make it a single text-type file
# separate the text in parts of same size
# this part will have same number of words
# make a list and append them
# Code used from Abhisheks tutorial; link at the end

text = " ".join(text.split())
text_tokens = text.split()

sentences = []
for i in range(0 , len(text_tokens) , step_size):
sentence = text_tokens[i : i + window_size]
if (len(sentence) < window_size):
break
sentences.append(sentence)

paragraphs = [" ".join(s) for s in sentences]
return paragraphs

def make_embs(para , query):
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2") #pretrained sentence transformer
model.max_seq_length = 512

embeddings = model.encode(
para,
show_progress_bar=True,
convert_to_tensor=True,
)
query_embeddings = model.encode(query, convert_to_tensor=True)

return embeddings , query_embeddings

Next, we use FAISS to search for embeddings that are “close” to our query. FAISS uses cosine similarity to compare the embeddings of the chunks and the query. Here I used top k = 5 most similar chunks with the most similarity. Top k is a hyperparameter, and you can experiment with less or more.

def search (embeddings, query_embed):
query = query_embed.unsqueeze(0).float()
res = faiss.StandardGpuResources()
index = faiss.GpuIndexFlatIP(res, embeddings.size(1))
faiss.normalize_L2(embeddings.cpu().numpy())
index.add(embeddings.cpu().numpy())
k = 5 # Number of nearest neighbors to search for
distances, indices = index.search(query.cpu().numpy(), k) # Search for the top-k similar embeddings

return indices

The indices returned here tell us which chunk of the pdf is the most similar to our query, and these selected chunks become the context for our LLM.

Along with the context (all concatenated), we can add a “PROMPT” to guide the LLM on what exactly we want and how the LLM should respond to our query. This is where we can be a bit creative and instruct the LLM accordingly.

PROMPT = """Use the following pieces of context to answer the question at the END.
Don't try to make up an answer. Don't make up new terms and try to be precise while being creative. You can add context
if you are very very sure about your information.
{context}"""

END = "\n<|prompter|>{query}<|endoftext|>"

Finally, use this query on the LLM and get the required response. I used the MULLER paper which was just recently presented in ICCV 23.


q = "How is the proposed resizing technique different from other techniques like Bilinear Interpolation and the others?"

# adding prompt to our query
query = PROMPT.format(context=joined_context)
query += END.format(query=q)

response = generate_text(query)


### ANSWER BY LLM

The proposed resizing technique, called MULLER, differs from other techniques like bilinear interpolation in several ways:
1. Non-linearity: Unlike bilinear interpolation, which uses a linear combination of neighboring pixels to estimate the resized image, MULLER uses a non-linear function to map the input image to the target resolution. This allows MULLER to capture more complex relationships between pixels and preserve more visual information.
2. Multi-resolution: MULLER uses a multi-resolution approach, where each layer of the resizer processes the input image at a different scale. This allows MULLER to capture both local and global features of the image, leading to better performance. In contrast, bilinear interpolation only considers the immediate neighbors of a pixel when estimating its resized value.
3. Trainability: MULLER is trained end-to-end, allowing it to learn the optimal mapping between input and output images. This makes MULLER more flexible and adaptable than traditional interpolation methods, which are often heuristic and do not take into account the specific characteristics of the input image.
4. Scalability: MULLER is designed to be scalable, allowing it to handle large images and high-resolution targets without significant computational overhead. This makes MULLER suitable for use in large-scale computer vision applications, where speed and efficiency are critical. In contrast, bilinear interpolation can become computationally expensive for large images, especially when interpolating between high-resolution targets.
5. Content retention: MULLER is designed to preserve the content of the input image during resizing. This is achieved through the use of a non-linear function that maps the input image to the target resolution, rather than simply averaging the values of nearby pixels. This allows MULLER to retain more of the original image's details and texture, leading to better performance in image classification tasks. In contrast, bilinear interpolation can lead to loss of content, especially when interpolating between very different resolutions.

One thing I noticed is that sometimes the LLMs don’t provide any response, resulting in a null response. In such cases, try rephrasing your question or use your prompt engineering skills to guide how you want the LLM to respond and how much additional information it should include on its own. I have also observed that sometimes the LLM doesn't give precise answers, even if in the prompt it was said to do so. This might be due to the context retrieval part, where the LLM doesn't have proper context for the answer. We can add a ranking model on top of retrieval to rank out the best contexts.

So finally got our LLM-based chatbot which can work with your pdfs.

Useful Links

3) https://www.youtube.com/watch?v=jkrNMKz9pWU&ab_channel=JeremyHoward LLMs code first review

References

1) https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476

2)https://gist.github.com/abhishekkrthakur/401c39d422fb6beff1600effe81f498a

3) https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb

--

--