Personal assistant on a local machine with 8Gb GPU

Khoa Le, Ph.D.
5 min readDec 28, 2023

--

In the ever-evolving landscape of language model technology, the past year has witnessed groundbreaking advancements that not only unlock new possibilities but also enhance the capabilities of individuals across various fields. As a dedicated data scientist employed in a product company, my daily routine involves delving into a multitude of documents and responding to inquiries from my colleagues. Imagine the convenience of having an intelligent assistant capable of comprehensively reading and addressing my questions — such a tool could revolutionize how we work and significantly boost productivity.

In this blog, we’ll explore the transformative impact of the latest advancements in LLM (Large Language Model) technology and how it can revolutionize document handling and information retrieval for professionals like myself.

In this tutorial, I create a chatbot based on Llama 2 large language model, and use Langchain library to create a vector database from multiple sources such as pdf and web. The Llama 2 model is quantized into 4 bits to run on an 8 Gb GPU. New knowledge is parsed as context to the LLM model to answer questions in the document. The whole project is on my github.

Full pipeline to create vector database from website and pdf, combined with pretrained LLM to make a chatbot assistant. Image borrowed from link

Language model

The first stage is to create a large language model. Here I use the transformers library from Hugging Face to load the pretrained weight of Llama 2, but you can choose another model available on Huggingface Hub. At the first run, you need to provide the token of your Hugging Face account, which should be used to request authorization to use the pretrained model. The same token is used as an argument to download the weight to your local machine. The class pipeline is called to wrap up the pretrained weight and tokenizer.

from torch import cuda, bfloat16
import transformers
import torch
from langchain.llms import HuggingFacePipeline
from langchain.llms import CTransformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'
config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = "cuda"

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True,
use_auth_token=hf_auth,
).to("cuda:0")
model.eval()

tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
use_auth_token=hf_auth
)

generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
temperature=0.1, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=256, # max number of tokens to generate in the output
repetition_penalty=1.1, # without this output begins repeating
torch_dtype=torch.float16,
device='cuda:0'
)

llm = HuggingFacePipeline(pipeline=generate_text)

Then we can test the LLM:

response=llm(prompt="Explain me the difference between Data Lakehouse and Data Warehouse.")

The model I created above is a fully Llama 2 model which requires ~16Gb GPU memory to run, to run this model in a 8Gb GPU we need to add some modifications:

bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
config.init_device = "cuda"

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True,
use_auth_token=hf_auth,
quantization_config=bnb_config
)
model.eval()

tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
use_auth_token=hf_auth
)

generate_text = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
temperature=0.1, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=256, # max number of tokens to generate in the output
repetition_penalty=1.1, # without this output begins repeating
torch_dtype=torch.float16,
device_map='auto',
)

llm = HuggingFacePipeline(pipeline=generate_text)

With this change, the model is quantized to be able to run in 4-bit and therefore fit your 8Gb GPU. We can rerun the LLM test with this new model.

Vector store

LangChain is an open source framework that lets software developers working with artificial intelligence (AI) and its machine learning subset combine large language models with other external components to develop LLM-powered applications.

The main concept behind the library is that we can link various parts together to make more advanced uses for large language models (LLMs). LangChain is made up of different pieces from various modules.

Chroma, often known as ChromaDB, is an open-source embedding database designed to simplify the development of Large Language Model (LLM) applications. It facilitates the storage and retrieval of embeddings and their metadata, along with documents and queries.

from langchain.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders.merge import MergedDataLoader

class VectorStoreChromab:
def __init__(self, output_dir="faiss"):
self.output_dir = output_dir

def get_loader(self):
loader_pdf = DirectoryLoader(path='data', glob="*.pdf", loader_cls=PyPDFLoader)
loader_web = self.get_web_loader()
self.loader = MergedDataLoader(loaders=[loader_web, loader_pdf])

def get_web_loader(self):
web_links = ["https://luna16.grand-challenge.org/Data/",
"https://luna16.grand-challenge.org/",
"https://luna16.grand-challenge.org/Description/"
"https://luna16.grand-challenge.org/Evaluation/"
]

loader = WebBaseLoader(web_links)
return loader

def get_documents(self):
# interpret information in the documents
documents = self.loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = splitter.split_documents(documents)
return texts

def get_embeddings(self):
# create the embeddings
self.embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'})

def create(self):
self.get_loader()
self.get_embeddings()
texts = self.get_documents()

# create the vector store database
db = Chroma.from_documents(documents=texts, embedding=self.embeddings, persist_directory=self.output_dir)

print(f"Saved vector store to {self.output_dir}")

def load_vector_store(self):
self.get_embeddings()
# load the interpreted information from the local database
db = Chroma(persist_directory=self.output_dir, embedding_function=self.embeddings)
return db

db = VectorStoreChromab()
db.create()

This vector store works by breaking down information from various sources like PDFs and web pages into chunks. Next, the embeddings of these chunks are extracted and saved to a local file. Users have the flexibility to choose different variations of the text splitter or the embedding models.

Question and answer

The final section of this tutorial involves creating a prompt template for your query. This template comprises two key elements: the context, which involves the embeddings of the documents computed earlier, and the user-input question. The LangChain’s PromptTemplate class combines these elements to form the ultimate prompt.

The response is produced by the QA chain, formed by combining the large language model (LLM) and the prompt. This chain generates answers by blending its existing knowledge with the new information gleaned from your documents.

from langchain import PromptTemplate
from langchain.chains import RetrievalQA

def create_prompt_template():
# prepare the template we will use when prompting the AI
template = """Use the provided context to answer the user's question.
If you don't know the answer, respond with "I do not know".

Context: {context}
Question: {question}
Answer:
"""

prompt = PromptTemplate(
template=template,
input_variables=['context', 'question'])
return prompt

def create_qa_chain(db, model):
prompt = create_prompt_template()

# create the qa_chain
retriever = db.as_retriever(search_kwargs={'k': 2})
qa_chain = RetrievalQA.from_chain_type(llm=model,
chain_type='stuff',
retriever=retriever,
# return_source_documents=True,
chain_type_kwargs={'prompt': prompt}
)

return qa_chain

def generate_response(query, qa_chain):
return qa_chain({'query':query})['result']

qa_chain = create_qa_chain(db,llm)
response = generate_response(query=query, qa_chain=qa_chain)
print(response)

If you don’t require the large language model (LLM) to rephrase the answer, you can use the vector store to search for paragraphs in all documents relevant to the query.

docs = db.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
doc_details = doc.to_json()['kwargs']
print("Source: ", doc_details['metadata']['source'])
print("Text: ", doc_details['page_content'], "\n")

Conclusion

In conclusion, this tutorial provided readers with a practical example of constructing a chatbot or personal assistant using LLM technology. The key takeaway is that there’s no need to retrain the pre-existing LLM for personalized knowledge; instead, you simply build an embedding of the text and retrieve it when questions arise. Stay tuned for more content by following me!

--

--

Khoa Le, Ph.D.

I do Data Science on Medical Imaging and Finance, and love them both.