Build LLM Model from Scratch | 1
How to Build an LLM Application from Scratch?
Building an LLM from scratch often means starting with a pre-trained model as a base
and then fine-tuning the LLM on specific data. This is because using a pre-trained
model saves time and resources compared to fully training an LLM from scratch.
In this section, we will guide you in building a PDF Q&A System using Google Colab,
which will enable users to upload a PDF and ask questions to retrieve targeted
information from the document. The system uses pre-trained models and data
processing techniques to deliver relevant answers based on PDF content. To
understand how one picks a pre-trained LLM for a given use case, check out this
podcast on How to Choose an LLM for your next AI Project by Dr. Saigeetha
Jegannathan, Chief Data Scientist and AI-ML Leader in IBM, Generative AI, Watson X.
The primary components of this project include:
Document Processing: Splits the PDF into chunks for efficient retrieval.
Model Loading: Loads a language model from Hugging Face for text
generation.
QA System: Uses document embedding and retrieval-based QA to answer
questions related to the PDF content.
Gradio UI: Allows users to upload a PDF, ask questions, and receive answers in
real-time.
Build LLM Model from Scratch | 1
Build LLM Model from Scratch | 2
To get started, please download and then upload the following files to the specified
locations:
1. custom_logger.py in a folder (current directory of your Google colab session)
named utils (create the folder if it doesn’t already exist).
2. config.json in the current directory of your Google colab session.
Then, run the following commands in your Google colab notebook:
pip install gradio langchain accelerate sentence_transformers pypdf tiktoken faiss-gpu bitsandbytes
pip install -U langchain-community
Once the libraries are downloaded, proceed to import the following libraries:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
import pickle
import os
from utils.custom_logger import logger
1) Document Processing
This module handles the ingestion, segmentation, and embedding of PDF documents to
enable efficient querying. It contains the following three classes:
DataLoadPDF: Handles loading and reading PDF files.
DataSplitter: Manages splitting of document data into chunks with overlap, if
necessary.
EmbeddingManager: Manages the creation, loading, and saving of
embeddings.
A. Loading PDF
First, we define the DataLoadPDF class that takes a PDF file path as input and uses
PyPDFLoader from LangChain to read and extract the document's pages. The
load_data function outputs a list of pages from the PDF, serving as the base content
for further processing.
class DataLoadPDF:
"""
A class for loading data from a PDF file.
"""
def __init__(self, file_path):
Build LLM Model from Scratch | 3
"""
Initialize the DataLoadPDF instance.
Args:
file_path (str): Path to the PDF file to load.
"""
self.file_path = file_path
def load_data(self):
"""
Load data from the PDF file.
Returns:
list: List of pages from the PDF.
"""
logger.info(f"Reading file {os.path.basename(self.file_path)} ... ")
loader = PyPDFLoader(self.file_path)
pages = loader.load()
return pages
B. Splitting the Data into Chunks
We now have the DataSplitter Class, where long documents are split into smaller
chunks using TokenTextSplitter to ensure efficient processing and improve retrieval
accuracy. The class splits the pages into chunks based on chunk_size (the size of
each chunk) and chunk_overlap (the overlap between chunks), and the function
split_data returns a list of split document segments, enhancing the model’s ability
to handle long texts by breaking them down into digestible pieces.
class DataSplitter:
"""
A class for splitting data into chunks.
"""
def __init__(self, chunk_size, chunk_overlap):
"""
Initialize the DataSplitter instance.
Args:
chunk_size (int): Size of each chunk.
chunk_overlap (int): Overlap between consecutive chunks.
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
Build LLM Model from Scratch | 4
def split_data(self, pages):
"""
Split data into chunks.
Args:
pages (list): List of data pages.
Returns:
list: List of split documents.
"""
logger.info(f"Document splitting with chunk_size {self.chunk_size} and chunk_overlap
{self.chunk_overlap} ... ")
text_splitter = TokenTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
length_function=len
)
docs = text_splitter.split_documents(pages)
C. Creating Embeddings
After that, we jump to creating embeddings (vectorized representations of text,
enabling the model to understand context and semantics) and define the
EmbeddingManager Class. It relies on:
HuggingFaceEmbeddings to generate embeddings for document chunks.
FAISS, a library for similarity search and clustering, which provides efficient
methods for managing large numbers of embeddings.
Pickle for saving and loading the FAISS index to and from disk.
Let us first define and initialize the constructor of the class with parameters:
model_name represents the name of the embedding model to use, passed
during initialization.
embeddings represents an instance of HuggingFaceEmbeddings initialized with
the model_name for creating embeddings.
class EmbeddingManager:
"""
A class for managing document embeddings.
"""
def __init__(self, model_name):
"""
Initialize the EmbeddingManager instance.
Build LLM Model from Scratch | 5
Args:
model_name (str): Name of the embedding model.
"""
self.model_name = model_name
logger.info(f"Loading embeddings Model {self.model_name} ... ")
self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)
Now, we define the create_embeddings function. It has the parameter:
docs representing a list of document chunks (usually text chunks from a PDF) for
which embeddings need to be generated.
This method generates embeddings for the provided document chunks using
FAISS.from_documents, which creates a FAISS index (optimized for similarity search)
from the documents and embeddings. It returns self.doc_embedding, a FAISS index
containing the embeddings for the document chunks.
def create_embeddings(self, docs):
"""
Create embeddings for documents.
Args:
docs (list): List of documents.
Returns:
FAISS: Document embeddings.
"""
logger.info(f"Creating document embeddings for {len(docs)} split ... ")
self.doc_embedding = FAISS.from_documents(docs, self.embeddings)
return self.doc_embedding
Next, we have the save_embeddings function with the parameter:
file_name represents the file name (without path) for storing the embedding
data.
This method checks if the directory embeddings_data exists; if not, it creates the
directory.
It also saves the FAISS embeddings index as a pickle file in the specified directory,
using the provided file name with a .pkl extension.
Build LLM Model from Scratch | 6
def save_embedding(self, file_name):
"""
Save document embeddings to a file.
Args:
file_name (str): Name of the file to save the embeddings.
"""
emedding_dir = "embeddings_data"
if not os.path.exists(emedding_dir):
os.mkdir(emedding_dir)
file_name = os.path.basename(file_name)
logger.info(f"Saving document embeddings: {'embeddings_data/'+file_name} ... ")
with open("embeddings_data/"+file_name+".pkl", "wb") as f:
pickle.dump(self.doc_embedding, f)
The load_embedding method takes the file_name as the input and loads the FAISS
embeddings index from a specified pickle file in the embeddings_data directory. It
returns the loaded self.doc_embedding.
def load_embedding(self, file_name):
"""
Load document embeddings from a file.
Args:
file_name (str): Name of the file to load the embeddings.
Returns:
FAISS: Loaded document embeddings.
"""
file_name = os.path.basename(file_name)
logger.info(f"Loading document embeddings locally: {'embeddings_data/'+file_name} ... ")
with open("embeddings_data/"+file_name+".pkl", "rb") as f:
self.doc_embedding = pickle.load(f)
return self.doc_embedding
The check_embedding_available function checks if a .pkl file with the embeddings
exists in the embeddings_data directory. It logs whether the embedding file exists and
returns True if it does, False otherwise.
def check_embedding_available(self, file_name):
"""
Check if document embeddings are available in a file.
Build LLM Model from Scratch | 7
Args:
file_name (str): Name of the file to check.
Returns:
bool: True if document embeddings are available, False otherwise.
"""
file_name = os.path.basename(file_name)
doc_check = os.path.isfile("embeddings_data/"+file_name+".pkl")
logger.info(f"Is document embedding found: {doc_check}")
return doc_check
D. Document Processing
Finally, we have the DocumentProcessor Class that coordinates the entire data
curation process. It initializes each step and checks if pre-existing embeddings are
available before reprocessing. The process_document method checks if embeddings
are saved; if not, it triggers PDF loading, splitting, embedding creation, and storage. Let
us understand its components in details:
First, we define the constructor of the class DocumentProcessor with following
variables:
model_name represents the name of the embedding model to use.
chunk_size represents the size of each document chunk for processing.
chunk_overlap represents the overlap between consecutive chunks, which
helps in maintaining context between chunks.
embedding_manager is an instance of the EmbeddingManager class, used to
manage embedding creation, loading, and saving.
class DocumentProcessor:
"""
A class for processing documents and managing embeddings.
"""
def __init__(self, model_name, chunk_size, chunk_overlap):
"""
Initialize the DocumentProcessor instance.
Args:
model_name (str): Name of the embedding model.
chunk_size (int): Size of each chunk.
chunk_overlap (int): Overlap between consecutive chunks.
"""
Build LLM Model from Scratch | 8
logger.info(f"Initializing document processor parameters - embedding model_name:
{model_name}, chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap} ... ")
self.model_name = model_name
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.embedding_manager = EmbeddingManager(model_name)
We now define the process-document method that takes the file_path as input,
representing the path to the document file (PDF) that must be processed, and performs
the following functions:
Check Embedding Availability
First, it checks if embeddings for the document already exist using
check_embedding_available from EmbeddingManager. If available, it loads
and returns these embeddings to avoid redundant computation.
Data Loading
If embeddings aren’t available, it loads the document using DataLoadPDF, which
reads the PDF file and retrieves its content as a series of pages.
Data Splitting
The content is then split into chunks using DataSplitter, based on the specified
chunk size and overlap.
Embedding Creation
The split document chunks are passed to create_embeddings in
EmbeddingManager, which generates embeddings for each chunk.
Saving Embeddings
Finally, it saves the newly created embeddings using save_embedding, which
stores them for future use.
Return value
It returns doc_embedding, the generated embeddings for the document, as a
FAISS (Facebook AI Similarity Search) index for efficient similarity-based
retrieval.
def process_document(self, file_path):
"""
Process a document and manage embeddings.
Args:
file_path (str): Path to the document file.
Returns:
FAISS: Document embeddings.
"""
if self.embedding_manager.check_embedding_available(file_path):
Build LLM Model from Scratch | 9
return self.embedding_manager.load_embedding(file_path)
else:
data_loader = DataLoadPDF(file_path)
pages = data_loader.load_data()
data_splitter = DataSplitter(self.chunk_size, self.chunk_overlap)
docs = data_splitter.split_data(pages)
doc_embedding = self.embedding_manager.create_embeddings(docs)
self.embedding_manager.save_embedding(file_path)
return doc_embedding
We are now ready to discuss the code for the question-answering (QA) system. We will
define two classes: ModelLoader, which loads and configures a language model, and
QASystem, which configures the QA system using a retrieval-based approach. But
before we understand the two classes in detail, we need to set up the environment by
running the following code:
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import torch
from utils.custom_logger import logger
2) Model Loading
The ModelLoader class is responsible for loading a language model for text generation.
The __init__ Method initializes the ModelLoader instance.
model_id is the identifier of the pretrained model (e.g., "gpt-2" or any model
hosted on Hugging Face).
max_length represents the maximum length of the generated text.
temperature is an LLM parameter that controls the randomness of predictions
(lower values make the model more conservative).
load_int8: If True, loads the model in 8-bit precision for memory efficiency.
class ModelLoader:
"""
A class responsible for loading the language model.
"""
def __init__(self, model_id, max_length, temperature,load_int8):
Build LLM Model from Scratch | 10
"""
Initialize the ModelLoader instance.
Args:
model_id (str): Identifier of the pretrained model.
max_length (int): Maximum length of generated text.
temperature (float): Temperature parameter for text generation.
"""
self.model_id = model_id
self.max_length = max_length
self.temperature = temperature
self.load_int8 = load_int8
The load_model method loads the language model using Hugging Face’s transformers
library.
Tokenization loads the tokenizer using the AutoTokenizer.from_pretrained
method with model_id.
Model Loading-
o If load_int8 is True, the model is loaded in 8-bit precision (saves
memory, often used for large models).
o If False, the model is loaded with torch.bfloat16 precision for potentially
faster inference.
Pipeline Setup sets up a text generation pipeline using the pipeline function
with the loaded model and tokenizer, setting max_length and temperature.
Return statement wraps the pipeline in a HuggingFacePipeline object and
returns it, which can be used in the QA system.
def load_model(self):
"""
Load the language model using the specified model_id, max_length, and temperature.
Returns:
HuggingFacePipeline: Loaded language model.
"""
logger.info(f"Loading LLM model {self.model_id} with max_length {self.max_length} and
temperature {self.temperature}...\n")
tokenizer = AutoTokenizer.from_pretrained(self.model_id)
if self.load_int8:
model = AutoModelForCausalLM.from_pretrained(self.model_id, load_in_8bit=True,
device_map="auto")
else:
Build LLM Model from Scratch | 11
model = AutoModelForCausalLM.from_pretrained(self.model_id, torch_dtype=torch.bfloat16,
device_map="auto")
logger.info("Model is loaded successfully\n")
pipe = pipeline(
"text-generation", model=model, tokenizer=tokenizer, max_length=self.max_length,
temperature=self.temperature
)
llm = HuggingFacePipeline(pipeline=pipe)
return llm
3) Q&A System
The QASystem class sets up a QA system that uses a retrieval-based approach with
the loaded language model.
The __init__ Method initializes the QASystem instance.
llm is the language model pipeline from ModelLoader.
Prompt Template defines a prompt for the model to answer questions based on
a provided context. The prompt encourages the model to answer only if it knows
the answer, reducing the chances of generating incorrect information.
Prompt Configuration creates a PromptTemplate with the template, setting the
variables {context} and {question} to be replaced dynamically.
Build LLM Model from Scratch | 12
Chain Configuration prepares chain_type_kwargs with the prompt template,
which will be passed when setting up the retrieval QA system.
class QASystem:
"""
A class representing a Question Answering (QA) system.
"""
def __init__(self, llm):
"""
Initialize the QASystem instance.
Args:
llm (HuggingFacePipeline): Loaded language model for text generation.
"""
self.llm = llm
self.prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Answer :"""
PROMPT = PromptTemplate(
template=self.prompt_template, input_variables=["context", "question"]
)
self.chain_type_kwargs = {
"prompt": PROMPT,
}
The setup_retrieval_qa method Sets up the retrieval-based QA system using
document embeddings.
doc_embedding represents the document embeddings, likely derived from a
knowledge base, to facilitate retrieval.
RetrievalQA constructs a retrieval-based QA system using the
RetrievalQA.from_chain_type method. This connects the language model to a
retriever that searches through embeddings to find relevant documents.
o chain_type="stuff" specifies the type of chain to be used for combining
results, though this might need adjustment based on the specific use
case.
Build LLM Model from Scratch | 13
The class returns the configured RetrievalQA instance.
def setup_retrieval_qa(self, doc_embedding):
"""
Set up the retrieval-based QA system.
Args:
doc_embedding: Document embedding for retrieval.
Returns:
RetrievalQA: Configured retrieval-based QA system.
"""
logger.info("Setting up retrieval QA system...\n")
qa = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff", # You might need to replace this with the appropriate chain type.
retriever=doc_embedding.as_retriever(),
chain_type_kwargs=self.chain_type_kwargs,
)
return qa
We will now build the system to find relevant information within the PDF using a
language model from Hugging Face for question-answering and document
embeddings.
Authentication with Hugging Face Hub
The login function authenticates with the Hugging Face Hub using a provided token
(YOUR_HF_TOKEN). Follow these steps by Hugging Face to generate the required
token for running this app. This token gives access to private models or features on
Hugging Face.
from huggingface_hub import login
# Replace 'YOUR_HF_TOKEN' with the token you generated
login(token='YOUR_HF_TOKEN')
Imports and Configurations
Import necessary libraries, including Gradio (for creating the web interface), re (for text
cleaning), and custom modules: DocumentProcessor (for handling PDF document
Build LLM Model from Scratch | 14
processing), ModelLoader (to load the language model), and QASystem (for the
question-answering system).
import gradio as gr
import json
import re
from utils.custom_logger import logger
Now, load settings from a config.json file, by downloading it and uploading it in the
current directory of your Google Colab session. This file contains information such as
model ID, chunk sizes, and other parameters for the language model and document
processing. Next, use logger to record that the configuration was loaded.
with open('config.json', 'r') as config_file:
config = json.load(config_file)
logger.info(f"Loaded config file: {config}")
Document Processing and Model Loading
Now, it is time to run an instance of all the predefined classes .
DocumentProcessor prepares the PDF document by splitting it into chunks,
allowing the model to handle large files more effectively.
ModelLoader loads a language model based on the configuration settings (e.g.,
model ID, max length, temperature).
QASystem Initializes a question-answering system with the loaded language
model.
# Loading embedding model
document_processor = DocumentProcessor(model_name=config["embedding_model_name"],
chunk_size=config["chunk_size"], chunk_overlap=config["chunk_overlap"])
# Load model globally
model_loder = ModelLoader(config["model_id"], config["max_length"],
config["temperature"],config['load_int8'])
llm = model_loder.load_model()
qa_system = QASystem(llm)
Setting Global Variables for PDF and Document Embedding
Build LLM Model from Scratch | 15
The variables below track the current PDF file and its document embeddings, allowing
efficient retrieval without reprocessing the PDF every time a question is asked.
# Initialize global variable for doc_embedding
doc_embedding = None
pdf_file_name = None
qa = None
Chatbot Function
It is finally time to define the chatbot function that involves performing the following
tasks:
Inputs
It takes in a pdf_file (uploaded by the user) and a query (user's question).
Document Check
If the PDF file is new, it resets the doc_embedding and processes the new PDF
with DocumentProcessor.
Setup Retrieval QA
If doc_embedding is not set, it generates document embeddings and sets up the
retrieval-based QA system.
Query Answering
It passes the user’s question to the qa system, retrieves the answer, and cleans
up extra line breaks with re.sub.
Output
It returns the answer, which will be displayed in the Gradio interface.
def chatbot(pdf_file,query):
global doc_embedding
global pdf_file_name
global qa
if pdf_file_name is None or pdf_file_name!= pdf_file.name or doc_embedding is None:
logger.info("New PDF Found Resetting doc_embedding")
doc_embedding = None
pdf_file_name = pdf_file.name
if doc_embedding is None:
logger.info("Starting for new doc_embedding")
doc_embedding = document_processor.process_document(pdf_file.name)
qa = qa_system.setup_retrieval_qa(doc_embedding)
result = qa({"query": query})
return re.sub(r'\n+', '\n', result['result'])
4) Gradio UI Interface
Build LLM Model from Scratch | 16
We will now set up a Gradio-based web interface that allows users to upload a PDF
and ask questions about its content. The Gradio Blocks code snippet that does the
following:
Defines a web interface layout using Gradio Blocks.
Creates an upload component (pdf_file) for the PDF, and textboxes for the query
(user question) and output (model's response).
Connects the Submit button to the chatbot function, so when clicked, it executes
the function and displays the result in the output box.
Launches the Gradio demo, with share=True to make the app accessible via a
public link.
with gr.Blocks(theme=gr.themes.Default(primary_hue="red", secondary_hue="pink")) as demo:
gr.Markdown("# Ask your Question to PDF Document")
with gr.Row():
with gr.Column(scale=4):
pdf_file = gr.File(label="Upload your PDF")
output = gr.Textbox(label="output",lines=3)
query = gr.Textbox(label="query")
btn = gr.Button("Submit")
btn.click(fn=chatbot, inputs=[pdf_file,query], outputs=[output])
gr.close_all()
demo.launch(share=True)
Once you have run everything, you will find the app running with the following UI:
To test it, you can upload ProjectPro’s Generative AI Interview Questions and Answers
PDF’ and ask the question "What is Generative AI?”. The application will return the
appropriate response, as highlighted in the image below.
Build LLM Model from Scratch | 17
That’s all! Pat yourself on the back for making it till here.
Complete ProjectPro's GenAI Certification Course to demonstrate expertise in AI
technologies!
This project (inspired by Vijay Maurya’s PDF-QLM LLM from scratch project on GitHub)
is just one example of the immense potential of LLM-based applications. There are
countless other exciting use cases awaiting your expertise, and ProjectPro is here to
help you with that.
Learn How to Build LLM Model from Scratch with
ProjectPro!
At ProjectPro, you can explore a treasure of 250+ solved projects prepared by industry-
experts to learn data science, big data, and generative AI from scratch. If you're an
established professional, our customized learning paths and solved projects help you
pick up right where your current skill set and experience leave off. Start building
amazing LLM projects with ProjectPro’s guidance tailored for every level. Subscribe
today!
Build LLM Model from Scratch | 18