0% found this document useful (0 votes)

307 views19 pages

Build An LLM From Scratch

Uploaded by

22kf1a05c0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

307 views19 pages

Build An LLM From Scratch

Uploaded by

22kf1a05c0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Build LLM Model from Scratch | 1

How to Build an LLM Application from Scratch?

Building an LLM from scratch often means starting with a pre-trained model as a base
and then fine-tuning the LLM on specific data. This is because using a pre-trained
model saves time and resources compared to fully training an LLM from scratch.

In this section, we will guide you in building a PDF Q&A System using Google Colab,
which will enable users to upload a PDF and ask questions to retrieve targeted
information from the document. The system uses pre-trained models and data
processing techniques to deliver relevant answers based on PDF content. To
understand how one picks a pre-trained LLM for a given use case, check out this
podcast on How to Choose an LLM for your next AI Project by Dr. Saigeetha
Jegannathan, Chief Data Scientist and AI-ML Leader in IBM, Generative AI, Watson X.

The primary components of this project include:

 Document Processing: Splits the PDF into chunks for efficient retrieval.
 Model Loading: Loads a language model from Hugging Face for text
generation.
 QA System: Uses document embedding and retrieval-based QA to answer
questions related to the PDF content.
 Gradio UI: Allows users to upload a PDF, ask questions, and receive answers in
real-time.

Build LLM Model from Scratch | 1

Build LLM Model from Scratch | 2
To get started, please download and then upload the following files to the specified
locations:

1. custom_logger.py in a folder (current directory of your Google colab session)

named utils (create the folder if it doesn’t already exist).
2. config.json in the current directory of your Google colab session.

Then, run the following commands in your Google colab notebook:

pip install gradio langchain accelerate sentence_transformers pypdf tiktoken faiss-gpu bitsandbytes
pip install -U langchain-community
Once the libraries are downloaded, proceed to import the following libraries:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
import pickle
import os
from utils.custom_logger import logger

1) Document Processing
This module handles the ingestion, segmentation, and embedding of PDF documents to
enable efficient querying. It contains the following three classes:

 DataLoadPDF: Handles loading and reading PDF files.

 DataSplitter: Manages splitting of document data into chunks with overlap, if
necessary.
 EmbeddingManager: Manages the creation, loading, and saving of
embeddings.

A. Loading PDF

First, we define the DataLoadPDF class that takes a PDF file path as input and uses
PyPDFLoader from LangChain to read and extract the document's pages. The
load_data function outputs a list of pages from the PDF, serving as the base content
for further processing.

class DataLoadPDF:
"""
A class for loading data from a PDF file.
"""

def init(self, file_path):

Build LLM Model from Scratch | 3

"""
Initialize the DataLoadPDF instance.

Args:
file_path (str): Path to the PDF file to load.
"""
self.file_path = file_path

def load_data(self):
"""
Load data from the PDF file.

Returns:
list: List of pages from the PDF.
"""
logger.info(f"Reading file {os.path.basename(self.file_path)} ... ")
loader = PyPDFLoader(self.file_path)
pages = loader.load()
return pages

B. Splitting the Data into Chunks

We now have the DataSplitter Class, where long documents are split into smaller
chunks using TokenTextSplitter to ensure efficient processing and improve retrieval
accuracy. The class splits the pages into chunks based on chunk_size (the size of
each chunk) and chunk_overlap (the overlap between chunks), and the function
split_data returns a list of split document segments, enhancing the model’s ability
to handle long texts by breaking them down into digestible pieces.

class DataSplitter:
"""
A class for splitting data into chunks.
"""

def init(self, chunk_size, chunk_overlap):

"""
Initialize the DataSplitter instance.

Args:
chunk_size (int): Size of each chunk.
chunk_overlap (int): Overlap between consecutive chunks.
"""
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap

Build LLM Model from Scratch | 4

def split_data(self, pages):
"""
Split data into chunks.

Args:
pages (list): List of data pages.

Returns:
list: List of split documents.
"""
logger.info(f"Document splitting with chunk_size {self.chunk_size} and chunk_overlap
{self.chunk_overlap} ... ")
text_splitter = TokenTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap,
length_function=len
)
docs = text_splitter.split_documents(pages)

C. Creating Embeddings

After that, we jump to creating embeddings (vectorized representations of text,

enabling the model to understand context and semantics) and define the
EmbeddingManager Class. It relies on:

 HuggingFaceEmbeddings to generate embeddings for document chunks.

 FAISS, a library for similarity search and clustering, which provides efficient
methods for managing large numbers of embeddings.
 Pickle for saving and loading the FAISS index to and from disk.

Let us first define and initialize the constructor of the class with parameters:

 model_name represents the name of the embedding model to use, passed

during initialization.
 embeddings represents an instance of HuggingFaceEmbeddings initialized with
the model_name for creating embeddings.

class EmbeddingManager:
"""
A class for managing document embeddings.
"""

def init(self, model_name):

"""
Initialize the EmbeddingManager instance.

Build LLM Model from Scratch | 5

Args:
model_name (str): Name of the embedding model.
"""
self.model_name = model_name
logger.info(f"Loading embeddings Model {self.model_name} ... ")
self.embeddings = HuggingFaceEmbeddings(model_name=self.model_name)

Now, we define the create_embeddings function. It has the parameter:

 docs representing a list of document chunks (usually text chunks from a PDF) for
which embeddings need to be generated.

This method generates embeddings for the provided document chunks using
FAISS.from_documents, which creates a FAISS index (optimized for similarity search)
from the documents and embeddings. It returns self.doc_embedding, a FAISS index
containing the embeddings for the document chunks.

def create_embeddings(self, docs):

"""
Create embeddings for documents.

Args:
docs (list): List of documents.

Returns:
FAISS: Document embeddings.
"""
logger.info(f"Creating document embeddings for {len(docs)} split ... ")
self.doc_embedding = FAISS.from_documents(docs, self.embeddings)
return self.doc_embedding

Next, we have the save_embeddings function with the parameter:

 file_name represents the file name (without path) for storing the embedding
data.

This method checks if the directory embeddings_data exists; if not, it creates the
directory.

It also saves the FAISS embeddings index as a pickle file in the specified directory,
using the provided file name with a .pkl extension.

Build LLM Model from Scratch | 6

def save_embedding(self, file_name):
"""
Save document embeddings to a file.

Args:
file_name (str): Name of the file to save the embeddings.
"""
emedding_dir = "embeddings_data"
if not os.path.exists(emedding_dir):
os.mkdir(emedding_dir)
file_name = os.path.basename(file_name)
logger.info(f"Saving document embeddings: {'embeddings_data/'+file_name} ... ")
with open("embeddings_data/"+file_name+".pkl", "wb") as f:
pickle.dump(self.doc_embedding, f)

The load_embedding method takes the file_name as the input and loads the FAISS
embeddings index from a specified pickle file in the embeddings_data directory. It
returns the loaded self.doc_embedding.

def load_embedding(self, file_name):

"""
Load document embeddings from a file.

Args:
file_name (str): Name of the file to load the embeddings.

Returns:
FAISS: Loaded document embeddings.
"""
file_name = os.path.basename(file_name)
logger.info(f"Loading document embeddings locally: {'embeddings_data/'+file_name} ... ")
with open("embeddings_data/"+file_name+".pkl", "rb") as f:
self.doc_embedding = pickle.load(f)
return self.doc_embedding

The check_embedding_available function checks if a .pkl file with the embeddings

exists in the embeddings_data directory. It logs whether the embedding file exists and
returns True if it does, False otherwise.

def check_embedding_available(self, file_name):

"""
Check if document embeddings are available in a file.

Build LLM Model from Scratch | 7

Args:
file_name (str): Name of the file to check.

Returns:
bool: True if document embeddings are available, False otherwise.
"""
file_name = os.path.basename(file_name)
doc_check = os.path.isfile("embeddings_data/"+file_name+".pkl")
logger.info(f"Is document embedding found: {doc_check}")
return doc_check

D. Document Processing

Finally, we have the DocumentProcessor Class that coordinates the entire data
curation process. It initializes each step and checks if pre-existing embeddings are
available before reprocessing. The process_document method checks if embeddings
are saved; if not, it triggers PDF loading, splitting, embedding creation, and storage. Let
us understand its components in details:

First, we define the constructor of the class DocumentProcessor with following

variables:

 model_name represents the name of the embedding model to use.

 chunk_size represents the size of each document chunk for processing.
 chunk_overlap represents the overlap between consecutive chunks, which
helps in maintaining context between chunks.
 embedding_manager is an instance of the EmbeddingManager class, used to
manage embedding creation, loading, and saving.

class DocumentProcessor:
"""
A class for processing documents and managing embeddings.
"""

def init(self, model_name, chunk_size, chunk_overlap):

"""
Initialize the DocumentProcessor instance.

Args:
model_name (str): Name of the embedding model.
chunk_size (int): Size of each chunk.
chunk_overlap (int): Overlap between consecutive chunks.
"""

Build LLM Model from Scratch | 8

logger.info(f"Initializing document processor parameters - embedding model_name:
{model_name}, chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap} ... ")
self.model_name = model_name
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.embedding_manager = EmbeddingManager(model_name)

We now define the process-document method that takes the file_path as input,
representing the path to the document file (PDF) that must be processed, and performs
the following functions:

 Check Embedding Availability

First, it checks if embeddings for the document already exist using
check_embedding_available from EmbeddingManager. If available, it loads
and returns these embeddings to avoid redundant computation.
 Data Loading
If embeddings aren’t available, it loads the document using DataLoadPDF, which
reads the PDF file and retrieves its content as a series of pages.
 Data Splitting
The content is then split into chunks using DataSplitter, based on the specified
chunk size and overlap.
 Embedding Creation
The split document chunks are passed to create_embeddings in
EmbeddingManager, which generates embeddings for each chunk.
 Saving Embeddings
Finally, it saves the newly created embeddings using save_embedding, which
stores them for future use.
 Return value
It returns doc_embedding, the generated embeddings for the document, as a
FAISS (Facebook AI Similarity Search) index for efficient similarity-based
retrieval.

def process_document(self, file_path):

"""
Process a document and manage embeddings.

Args:
file_path (str): Path to the document file.

Returns:
FAISS: Document embeddings.
"""
if self.embedding_manager.check_embedding_available(file_path):

Build LLM Model from Scratch | 9

return self.embedding_manager.load_embedding(file_path)
else:
data_loader = DataLoadPDF(file_path)
pages = data_loader.load_data()

data_splitter = DataSplitter(self.chunk_size, self.chunk_overlap)

docs = data_splitter.split_data(pages)

doc_embedding = self.embedding_manager.create_embeddings(docs)
self.embedding_manager.save_embedding(file_path)
return doc_embedding

We are now ready to discuss the code for the question-answering (QA) system. We will
define two classes: ModelLoader, which loads and configures a language model, and
QASystem, which configures the QA system using a retrieval-based approach. But
before we understand the two classes in detail, we need to set up the environment by
running the following code:

from langchain import HuggingFacePipeline

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import torch
from utils.custom_logger import logger

2) Model Loading
The ModelLoader class is responsible for loading a language model for text generation.

The init Method initializes the ModelLoader instance.

 model_id is the identifier of the pretrained model (e.g., "gpt-2" or any model
hosted on Hugging Face).
 max_length represents the maximum length of the generated text.
 temperature is an LLM parameter that controls the randomness of predictions
(lower values make the model more conservative).
 load_int8: If True, loads the model in 8-bit precision for memory efficiency.

class ModelLoader:
"""
A class responsible for loading the language model.
"""

def init(self, model_id, max_length, temperature,load_int8):

Build LLM Model from Scratch | 10

"""
Initialize the ModelLoader instance.

Args:
model_id (str): Identifier of the pretrained model.
max_length (int): Maximum length of generated text.
temperature (float): Temperature parameter for text generation.
"""
self.model_id = model_id
self.max_length = max_length
self.temperature = temperature
self.load_int8 = load_int8

The load_model method loads the language model using Hugging Face’s transformers
library.

 Tokenization loads the tokenizer using the AutoTokenizer.from_pretrained

method with model_id.
 Model Loading-
o If load_int8 is True, the model is loaded in 8-bit precision (saves
memory, often used for large models).
o If False, the model is loaded with torch.bfloat16 precision for potentially
faster inference.
 Pipeline Setup sets up a text generation pipeline using the pipeline function
with the loaded model and tokenizer, setting max_length and temperature.
 Return statement wraps the pipeline in a HuggingFacePipeline object and
returns it, which can be used in the QA system.

def load_model(self):
"""
Load the language model using the specified model_id, max_length, and temperature.

Returns:
HuggingFacePipeline: Loaded language model.
"""
logger.info(f"Loading LLM model {self.model_id} with max_length {self.max_length} and
temperature {self.temperature}...\n")
tokenizer = AutoTokenizer.from_pretrained(self.model_id)
if self.load_int8:
model = AutoModelForCausalLM.from_pretrained(self.model_id, load_in_8bit=True,
device_map="auto")
else:

Build LLM Model from Scratch | 11

model = AutoModelForCausalLM.from_pretrained(self.model_id, torch_dtype=torch.bfloat16,
device_map="auto")

logger.info("Model is loaded successfully\n")

pipe = pipeline(
"text-generation", model=model, tokenizer=tokenizer, max_length=self.max_length,
temperature=self.temperature
)
llm = HuggingFacePipeline(pipeline=pipe)
return llm

3) Q&A System
The QASystem class sets up a QA system that uses a retrieval-based approach with
the loaded language model.

The init Method initializes the QASystem instance.

 llm is the language model pipeline from ModelLoader.

 Prompt Template defines a prompt for the model to answer questions based on
a provided context. The prompt encourages the model to answer only if it knows
the answer, reducing the chances of generating incorrect information.
 Prompt Configuration creates a PromptTemplate with the template, setting the
variables {context} and {question} to be replaced dynamically.

Build LLM Model from Scratch | 12

 Chain Configuration prepares chain_type_kwargs with the prompt template,
which will be passed when setting up the retrieval QA system.

class QASystem:
"""
A class representing a Question Answering (QA) system.
"""

def init(self, llm):

"""
Initialize the QASystem instance.

Args:
llm (HuggingFacePipeline): Loaded language model for text generation.
"""
self.llm = llm

self.prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer :"""
PROMPT = PromptTemplate(
template=self.prompt_template, input_variables=["context", "question"]
)
self.chain_type_kwargs = {
"prompt": PROMPT,
}

The setup_retrieval_qa method Sets up the retrieval-based QA system using

document embeddings.

 doc_embedding represents the document embeddings, likely derived from a

knowledge base, to facilitate retrieval.
 RetrievalQA constructs a retrieval-based QA system using the
RetrievalQA.from_chain_type method. This connects the language model to a
retriever that searches through embeddings to find relevant documents.
o chain_type="stuff" specifies the type of chain to be used for combining
results, though this might need adjustment based on the specific use
case.

Build LLM Model from Scratch | 13

 The class returns the configured RetrievalQA instance.

def setup_retrieval_qa(self, doc_embedding):

"""
Set up the retrieval-based QA system.

Args:
doc_embedding: Document embedding for retrieval.

Returns:
RetrievalQA: Configured retrieval-based QA system.
"""
logger.info("Setting up retrieval QA system...\n")
qa = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff", # You might need to replace this with the appropriate chain type.
retriever=doc_embedding.as_retriever(),
chain_type_kwargs=self.chain_type_kwargs,
)

return qa

We will now build the system to find relevant information within the PDF using a
language model from Hugging Face for question-answering and document
embeddings.

Authentication with Hugging Face Hub

The login function authenticates with the Hugging Face Hub using a provided token
(YOUR_HF_TOKEN). Follow these steps by Hugging Face to generate the required
token for running this app. This token gives access to private models or features on
Hugging Face.

from huggingface_hub import login

# Replace 'YOUR_HF_TOKEN' with the token you generated

Imports and Configurations

Import necessary libraries, including Gradio (for creating the web interface), re (for text
cleaning), and custom modules: DocumentProcessor (for handling PDF document

Build LLM Model from Scratch | 14

processing), ModelLoader (to load the language model), and QASystem (for the
question-answering system).

import gradio as gr
import json
import re
from utils.custom_logger import logger

Now, load settings from a config.json file, by downloading it and uploading it in the
current directory of your Google Colab session. This file contains information such as
model ID, chunk sizes, and other parameters for the language model and document
processing. Next, use logger to record that the configuration was loaded.

with open('config.json', 'r') as config_file:

config = json.load(config_file)

logger.info(f"Loaded config file: {config}")

Document Processing and Model Loading

Now, it is time to run an instance of all the predefined classes .

 DocumentProcessor prepares the PDF document by splitting it into chunks,

allowing the model to handle large files more effectively.
 ModelLoader loads a language model based on the configuration settings (e.g.,
model ID, max length, temperature).
 QASystem Initializes a question-answering system with the loaded language
model.

# Loading embedding model

document_processor = DocumentProcessor(model_name=config["embedding_model_name"],
chunk_size=config["chunk_size"], chunk_overlap=config["chunk_overlap"])

# Load model globally

model_loder = ModelLoader(config["model_id"], config["max_length"],
config["temperature"],config['load_int8'])
llm = model_loder.load_model()

qa_system = QASystem(llm)

Setting Global Variables for PDF and Document Embedding

Build LLM Model from Scratch | 15

The variables below track the current PDF file and its document embeddings, allowing
efficient retrieval without reprocessing the PDF every time a question is asked.

# Initialize global variable for doc_embedding

doc_embedding = None
pdf_file_name = None
qa = None

Chatbot Function

It is finally time to define the chatbot function that involves performing the following
tasks:

 Inputs
It takes in a pdf_file (uploaded by the user) and a query (user's question).
 Document Check
If the PDF file is new, it resets the doc_embedding and processes the new PDF
with DocumentProcessor.
 Setup Retrieval QA
If doc_embedding is not set, it generates document embeddings and sets up the
retrieval-based QA system.
 Query Answering
It passes the user’s question to the qa system, retrieves the answer, and cleans
up extra line breaks with re.sub.
 Output
It returns the answer, which will be displayed in the Gradio interface.

def chatbot(pdf_file,query):
global doc_embedding
global pdf_file_name
global qa
if pdf_file_name is None or pdf_file_name!= pdf_file.name or doc_embedding is None:
logger.info("New PDF Found Resetting doc_embedding")
doc_embedding = None
pdf_file_name = pdf_file.name
if doc_embedding is None:
logger.info("Starting for new doc_embedding")
doc_embedding = document_processor.process_document(pdf_file.name)
qa = qa_system.setup_retrieval_qa(doc_embedding)
result = qa({"query": query})
return re.sub(r'\n+', '\n', result['result'])

4) Gradio UI Interface

Build LLM Model from Scratch | 16

We will now set up a Gradio-based web interface that allows users to upload a PDF
and ask questions about its content. The Gradio Blocks code snippet that does the
following:

 Defines a web interface layout using Gradio Blocks.

 Creates an upload component (pdf_file) for the PDF, and textboxes for the query
(user question) and output (model's response).
 Connects the Submit button to the chatbot function, so when clicked, it executes
the function and displays the result in the output box.
 Launches the Gradio demo, with share=True to make the app accessible via a
public link.

with gr.Blocks(theme=gr.themes.Default(primary_hue="red", secondary_hue="pink")) as demo:

gr.Markdown("# Ask your Question to PDF Document")
with gr.Row():
with gr.Column(scale=4):
pdf_file = gr.File(label="Upload your PDF")
output = gr.Textbox(label="output",lines=3)
query = gr.Textbox(label="query")
btn = gr.Button("Submit")
btn.click(fn=chatbot, inputs=[pdf_file,query], outputs=[output])
gr.close_all()
demo.launch(share=True)

Once you have run everything, you will find the app running with the following UI:

To test it, you can upload ProjectPro’s Generative AI Interview Questions and Answers
PDF’ and ask the question "What is Generative AI?”. The application will return the
appropriate response, as highlighted in the image below.

Build LLM Model from Scratch | 17

That’s all! Pat yourself on the back for making it till here.

Complete ProjectPro's GenAI Certification Course to demonstrate expertise in AI

technologies!

This project (inspired by Vijay Maurya’s PDF-QLM LLM from scratch project on GitHub)
is just one example of the immense potential of LLM-based applications. There are
countless other exciting use cases awaiting your expertise, and ProjectPro is here to
help you with that.

Learn How to Build LLM Model from Scratch with

ProjectPro!
At ProjectPro, you can explore a treasure of 250+ solved projects prepared by industry-
experts to learn data science, big data, and generative AI from scratch. If you're an
established professional, our customized learning paths and solved projects help you
pick up right where your current skill set and experience leave off. Start building
amazing LLM projects with ProjectPro’s guidance tailored for every level. Subscribe
today!

Build LLM Model from Scratch | 18

RAG With Reinforcement Learning
No ratings yet
RAG With Reinforcement Learning
40 pages
Langchain App Design
No ratings yet
Langchain App Design
7 pages
Rag Project
No ratings yet
Rag Project
13 pages
Introduction
No ratings yet
Introduction
17 pages
Guide Ipynb
No ratings yet
Guide Ipynb
26 pages
Claude Comparet DB
No ratings yet
Claude Comparet DB
8 pages
Understanding The Core Idea: Retrieval-Augmented Generation (RAG)
No ratings yet
Understanding The Core Idea: Retrieval-Augmented Generation (RAG)
6 pages
Notes - by Kishor
No ratings yet
Notes - by Kishor
11 pages
Anthropic-cookbook:Skills:Contextual-embeddings:Guide - Ipynb at Main Anthropics
No ratings yet
Anthropic-cookbook:Skills:Contextual-embeddings:Guide - Ipynb at Main Anthropics
21 pages
LangChain LLM Programming Guide
No ratings yet
LangChain LLM Programming Guide
39 pages
QA Using Gemini Langchain ChromaDB PDF
No ratings yet
QA Using Gemini Langchain ChromaDB PDF
2 pages
Enhanced Stock Prediction Pipeline With RAG and Fine-Tuned LLM
No ratings yet
Enhanced Stock Prediction Pipeline With RAG and Fine-Tuned LLM
10 pages
Case Study
No ratings yet
Case Study
25 pages
Build Personalized Bots with RAG
No ratings yet
Build Personalized Bots with RAG
32 pages
AI Document Processing with GPT
No ratings yet
AI Document Processing with GPT
18 pages
MultiModel RAG
No ratings yet
MultiModel RAG
18 pages
45
No ratings yet
45
5 pages
RAG Application Using Open Source Tools 1721123882
No ratings yet
RAG Application Using Open Source Tools 1721123882
5 pages
Week 5 Large Language Models
No ratings yet
Week 5 Large Language Models
5 pages
02 Data Connections
No ratings yet
02 Data Connections
32 pages
Fine-Tuned Vs RAG Short Notes ?
No ratings yet
Fine-Tuned Vs RAG Short Notes ?
25 pages
LLM From Scratch
No ratings yet
LLM From Scratch
67 pages
Setting Up A Local AI Q&A Server For Class 11 - 12 and JEE PDFs On Windows 10
No ratings yet
Setting Up A Local AI Q&A Server For Class 11 - 12 and JEE PDFs On Windows 10
6 pages
(English) Python RAG Tutorial (With Local LLMS) - AI For Your PDFs (DownSub - Com)
No ratings yet
(English) Python RAG Tutorial (With Local LLMS) - AI For Your PDFs (DownSub - Com)
15 pages
Transformers Agents 2.0 Overview
No ratings yet
Transformers Agents 2.0 Overview
8 pages
Birthday Gift Ideas for Data Scientists
No ratings yet
Birthday Gift Ideas for Data Scientists
1 page
Build Scalable RAG-Based LLM Apps
100% (2)
Build Scalable RAG-Based LLM Apps
39 pages
Building RAG-based LLM Applications For Production: Blog Detail
No ratings yet
Building RAG-based LLM Applications For Production: Blog Detail
78 pages
LLAMA 2.0 CPU Setup for In-Context Learning
No ratings yet
LLAMA 2.0 CPU Setup for In-Context Learning
20 pages
Research Paper Summarization
No ratings yet
Research Paper Summarization
13 pages
Langchain N VDB
No ratings yet
Langchain N VDB
6 pages
How I Built A Basic RAG For PDF QA in A Few Lines of Python Code - by DR Julija - Medium
No ratings yet
How I Built A Basic RAG For PDF QA in A Few Lines of Python Code - by DR Julija - Medium
8 pages
Flowise AI Tutorial #3 File Loaders, Text Splitters, Embeddings & Vector Stores
No ratings yet
Flowise AI Tutorial #3 File Loaders, Text Splitters, Embeddings & Vector Stores
3 pages
RAG Project Documentation
No ratings yet
RAG Project Documentation
3 pages
Langchain Guide
No ratings yet
Langchain Guide
11 pages
Gen Project
No ratings yet
Gen Project
7 pages
365careers - AI - Eng - Bootcamp, Ai, 365careers, Udemy
No ratings yet
365careers - AI - Eng - Bootcamp, Ai, 365careers, Udemy
89 pages
Gen Ai-1
No ratings yet
Gen Ai-1
6 pages
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
100% (1)
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
21 pages
Open Source RAG Made Easy by Dell Enterprise Hub
No ratings yet
Open Source RAG Made Easy by Dell Enterprise Hub
9 pages
Vector Embeddings - OpenAI API
No ratings yet
Vector Embeddings - OpenAI API
7 pages
Gen AI Lab
No ratings yet
Gen AI Lab
22 pages
PDF Chatbot with LangChain Integration
No ratings yet
PDF Chatbot with LangChain Integration
2 pages
LangChain Cheatsheet 1704475842
No ratings yet
LangChain Cheatsheet 1704475842
11 pages
Code Explanation
No ratings yet
Code Explanation
8 pages
Chatbot Code
No ratings yet
Chatbot Code
2 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
LLM Fine Tune
No ratings yet
LLM Fine Tune
11 pages
SemanticAI NEW
No ratings yet
SemanticAI NEW
4 pages
Implementation of Bug Report
No ratings yet
Implementation of Bug Report
12 pages
ML Interview Ke Pehle Padhna Hai
No ratings yet
ML Interview Ke Pehle Padhna Hai
59 pages
Practicals
No ratings yet
Practicals
22 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Setup LangChain for PDF QA Chatbot
No ratings yet
Setup LangChain for PDF QA Chatbot
3 pages
Jupyter Notebook Setup for Milvus and Attu
No ratings yet
Jupyter Notebook Setup for Milvus and Attu
5 pages
IndicTrans2 PDF to Punjabi Docx Conversion
No ratings yet
IndicTrans2 PDF to Punjabi Docx Conversion
5 pages
Message
No ratings yet
Message
3 pages
Chatbot Code
No ratings yet
Chatbot Code
2 pages
McKinsey Handbook - How To Write A Business Plan
97% (29)
McKinsey Handbook - How To Write A Business Plan
116 pages
The Chief Strategy Officer Playbook PDF
100% (11)
The Chief Strategy Officer Playbook PDF
176 pages
Top 101 Consulting Framework
91% (11)
Top 101 Consulting Framework
205 pages
Strategy
100% (17)
Strategy
482 pages
The 100+ Business Models by FourWeekMBA - Full Library
95% (42)
The 100+ Business Models by FourWeekMBA - Full Library
780 pages
Strategy Tools
100% (13)
Strategy Tools
317 pages
Design Thinking Methodology Book
88% (26)
Design Thinking Methodology Book
119 pages
Practical Project Management
93% (30)
Practical Project Management
295 pages
Strategy
100% (10)
Strategy
299 pages
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
99% (68)
BABOK 3 ONLINE - A Guide To The Business Analysis Body of Knowledge
514 pages
Flevy 100 Case Studies On Strategy & Transformation
100% (12)
Flevy 100 Case Studies On Strategy & Transformation
501 pages
Thinking in Systems and Mental Models Think Like A Super Thinker by Marcus P. Dawson
93% (14)
Thinking in Systems and Mental Models Think Like A Super Thinker by Marcus P. Dawson
271 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
97% (35)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Max Mckeown The Strategy Book 2016 FT Press
83% (12)
Max Mckeown The Strategy Book 2016 FT Press
264 pages
The Best ChatGPT
98% (53)
The Best ChatGPT
8 pages
Strategy Genius - 40 Insights From The Science of Strategic Thinking
100% (10)
Strategy Genius - 40 Insights From The Science of Strategic Thinking
257 pages
Rishabh Choudhary Resume
No ratings yet
Rishabh Choudhary Resume
1 page
DSB-SC AM Signal Demodulation Guide
No ratings yet
DSB-SC AM Signal Demodulation Guide
3 pages
State Design Gray Code
No ratings yet
State Design Gray Code
3 pages
Marc Product Brochure
No ratings yet
Marc Product Brochure
20 pages
Networks (Second Edition) Mark Newman Digital Download
No ratings yet
Networks (Second Edition) Mark Newman Digital Download
104 pages
Green Expectations: The Story of A Customizable Lighting Control Panel Designed To Reduce Energy Use
No ratings yet
Green Expectations: The Story of A Customizable Lighting Control Panel Designed To Reduce Energy Use
5 pages
JETI DS-12 Manual (English)
No ratings yet
JETI DS-12 Manual (English)
151 pages
VLT AutomationDrive FC 301 302 DG M00190 01
No ratings yet
VLT AutomationDrive FC 301 302 DG M00190 01
264 pages
Lecture04 - MRI Pulse Sequence
No ratings yet
Lecture04 - MRI Pulse Sequence
14 pages
CETAC Technologies: C-Term™ Users' Guide
No ratings yet
CETAC Technologies: C-Term™ Users' Guide
3 pages
GA - DPP - 03 From Base, Powers and Averages
No ratings yet
GA - DPP - 03 From Base, Powers and Averages
30 pages
Bus Ticket Instructions
No ratings yet
Bus Ticket Instructions
5 pages
Mobile Computing Fundamentals Explained
No ratings yet
Mobile Computing Fundamentals Explained
17 pages
Cobit 2019 Mapping To IT GOAL
100% (1)
Cobit 2019 Mapping To IT GOAL
1 page
Understanding 2D Arrays in C
No ratings yet
Understanding 2D Arrays in C
40 pages
Elekta The Theory and Operation of Computer-Controlled Medical Linear 8280-44071
No ratings yet
Elekta The Theory and Operation of Computer-Controlled Medical Linear 8280-44071
19 pages
651a36d211b4ed7aad6955a7 46184252463
No ratings yet
651a36d211b4ed7aad6955a7 46184252463
2 pages
Ssg-Ng01012401-Gen-Aa-5800-00008 - C01 - Project Interface Managment Plan
100% (1)
Ssg-Ng01012401-Gen-Aa-5800-00008 - C01 - Project Interface Managment Plan
16 pages
Difference Between 2-D and 3-D Animation - GeeksforGeeks
No ratings yet
Difference Between 2-D and 3-D Animation - GeeksforGeeks
1 page
p7211 e PDF
No ratings yet
p7211 e PDF
4 pages
R in Clinical Research: A Comprehensive Guide
100% (5)
R in Clinical Research: A Comprehensive Guide
376 pages
Yeni Metin Belgesi
No ratings yet
Yeni Metin Belgesi
4 pages
Dsb19 0003. 30rb Xa XW Exv Board Changes
100% (1)
Dsb19 0003. 30rb Xa XW Exv Board Changes
4 pages
Turkish Citizenship Database Leak
17% (6)
Turkish Citizenship Database Leak
2 pages
Conversion of Nfa To Dfa-1 Up
100% (1)
Conversion of Nfa To Dfa-1 Up
12 pages
Risk Analysis - Assignment 1
No ratings yet
Risk Analysis - Assignment 1
2 pages
FDMEE & ODI Tutorials Guide
100% (1)
FDMEE & ODI Tutorials Guide
120 pages
Computational Supremacy in Quantum Simulation
No ratings yet
Computational Supremacy in Quantum Simulation
55 pages
POS Solutions for Tech Developers
No ratings yet
POS Solutions for Tech Developers
14 pages
Sss Format Id
No ratings yet
Sss Format Id
8 pages

Build An LLM From Scratch

Uploaded by

Build An LLM From Scratch

Uploaded by

Build LLM Model from Scratch | 1

How to Build an LLM Application from Scratch?

The primary components of this project include:

Build LLM Model from Scratch | 1

1. custom_logger.py in a folder (current directory of your Google colab session)

Then, run the following commands in your Google colab notebook:

 DataLoadPDF: Handles loading and reading PDF files.

def __init__(self, file_path):

Build LLM Model from Scratch | 3

B. Splitting the Data into Chunks

def __init__(self, chunk_size, chunk_overlap):

Build LLM Model from Scratch | 4

After that, we jump to creating embeddings (vectorized representations of text,

 HuggingFaceEmbeddings to generate embeddings for document chunks.

 model_name represents the name of the embedding model to use, passed

def __init__(self, model_name):

Build LLM Model from Scratch | 5

Now, we define the create_embeddings function. It has the parameter:

def create_embeddings(self, docs):

Next, we have the save_embeddings function with the parameter:

Build LLM Model from Scratch | 6

def load_embedding(self, file_name):

The check_embedding_available function checks if a .pkl file with the embeddings

def check_embedding_available(self, file_name):

Build LLM Model from Scratch | 7

First, we define the constructor of the class DocumentProcessor with following

 model_name represents the name of the embedding model to use.

def __init__(self, model_name, chunk_size, chunk_overlap):

Build LLM Model from Scratch | 8

 Check Embedding Availability

def process_document(self, file_path):

Build LLM Model from Scratch | 9

data_splitter = DataSplitter(self.chunk_size, self.chunk_overlap)

from langchain import HuggingFacePipeline

The __init__ Method initializes the ModelLoader instance.

def __init__(self, model_id, max_length, temperature,load_int8):

Build LLM Model from Scratch | 10

 Tokenization loads the tokenizer using the AutoTokenizer.from_pretrained

Build LLM Model from Scratch | 11

logger.info("Model is loaded successfully\n")

The __init__ Method initializes the QASystem instance.

 llm is the language model pipeline from ModelLoader.

Build LLM Model from Scratch | 12

def __init__(self, llm):

The setup_retrieval_qa method Sets up the retrieval-based QA system using

 doc_embedding represents the document embeddings, likely derived from a

Build LLM Model from Scratch | 13

def setup_retrieval_qa(self, doc_embedding):

Authentication with Hugging Face Hub

from huggingface_hub import login

# Replace 'YOUR_HF_TOKEN' with the token you generated

Imports and Configurations

Build LLM Model from Scratch | 14

with open('config.json', 'r') as config_file:

logger.info(f"Loaded config file: {config}")

Document Processing and Model Loading

Now, it is time to run an instance of all the predefined classes .

 DocumentProcessor prepares the PDF document by splitting it into chunks,

# Loading embedding model

# Load model globally

Setting Global Variables for PDF and Document Embedding

Build LLM Model from Scratch | 15

# Initialize global variable for doc_embedding

Build LLM Model from Scratch | 16

 Defines a web interface layout using Gradio Blocks.

with gr.Blocks(theme=gr.themes.Default(primary_hue="red", secondary_hue="pink")) as demo:

Build LLM Model from Scratch | 17

Complete ProjectPro's GenAI Certification Course to demonstrate expertise in AI

Learn How to Build LLM Model from Scratch with

Build LLM Model from Scratch | 18

You might also like

def init(self, file_path):

def init(self, chunk_size, chunk_overlap):

def init(self, model_name):

def init(self, model_name, chunk_size, chunk_overlap):

The init Method initializes the ModelLoader instance.

def init(self, model_id, max_length, temperature,load_int8):

The init Method initializes the QASystem instance.

def init(self, llm):