PRE-TRAINING
Pre-training is the first phase where a model learns general patterns and knowledge
from large datasets which often includes text, images, or other forms of
information.
This stage doesn't focus on task-specific details but instead learns generic
features.
It allows the model to build a broad understanding of language, concepts, or visual
features, depending on its application.
Pre-training in AI refers to the process where a model is trained on a large amount
of general data before being fine-tuned for a specific task.
It's like teaching the AI the basics of a language, patterns, and knowledge about
the world so it has a strong foundation to build on later.
Models are trained on large amounts of unlabeled data.
The models learn generalizable features that can be used later.
The models are then fine-tuned for specific tasks.
Lets take a simple analogy:
Think of it like learning to read before reading a specific subject textbook.
First of all, you learn the alphabet, vocabulary, and grammar.
During pre-training, the model learns:
Word meanings
Sentence structures
Common sense knowledge
Relationships between things
FINE-TUNING
Fine-tuning in AI refers to the process of taking a pretrained model and making
small adjustments to it so it performs better on a specific task or dataset.
Fine-tuning is a process in machine learning where a pre-trained model is further
trained on a smaller, task-specific dataset to improve its performance for a
particular application.
Instead of training a model from scratch, fine-tuning leverages the knowledge the
model has already gained from a large dataset and adapts it to a new but related
task.
🧠 Why Fine-Tune?
Saves time and resources compared to training a model from scratch.
Leverages existing knowledge (like language or image understanding).
Produces better results on niche or domain-specific tasks.
[ Millions of Random Images ] => [ Pre-trained Vision Model ]
(cats, cars, trees) (understands general image features)
↓
[ Fine-tuning with Damaged Car Images ]
(learns specific task)
↓
[ Final Model: Damaged Car Detector]
Parameter Efficient Fine-tuning (LoRA, QLoRa)
Parameter-Efficient Fine-Tuning (PEFT) is a technique used in deep learning to
fine-tune large pre-trained models (like GPT, BERT, or Vision Transformers) with
fewer trainable parameters.
This helps save memory, reduce computational cost, and improve efficiency,
especially when adapting models to multiple tasks.
Fine-tuning the full model is expensive and requires significant GPU resources.
PEFT methods update only a small subset of parameters while keeping most of the
model frozen.
Helps when deploying models in resource-constrained environments.
With the rise of large language models (LLMs) like GPT, LLaMA, or BERT-based
models, training or even fine-tuning these models fully can be hugely expensive in
terms of compute, memory, and time.
That’s where Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and QLoRA
come in! we will see them in further slides.
LoRA stands for Low-Rank Adaptation of Large Language Models.
LoRA (Low-Rank Adaptation) is a technique used in machine learning, particularly in
fine-tuning large language models (LLMs) and other deep learning architectures.
It is designed to reduce the computational and memory costs associated with fine-
tuning massive pre-trained models.
The core idea of LoRA:
Instead of updating the entire set of weights in a large pretrained model during
fine-tuning, LoRA freezes the original weights and adds small trainable matrices
that approximate the changes — using low-rank matrix decomposition.
This approach drastically reduces the number of trainable parameters, making
training:
Faster
Cheaper
Less memory-intensive
1.Start with a Pretrained GPT-style Model
Use a general-purpose language model like LLaMA 2, GPT-J, or GPT-NeoX.
These models have strong general language understanding but lack specific knowledge
of your domain (e.g., customer support, legal advice, medical answers).
2.Prepare Domain-Specific Training Data
Collect conversations, documents, or Q&A logs from your domain.
Examples:
Customer support: Chat logs, ticket resolutions
Legal: Contracts, case summaries
Healthcare: Medical notes, patient inquiries
Clean and tokenize the data to feed into the model in a prompt-response format.
3.Apply LoRA to Key Layers of the Model
LoRA inserts small, trainable matrices (A and B) into existing model layers (like
the query/key/value projections in attention).
These matrices are low-rank adapters that learn the domain-specific patterns.
The original weights of the GPT model are frozen—LoRA only trains these adapters.
Why this matters: You're only adding new knowledge, not overwriting what the model
already knows.
4.Train the Model on Domain Data
Use your domain data to fine-tune just the LoRA adapters.
This teaches the model:
How users typically ask questions in your domain
The correct terminology and response style
Contextual understanding specific to your use case
Example: Instead of replying “I’m not sure,” the fine-tuned model learns to say:
📦 “Let me check your order status. Can you share your tracking number?”
QLORA
QLoRA = Quantized model + LoRA-based fine-tuning
QLoRA (Quantized Low-Rank Adaptation) is a recent and powerful technique that
combines quantization with low-rank adaptation to enable efficient fine-tuning of
large language models (LLMs) — even on consumer-grade GPUs.
QLoRA extends LoRA by quantizing the low-rank adaptation matrices. Quantization
reduces the precision of weights, lowering memory and computation requirements
while retaining performance.
QLoRA is very effective at understanding and generating natural language. This
makes it a valuable tool for applications that require a deep understanding of
context, such as language translation, content creation, and even complex problem-
solving tasks.
Load model in 4-bit:
Use bitsandbytes or Hugging Face’s transformers + accelerate
Memory usage drops to ~5-6GB for LLaMA-7B
Inject LoRA adapters:
Freeze original weights
Insert trainable low-rank matrices (A and B) in attention layers
Only train a few million parameters (not billions!)
FLASH ATTENTION
📚 Traditional Attention:
You try to compare every sentence to every other sentence at once.
You write all possible relationships out on a giant whiteboard.
Problem: The whiteboard gets too big to handle — slow and memory-intensive!
⚡ FlashAttention:
You read the novel in small sections (tiles).
You summarize each section as you go, without writing everything down.
You use a high-speed notepad (GPU SRAM) that’s very fast but small.
Result: You understand the whole book without memory overload, and much faster.
Benefits of Flash Attention:
Speed & Efficiency:
Significantly faster model training and inference.
Scalability:
Makes it feasible to train and deploy very large models that would otherwise be
impractical due to memory constraints.
Real-world Use Cases:
Reduces the cost and resource demand for training large models in applications like
autonomous driving, natural language understanding, and large-scale data
generation.