0% found this document useful (0 votes)
14 views4 pages

Pre Training

Uploaded by

animation.work.6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Pre Training

Uploaded by

animation.work.6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

PRE-TRAINING

Pre-training is the first phase where a model learns general patterns and knowledge
from large datasets which often includes text, images, or other forms of
information.
This stage doesn't focus on task-specific details but instead learns generic
features.
It allows the model to build a broad understanding of language, concepts, or visual
features, depending on its application.

Pre-training in AI refers to the process where a model is trained on a large amount


of general data before being fine-tuned for a specific task.
It's like teaching the AI the basics of a language, patterns, and knowledge about
the world so it has a strong foundation to build on later.

Models are trained on large amounts of unlabeled data.


The models learn generalizable features that can be used later.
The models are then fine-tuned for specific tasks.

Lets take a simple analogy:


Think of it like learning to read before reading a specific subject textbook.
First of all, you learn the alphabet, vocabulary, and grammar.

During pre-training, the model learns:


Word meanings
Sentence structures
Common sense knowledge
Relationships between things

FINE-TUNING

Fine-tuning in AI refers to the process of taking a pretrained model and making


small adjustments to it so it performs better on a specific task or dataset.

Fine-tuning is a process in machine learning where a pre-trained model is further


trained on a smaller, task-specific dataset to improve its performance for a
particular application.
Instead of training a model from scratch, fine-tuning leverages the knowledge the
model has already gained from a large dataset and adapts it to a new but related
task.

🧠 Why Fine-Tune?

Saves time and resources compared to training a model from scratch.

Leverages existing knowledge (like language or image understanding).

Produces better results on niche or domain-specific tasks.

[ Millions of Random Images ] => [ Pre-trained Vision Model ]


(cats, cars, trees) (understands general image features)

[ Fine-tuning with Damaged Car Images ]
(learns specific task)

[ Final Model: Damaged Car Detector]
Parameter Efficient Fine-tuning (LoRA, QLoRa)

Parameter-Efficient Fine-Tuning (PEFT) is a technique used in deep learning to


fine-tune large pre-trained models (like GPT, BERT, or Vision Transformers) with
fewer trainable parameters.
This helps save memory, reduce computational cost, and improve efficiency,
especially when adapting models to multiple tasks.
Fine-tuning the full model is expensive and requires significant GPU resources.
PEFT methods update only a small subset of parameters while keeping most of the
model frozen.
Helps when deploying models in resource-constrained environments.

With the rise of large language models (LLMs) like GPT, LLaMA, or BERT-based
models, training or even fine-tuning these models fully can be hugely expensive in
terms of compute, memory, and time.

That’s where Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and QLoRA
come in! we will see them in further slides.

LoRA stands for Low-Rank Adaptation of Large Language Models.

LoRA (Low-Rank Adaptation) is a technique used in machine learning, particularly in


fine-tuning large language models (LLMs) and other deep learning architectures.
It is designed to reduce the computational and memory costs associated with fine-
tuning massive pre-trained models.

The core idea of LoRA:


Instead of updating the entire set of weights in a large pretrained model during
fine-tuning, LoRA freezes the original weights and adds small trainable matrices
that approximate the changes — using low-rank matrix decomposition.
This approach drastically reduces the number of trainable parameters, making
training:
Faster
Cheaper
Less memory-intensive

1.Start with a Pretrained GPT-style Model


Use a general-purpose language model like LLaMA 2, GPT-J, or GPT-NeoX.
These models have strong general language understanding but lack specific knowledge
of your domain (e.g., customer support, legal advice, medical answers).

2.Prepare Domain-Specific Training Data


Collect conversations, documents, or Q&A logs from your domain.
Examples:
Customer support: Chat logs, ticket resolutions
Legal: Contracts, case summaries
Healthcare: Medical notes, patient inquiries
Clean and tokenize the data to feed into the model in a prompt-response format.

3.Apply LoRA to Key Layers of the Model


LoRA inserts small, trainable matrices (A and B) into existing model layers (like
the query/key/value projections in attention).
These matrices are low-rank adapters that learn the domain-specific patterns.
The original weights of the GPT model are frozen—LoRA only trains these adapters.
Why this matters: You're only adding new knowledge, not overwriting what the model
already knows.

4.Train the Model on Domain Data


Use your domain data to fine-tune just the LoRA adapters.
This teaches the model:
How users typically ask questions in your domain
The correct terminology and response style
Contextual understanding specific to your use case
Example: Instead of replying “I’m not sure,” the fine-tuned model learns to say:
📦 “Let me check your order status. Can you share your tracking number?”

QLORA
QLoRA = Quantized model + LoRA-based fine-tuning

QLoRA (Quantized Low-Rank Adaptation) is a recent and powerful technique that


combines quantization with low-rank adaptation to enable efficient fine-tuning of
large language models (LLMs) — even on consumer-grade GPUs.

QLoRA extends LoRA by quantizing the low-rank adaptation matrices. Quantization


reduces the precision of weights, lowering memory and computation requirements
while retaining performance.
QLoRA is very effective at understanding and generating natural language. This
makes it a valuable tool for applications that require a deep understanding of
context, such as language translation, content creation, and even complex problem-
solving tasks.

Load model in 4-bit:


Use bitsandbytes or Hugging Face’s transformers + accelerate
Memory usage drops to ~5-6GB for LLaMA-7B

Inject LoRA adapters:


Freeze original weights
Insert trainable low-rank matrices (A and B) in attention layers
Only train a few million parameters (not billions!)

FLASH ATTENTION

📚 Traditional Attention:
You try to compare every sentence to every other sentence at once.
You write all possible relationships out on a giant whiteboard.
Problem: The whiteboard gets too big to handle — slow and memory-intensive!

⚡ FlashAttention:
You read the novel in small sections (tiles).
You summarize each section as you go, without writing everything down.
You use a high-speed notepad (GPU SRAM) that’s very fast but small.
Result: You understand the whole book without memory overload, and much faster.

Benefits of Flash Attention:

Speed & Efficiency:


Significantly faster model training and inference.
Scalability:
Makes it feasible to train and deploy very large models that would otherwise be
impractical due to memory constraints.

Real-world Use Cases:


Reduces the cost and resource demand for training large models in applications like
autonomous driving, natural language understanding, and large-scale data
generation.

You might also like