Machine Learning
Machine Learning
Machine learning is field of study which give computer ability to learn from
past experience and past data without being explicitly programmed
o Arthure Samuel in 1959
Or in simple words Machine learning is science or art of programming
computer to learn from the data
Supervised Learning
Definition
Supervised Learning is a type of Machine Learning where the model is trained
using labeled data (data with both inputs and correct outputs).the computer
uses this information to learn the relationship between inputs and outputs.
It’s called “supervised” because it’s like a teacher guiding the computer.
Simple Example
Imagine you want to teach a computer to recognize whether an email is spam
or not:
You show the computer lots of emails.
Each email is labeled as “spam” or “not spam.”
The computer learns what spam emails look like based on these labeled
examples
Later, you give it a new email, and it predicts if it’s spam or no
Supervised learning solve 2 problems
Regression and Classification
Regression
o It is related to numeric values
o E.g.
Population growth prediction
Expecting life expectancy
Market forecasting/prediction
Advertising Popularity prediction
Stock prediction
o Algorithms
Linear Regression (single feature)
Multiple Linear Regression (many features)
Ridge Regression (regularized linear regression)
Lasso Regression (another regularized version)
Support Vector Regression (SVR)
Decision Tree Regression / Random Forest Regression
Classification
o Related to classify the records based on classes
o E.g.
Find whether an email received is a spam or ham
Identify customer segments
Find if a bank loan is granted
Identify if a kid will pass or fail in an examination
o Algorithms
Logistic Regression
Decision Tree
Random Forest
Support Vector Machine (SVM)
K-Nearest Neighbors (KNN)
Naïve Bayes
AdaBoost
Gradient Boosting / XGBoost
Unsupervised Learning
Definition
Unsupervised Learning is when the model is trained on unlabeled data (only inputs,
no correct outputs).
The goal is to find patterns, groups, or structures hidden in the data.
Examples
It solve the problem like clustering, Association Rules and Dimensionality Reduction
1. Clustering – discover the inherent groupings in the data, such as grouping customers by
purchasing behavior
o e.g., K-Means, Hierarchical Clustering, DBSCAN
2. Association Rules – finding relationships between items
An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as people that buy X also tend to buy Y
E.g.
Algorithms
▪ Apriori
▪ Eclat
3. Dimensionality Reduction
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction
It is a way of converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated resource more test Time more
Because it is very difficult to visualize or make predictions for the training dataset with a
high number of features, for such cases, dimensionality reduction techniques are required
to use
o Features Selection
Filter
Wrapper
Embedded
o Features Extraction
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)
Reinforcement Learning
A type of machine learning where an agent learns by interacting with an
environment, taking actions, and receiving rewards or penalties.
Goal: Learn the best strategy (policy) to maximize long-term rewards.
It is employed by various software and machines to find the best possible
behavior or path it should take in a specific situation
Reinforcement learning differs from the supervised learning in a way that in
supervised learning the training data has the answer key with it so the model
is trained with the correct answer itself whereas in reinforcement learning,
there is no answer but the reinforcement agent decides what to do to perform
the given task
In the absence of training dataset, it is bound to learn from its experience
Like humans learn in the real world – through trial and error. The agent
(software/machine) learns by experience:
o Action → Environment → Reward/Penalty → Learn → Improve
Examples
o Resources management in computer clusters
o Traffic Light Control
o Robotics
o Web system configuration
o Chemistry
Algorithms
o Q-Learning
o Deep Q-Learning & offline learning
Batch Learning / Offline Learning
Model is static
\/
\/
After 1 year
\/
No change in Model
\/
If
Train the model - > put on server -> again pull the ML model -> again train with
previous and updated data -> again deploy the model -> repeat the process again
and again
The above process is vary time consuming
In simple terms:
Batch learning is machine learning approach where
Model is train on the entire dataset all at once, instad of updating
continuously as new data arrive[avoid incremental learning] after training the model
is used for making predction ad it only changes if it is retain new data from scratch
↳ MLops pipeline
▪ Training using the full set of data can take many hours
▪ Typically train a new system only every 24 hours or even just weekly
▪ Training on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O)
Each learning step is fast and cheap, so the system can learn about new data on the
server, as it arrives
Online learning is great for systems that receive data as a continuous flow (e.g.,
stock prices) and need to adapt to change rapidly or autonomously
It is also a good option if you have limited computing resources once an online
learning system has learned about new data instances, it does not need them
anymore, so you can discard them
This can save a huge amount of space
out-of-core learning: - train systems on huge datasets that cannot fit in one
machine’s main memory
When to use:
When there is concept drift
If the software have volatile Nature
The Frequency of data changes is high
Cost Effective
Faster Solution
Learning Rate:
Decide how much frequent you are going to train model on data
One important parameter of online learning systems is how fast they should
adapt to changing data
If you set a high learning rate, then your system will rapidly adapt to new
data, but it will also tend to quickly forget the old data
if you set a low learning rate, the system will have more inertia-
that is, it will learn more slowly, but it will also be less sensitive to
noise in the new data or to sequences of nonrepresentative data points
(outliers
We need to find correct learning rate if not the model can cause
Learn new pattern and forgot old patterns
Or Might be learn patterns slowly
Disadvantages:
Tricky to use
Risky
Online learning is less stable and give less accuracy than batch learning
o Like server hack -> hacker give spam email
Can handle using monitoring system like anomaly detection or
can me rollback
Instance Based
The system learns the examples by heart-- stance
Then generalizes to new cases by using a similarity measure to compare them to
the learned examples (or a subset of them)
It is called instance-based because it builds the hypotheses from the training
instances
It is also known as memory-based learning or lazy-learning
Ex. K Nearest Neighbor (KNN)
Model Based
Model-> formula
Train model from training data to estimate model parameters i.e. discover
patterns
Store the built model in suitable format
Generalize the rules of model pickle training set
Predict the unseen instance (data) using the model
It requires a known model form
It takes less memory compared to the instance based learning
E.g. ▪ Linear Regression
End to End for Model Deployment
Model Evaluation
Mean Absolute Error (MAE):
measures the average magnitude of the errors in a set of forecasts, without
considering their direction
measures accuracy for continuous variables
The MAE is a linear score which means that all the individual differences are
weighted equally in the average
Classification
Prediction of class/label values
Classification is supervised machine learning method where model try to predict correct
label using give input
In classification model is fully train on training data then evaluated on test data before
model is going to predict on unseen data
Learners
These are the Machine learning algorithm which first build on training dataset then
before making any future prediction
They spend more time on during the process because the eagerness to have more
generalization and required less time to prediction
Types of classification
Binary Classification
Label = 2
Spam
Email No Spam
Output = 1
The goal is to classify the input into two mutually exclusive category
Th training data in such situation is labeled as True False, 0 1 , Spam not Spam
Multiclass Classification
Label > 2
Upper class
Person Middel
class
Lower class
Output = 1
• Each instance belongs to exactly one class out of multiple possible classes.
• Classes are mutually exclusive (choosing one means excluding the others).
• Example:
- Predicting the type of fruit (Apple, Banana, Mango, Orange) → one fruit at a time.
- Handwritten digit recognition (0–9) → only one digit per image.
Multilabel Classification
• Each instance can belong to multiple classes simultaneously.
• Classes are not mutually exclusive (an instance may have zero, one, or several
labels).
• Example:
- Predicting movie genres → a movie can be Action + Comedy + Drama.
- Detecting objects in an image → one picture can have Dog + Car + Tree.
We can site:
Multi-label Decision Tree
Multi-label Gradient Boosting
Multi-label Random Forest
Logistic Regression
Designed for Binary Classification
1) Definition
2) How it works
3) Training / Learning
4) Decision Making
5) Evaluation Metrics
6) Assumptions
The independent variable should be independent of each other. i.e the model should have
little or no multi-collinearity
Required quite large sample sizes
7) Extensions
8) Advantages
9) Disadvantages
Formula:
Range:
o Output is always between 0 and 1.
o This makes it suitable for probability representation.
o
In ML usage:
o In binary classification, sigmoid maps raw model output (logit) into a probability
of belonging to the positive class.
o Example: if sigmoid(z) = 0.82 → interpreted as 82% probability of being class
1.
o In multilabel classification, sigmoid is applied to each label independently (since
each can have its own probability).
o In multiclass classification, we usually use softmax instead of sigmoid, because
probabilities must sum to 1 across classes.
🔑 Key Point:
Bayes’ Theorem finds the probability of an event occurring given the probability
of another event that has already occurred
P(B) Evidence
P(B|A) Likelihood
It is supervised machine learning algorithm which is used for both problems classification
as well as regression
Can use in use for both Regression as well as Classification Problem Statement
K No of nearest neighbors
Instance based algorithm or Lazy learner
No model is created memorize the training data and find the nearest neighbors of the
input and predict the output
Required more time for prediction
How ever it is widely use for classification in industry
It assume that similar data point are in close proximity
Working:
Classify by the vote of its neighbors , with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function
Choose the optimal value for K by looking the dataset [ usually for most data set it is
between 3 to 10]
Cross-validation is another way to retrospectively determine a good K value by using an
independent dataset to validate the K value
Disadvantage:
o Computationally expensive – it store all the training data in memory
o High Memory requirement
Applications of KNN
o Recommender system
o Relevant document classification
SVM [Support Vector Machine]
The main task of the classification problem is to find the best separating
hyperplane/ Decision boundary.
We can have n-1 hyperplane which can be either linear or nonlinear. Such data
points are called Support vectors.
What exactly are Margins?
further the data points are from the margins, the more correctly they are
classified.
Margins represent the width of the corridor that the SVM algorithm aims to
maximize when finding the optimal hyperplane to separate different classes
of data. The larger the margin, the greater the confidence in the
classification made by the SVM model.
Linear and Non-Linear SVM
Linear SVM:
Non-Linear SVM
This optimization problem aims to minimize the classification error while maximizing
the margin, which is the distance between the decision boundary and the closest data
points from each class.
Hard constrain that Support Vector Machine follows:- each data point
must lie on the correct side of the margin and there should be no
misclassification.
Hard SVM
Algorithm aims to find the hyperplane that separates the classes with the
maximum margin while strictly enforcing that all data points are correctly
classified.
Assuming that the data is linearly separable, it implies the existence of at
least one hyperplane that can perfectly separate the classes without any
misclassifications.
However, Hard SVM does not tolerate any misclassification errors and
demands the data to be perfectly separable, which can be overly restrictive
and might lead to poor performance on noisy or overlapping datasets.
Soft SVM
RBF SVM
Polynomial Linear SVM
SVM
Library:
#from sklearn.svm import SVC
Decision Tree
Decision Tree
Classifier
A decision tree algorithm that uses information A decision tree algorithm that uses Gini Index
gain based on entropy to select the best (for classification) or variance reduction (for
attribute for splitting. regression) to split data.
CART always creates a binary tree
Test condition:
Root Node
Parent Node
Child Node
Leaf Node
Hyper Parameter
Max_depth = #level / Height
Entropy:-
Entropy H(S) = -p log2 p - q log2q
Range between 0 to 1
If entropy is 1 Not pure split
If entropy is 0 Pure entropy
H(s)
1- Entropy
Gini
0.5 -
Impurity
P+ ,P-
0 0.5 1
Disadvantages
Allows the tree to grow fully, then removes (cuts back) the branches that add little
predictive power.
Based on validation error, cost complexity pruning, or statistical tests.
Reduces overfitting while keeping useful splits.
More accurate but computationally expensive
If dataset is big then Pre-pruning is use if not then Post pruning is use
Most efficient in both of them is Pre-Pruning because it prune while creating the tree
Pre-pruning = Stop early to avoid overfitting.
Post-pruning = Grow fully, then trim to simplify.
Ensemble Learning
Ensemble is art of combining the diverse set of learners together [Individual Model] to improve
the stability and predictive power of model.
Bagging
Stands for bootstrap aggregation