CHAPTER 7
Supervised Learning:
Classification and Regression
WHAT IS SUPERVISED
LEARNING
Supervised learning is the types of machine
learning in which machines are trained using well
"labelled" training data, and on basis of that data,
machines predict the output. The labelled data
means some input data is already tagged with the
correct output.
It is also know as Classification algorithm.
Supervised learning is a process of providing
input data as well as correct output data to the
machine learning model. The aim of a supervised
learning algorithm is to find a mapping
function to map the input variable(x) with
the output variable(y).
In the real-world, supervised learning can be used
for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
HOW SUPERVISED LEARNING WORKS?
In supervised learning, models are trained
using labelled dataset, where the model
learns about each type of data. Once the
training process is completed, the model is
tested on the basis of test data (a subset of
the training set), and then it predicts the
output.
TYPES OF SUPERVISED MACHINE LEARNING ALGORITHMS:
1. REGRESSION
Regression algorithms are used if there is a
relationship between the input variable and
the output variable. It is used for the
prediction of continuous variables, such as
Weather forecasting, Market Trends, etc.
Below are some popular Regression
algorithms which come under supervised
learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
2. CLASSIFICATION
Classification algorithms are used when
the output variable is categorical, which
means there are two classes such as
Yes-No, Male-Female, True-false, etc.
Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
CLASSIFICATION—A TWO-
STEP PROCESS
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision
trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that
are correctly classified by the model
Test set is independent of training set (otherwise
overfitting)
If the accuracy is acceptable, use the model to classify new
data
PROCESS (1): MODEL
CONSTRUCTION
PROCESS (2): USING THE
MODEL IN PREDICTION
ADVANTAGES OF SUPERVISED LEARNING
• With the help of supervised learning, the
model can predict the output on the
basis of prior experiences.
• In supervised learning, we can have an
exact idea about the classes of objects.
• Supervised learning model helps us to
solve various real-world problems such
as fraud detection, spam filtering,
etc.
LEARNING STEPS
DISADVANTAGES OF SUPERVISED LEARNING:
• Supervised learning models are not
suitable for handling the complex tasks.
• Supervised learning cannot predict the
correct output if the test data is
different from the training dataset.
• Training required lots of computation
times.
• In supervised learning, we need enough
knowledge about the classes of object.
DECISION TREE
Decision Tree is a Supervised learning
technique that can be used for both
classification and Regression problems, but
mostly it is preferred for solving Classification
problems.
It is a tree-structured classifier, where internal
nodes represent the features of a dataset,
branches represent the decision rules and each
leaf node represents the outcome.
In a Decision tree, there are two nodes, which are
the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
The decisions or the test are performed on the basis of
features of the given dataset.
Define: It is a graphical representation for getting all
the possible solutions to a problem/decision based on
given conditions.
In order to build a tree, we use the CART
algorithm, which stands for Classification and
Regression Tree algorithm.
A decision tree simply asks a question, and based on
the answer (Yes/No), it further split the tree into sub-
trees.
DECISION TREE TERMINOLOGIES
Root Node: Root node is from where the decision tree
starts. It represents the entire dataset, which further gets
divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the
tree cannot be segregated further after getting a leaf
node.
Splitting: Splitting is the process of dividing the decision
node/root node into sub-nodes according to the given
conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the
unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the
parent node, and other nodes are called the child nodes.
Example:
ALGORITHM FOR
DECISION TREE
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
There are no samples left
KEY TERMS AND CONCEPTS
(A) Entropy: It measures randomness or impurity of
an attribute(or a datapoints) in a decision tree.
Entropy generally lies between 0 and 1.
A lower value of entropy signifies a
homogeneous dataset with less randomness
hence better predictions.
A high entropy indicates high disorder.
Entropy could also be more then 1.It then
signifies that the dataset is very random and is
not good for creating a prediction classification
model.
Example: Find Entropy of the following distribution:
Gender Count
Male 9
Female 5
p(Male)= 9/14
p(Female)= 5/14
So based on the formula entropy is:
Entropy= - (5/14 log2 (5/14) + 9/14 log2 (9/14))
= - (-0.53 – 0.40)
= 0.93
Example:
Fruit Colour Taste Count
Yellow Sweet 10
Red Sweet 5
Green Sour 15
Orange Sour 5
Example:
Fruit Colour Count
Yellow 10
Red 5
Green 15
Orange 5
INFORMATION GAIN
Definition: It is a measure of purity produced by an
attribute.
IG=Entropy(before split) – Entropy(after the
split for all subsets)
Question: Find Entropy and Information Gain
Credit Rating Accommodation Loan Approved
Above 600 Own Yes
Above 600 Own Yes
Above 600 Own Yes
Above 600 Own Yes
Above 600 Own Yes
Above 600 Own Yes
Above 600 Own Yes
Above 600 Rent Yes
Above 600 Rent Yes
Above 600 Rent Yes
Above 600 Rent Yes
Above 600 Other Yes
Below 600 Other Yes
Below 600 Other Yes
Below 600 Other Yes
Below 600 Other Yes
Credit Rating Accommodation Loan Approved
Above 600 Own No
Below 600 Rent No
Below 600 Rent No
Below 600 Rent No
Below 600 Rent No
Below 600 Rent No
Below 600 Rent No
Below 600 Other No
Below 600 Other No
Below 600 Other No
Below 600 Other No
Below 600 Other No
Below 600 Other No
Below 600 Other No
Answer:
First calculate the entropy of root note
which have 30 datapoints from which 16
– loan approved and 14 – loan not
approved.
p(loan approved)=16/30
p(loan not approved)= 14/30
Entropy(parent node)= -((16/30 log2
(16/30)) +
(14/30 log2
(14/30)))
= 0.99
Taking Credit Rating as the attribute of
split:
Entropy(Above 600) = - ((12/13 log2 12/13) + (1/13 log 2 1/13))
= 0.38
Entropy(Below 600) = - ((4/17 log2 4/17) + (13/17 log 2 13/17)) =
0.79
Weighted Entropy(Credit Rating) = 13/30 * Entropy(Above 600)
+ 17/30 * Entropy(Below
600)
= 0.16 + 0.45 = 0.61
Information Gain(Credit Rating), IG = Entropy(Parent node) -
Taking Accommodation as the attribute
of split:
Entropy(Own) = - ((7/8 log2 7/8) + (1/8 log 2 1/8)) = 0.55
Entropy(Rent) = - ((4/10 log2 4/10) + (6/10 log 2 6/10)) = 0.97
Entropy(Other) = - ((5/12 log2 5/12) + (7/12 log 2 7/12)) = 0.98
Weighted Entropy(Accommodation) = 8/30 * Entropy(Own) +
10/30 * Entropy(Rent) + 12/30 *
Entropy(Other)
= 0.15 + 0.32 + 0.39 = 0.86
Information Gain(Accommodation), IG =
Entropy(Parent node) - Weighted
Entropy(Accommodation)
= 0.99 – 0.86 = 0.13
IG(Credit Rating) is almost 3 times more
better then IG(Accommodation).
Hence, credit Rating is better choice for
splitting the decision.
K-NEAREST NEIGHBOR(KNN) ALGORITHM
• K-Nearest Neighbour is one of the simplest
Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity
between the new case/data and available
cases and put the new case into the category
that is most similar to the available
categories.
• K-NN algorithm stores all the available data
and classifies a new data point based on the
similarity. This means when new data
appears then it can be easily classified into a
well suite category by using K- NN algorithm.
K-NEAREST NEIGHBOR(KNN) ALGORITHM
• K-NN algorithm can be used for Regression as well
as for Classification but mostly it is used for the
Classification problems.
• K-NN is a non-parametric algorithm, which
means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because
it does not learn from the training set immediately
instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the
dataset and when it gets new data, then it
classifies that data into a category that is much
similar to the new data.
EXAMPLE:
Suppose, we have an image of a creature
that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for
this identification, we can use the KNN
algorithm, as it works on a similarity
measure. Our KNN model will find the
similar features of the new data set to the
cats and dogs images and based on the
most similar features it will put it in either
cat or dog category.
WHY DO WE NEED A K-NN ALGORITHM?
Suppose there are two categories, i.e.,
Category A and Category B, and we have a
new data point x1, so this data point will lie
in which of these categories. To solve this
type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify
the category or class of a particular
dataset. Consider the below diagram:
KNN ALGORITHM IN
DETAIL
Refer to the below link for KNN
Explanation and problem Sums.
https://www.slideshare.net/Simplilearn/k
nearest-neighbor-classification-algorith
m-how-knn-algorithm-works-knn-algorit
hm-simplilearn
ADVANTAGES OF KNN
ALGORITHM:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training
data is large.
DISADVANTAGES OF KNN ALGORITHM:
• Always needs to determine the value of
K which may be complex some time.
• The computation cost is high because of
calculating the distance between the
data points for all the training samples.
SUPPORT VECTOR
MACHINE (SVM)
Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems.
The goal of the SVM algorithm is to create the best
line or decision boundary that can segregate n-
dimensional space into classes so that we can easily
put the new data point in the correct category in the
future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is
termed as Support Vector Machine.
SVM algorithm can be used for Face detection,
image classification, text categorization, etc.
GRAPH:
EXAMPLE FOR SVM
Suppose we see a strange cat that also has some
features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm.
We will first train our model with lots of images of cats
and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange
creature.
So as support vector creates a decision boundary
between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors,
it will classify it as a cat. Consider the below diagram:
EXAMPLE OF SVM(CONT.)
HYPERPLANE IN THE SVM ALGORITHM:
Hyperplane: There can be multiple
lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to
find out the best decision boundary that helps to
classify the data points. This best boundary is
known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the
features present in the dataset, which means if
there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there
are 3 features, then hyperplane will be a 2-
dimension plane.
We always create a hyperplane that has a
maximum margin, which means the maximum
distance between the data points.
SUPPORT VECTORS IN THE
SVM ALGORITHM
The data points or vectors that are the
closest to the hyperplane and which
affect the position of the hyperplane are
termed as Support Vector. Since these
vectors support the hyperplane, hence
called a Support vector.
NON-LINEAR SVM
Nonlinear SVM (Support Vector Machine) is necessary
when the data cannot be effectively separated by a
linear decision boundary in the original feature
space.
Nonlinear SVM addresses this limitation by utilizing
kernel functions to map the data from low dimension
into a higher-dimensional space where linear
separation becomes possible.
The kernel function computes the similarity between
data points, allowing SVM to capture complex
patterns and nonlinear relationships between
features.
By leveraging the kernel trick, nonlinear SVM
provides a powerful tool for solving classification
problems where linear separation is insufficient,
extending its applicability to a wide range of real-
world scenarios.
If data is linearly arranged, then we can separate it by
using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:
So to separate these data points, we
need to add one more dimension. For
linear data, we have used two
dimensions x and y, so for non-linear
data, we will add a third dimension z. It
can be calculated as:
By adding the third dimension, the
sample space will become as below
image:
So now, SVM will divide the datasets
into classes in the following way.
Consider the below image:
RANDOM FOREST
ALGORITHM
Random Forest is a popular machine learning
algorithm that belongs to the supervised learning
technique. It can be used for both Classification and
Regression problems in ML. It is based on the
concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex
problem and to improve the performance of the
model.
As the name suggests, "Random Forest is a
classifier that contains a number of decision
trees on various subsets of the given dataset
and takes the average to improve the
predictive accuracy of that dataset." Instead of
relying on one decision tree, the random forest
takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the
final output.
The greater number of trees in the
forest leads to higher accuracy and
prevents the problem of overfitting.
ASSUMPTIONS FOR RANDOM FOREST
Since the random forest combines multiple
trees to predict the class of the dataset, it
is possible that some decision trees may
predict the correct output, while others
may not. But together, all the trees predict
the correct output. Therefore, below are
two assumptions for a better Random
forest classifier:
1. There should be some actual values in
the feature variable of the dataset so that
the classifier can predict accurate results
rather than a guessed result.
2. The predictions from each tree must have
very low correlations.
WHY USE RANDOM FOREST?
• It takes less training time as compared
to other algorithms.
• It predicts output with high accuracy,
even for the large dataset it runs
efficiently.
• It can also maintain accuracy when a
large proportion of data is missing.
HOW DOES RANDOM FOREST
ALGORITHM WORK?
Random Forest works in two-phase first
is to create the random forest by
combining N decision tree, and second
is to make predictions for each tree
created in the first phase.
The Working process can be explained
in the below steps and diagram:
Step-1: Select random K data points
from the training set.
Step-2: Build the decision trees
associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees
that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions
of each decision tree, and assign the new data
points to the category that wins the majority
votes.
The working of the algorithm can be better
understood by the below example:
Example: Suppose there is a dataset that
contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset
is divided into subsets and given to each decision
tree. During the training phase, each decision tree
produces a prediction result, and when a new
data point occurs, then based on the majority of
results, the Random Forest classifier predicts the
final decision. Consider the below image:
HOW DOES RANDOM FOREST
ALGORITHM WORK? (CONTI.)
LINEAR REGRESSION
Linear regression is one of the easiest and most
popular Machine Learning algorithms. It is a
statistical method that is used for predictive
analysis. Linear regression makes predictions
for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear
relationship between a dependent (y) and one
or more independent (x) variables, hence called
as linear regression. Since linear regression
shows the linear relationship, which means it
finds how the value of the dependent variable is
changing according to the value of the
independent variable.
The linear regression model provides a
sloped straight line representing the
relationship between the variables.
Consider the below image:
Mathematically, we can represent a
linear regression as:
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional
degree of freedom)
a1 = Linear regression coefficient (scale
factor to each input value).
The values for x and y variables are training
datasets for Linear Regression model
representation.
Multiple Linear Regression:
This involves more than one
independent variable and one
dependent variable. The equation for
multiple linear regression is:
where:
• Y is the dependent variable
• X1, X2, …, Xn are the independent
variables
• β0 is the intercept
• β1, β2, …, βn are the slopes
TYPES OF LINEAR REGRESSION
• Simple Linear Regression:
If a single independent variable is used
to predict the value of a numerical
dependent variable, then such a Linear
Regression algorithm is called Simple
Linear Regression.
• Multiple Linear regression:
If more than one independent variable is
used to predict the value of a numerical
dependent variable, then such a Linear
Regression algorithm is called Multiple
Linear Regression.
LINEAR REGRESSION LINE
• Positive Linear Relationship:
If the dependent variable increases on
the Y-axis and independent variable
increases on X-axis, then such a
relationship is termed as a Positive
linear relationship.
LINEAR REGRESSION
LINE(CONT.)
• Negative Linear Relationship:
If the dependent variable decreases on
the Y-axis and independent variable
increases on the X-axis, then such a
relationship is called a negative linear
relationship.
LOGISTIC REGRESSION
• Logistic regression is one of the most popular
Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for
predicting the categorical dependent variable
using a given set of independent variables.
• Logistic regression predicts the output of a
categorical dependent variable. Therefore the
outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which
lie between 0 and 1.
• Linear Regression is used for solving Regression
problems, whereas Logistic regression is used
for solving the classification problems.
LOGISTIC REGRESSION
• In Logistic regression, instead of fitting a
regression line, we fit an "S" shaped
logistic function, which predicts two
maximum values (0 or 1).
• The curve from the logistic function
indicates the likelihood of something such
as whether the cells are cancerous or not,
a mouse is obese or not based on its
weight, etc.
• Logistic Regression is a significant
machine learning algorithm because it
has the ability to provide probabilities and
classify new data using continuous and
discrete datasets.
LOGISTIC REGRESSION
• Logistic Regression can be used to
classify the observations using different
types of data and can easily determine
the most effective variables used for the
classification. The below image is
showing the logistic function:
LOGISTIC FUNCTION
(SIGMOID FUNCTION):
• The sigmoid function is a mathematical function
used to map the predicted values to probabilities.
• It maps any real value into another value within a
range of 0 and 1.
• The value of the logistic regression must be
between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-
form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the
threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold
values tends to 0.
LOGISTIC REGRESSION EQUATION:
TYPE OF LOGISTIC REGRESSION:
On the basis of the categories, Logistic
Regression can be classified into three
types:
• Binomial: In binomial Logistic
regression, there can be only two
possible types of the dependent
variables, such as 0 or 1, Pass or Fail,
etc.
• Multinomial: In multinomial Logistic
regression, there can be 3 or more
possible unordered types of the
dependent variable, such as "cat",
"dogs", or "sheep"
THANK YOU