0% found this document useful (0 votes)
23 views27 pages

Unit 3

Unit III covers supervised learning techniques in machine learning, focusing on classification methods such as logistic regression, decision trees, random forests, and support vector machines. It explains the classification process, types of classification (binary and multiclass), and various algorithms, as well as how to evaluate classification models using metrics like log loss, confusion matrix, and AUC-ROC curves. Additionally, it discusses the characteristics of classification, model selection, and real-life applications of classification algorithms.

Uploaded by

siva71469
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views27 pages

Unit 3

Unit III covers supervised learning techniques in machine learning, focusing on classification methods such as logistic regression, decision trees, random forests, and support vector machines. It explains the classification process, types of classification (binary and multiclass), and various algorithms, as well as how to evaluate classification models using metrics like log loss, confusion matrix, and AUC-ROC curves. Additionally, it discusses the characteristics of classification, model selection, and real-life applications of classification algorithms.

Uploaded by

siva71469
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT – III: Supervised Learning - II

Classification – Logistic Regression – Decision Tree Regression and


Classification – Random Forest Regression and Classification – Support Vector Machine
Regression and Classification - Evaluating Classification Models.

Machine Learning for classification

Classification is a process of categorizing data or objects into predefined classes


or categories based on their features or attributes.
Machine Learning classification is a type of supervised learning technique where
an algorithm is trained on a labeled dataset to predict the class or category of
new, unseen data.
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can
be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
technique, hence it takes labeled input data, which means it contains input with the corresponding
output.

The main objective of classification machine learning is to build a model that can
accurately assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images
labeled as either dogs or cats and then used to predict the class of new, unseen
images of dogs or cats based on their features such as color, texture, and shape.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or
categories. Example – On the basis of the given health conditions of a person, we
have to determine whether the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several
classes or categories. For Example – On the basis of data about different species
of flowers, we have to determine which specie our observation belongs to.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple
and computationally efficient. Some of the linear classification models are as
follows:
 Logistic Regression
 Support Vector Machines having kernel = ‘linear’
 Single-layer Perceptron
 Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They
can capture more complex relationships between the input features and the target
variable. Some of the non-linear classification models are as follows:
 K-Nearest Neighbours
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Ensemble learning classifiers:
 Random Forests,
 AdaBoost,
 Bagging Classifier,
 Voting Classifier,
 ExtraTrees Classifier
 Multi-layer Artificial Neural Networks

Learners in Classification Problems:


In the classification problems, there are two types of learners:

 Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
 Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction. Example: Decision Trees, Na�ve Bayes, ANN.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use cases of
Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Characteristics of Classification
Here are the characteristics of the classification:
 Categorical Target Variable: Classification deals with predicting categorical
target variables that represent discrete classes or labels. Examples include
classifying emails as spam or not spam, predicting whether a patient has a
high risk of heart disease, or identifying image objects.
 Accuracy and Error Rates: Classification models are evaluated based on
their ability to correctly classify data points. Common metrics include
accuracy, precision, recall, and F1-score.
 Model Complexity: Classification models range from simple linear classifiers
to more complex nonlinear models. The choice of model complexity depends
on the complexity of the relationship between the input features and the target
variable.
 Overfitting and Underfitting: Classification models are susceptible to
overfitting and underfitting. Overfitting occurs when the model learns the
training data too well and fails to generalize to new data.

How does Classification Machine Learning Work?

The basic idea behind classification is to train a model on a labeled dataset, where
the input data is associated with their corresponding output labels, to learn the
patterns and relationships between the input data and output labels. Once the
model is trained, it can be used to predict the output labels for new unseen data.

Model Selection
There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM),
or neural networks. It is important to select a model that is appropriate for your
problem, taking into account the size and complexity of your data, and the
computational resources you have available.
Model Training
Once you have selected a model, the next step is to train it on your training data.
This involves adjusting the parameters of the model to minimize the error
between the predicted class labels and the actual class labels for the training data.
Model Evaluation
Evaluating the model: After training the model, it is important to evaluate its
performance on a validation set. This will give you a good idea of how well the
model is likely to perform on new, unseen data.

Examples of Machine Learning Classification in Real Life


Classification algorithms are widely used in many real-world applications across
various domains, including:
 Email spam filtering
 Credit risk assessment
 Medical diagnosis
 Image classification
 Sentiment analysis.
 Fraud detection
 Quality control
 Recommendation systems

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:

Assumptions for Logistic Regression:


o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
The above equation is the final equation for Logistic Regression.

Logistic Function - Sigmoid Function

The sigmoid or logistic function is essential for converting predicted values into probabilities
in logistic regression. This function maps any real number to a value between 0 and 1,
ensuring that predictions remain within this probability range. Its "S" shaped curve helps
translate raw scores into a more interpretable format.

A threshold value is used in logistic regression to make decisions based on these


probabilities. For instance, if the predicted probability is above a certain threshold, such as
0.5, the result is 1. If it’s below, it’s classified as 0. This approach allows for clear and
actionable outcomes, such as determining whether a customer will purchase a product or a
patient has a particular condition based on the probability calculated by the sigmoid function.

Type of Logistic Regression:


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

How Does the Logistic Regression Algorithm Work?


Consider the following example: An organization wants to determine an employee’s salary
increase based on their performance.

For this purpose, a linear regression algorithm will help them decide. Plotting a regression
line by considering the employee’s performance as the independent variable, and the salary
increase as the dependent variable will make their task easier.

Now, what if the organization wants to know whether an employee would get a promotion or
not based on their performance? The above linear graph won’t be suitable in this case. As
such, we clip the line at zero and one, and convert it into a sigmoid curve (S curve).

Based on the threshold values, the organization can decide whether an employee will get a
salary increase or not.

To understand logistic regression, let’s go over the odds of success.

Odds (𝜃) = Probability of an event happening / Probability of an event not happening

𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between zero and
one.

Consider the equation of a straight line:

𝑦 = 𝛽0 + 𝛽1* 𝑥

Here, 𝛽0 is the y-intercept

𝛽1 is the slope of the line

x is the value of the x coordinate

y is the value of the prediction

Now to predict the odds of success, we use the following formula:

Exponentiating both the sides, we have:

Let Y = e 𝛽0+𝛽1 * 𝑥

Then p(x) / 1 - p(x) = Y

p(x) = Y(1 - p(x))


p(x) = Y - Y(p(x))

p(x) + Y(p(x)) = Y

p(x)(1+Y) = Y

p(x) = Y / 1+Y

The equation of the sigmoid function is:

The sigmoid curve obtained from the above equation is as follows:

Now that you know more about logistic regression algorithms, let’s look at the difference
between linear regression and logistic regression.

Advantages of the Logistic Regression Algorithm

 Logistic regression performs better when the data is linearly separable

 It does not require too many computational resources as it’s highly interpretable

 There is no problem scaling the input features—It does not require tuning
 It is easy to implement and train a model using logistic regression

 It gives a measure of how relevant a predictor (coefficient size) is, and its direction of
association (positive or negative).

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Random Forest Algorithm


A Random Forest Algorithm is a supervised machine learning algorithm that is extremely
popular and is used for Classification and Regression problems in Machine Learning. We know
that a forest comprises numerous trees, and the more trees more it will be robust. Similarly, the
greater the number of trees in a Random Forest Algorithm, the higher its accuracy and problem-
solving ability. Random Forest is a classifier that contains several decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset. It is based on the concept of ensemble learning which is a process of combining
multiple classifiers to solve a complex problem and improve the performance of the model.

Working of Random Forest Algorithm

A Random Forest Algorithm is a supervised machine learning algorithm that is extremely


popular and is used for Classification and Regression problems in Machine Learning. We
know that a forest comprises numerous trees, and the more trees more it will be robust.
Similarly, the greater the number of trees in a Random Forest Algorithm, the higher its
accuracy and problem-solving ability. Random Forest is a classifier that contains several
decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset. It is based on the concept of ensemble learning which is a
process of combining multiple classifiers to solve a complex problem and improve the
performance of the model.

To better understand Random Forest algorithm and how it works, it's helpful to review the
three main types of machine learning -
Working of Random Forest Algorithm

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples from a given data or training set.

Step 2: This algorithm will construct a decision tree for every training data.

Step 3: Voting will take place by averaging the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction result.

This combination of multiple models is called Ensemble. Ensemble uses two methods:

1. Bagging: Creating a different training subset from sample training data with replacement is called
Bagging. The final output is based on majority voting.

2. Boosting: Combing weak learners into strong learners by creating sequential models such that the
final model has the highest accuracy is called Boosting. Example: ADA BOOST, XG BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses the
Bagging code. Now, let us understand this concept in detail. Bagging is also known as
Bootstrap Aggregation used by random forest. The process begins with any original random
data. After arranging, it is organised into samples known as Bootstrap Sample. This process is
known as Bootstrapping.Further, the models are trained individually, yielding different results
known as Aggregation. In the last step, all the results are combined, and the generated output
is based on majority voting. This step is known as Bagging and is done using an Ensemble
Classifier.

Essential Features of Random Forest

 Miscellany: Each tree has a unique attribute, variety and features concerning other trees. Not all
trees are the same.

 Immune to the curse of dimensionality: Since a tree is a conceptual idea, it requires no features to
be considered. Hence, the feature space is reduced.
 Parallelization: We can fully use the CPU to build random forests since each tree is created
autonomously from different data and features.

 Train-Test split: In a Random Forest, we don’t have to differentiate the data for train and test
because the decision tree never sees 30% of the data.

 Stability: The final result is based on Bagging, meaning the result is based on majority voting or
average.

Difference between Decision Tree and Random Forest

Decision Trees Random Forest

 Since they are created from


subsets of data and the final
 They usually suffer from the problem of overfitting if it’s output is based on average or
allowed to grow without any control. majority ranking, the problem
of overfitting doesn’t happen
here.

 A single decision tree is comparatively faster in


 It is slower.
computation.

 Random Forest randomly


selects observations, builds a
 They use a particular set of rules when a data set with decision tree and then the
features are taken as input. result is obtained based on
majority voting. No formulas
are required here.

Why Use a Random Forest Algorithm?

There are a lot of benefits to using Random Forest Algorithm, but one of the main advantages
is that it reduces the risk of overfitting and the required training time. Additionally, it offers a
high level of accuracy. Random Forest algorithm runs efficiently in large databases and
produces highly accurate predictions by estimating missing data.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category
in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector. How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Evaluation Metrics For Classification Model


Evaluating the performance of your classification model is crucial to ensure its
accuracy and effectiveness.

Confusion Matrix?
A confusion matrix is a simple table that shows how well a classification model is
performing by comparing its predictions to the actual results. It breaks down the
predictions into four categories: correct predictions for both classes (true
positives and true negatives) and incorrect predictions (false positives and false
negatives). This helps you understand where the model is making mistakes, so you
can improve it.
The matrix displays the number of instances produced by the model on the test
data.
 True Positive (TP): The model correctly predicted a positive outcome (the
actual outcome was positive).
 True Negative (TN): The model correctly predicted a negative outcome (the
actual outcome was negative).
 False Positive (FP): The model incorrectly predicted a positive outcome (the
actual outcome was negative). Also known as a Type I error.
 False Negative (FN): The model incorrectly predicted a negative outcome
(the actual outcome was positive). Also known as a Type II error.

Why do we need a Confusion Matrix?


A confusion matrix helps you see how well a model is working by showing
correct and incorrect predictions. It also helps calculate key measures
like accuracy, precision, and recall, which give a better idea of performance,
especially when the data is imbalanced.

Metrics based on Confusion Matrix Data


1. Accuracy
Accuracy measures how often the model’s predictions are correct overall. It gives
a general idea of how well the model is performing. However, accuracy can be
misleading, especially with imbalanced datasets where one class dominates. For
example, a model that predicts the majority class correctly most of the time might
have high accuracy but still fail to capture important details about other classes.
Accuracy=TP+TNTP+TN+FP+FN Accuracy=TP+TN+FP+FNTP+TN
2. Precision
Precision focuses on the quality of the model’s positive predictions. It tells us
how many of the instances predicted as positive are actually positive. Precision is
important in situations where false positives need to be minimized, such as
detecting spam emails or fraud.
Precision=TPTP+FPPrecision=TP+FPTP
3. Recall
Recall measures how well the model identifies all actual positive cases. It shows
the proportion of true positives detected out of all the actual positive instances.
High recall is essential when missing positive cases has significant consequences,
such as in medical diagnoses.
Recall=TPTP+FNRecall=TP+FNTP
4. F1-Score
F1-score combines precision and recall into a single metric to balance their trade-
off. It provides a better sense of a model’s overall performance, particularly for
imbalanced datasets. The F1 score is helpful when both false positives and false
negatives are important, though it assumes precision and recall are equally
significant, which might not always align with the use case.
F1-Score=2⋅Precision⋅RecallPrecision+RecallF1-Score=Precision+Recall2⋅Precision⋅Recall
5. Specificity
Specificity is another important metric in the evaluation of classification models,
particularly in binary classification. It measures the ability of a model to correctly
identify negative instances. Specificity is also known as the True Negative Rate.
Formula is given by:
Specificity=TNTN+FP Specificity=TN+FPTN
6. Type 1 and Type 2 error
 Type 1 error
o A Type 1 Error occurs when the model incorrectly predicts a
positive instance, but the actual instance is negative. This is also
known as a false positive. Type 1 Errors affect the precision of a
model, which measures the accuracy of positive predictions.
Type 1 Error=FPTN+FPType 1 Error=TN+FPFP
 Type 2 error
o A Type 2 Error occurs when the model fails to predict a positive
instance, even though it is actually positive. This is also known as
a false negative. Type 2 Errors impact the recall of a model, which
measures how well the model identifies all actual positive cases.
Type 2 Error=FNTP+FNType 2 Error=TP+FNFN

 Example:

o Scenario: A diagnostic test is used to detect a particular disease in


patients.
o Type 1 Error (False Positive):
o This occurs when the test predicts a patient has the disease
(positive result), but the patient is actually healthy
(negative case).
o Type 2 Error (False Negative):
o This occurs when the test predicts the patient is healthy
(negative result), but the patient actually has the disease
(positive case).
Confusion Matrix For binary classification
A 2X2 Confusion matrix is shown below for the image recognition having a Dog
image or Not Dog image:
Predicted Predicted

Actual True Positive (TP) False Negative (FN)

Actual False Positive (FP) True Negative (TN)

 True Positive (TP): It is the total counts having both predicted and actual
values are Dog.
 True Negative (TN): It is the total counts having both predicted and actual
values are Not Dog.
 False Positive (FP): It is the total counts having prediction is Dog while
actually Not Dog.
 False Negative (FN): It is the total counts having prediction is Not Dog while
actually, it is Dog.

Example: Confusion Matrix for Dog Image Recognition with Numbers


Index 1 2 3 4 5 6 7 8 9 10

Not Not Not Not


Dog Dog Dog Dog Dog Dog
Actual Dog Dog Dog Dog

Not Not Not Not


Dog Dog Dog Dog Dog Dog
Predicted Dog Dog Dog Dog

Result TP FN TP TN TP FP TP TP TN TN

 Actual Dog Counts = 6


 Actual Not Dog Counts = 4
 True Positive Counts = 5
 False Positive Counts = 1
 True Negative Counts = 3
 False Negative Counts = 1
Predicted

Dog Not Dog

True Positive False Negative


Dog (TP =5) (FN =1)

False Positive True Negative


Actual Not Dog (FP=1) (TN=3)
Evaluation Metrics For Classification Model
Evaluating the performance of your classification model is crucial to ensure its
accuracy and effectiveness.

You might also like