Unit 3
Unit 3
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning
technique, hence it takes labeled input data, which means it contains input with the corresponding
output.
The main objective of classification machine learning is to build a model that can
accurately assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images
labeled as either dogs or cats and then used to predict the class of new, unseen
images of dogs or cats based on their features such as color, texture, and shape.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or
categories. Example – On the basis of the given health conditions of a person, we
have to determine whether the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several
classes or categories. For Example – On the basis of data about different species
of flowers, we have to determine which specie our observation belongs to.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple
and computationally efficient. Some of the linear classification models are as
follows:
Logistic Regression
Support Vector Machines having kernel = ‘linear’
Single-layer Perceptron
Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They
can capture more complex relationships between the input features and the target
variable. Some of the non-linear classification models are as follows:
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Decision Tree Classification
Ensemble learning classifiers:
Random Forests,
AdaBoost,
Bagging Classifier,
Voting Classifier,
ExtraTrees Classifier
Multi-layer Artificial Neural Networks
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test
dataset. In Lazy learner case, classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
Eager Learners:Eager Learners develop a classification model based on a training dataset before
receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less
time in prediction. Example: Decision Trees, Na�ve Bayes, ANN.
2. Confusion Matrix:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.
Characteristics of Classification
Here are the characteristics of the classification:
Categorical Target Variable: Classification deals with predicting categorical
target variables that represent discrete classes or labels. Examples include
classifying emails as spam or not spam, predicting whether a patient has a
high risk of heart disease, or identifying image objects.
Accuracy and Error Rates: Classification models are evaluated based on
their ability to correctly classify data points. Common metrics include
accuracy, precision, recall, and F1-score.
Model Complexity: Classification models range from simple linear classifiers
to more complex nonlinear models. The choice of model complexity depends
on the complexity of the relationship between the input features and the target
variable.
Overfitting and Underfitting: Classification models are susceptible to
overfitting and underfitting. Overfitting occurs when the model learns the
training data too well and fails to generalize to new data.
The basic idea behind classification is to train a model on a labeled dataset, where
the input data is associated with their corresponding output labels, to learn the
patterns and relationships between the input data and output labels. Once the
model is trained, it can be used to predict the output labels for new unseen data.
Model Selection
There are many different models that can be used for classification,
including logistic regression, decision trees, support vector machines (SVM),
or neural networks. It is important to select a model that is appropriate for your
problem, taking into account the size and complexity of your data, and the
computational resources you have available.
Model Training
Once you have selected a model, the next step is to train it on your training data.
This involves adjusting the parameters of the model to minimize the error
between the predicted class labels and the actual class labels for the training data.
Model Evaluation
Evaluating the model: After training the model, it is important to evaluate its
performance on a validation set. This will give you a good idea of how well the
model is likely to perform on new, unseen data.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
The above equation is the final equation for Logistic Regression.
The sigmoid or logistic function is essential for converting predicted values into probabilities
in logistic regression. This function maps any real number to a value between 0 and 1,
ensuring that predictions remain within this probability range. Its "S" shaped curve helps
translate raw scores into a more interpretable format.
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
For this purpose, a linear regression algorithm will help them decide. Plotting a regression
line by considering the employee’s performance as the independent variable, and the salary
increase as the dependent variable will make their task easier.
Now, what if the organization wants to know whether an employee would get a promotion or
not based on their performance? The above linear graph won’t be suitable in this case. As
such, we clip the line at zero and one, and convert it into a sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an employee will get a
salary increase or not.
𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between zero and
one.
𝑦 = 𝛽0 + 𝛽1* 𝑥
Let Y = e 𝛽0+𝛽1 * 𝑥
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
Now that you know more about logistic regression algorithms, let’s look at the difference
between linear regression and logistic regression.
It does not require too many computational resources as it’s highly interpretable
There is no problem scaling the input features—It does not require tuning
It is easy to implement and train a model using logistic regression
It gives a measure of how relevant a predictor (coefficient size) is, and its direction of
association (positive or negative).
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node (distance
from the office) and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision
node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute for
the root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology
used:
To better understand Random Forest algorithm and how it works, it's helpful to review the
three main types of machine learning -
Working of Random Forest Algorithm
Step 2: This algorithm will construct a decision tree for every training data.
Step 4: Finally, select the most voted prediction result as the final prediction result.
This combination of multiple models is called Ensemble. Ensemble uses two methods:
1. Bagging: Creating a different training subset from sample training data with replacement is called
Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models such that the
final model has the highest accuracy is called Boosting. Example: ADA BOOST, XG BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses the
Bagging code. Now, let us understand this concept in detail. Bagging is also known as
Bootstrap Aggregation used by random forest. The process begins with any original random
data. After arranging, it is organised into samples known as Bootstrap Sample. This process is
known as Bootstrapping.Further, the models are trained individually, yielding different results
known as Aggregation. In the last step, all the results are combined, and the generated output
is based on majority voting. This step is known as Bagging and is done using an Ensemble
Classifier.
Miscellany: Each tree has a unique attribute, variety and features concerning other trees. Not all
trees are the same.
Immune to the curse of dimensionality: Since a tree is a conceptual idea, it requires no features to
be considered. Hence, the feature space is reduced.
Parallelization: We can fully use the CPU to build random forests since each tree is created
autonomously from different data and features.
Train-Test split: In a Random Forest, we don’t have to differentiate the data for train and test
because the decision tree never sees 30% of the data.
Stability: The final result is based on Bagging, meaning the result is based on majority voting or
average.
There are a lot of benefits to using Random Forest Algorithm, but one of the main advantages
is that it reduces the risk of overfitting and the required training time. Additionally, it offers a
high level of accuracy. Random Forest algorithm runs efficiently in large databases and
produces highly accurate predictions by estimating missing data.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs, and then we test it with this strange creature. So as
support vector creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector. How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data,
we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Confusion Matrix?
A confusion matrix is a simple table that shows how well a classification model is
performing by comparing its predictions to the actual results. It breaks down the
predictions into four categories: correct predictions for both classes (true
positives and true negatives) and incorrect predictions (false positives and false
negatives). This helps you understand where the model is making mistakes, so you
can improve it.
The matrix displays the number of instances produced by the model on the test
data.
True Positive (TP): The model correctly predicted a positive outcome (the
actual outcome was positive).
True Negative (TN): The model correctly predicted a negative outcome (the
actual outcome was negative).
False Positive (FP): The model incorrectly predicted a positive outcome (the
actual outcome was negative). Also known as a Type I error.
False Negative (FN): The model incorrectly predicted a negative outcome
(the actual outcome was positive). Also known as a Type II error.
Example:
True Positive (TP): It is the total counts having both predicted and actual
values are Dog.
True Negative (TN): It is the total counts having both predicted and actual
values are Not Dog.
False Positive (FP): It is the total counts having prediction is Dog while
actually Not Dog.
False Negative (FN): It is the total counts having prediction is Not Dog while
actually, it is Dog.
Result TP FN TP TN TP FP TP TP TN TN