0% found this document useful (0 votes)
271 views64 pages

Unit 2 Notes

Unit II covers supervised learning, focusing on the distinction between discriminative and generative models, linear regression, and various classification techniques. It explains how supervised learning utilizes labeled data for prediction and outlines the mathematical foundations and applications of different models. Key concepts include linear regression, cost functions, and the impact of outliers on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
271 views64 pages

Unit 2 Notes

Unit II covers supervised learning, focusing on the distinction between discriminative and generative models, linear regression, and various classification techniques. It explains how supervised learning utilizes labeled data for prediction and outlines the mathematical foundations and applications of different models. Key concepts include linear regression, cost functions, and the impact of outliers on model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT II SUPERVISED LEARNING 12

Introduction-Discriminative and Generative Models -Linear Regression - Least Squares -Underfitting/


Overfitting -Cross-Validation Lasso Regression- Classification - Logistic Regression- Gradient Linear
Models -Support Vector Machines Kernel Methods -Instance based Methods – Knearest Neighbors -
Tree based Methods Decision Trees ID3 CART - Ensemble Methods Random Forest - Evaluation of
Classification Algorithms

UNIT II SUPERVISED LEARNING

Introduction

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping function
to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape. The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Discriminative and Generative Models

Discriminative and Generative models. In simple words, a discriminative model makes predictions on
the unseen data based on conditional probability and can be used either for classification or regression
problem statements. On the contrary, a generative model focuses on the distribution of a dataset to
return a probability for a given example.
Problem Formulation

Suppose we are working on a classification problem where our task is to decide if an email is a spam
or not spam based on the words present in a particular email. To solve this problem, we have a joint
model over
 Labels: Y=y, and
 Features: X={x1, x2, …
xn} p(Y,X) = P(y,x1,x2…xn)
Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both generative and
discriminative models can solve this problem but in different ways.
The approach of Generative Models

In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and uses the
Bayes Theorem to calculate the posterior probability P(Y |X):

The approach of Discriminative Models

In the case of discriminative models, to find the probability, they directly assume some functional
form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the training data.
Discriminative Models

The discriminative model refers to a class of models used in Statistical Classification, mainly used for
supervised machine learning. These types of models are also known as conditional models since they
learn the boundaries between classes or labels in a dataset.
Discriminative models (just as in the literal meaning) separate classes instead of modeling the
conditional probability and don’t make any assumptions about the data points. But these models are
not capable of generating new data points. Therefore, the ultimate objective of discriminative
models is to separate one class from another.
If we have some outliers present in the dataset, then discriminative models work better compared to
generative models i.e, discriminative models are more robust to outliers. However, there is one major
drawback of these models is the misclassification problem, i.e., wrongly classifying a data point.
Mathematical things involved in Discriminative Models

Training discriminative classifiers involve estimating a function f: X -> Y, or probability P(Y|X)


 Assume some functional form for the probability such as P(Y|X)
 With the help of training data, we estimate the parameters of P(Y|X)
Some Examples of Discriminative Models

 Logistic regression
 Scalar Vector Machine (SVMs)
 Traditional neural networks
 Nearest neighbor
 Conditional Random Fields (CRFs)
 Decision Trees and Random Forest
Generative Models

Generative models are considered as a class of statistical models that can generate new data instances.
These models are used in unsupervised machine learning as a means to perform tasks such as
o Probability and Likelihood estimation,
o Modeling data points,
o To describe the phenomenon in data,
o To distinguish between classes based on these probabilities.
So, Generative models focus on the distribution of individual classes in a dataset and the learning
algorithms tend to model the underlying patterns or distribution of the data points. These models use
the concept of joint probability and create the instances where a given feature (x) or input and the
desired output or label (y) exist at the same time.
These models use probability estimates and likelihood to model data points and differentiate between
different class labels present in a dataset. Unlike discriminative models, these models are also capable
of generating new data points. However, they also have a major drawback – If there is a presence of
outliers in the dataset, then it affects these types of models to a significant extent.

Mathematical things involved in Generative Models

Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):


 Assume some functional form for the probabilities such as P(Y), P(X|Y)
 With the help of training data, we estimate the parameters of P(X|Y), P(Y)
 Use the Bayes theorem to calculate the posterior probability P(Y |X)
Some Examples of Generative Models

 Naïve Bayes
 Bayesian networks
 Markov random fields
 Hidden Markov Models (HMMs)
 Latent Dirichlet Allocation (LDA)
 Generative Adversarial Networks (GANs)
 Autoregressive Model
Difference between Discriminative and Generative Models

Core Idea

Discriminative models draw boundaries in the data space, while generative models try to model how
data is placed throughout the space. A generative model focuses on explaining how the data was
generated, while a discriminative model focuses on predicting the labels of the data.
Mathematical Intuition

In mathematical terms, a discriminative machine learning trains a model which is done by


learning parameters that maximize the conditional probability P(Y|X), while on the other hand, a
generative model learns parameters by maximizing the joint probability of P(X, Y).
Applications

Discriminative models recognize existing data i.e, discriminative modeling identifies tags and
sorts data and can be used to classify data while Generative modeling produces something.
Since these models use different approaches to machine learning, so both are suited for specific tasks
i.e, Generative models are useful for unsupervised learning tasks while discriminative models are
useful for supervised learning tasks.
Outliers

Generative models have more impact on outliers than discriminative models.


Computational Cost

Discriminative models are computationally cheap as compared to generative models.


Comparison between Discriminative and Generative Models

Let’s see some of the comparisons based on the following criteria between Discriminative and
Generative Models:
 Performance
 Missing Data
 Accuracy Score
 Applications
Based on Performance

Generative models need fewer data to train compared with discriminative models since generative
models are more biased as they make stronger assumptions i.e, assumption of conditional
independence.
Based on Missing Data

In general, if we have missing data in our dataset, then Generative models can work with these
missing data, while on the contrary discriminative models can’t. This is because, in generative
models, still we can estimate the posterior by marginalizing over the unseen variables. However, for
discriminative models, we usually require all the features X to be observed.

Based on Accuracy Score

If the assumption of conditional independence violates, then at that time generative models are less
accurate than discriminative models.
Based on Applications

Discriminative models are called “discriminative” since they are useful for discriminating Y’s label
i.e, target outcome, so they can only solve classification problems while Generative models have
more applications besides classification such as,
 Samplings,
 Bayes learning,
 MAP inference, etc.

Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input
value). ε = random error
Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.

o Multiple Linear regression:


If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line

A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the X-
axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function

 The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
 Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
 We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known as
Hypothesis function.

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Gradient Descent

 Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
 A regression model uses gradient descent to update the coefficients of the line by reducing
the cost function.
 It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Assumptions of Linear Regression

o Linear relationship between the features and target:


Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern.
If error terms are not normally distributed, then confidence intervals will become either too
wide or too narrow, which may cause difficulties in finding coefficients. It can be checked
using the q-q plot. If the plot shows a straight line without any deviation, which means the
error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

Here are some examples of linear regression:

 Income and expenses

If you know your income and expenses for the last year, you can use linear regression to predict your
future expenses.

 Height and weight

You can use linear regression to model the relationship between a person's height and weight.

 Blood pressure

You can use multiple linear regression to analyze the relationship between height, weight, and
exercise on blood pressure.

 Income and happiness

You can use linear regression to analyze the relationship between income and happiness.

 Chemical mass and time

You can use linear regression to model how the mass of a chemical changes over time.

 Child height and age

You can use linear regression to model how a child's height changes with age.

Least Square LR

Linear regression is a statistical method that uses the least squares method to find the best line to fit
data.

Least squares linear regression (LSLR) is a mathematical method that finds the best fit line for a set of
data. It's also known as the least-squares regression line or the line of best fit.

The least-squares method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data. The curve of the equation is called the regression line.
Our main objective in this method is to reduce the sum of the squares of errors as much as possible.

How it works
 LS LR minimizes the sum of squared errors, or residuals, between the data points and the line
 The line of best fit doesn't have to pass through every data point, but it does minimize the
vertical distance between the data points and the line
 LS LR is often used for scatter plots, where the data is spread out in the x-y plane

What it's used for


 LS LR is used to illustrate trends
 It's used to predict or estimate data values
 It's used by technical analysts to identify trading opportunities and market trends

The least-square method states that the curve that best fits a given set of observations, is said
to be a curve having a minimum sum of the squared residuals (or deviations or errors) from
the given data points. Let us assume that the given points of data are (x 1, y1), (x2, y2), (x3, y3),
…, (xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum, i.e:

1) Find a linear regression equation for the following two sets of data:
Hence we got the value of a = 1.5 and b = 0.95
The linear equation is given by
Y = a + bx
Now put the value of a and b in the equation
Hence equation of linear regression is y = 1.5 + 0.95x
Method 2
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
Solution:
Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 = 62/10 = 6.2
Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 = 72/10 = 7.2
Straight line equation is y = a + bx.
The normal equations are
Under-fitting / Overfitting

Overfitting and Underfitting are the two main problems that occur in machine learning and degrade
the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. Here generalization defines the
ability of an ML model to provide a suitable output by adapting the given set of unknown input. It
means after providing training on the dataset, it can produce reliable and accurate output. Hence, the
underfitting and overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term that will help
to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but does
not perform well with the test dataset, then variance occurs.
Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model. It means
the more we train our model, the more chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:

As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine learning model. But
the main cause is overfitting, so there are some ways by which we can reduce the occurrence of
overfitting in our model.
 Cross-Validation
 Training with more data
 Removing features
 Early stopping the training
 Regularization
 Ensembling
Underfitting

Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early
stage, due to which the model may not learn enough from the training data. As a result, it may fail to
find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the linear regression model:

As we can see from the above diagram, the model is unable to capture the data points present in the
plot.
How to avoid underfitting:

 By increasing the training time of the model.


 By increasing the number of features.
Goodness of Fit

The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models
to achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted
values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it makes
predictions with 0 errors, but in practice, it is difficult to achieve it.

Cross-Validation

Cross-validation is a technique for validating the model efficiency by training it on the subset of
input data and testing on previously unseen subset of the input data. We can also say that it is a
technique to check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a
particular sample of the dataset, which was not part of the training dataset. After that, we test our
model on that sample before deployment, and this complete process comes under cross-validation.
This is something different from the general train-test split.
Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.
Methods used for Cross-Validation

There are some common methods that are used for cross-validation. These methods are given below:
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation Set Approach

We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the under
fitted model.
Leave-P-out cross-validation

In this approach, the p datasets are left out of the training data. It means, if there are total n data
points in the original input dataset, then n-p data points will be used as the training dataset and the p
data points as the validation set. This complete process is repeated for all the samples, and the
average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.
Leave one out cross-validation

This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1 dataset
out of training. It means, in this approach, for each learning set, only one datapoint is reserved, and
the remaining dataset is used to train the model. This process repeats for each datapoint. Hence for n
samples, we get n different training set and n test set. It has the following features:
 In this approach, the bias is minimum as all the data points are used.
 The process is executed for n times; hence execution time is high.
 This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation

K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes.
These samples are called folds. For each learning set, the prediction function uses k-1 folds, and the
rest of the folds are used for the test set. This approach is a very popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
 Split the input dataset into K groups
 For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using
the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st
iteration, the first fold is reserved for test the model, and rest are used to train the model. On 2nd
iteration, the second fold is used to test the model, and rest are used to train the model. This process

will continue until each fold is not used for the test fold.

Stratified k-fold cross-validation

This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best
approaches.
It can be understood with an example of housing prices, such that the price of some houses can be
much high than other houses. To tackle such situations, a stratified k-fold cross-validation
technique is useful.
Holdout Method

This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the rest part
of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also
produces misleading results sometimes.
Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:
 For the ideal conditions, it provides the optimum output. But for the inconsistent data, it
may produce a drastic result. So, it is one of the big disadvantages of cross-validation, as
there is no certainty of the type of data in machine learning.
 In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model for the
prediction of stock market values, and the data is trained on the previous 5 years stock
values, but the realistic future values for the next 5 years may drastically different, so it is
difficult to expect the correct output for such situations.

Applications of Cross-Validation

 This technique can be used to compare the performance of different predictive


modeling methods.
 It has great scope in the medical research field.
 It can also be used for the meta-analysis, as it is already being used by the data scientists
in the field of medical statistics.

Lasso Regression

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for
the regularisation of data models and feature selection.
Lasso regression is a regularization technique. It is used over regression methods for a more accurate
prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central
point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer
parameters). This particular type of regression is well-suited for models showing high levels of
multicollinearity or when you want to automate certain parts of model selection, like variable
selection/parameter elimination.
Lasso Regression uses L1 regularization technique (will be discussed later). It is used when we have
more features because it automatically performs feature selection.
Lasso Regularization Techniques

There are two main regularization techniques, namely Ridge Regression and Lasso Regression.
They both differ in the way they assign a penalty to the coefficients.
Regularization

Regularization is an important concept that is used to avoid overfitting of the data, especially when
the trained and test data are much varying.
Regularization is implemented by adding a “penalty” term to the best fit derived from the trained
data, to achieve a lesser variance with the tested data and also restricts the influence of predictor
variables over the output variable by compressing their coefficients.
In regularization, what we do is normally we keep the same number of features but reduce the
magnitude of the coefficients. We can reduce the magnitude of the coefficients by using different
types of regression techniques which uses regularization to overcome this problem. So, let us
discuss them. Before we move further, you can also upskill with the help of online courses on Linear
Regression in Python and enhance your skills.
L1 Regularization

If a regression model uses the L1 Regularization technique, then it is called Lasso Regression. If it
used the L2 regularization technique, it’s called Ridge Regression. We will study more about these in
the later sections.
L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient.
This regularization type can result in sparse models with few coefficients. Some coefficients might
become zero and get eliminated from the model. Larger penalties result in coefficient values that are
closer to zero (ideal for producing simpler models). On the other hand, L2 regularization does not
result in any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to interpret
as compared to the Ridge. While there are ample resources available online to help you understand
the subject, there’s nothing quite like a certificate

Mathematical equation of Lasso Regression


In short, in linear regression, the goal is to minimize a certain expression (RSS) with respect to the
parameter values βi.

In LASSO, we also minimize the RSS, however, augmented by a regularization term called the L1
penalty.
Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)

Where,
 λ denotes the amount of shrinkage.
 λ = 0 implies all features are considered and it is equivalent to the linear regression
where only the residual sum of squares is considered to build a predictive model
 λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and
more features
 The bias increases with increase in λ
 variance increases with decrease in λ

Limitation of Lasso Regression:

 Lasso sometimes struggles with some types of data. If the number of predictors (p) is
greater than the number of observations (n), Lasso will pick at most n predictors as non-
zero, even if all predictors are relevant (or may be used in the test set).
 If there are two or more highly collinear variables, then LASSO regression select one of
them randomly which is not good for the interpretation of data

What is the difference between LASSO and ridge regression?

Notice that the difference between ridge regression and LASSO models appears to be very small. It
only lies in the fact that the regularization term has a slightly different form. In fact, the expression we
minimize in these models only differs in the norm we use for the penalty. In ridge regression, we used
the norm p=2, while in LASSO we use the norm p=1.

The key difference is in how they assign penalties to the coefficients:

 Ridge Regression:

o Performs L2 regularization, i.e., adds penalty equivalent to the square of the


magnitude of coefficients

 Lasso Regression:

o Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the
magnitude of coefficients

Classification-Logistic Regression

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.
Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:

 Data Pre-processing step


 Fitting Logistic Regression to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result

1. Data Pre-processing step:

In this step, we will pre-process/prepare the data so that we can use it in our code efficiently. It will
be the same as we have done in Data pre-processing topic.

#Data Pre-procesing Step


# importing libraries
import numpy as nm
import [Link] as
mtp import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')

2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class
of the sklearn library.
#Fitting Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
[Link](x_train, y_train)
3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
#Predicting the test set result
y_pred= [Link](x_test)
4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To create
it, we need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters, mainly y_true(
the actual values) and y_pred (the targeted value return by the classifier). Below is the code for it:
#Creating the Confusion matrix
from [Link] import confusion_matrix
cm= confusion_matrix()

Gradient Linear Models/ Gradient Descent in Linear Regression

A linear regression model attempts to explain the relationship between a dependent (output variables)
variable and one or more independent (predictor variable) variables using a straight line.

The first step in finding a linear regression equation is to determine if there is a relationship between the
two variables. We can do this by using the Correlation coefficient and scatter plot. When a correlation
coefficient shows that data is likely to be able to predict future outcomes and a scatter plot of the data
appears to form a straight line, we can use simple linear regression to find a predictive function. Let us
consider an example.
From the scatter plot we can see there is a linear relationship between Sales and marketing spent. The
next step is to find a straight line between Sales and Marketing that explain the relationship between
them. But there can be multiple lines that can pass through these points.

So how do we know which of these lines is the best fit line?.

Cost Function

The cost is the error in our predicted value. We will use the Mean Squared Error function to calculate the
cost.
Our goal is to minimize the cost as much as possible in order to find the best fit line. For that, we will
use Gradient Descent Algorithm.

Gradient Descent Algorithm

Gradient Descent is an algorithm that finds the best-fit line for a given training dataset in a smaller
number of iterations.

For some combination of m and c, we will get the least Error (MSE). That combination of m and c will
give us our best fit line.

The algorithm starts with some value of m and c (usually starts with m=0, c=0). We calculate MSE
(cost) at point m=0, c=0. Let say the MSE (cost) at m=0, c=0 is 100. Then we reduce the value of m and
c by some amount (Learning Step). We will notice a decrease in MSE (cost). We will continue doing the
same until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy).

Step by Step Algorithm:

1. Let m = 0 and c = 0. Let L be our learning rate. It could be a small value like 0.01 for good accuracy.
Learning rate gives the rate of speed where the gradient moves during gradient descent. Setting it too
high would make your path instable, too low would make convergence slow. Put it to zero means your
model isn’t learning anything from the gradients.

2. Calculate the partial derivative of the Cost function with respect to m. Let partial derivative of the
Cost function with respect to m be Dm (With little change in m how much Cost function changes).
Similarly, let’s find the partial derivative with respect to c. Let partial derivative of the Cost function
with respect to c be Dc (With little change in c how much Cost function changes).

3. Now update the current values of m and c using the following equation:

4. We will repeat this process until our Cost function is very small (ideally 0).

Gradient Descent Algorithm gives optimum values of m and c of the linear regression equation. With
these values of m and c, we will get the equation of the best-fit line and ready to make predictions.

Support Vector Machine

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if
there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair (x1, x2) of coordinates in either green or blue. Consider the below
image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane
with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.


As we can see in the above output image, the SVM classifier has divided the users into two regions
(Purchased or Not purchased). Users who purchased the SUV are in the red region with the red scatter
points. And users who did not purchase the SUV are in the green region with green scatter points. The
hyperplane has divided the two classes into Purchased and not purchased variable.

Support Vector Machine (SVM) is a supervised machine learning algorithm for classification and
regression tasks. Here’s a detailed description of the SVM algorithm for binary classification:

1. Problem Formulation
Given a set of training examples:

The goal is to find a decision boundary (a hyperplane) that maximizes the margin between the two
classes.

2. Define the Hyperplane

A hyperplane can be defined as wᵀx + b = 0, where w is the weight vector (normal to the hyperplane),
and b is the bias term. The equation can be rewritten as

3. Maximize the Margin

The margin can be calculated as the perpendicular distance between the closest data points and the
hyperplane, given by 2 / ||w||. To maximize the margin, we need to minimize ||w||² / 2, subject to the
constraints

4. Lagrange Multipliers

Introduce Lagrange multipliers αᵢ for each constraint, and form the Lagrangian:

5. Find the saddle point

To find the saddle point of L(w, b, α), compute the gradients with respect to w and b, and set them to
zero:

6. Dual Formulation
Substitute the gradients back into the Lagrangian to obtain the dual problem:

7. Solve the Quadratic Programming Problem

The dual problem is a convex quadratic programming problem. Solve it using optimization techniques
such as the Sequential Minimal Optimization (SMO) algorithm, gradient ascent, or specialized quadratic
programming solvers.

8. Obtain w and b

Once you have the optimal αᵢ values, compute the weight vector w:

To find the bias term b, use any support vector (xₛ, yₛ) where αₛ > 0:

9. Make Predictions
For a new data point x, the predicted class label ŷ can be calculated using:

The sign of the result determines the class: if the result is positive, the predicted class is 1, and if the
result is negative, the predicted class is -1.

10. Kernel Trick (optional)

For non-linearly separable data, you can use the kernel trick to map the data to a higher-dimensional
space where it becomes linearly separable. Replace the dot product (xᵢᵀxⱼ) in the dual problem with a
kernel function K(xᵢ, xⱼ) that computes the dot product in the higher-dimensional space:

Common kernel functions include the linear kernel (K(xᵢ, xⱼ) = xᵢᵀxⱼ), polynomial kernel (K(xᵢ, xⱼ) =
(xᵢᵀxⱼ + c)ᵈ), and radial basis function (RBF) or Gaussian kernel (K(xᵢ, xⱼ) = exp(-γ||xᵢ — xⱼ||²)).
11. Make Predictions with Kernels

For a new data point x, the predicted class label ŷ can be calculated using the kernel function

Kernel Methods

Higher-Dimensional Feature Space: By applying a kernel function, the data is transformed into a new,
higher-dimensional space where the data may become linearly separable. In this new feature space,
SVM can find a linear hyperplane that effectively separates the classes, even though the data appeared
non-linear in the original space. It is very difficult to solve this classification using a linear classifier as
there is no good linear line that should be able to classify the red and the green dots as the points are
randomly distributed. Here comes the use of kernel function which takes the points to higher
dimensions, solves the problem over there and returns the output

The key idea of SVMs is that we don’t need to explicitly compute the mapping to the higher-
dimensional feature space. Instead, the kernel function computes the similarity between data points in
the higher-dimensional space without having to directly compute the coordinates of each point in that
space. This allows SVMs to handle complex, non-linear relationships between features while
maintaining computational efficiency

Linear Kernel

Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined by the dot
product of these two vectors:
K(x1, x2) = x1 . x2

1. The linear kernel is the simplest and most straightforward kernel function.
2. This kernel is used when the data is already linearly separable. It effectively means that no
transformation is applied to the data.
3. Advantages:
 Simple and fast to compute.
 Effective for linearly separable data.
4. Disadvantages:
 Not suitable for complex, non-linear data.

Polynomial Kernel

A polynomial kernel is defined by the following equation:


K(x1, x2) = (x1 . x2 + 1)d,
Where,

d is the degree of the polynomial and x1 and x2 are vectors

1. The polynomial kernel allows for more complex decision boundaries by adding polynomial
features to the data. It is defined as:
2. This kernel can capture interactions between features up to a certain degree.
3. Advantages:
 Can model interactions between features.
 Suitable for non-linearly separable data.
4. Disadvantages:
 Computationally more expensive than the linear kernel.
 Risk of overfitting with high-degree polynomials.

Radial Basis Function (RBF) Kernel

This kernel is an example of a radial basis function kernel. Below is the equation for this:

The given sigma plays a very important role in the performance of the Gaussian kernel and should
neither be overestimated and nor be underestimated, it should be carefully tuned according to the
problem.

1. The RBF kernel, also known as the Gaussian kernel, is a popular choice due to its flexibility. It
is defined as:
2. This kernel can handle very complex and non-linear relationships.
3. Advantages:
 Can handle a wide range of data distributions.
 Effective in high-dimensional spaces.
4. Disadvantages:
 Requires careful tuning of the σ parameter.
 Can be computationally expensive with large datasets.
Sigmoid Kernel

This kernel is used in neural network areas of machine learning. The activation function for the
sigmoid kernel is the bipolar sigmoid function. The equation for the hyperbolic kernel function

Advantages:

 Can be used to model relationships similar to those found in neural networks.

 Simple to implement.

Disadvantages:

 Less commonly used compared to other kernels.

 Can be less effective for certain types of data.

Choosing the Right Kernel

Selecting the appropriate kernel for your SVM model depends on several factors:

 Data Complexity: For linearly separable data, the linear kernel is sufficient. For more complex
data, consider polynomial or RBF kernels.

 Computational Resources: RBF and polynomial kernels are computationally more intensive
than the linear kernel. Ensure that your computational resources can handle the increased
complexity.

 Model Performance: Experiment with different kernels and use cross-validation to determine
which kernel yields the best performance for your specific problem.

Instance based Methods

The Machine Learning systems which are categorized as instance-based learning are the systems
that learn the training examples by heart and then generalizes to new instances based on some
similarity measure. It is called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-learning. The time complexity of this
algorithm depends upon the size of training data. The worst-case time complexity of this algorithm
is O (n), where n is the number of training instances.

For example, if we were to create a spam filter with an instance-based learning algorithm, instead of
just flagging emails that are already marked as spam emails, our spam filter would be programmed
to also flag emails that are very similar to them. This requires a measure of resemblance between
two emails. A similarity measure between two emails could be the same sender or the repetitive use
of the same keywords or something else.
Advantages:

1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.

Disadvantages:

1. Classification costs are high

2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch

K Nearest Neighbor

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based


on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we
want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.
Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
o
How does K-NN work?

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
Step1: Selecting the optimal value of K

o Firstly, we will choose the number of neighbors, so we will choose the k=5.

Step2: Calculating distance

o To measure the similarity between target and training data points Euclidean distance is used.
Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors

The k data points with the smallest distances to the target point are nearest neighbors. By calculating
the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

Step 4 Voting for Classification or Taking Average for Regression

 When you want to classify a data point into a category (like spam or not spam), the K-NN
algorithm looks at the K closest points in the dataset. These closest points are called
neighbors. The algorithm then looks at which category the neighbors belong to and picks the
one that appears the most. This is called majority voting.
 In regression, the algorithm still looks for the K closest points. But instead of voting for a
class in classification, it takes the average of the values of those K neighbors. This average is
the predicted value for the new point for the algorithm .

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:
 There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

 Easy to implement- The K-NN algorithm is easy to implement because its complexity is
relatively low as compared to other machine learning algorithms.

 Easily Adaptable- K-NN stores all data in memory, so when new data points are added, it
automatically adjusts and uses the new data for future predictions.

 Few Hyperparameters – The only parameters which are required in the training of a KNN
algorithm are the value of k and the choice of the distance metric which we would like to
choose from our evaluation metric.

Disadvantages of KNN Algorithm:


 Does’nt scale well – K-NN is considered a “lazy” algorithm, meaning it requires a lot of
computing power and memory. This makes it slow , especially with large datasets
 Curse of Dimensionality -When the number of features increases, K-NN struggles to classify
data accurately, a problem known as curse of dimensionality which implies the algorithm
faces a hard time classifying the data points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of dimensionality it is
prone to the problem of overfitting as well.

Distance Metrics Used in KNN Algorithm


KNN uses distance metrics to identify nearest neighbour, these neighbours are used for classification
and regression task. To identify nearest neighbour we use below distance metrics:

1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space. You
can think of it like the shortest path you would walk if you were to go directly from one point to another.

2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical lines
(like a grid or city streets). It’s also called “taxicab distance” because a taxi can only drive along the
grid-like streets of a city.

3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan
distances as special cases.

Applications of the KNN Algorithm


Here are some real life applications of KNN Algorithm.
 Recommendation Systems: Many recommendation systems, such as those used by Netflix or
Amazon, rely on KNN to suggest products or content. KNN observes at user behavior and finds
similar users. If user A and user B have similar preferences, KNN might recommend movies
that user A liked to user B.
 Spam Detection: KNN is widely used in filtering spam emails. By comparing the features of a
new email with those of previously labeled spam and non-spam emails, KNN can predict
whether a new email is spam or not.
 Customer Segmentation: In marketing firms, KNN is used to segment customers based on their
purchasing behavior . By comparing new customers to existing customers, KNN can easily
group customers into segments with similar choices and preferences. This helps businesses
target the right customers with right products or advertisements.
 Speech Recognition: KNN is often used in speech recognition systems to transcribe spoken
words into text. The algorithm compares the features of the spoken input with those of known
speech patterns. It then predicts the most likely word or command based on the closest matches.

Example
The table represents our data set. We have two columns — Brightness and Saturation. Each row in the
table has a class of either Red or Blue.

How to Calculate Euclidean Distance in the K-Nearest Neighbors Algorithm

Here's the new data entry:

Brightness - 20

Saturation - 35

Class- ?
Let's rearrange the distances in ascending order: Since we chose 5 as the value of K, we'll only
consider the first five rows. That is:

As you can see above, the majority class within the 5 nearest neighbors to the new entry is Red.
Therefore, we'll classify the new entry as Red.
Tree based methods- decision Tree

o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Use Decision Trees

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model. Below
are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.

How does the Decision Tree Algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o GiniIndex

Information Gain:

o Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)]

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

Gini Index:

oGini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below
formula: Gini Index= 1- ∑ P 2
Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follows while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

ID3
The ID3 algorithm begins with the original set S as the root node. On each iteration of the algorithm,
it iterates through every unused attribute of the set S and calculates the entropy H(S) or the
information gain IG(S) of that attribute. It then selects the attribute which has the smallest entropy
(or largest information gain) value. The set S is then split or partitioned by the selected attribute to
produce subsets of the data. (For example, a node can be split into child nodes based upon the
subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.)
The algorithm continues to recurse on each subset, considering only attributes never selected before.

[Link]
Here, we have 3 features and 2 output classes. To build a decision tree using Information gain. We will
take each of the features and calculate the information for each feature.
From the above images, we can see that the information gain is maximum when we make a split on
feature Y. So, for the root node best-suited feature is feature Y. Now we can see that while splitting the
dataset by feature Y, the child contains a pure subset of the target variable. So we don’t need to further
split the dataset. The final tree for the above dataset would look like this

2)

For the set X = {a,a,a,b,b,b,b,b}


Total instances: 8
Instances of b: 5
Instances of a: 3
CART

CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle
both classification and regression tasks.

CART Algorithm

Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from labelled data to
predict unseen data.

 Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.

 Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression tasks, CART uses residual reduction as the
splitting criterion. The lower the residual reduction, the better the fit of the model to the data.

 Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes
that contribute little to the model accuracy. Cost complexity pruning and information gain
pruning are two popular pruning techniques. Cost complexity pruning involves calculating the
cost of each node and removing nodes that have a negative cost. Information gain pruning
involves calculating the information gain of each node and removing nodes that have a low
information gain.

How does CART algorithm works?

The CART algorithm works via the following process:

 The best-split point of each input is obtained.

 Based on the best-split points of each input in Step 1, the new “best” split point is identified.

 Split the chosen input according to the “best” split point.

 Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree. It does that by searching for
the best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity

The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities
of each class. It computes the degree of probability of a specific variable that is wrongly being classified
when chosen randomly and a variation of the Gini coefficient. It works on categorical variables,
provides outcomes either “successful” or “failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,

 Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.

 Gini index close to 1 means a high level of impurity, where each class contains a very small
fraction of elements, and

 A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each
class has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2
= 0.5.

 Mathematically, we can write Gini Impurity as follows:

 In conclusion, Gini impurity is the probability of misclassification, assuming independent


selection of the element and its class based on the class probabilities

CART for Classification

A classification tree is an algorithm where the target variable is categorical. The algorithm is then used
to identify the “Class” within which the target variable is most likely to fall. Classification trees are used
when the dataset needs to be split into classes that belong to the response variable (like yes or no)

For classification in decision tree learning algorithm that creates a tree-like structure to predict class
labels. The tree consists of nodes, which represent different decision points, and branches, which
represent the possible result of those decisions. Predicted class labels are present at each leaf node of the
tree.

How Does CART for Classification Work?

CART for classification works by recursively splitting the training data into smaller and smaller subsets
based on certain criteria. The goal is to split the data in a way that minimizes the impurity within each
subset. Impurity is a measure of how mixed up the data is in a particular subset. For classification tasks,
CART uses Gini impurity

 Gini Impurity- Gini impurity measures the probability of misclassifying a random instance
from a subset labeled according to the majority class. Lower Gini impurity means more purity
of the subset.

 Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses
the one that best decreases the Gini impurity of the resultant subsets. This process continues
until a stopping criterion is reached, like a maximum tree depth or a minimum number of
instances in a leaf node.

CART for Regression

A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the
response variable is the temperature of the day.
CART for regression is a decision tree learning method that creates a tree-like structure to predict
continuous target variables. The tree consists of nodes that represent different decision points and
branches that represent the possible outcomes of those decisions. Predicted values for the target variable
are stored in each leaf node of the tree.

How Does CART works for Regression?

Regression CART works by splitting the training data recursively into smaller subsets based on specific
criteria. The objective is to split the data in a way that minimizes the residual reduction in each subset.

 Residual Reduction- Residual reduction is a measure of how much the average squared
difference between the predicted values and the actual values for the target variable is reduced
by splitting the subset. The lower the residual reduction, the better the model fits the data.

 Splitting Criteria- CART evaluates every possible split at each node and selects the one that
results in the greatest reduction of residual error in the resulting subsets. This process is repeated
until a stopping criterion is met, such as reaching the maximum tree depth or having too few
instances in a leaf node.

POPULAR CART-BASED ALGORITHMS:

 CART (Classification and Regression Trees): The original algorithm that uses binary splits to
build decision trees.

 C4.5 and C5.0: Extensions of CART that allow for multiway splits and handle categorical
variables more effectively.

 Random Forests: Ensemble methods that use multiple decision trees (often CART) to improve
predictive performance and reduce overfitting.

 Gradient Boosting Machines (GBM): Boosting algorithms that also use decision trees (often
CART) as base learners, sequentially improving model performance.

Advantages of CART

 Results are simplistic.

 Classification and regression trees are Nonparametric and Nonlinear.

 Classification and regression trees implicitly perform feature selection.

 Outliers have no meaningful effect on CART.

 It requires minimal supervision and produces easy-to-understand models.

Limitations of CART

 Overfitting.

 High Variance.

 low bias.

 the tree structure may be unstable.


Applications of the CART algorithm

 For quick Data insights.

 In Blood Donors Classification.

 For environmental and ecological data.

 In the financial sectors

Gini index

Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each
class. We can formulate it as illustrated below.

Outlook

Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for
outlook feature.
Temperature

Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
Let’s summarize decisions for temperature feature.

Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439

Humidity

Humidity is a binary class feature. It can be high or normal.


Wind

Wind is a binary class similar to humidity. It can be weak and strong.

Time to decide

We’ve calculated gini index values for each feature. The winner will be outlook feature because its cost
is the lowest.
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that overcast
leaf is over.

We will apply same principles to those sub datasets in the following steps.

Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature,
humidity and wind features respectively.

Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0

Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0

Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5

Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2


Gini of humidity for sunny outlook

Humidit Ye N Number of
y s o instances

High 0 3 3

Normal 2 0 2

Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0

Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0

Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

Gini of wind for sunny outlook

Ye N Number of
Wind
s o instances

Weak 1 2 3

Stron
1 1 2
g
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266

Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2

Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

Decision for sunny outlook

We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it
has the lowest value.

Gini
Feature
index

Temperatur
0.2
e

Humidity 0

Wind 0.466
As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will
always be yes for normal humidity and sunny outlook. This branch is over.

Now, we need to focus on rain outlook.

Rain outlook

Da Outloo Temp Humidit


Wind Decision
y k . y

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

Stron
6 Rain Cool Normal No
g

10 Rain Mild Normal Weak Yes

14 Rain Mild High Stron No


g
Gini of temperature for rain outlook

Temperatur Ye N
Number of instances
e s o

Cool 1 1 2

Mild 2 1 3
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444

Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

Gini of humidity for rain outlook

Humidit Ye N
Number of instances
y s o

High 1 1 2

Normal 2 1 3
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444

Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466

Gini of wind for rain outlook

Wind Yes No Number of instances

Weak 3 0 3

Strong 0 2 2
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0

Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0

Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

Decision for rain outlook

The winner is wind feature for rain outlook because it has the minimum gini index score in features.

Feature Gini index

Temperature 0.466
Humidity 0.466

Wind 0
Put the wind feature for rain outlook branch and monitor the new sub data sets.

As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is
strong. This means that this branch is over.

Ensemble methods- Random Forest

Ensemble learning stands out as a powerful technique in machine learning, offering a robust approach to
improving model performance and predictive accuracy. Combining the strengths of multiple individual
models, ensemble methods can often outperform any single model, making them valuable in the
machine learning toolkit. In this article, we delve into the depths of ensemble learning, exploring its
various techniques, algorithms, and real-world applications. Join us to uncover the secrets behind
ensemble learning and unlock its full potential in your machine learning projects.
What Is Ensemble Learning?

Ensemble learning refers to a machine learning approach where several models are trained to address a
common problem, and their predictions are combined to enhance the overall performance. The idea
behind ensemble learning is that by combining multiple models, each with its strengths and weaknesses,
the ensemble can achieve better results than any single model alone. Ensemble learning can be applied
to various machine learning tasks, including classification, regression, and clustering. Some common
ensemble learning methods include bagging, boosting, and stacking.

Ensemble Techniques

Ensemble techniques in machine learning involve combining multiple models to improve performance.
One common ensemble technique is bagging, which uses bootstrap sampling to create multiple datasets
from the original data and trains a model on each dataset. Another technique is boosting, which trains
models sequentially, each focusing on the previous models' mistakes. Random forests are a popular
ensemble method that uses decision trees as base learners and combines their predictions to make a final
prediction. Ensemble techniques are effective because they reduce overfitting and improve
generalization, leading to more robust models.

Simple Ensemble Techniques

Simple ensemble techniques combine predictions from multiple models to produce a final prediction.
These techniques are straightforward to implement and can often improve performance compared to
individual models.

Max Voting

In this technique, the final prediction is the most frequent prediction among the base models. For
example, if three base models predict the classes A, B, and A for a given sample, the final prediction
using max voting would be class A, as it appears more frequently.

Averaging

Averaging involves taking the average of predictions from multiple models. This can be particularly
useful for regression problems, where the final prediction is the mean of predictions from all models.
For classification, averaging can be applied to the predicted probabilities for a more confident
prediction.

Weighted Averaging

Weighted averaging is similar, but each model's prediction is given a different weight. The weights can
be assigned based on each model's performance on a validation set or tuned using grid or randomized
search techniques. This allows models with higher performance to have a greater influence on the final
prediction

Advanced Ensemble Techniques

Advanced ensemble techniques go beyond basic methods like bagging and boosting to enhance model
performance further. Here are explanations of stacking, blending, bagging, and boosting:
Stacking

Stacking, or stacked generalization, combines multiple base models with a meta-model to make
predictions.

Instead of using simple methods like averaging or voting, stacking trains a meta-model to learn how to
combine the base models' predictions best.

The base models can be diverse to capture different aspects of the data, and the meta-model learns to
weight its predictions based on its performance.

Blending

Blending is similar to stacking but more straightforward.

Instead of a meta-model, blending uses a simple method like averaging or a linear model to combine the
predictions of the base models.

Blending is often used in competitions where simplicity and efficiency are important.

Bagging (Bootstrap Aggregating)

Bagging is a technique where multiple subsets of the dataset are created through bootstrapping
(sampling with replacement).

A base model (often a decision tree) is trained on each subset, and the final prediction is the average (for
regression) or majority vote (for classification) of the individual predictions.

Bagging helps reduce variance and overfitting, especially for unstable models.

Boosting

Boosting is an ensemble technique where base models are trained sequentially, with each subsequent
model focusing on the mistakes of the previous ones.

The final prediction is a weighted sum of the individual models' predictions, with higher weights given
to more accurate models.

Boosting algorithms like AdaBoost, Gradient Boosting, and XGBoost are popular because they improve
model performance.

Bagging and Boosting Algorithms

Random Forest

 Random Forest is a technique in ensemble learning that utilizes a decision tree group to make
predictions.

 The key concept behind Random Forest is introducing randomness in tree-building to create
diverse trees.
 To create each tree, a random subset of the training data is sampled (with replacement), and a
decision tree is trained on this subset.

 Additionally, rather than considering all features, a random subset of features is selected at each
tree node to determine the best split.

 The final prediction of the Random Forest is made by aggregating the predictions of all the
individual trees (e.g., averaging for regression, majority voting for classification).

 Random Forests are robust against overfitting and perform well on many datasets. Compared to
individual decision trees, they are also less sensitive to hyperparameters.

Working of Random Forest Algorithm

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples from a given data or training set.

Step 2: This algorithm will construct a decision tree for every training data.

Step 3: Voting will take place by averaging the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction result.

This combination of multiple models is called Ensemble. Ensemble uses two methods:

1. Bagging: Creating a different training subset from sample training data with replacement is
called Bagging. The final output is based on majority voting.

2. Boosting: Combing weak learners into strong learners by creating sequential models such that
the final model has the highest accuracy is called Boosting. Example: ADA BOOST, XG
BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses the Bagging
code. Now, let us understand this concept in detail. Bagging is also known as Bootstrap Aggregation
used by random forest. The process begins with any original random data. After arranging, it is
organised into samples known as Bootstrap Sample. This process is known as [Link], the
models
are trained

individually, yielding different results known as Aggregation. In the last step, all the results are
combined, and the generated output is based on majority voting. This step is known as Bagging and is
done using an Ensemble Classifier.

Assumptions of Random Forest

 Each tree makes its own decisions: Every tree in the forest makes its own predictions without
relying on others.

 Random parts of the data are used: Each tree is built using random samples and features to
reduce mistakes.

 Enough data is needed: Sufficient data ensures the trees are different and learn unique patterns
and variety.

 Different predictions improve accuracy: Combining the predictions from different trees leads to
a more accurate final result.

Key Benefits

 Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit
all the samples within training data. However, when there’s a robust number of decision trees in
a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees
lowers the overall variance and prediction error.

 Provides flexibility: Since random forest can handle both regression and classification tasks
with a high degree of accuracy, it is a popular method among data scientists. Feature bagging
also makes the random forest classifier an effective tool for estimating missing values as it
maintains accuracy when a portion of the data is missing.

Key Challenges
 Time-consuming process: Since random forest algorithms can handle large data sets, they can
be provide more accurate predictions, but can be slow to process data as they are computing
data for each individual decision tree.

 Requires more resources: Since random forests process larger data sets, they’ll require more
resources to store that data.

 More complex: The prediction of a single decision tree is easier to interpret when compared to a
forest of them.

Random forest applications

The random forest algorithm has been applied across a number of industries, allowing them to make
better business decisions. Some use cases include:

Finance: It is a preferred algorithm over others as it reduces time spent on data management and pre-
processing tasks. It can be used to evaluate customers with high credit risk, to detect fraud, and option
pricing problems.

Healthcare: The random forest algorithm has applications within computational biology (link resides
outside [Link]), allowing doctors to tackle problems such as gene expression classification, biomarker
discovery, and sequence annotation. As a result, doctors can make estimates around drug responses to
specific medications.

E-commerce: It can be used for recommendation engines for cross-sell purposes.

Evaluation of Classification algorithms

Classification Metrics
In a classification task, our main task is to predict the target variable which is in the form of discrete
values. To evaluate the performance of such a model there are metrics as mentioned below:

 Classification Accuracy
 Logarithmic loss
 Area under Curve
 F1 score
 Precision
 Recall
 Confusion Matrix

Classification Accuracy
Classification accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of correct
predictions. This is calculated as the ratio of correct predictions to the total number of input Samples.

When is it used?
 Classification accuracy is a fundamental metric for evaluating the performance of a
classification model.
 It's often used as the default evaluation metric for generic models.
What does it indicate?
 A model with perfect accuracy has zero false positives and zero false negatives.
Logarithmic Loss
Logarithmic loss, also known as log loss or cross-entropy loss, is a metric used to evaluate the
performance of a classification model. It measures how close a model's predicted probabilities are to
the actual class labels.

How it works
 Log loss
penalizes
models for
incorrect labeling of data classes.
 It takes into account the confidence of predictions, unlike accuracy which is binary.
 A lower log loss value indicates more accurate predictions.
 Log loss is a popular metric for measuring error in machine learning.
When it's used
 Log loss is used to train binary classifiers, which are simple tasks with two labels.
 It's also used to evaluate the performance of sentiment analysis models in natural language
processing.
Examples of use
 Predicting whether it will rain or not rain in a city
 Predicting whether an email is spam or not spam
 Analyzing customer feedback
 Monitoring social media

Area Under Curve(AUC)


The area under the curve (AUC) is a statistical metric that measures a classifier's ability to distinguish
between two classes. It's a key component of the receiver operating characteristic (ROC) curve, which
is used in data science to evaluate model performance
What does AUC measure?
 AUC measures the probability that a model will rank a randomly chosen positive example
higher than a negative example
 AUC is a summary of the ROC curve, which plots the true positive rate (TPR) versus the
false positive rate (FPR)
 AUC is a measure of a model's discriminative power
What does a high AUC mean?
 A higher AUC indicates better model performance
 A perfect model has an AUC of 1, which means it can perfectly distinguish between positive
and negative examples
 An AUC of 0.5 means the model is performing randomly
How is AUC used?
 AUC is used to evaluate binary classifiers, such as spam filters
 AUC is a valuable metric for comparing classifiers without considering the classification
threshold
 AUC is often used in combination with other metrics to evaluate model performance
What are some limitations of AUC?
 AUC may not provide a complete picture of model performance
 AUC doesn't account for the cost of false positives versus false negatives
 AUC can be misleading if the ROC curve is not well-defined
True positive rate:
Also called or termed sensitivity. True Positive Rate is considered as a portion of positive data points
that are correctly considered as positive, with respect to all data points that are positive.

True Negative Rate


Also called or termed specificity. False Negative Rate is considered as a portion of negative data
points that are correctly considered as negative, with respect to all data points that are negatives.

False-positive Rate
False Negatives rate is actually the proportion of actual positives that are incorrectly identified as
negatives

F1 Score

It is a harmonic mean between recall and precision. Its range is [0,1]. This metric usually tells us how
precise (It correctly classifies how many instances) and robust (does not miss any significant number
of instances) our classifier is.

Precision
There is another metric named Precision. Precision is a measure of a model’s performance that tells
you how many of the positive predictions made by the model are actually correct. It is calculated as
the number of true positive predictions divided by the number of true positive and false positive
predictions.

You might also like