Unit 2 Notes
Unit 2 Notes
Introduction
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping function
to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled
as a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape. The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
Discriminative and Generative models. In simple words, a discriminative model makes predictions on
the unseen data based on conditional probability and can be used either for classification or regression
problem statements. On the contrary, a generative model focuses on the distribution of a dataset to
return a probability for a given example.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an email is a spam
or not spam based on the words present in a particular email. To solve this problem, we have a joint
model over
Labels: Y=y, and
Features: X={x1, x2, …
xn} p(Y,X) = P(y,x1,x2…xn)
Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both generative and
discriminative models can solve this problem but in different ways.
The approach of Generative Models
In the case of generative models, to find the conditional probability P(Y|X), they estimate the prior
probability P(Y) and likelihood probability P(X|Y) with the help of the training data and uses the
Bayes Theorem to calculate the posterior probability P(Y |X):
In the case of discriminative models, to find the probability, they directly assume some functional
form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the training data.
Discriminative Models
The discriminative model refers to a class of models used in Statistical Classification, mainly used for
supervised machine learning. These types of models are also known as conditional models since they
learn the boundaries between classes or labels in a dataset.
Discriminative models (just as in the literal meaning) separate classes instead of modeling the
conditional probability and don’t make any assumptions about the data points. But these models are
not capable of generating new data points. Therefore, the ultimate objective of discriminative
models is to separate one class from another.
If we have some outliers present in the dataset, then discriminative models work better compared to
generative models i.e, discriminative models are more robust to outliers. However, there is one major
drawback of these models is the misclassification problem, i.e., wrongly classifying a data point.
Mathematical things involved in Discriminative Models
Logistic regression
Scalar Vector Machine (SVMs)
Traditional neural networks
Nearest neighbor
Conditional Random Fields (CRFs)
Decision Trees and Random Forest
Generative Models
Generative models are considered as a class of statistical models that can generate new data instances.
These models are used in unsupervised machine learning as a means to perform tasks such as
o Probability and Likelihood estimation,
o Modeling data points,
o To describe the phenomenon in data,
o To distinguish between classes based on these probabilities.
So, Generative models focus on the distribution of individual classes in a dataset and the learning
algorithms tend to model the underlying patterns or distribution of the data points. These models use
the concept of joint probability and create the instances where a given feature (x) or input and the
desired output or label (y) exist at the same time.
These models use probability estimates and likelihood to model data points and differentiate between
different class labels present in a dataset. Unlike discriminative models, these models are also capable
of generating new data points. However, they also have a major drawback – If there is a presence of
outliers in the dataset, then it affects these types of models to a significant extent.
Naïve Bayes
Bayesian networks
Markov random fields
Hidden Markov Models (HMMs)
Latent Dirichlet Allocation (LDA)
Generative Adversarial Networks (GANs)
Autoregressive Model
Difference between Discriminative and Generative Models
Core Idea
Discriminative models draw boundaries in the data space, while generative models try to model how
data is placed throughout the space. A generative model focuses on explaining how the data was
generated, while a discriminative model focuses on predicting the labels of the data.
Mathematical Intuition
Discriminative models recognize existing data i.e, discriminative modeling identifies tags and
sorts data and can be used to classify data while Generative modeling produces something.
Since these models use different approaches to machine learning, so both are suited for specific tasks
i.e, Generative models are useful for unsupervised learning tasks while discriminative models are
useful for supervised learning tasks.
Outliers
Let’s see some of the comparisons based on the following criteria between Discriminative and
Generative Models:
Performance
Missing Data
Accuracy Score
Applications
Based on Performance
Generative models need fewer data to train compared with discriminative models since generative
models are more biased as they make stronger assumptions i.e, assumption of conditional
independence.
Based on Missing Data
In general, if we have missing data in our dataset, then Generative models can work with these
missing data, while on the contrary discriminative models can’t. This is because, in generative
models, still we can estimate the posterior by marginalizing over the unseen variables. However, for
discriminative models, we usually require all the features X to be observed.
If the assumption of conditional independence violates, then at that time generative models are less
accurate than discriminative models.
Based on Applications
Discriminative models are called “discriminative” since they are useful for discriminating Y’s label
i.e, target outcome, so they can only solve classification problems while Generative models have
more applications besides classification such as,
Samplings,
Bayes learning,
MAP inference, etc.
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Here,
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis,
then such a relationship is termed as a Positive linear relationship.
When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost function
The different values for weights or coefficient of lines (a 0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known as
Hypothesis function.
Where,
Gradient Descent
Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
A regression model uses gradient descent to update the coefficients of the line by reducing
the cost function.
It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
Assumptions of Linear Regression
If you know your income and expenses for the last year, you can use linear regression to predict your
future expenses.
You can use linear regression to model the relationship between a person's height and weight.
Blood pressure
You can use multiple linear regression to analyze the relationship between height, weight, and
exercise on blood pressure.
You can use linear regression to analyze the relationship between income and happiness.
You can use linear regression to model how the mass of a chemical changes over time.
You can use linear regression to model how a child's height changes with age.
Least Square LR
Linear regression is a statistical method that uses the least squares method to find the best line to fit
data.
Least squares linear regression (LSLR) is a mathematical method that finds the best fit line for a set of
data. It's also known as the least-squares regression line or the line of best fit.
The least-squares method is a statistical method used to find the line of best fit of the form of an
equation such as y = mx + b to the given data. The curve of the equation is called the regression line.
Our main objective in this method is to reduce the sum of the squares of errors as much as possible.
How it works
LS LR minimizes the sum of squared errors, or residuals, between the data points and the line
The line of best fit doesn't have to pass through every data point, but it does minimize the
vertical distance between the data points and the line
LS LR is often used for scatter plots, where the data is spread out in the x-y plane
The least-square method states that the curve that best fits a given set of observations, is said
to be a curve having a minimum sum of the squared residuals (or deviations or errors) from
the given data points. Let us assume that the given points of data are (x 1, y1), (x2, y2), (x3, y3),
…, (xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents error or deviation from each given point.
Now, we can write:
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the property that the sum of
squares of all the deviations from given values must be minimum, i.e:
1) Find a linear regression equation for the following two sets of data:
Hence we got the value of a = 1.5 and b = 0.95
The linear equation is given by
Y = a + bx
Now put the value of a and b in the equation
Hence equation of linear regression is y = 1.5 + 0.95x
Method 2
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
Solution:
Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 = 62/10 = 6.2
Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 = 72/10 = 7.2
Straight line equation is y = a + bx.
The normal equations are
Under-fitting / Overfitting
Overfitting and Underfitting are the two main problems that occur in machine learning and degrade
the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. Here generalization defines the
ability of an ML model to provide a suitable output by adapting the given set of unknown input. It
means after providing training on the dataset, it can produce reliable and accurate output. Hence, the
underfitting and overfitting are the two terms that need to be checked for the performance of the
model and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term that will help
to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but does
not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points or more than the
required data points present in the given dataset. Because of this, the model starts caching noise and
inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the
model. The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we provide training to our model. It means
the more we train our model, the more chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points present in the
scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the regression
model to find the best fit line, but here we have not got any best fit, so, it will generate the
prediction errors.
How to avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning model. But
the main cause is overfitting, so there are some ways by which we can reduce the occurrence of
overfitting in our model.
Cross-Validation
Training with more data
Removing features
Early stopping the training
Regularization
Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the underlying trend of
the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early
stage, due to which the model may not learn enough from the training data. As a result, it may fail to
find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training data, and hence it
reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the linear regression model:
As we can see from the above diagram, the model is unable to capture the data points present in the
plot.
How to avoid underfitting:
The "Goodness of fit" term is taken from the statistics, and the goal of the machine learning models
to achieve the goodness of fit. In statistics modeling, it defines how closely the result or predicted
values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and ideally, it makes
predictions with 0 errors, but in practice, it is difficult to achieve it.
Cross-Validation
Cross-validation is a technique for validating the model efficiency by training it on the subset of
input data and testing on previously unseen subset of the input data. We can also say that it is a
technique to check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a
particular sample of the dataset, which was not part of the training dataset. After that, we test our
model on that sample before deployment, and this complete process comes under cross-validation.
This is something different from the general train-test split.
Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.
Methods used for Cross-Validation
There are some common methods that are used for cross-validation. These methods are given below:
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
Validation Set Approach
We divide our input dataset into a training set and test or validation set in the validation set approach.
Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the
model may miss out to capture important information of the dataset. It also tends to give the under
fitted model.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n data
points in the original input dataset, then n-p data points will be used as the training dataset and the p
data points as the validation set. This complete process is repeated for all the samples, and the
average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for the large p.
Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1 dataset
out of training. It means, in this approach, for each learning set, only one datapoint is reserved, and
the remaining dataset is used to train the model. This process repeats for each datapoint. Hence for n
samples, we get n different training set and n test set. It has the following features:
In this approach, the bias is minimum as all the data points are used.
The process is executed for n times; hence execution time is high.
This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes.
These samples are called folds. For each learning set, the prediction function uses k-1 folds, and the
rest of the folds are used for the test set. This approach is a very popular CV approach because it is
easy to understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
Split the input dataset into K groups
For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using
the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st
iteration, the first fold is reserved for test the model, and rest are used to train the model. On 2nd
iteration, the second fold is used to test the model, and rest are used to train the model. This process
will continue until each fold is not used for the test fold.
This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best
approaches.
It can be understood with an example of housing prices, such that the price of some houses can be
much high than other houses. To tackle such situations, a stratified k-fold cross-validation
technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the rest part
of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also
produces misleading results sometimes.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
For the ideal conditions, it provides the optimum output. But for the inconsistent data, it
may produce a drastic result. So, it is one of the big disadvantages of cross-validation, as
there is no certainty of the type of data in machine learning.
In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model for the
prediction of stock market values, and the data is trained on the previous 5 years stock
values, but the realistic future values for the next 5 years may drastically different, so it is
difficult to expect the correct output for such situations.
Applications of Cross-Validation
Lasso Regression
“LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a statistical formula for
the regularisation of data models and feature selection.
Lasso regression is a regularization technique. It is used over regression methods for a more accurate
prediction. This model uses shrinkage. Shrinkage is where data values are shrunk towards a central
point as the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer
parameters). This particular type of regression is well-suited for models showing high levels of
multicollinearity or when you want to automate certain parts of model selection, like variable
selection/parameter elimination.
Lasso Regression uses L1 regularization technique (will be discussed later). It is used when we have
more features because it automatically performs feature selection.
Lasso Regularization Techniques
There are two main regularization techniques, namely Ridge Regression and Lasso Regression.
They both differ in the way they assign a penalty to the coefficients.
Regularization
Regularization is an important concept that is used to avoid overfitting of the data, especially when
the trained and test data are much varying.
Regularization is implemented by adding a “penalty” term to the best fit derived from the trained
data, to achieve a lesser variance with the tested data and also restricts the influence of predictor
variables over the output variable by compressing their coefficients.
In regularization, what we do is normally we keep the same number of features but reduce the
magnitude of the coefficients. We can reduce the magnitude of the coefficients by using different
types of regression techniques which uses regularization to overcome this problem. So, let us
discuss them. Before we move further, you can also upskill with the help of online courses on Linear
Regression in Python and enhance your skills.
L1 Regularization
If a regression model uses the L1 Regularization technique, then it is called Lasso Regression. If it
used the L2 regularization technique, it’s called Ridge Regression. We will study more about these in
the later sections.
L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient.
This regularization type can result in sparse models with few coefficients. Some coefficients might
become zero and get eliminated from the model. Larger penalties result in coefficient values that are
closer to zero (ideal for producing simpler models). On the other hand, L2 regularization does not
result in any elimination of sparse models or coefficients. Thus, Lasso Regression is easier to interpret
as compared to the Ridge. While there are ample resources available online to help you understand
the subject, there’s nothing quite like a certificate
In LASSO, we also minimize the RSS, however, augmented by a regularization term called the L1
penalty.
Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of coefficients)
Where,
λ denotes the amount of shrinkage.
λ = 0 implies all features are considered and it is equivalent to the linear regression
where only the residual sum of squares is considered to build a predictive model
λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and
more features
The bias increases with increase in λ
variance increases with decrease in λ
Lasso sometimes struggles with some types of data. If the number of predictors (p) is
greater than the number of observations (n), Lasso will pick at most n predictors as non-
zero, even if all predictors are relevant (or may be used in the test set).
If there are two or more highly collinear variables, then LASSO regression select one of
them randomly which is not good for the interpretation of data
Notice that the difference between ridge regression and LASSO models appears to be very small. It
only lies in the fact that the regularization term has a slightly different form. In fact, the expression we
minimize in these models only differs in the norm we use for the penalty. In ridge regression, we used
the norm p=2, while in LASSO we use the norm p=1.
Ridge Regression:
Lasso Regression:
o Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the
magnitude of coefficients
Classification-Logistic Regression
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:
o The sigmoid function is a mathematical function used to map the predicted values
to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
Assumptions for Logistic Regression:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
The above equation is the final equation for Logistic Regression.
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the
same steps as we have done in previous topics of Regression. Below are the steps:
In this step, we will pre-process/prepare the data so that we can use it in our code efficiently. It will
be the same as we have done in Data pre-processing topic.
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import the LogisticRegression class
of the sklearn library.
#Fitting Logistic Regression to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
[Link](x_train, y_train)
3. Predicting the Test Result
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
#Predicting the test set result
y_pred= [Link](x_test)
4. Test Accuracy of the result
Now we will create the confusion matrix here to check the accuracy of the classification. To create
it, we need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters, mainly y_true(
the actual values) and y_pred (the targeted value return by the classifier). Below is the code for it:
#Creating the Confusion matrix
from [Link] import confusion_matrix
cm= confusion_matrix()
A linear regression model attempts to explain the relationship between a dependent (output variables)
variable and one or more independent (predictor variable) variables using a straight line.
The first step in finding a linear regression equation is to determine if there is a relationship between the
two variables. We can do this by using the Correlation coefficient and scatter plot. When a correlation
coefficient shows that data is likely to be able to predict future outcomes and a scatter plot of the data
appears to form a straight line, we can use simple linear regression to find a predictive function. Let us
consider an example.
From the scatter plot we can see there is a linear relationship between Sales and marketing spent. The
next step is to find a straight line between Sales and Marketing that explain the relationship between
them. But there can be multiple lines that can pass through these points.
Cost Function
The cost is the error in our predicted value. We will use the Mean Squared Error function to calculate the
cost.
Our goal is to minimize the cost as much as possible in order to find the best fit line. For that, we will
use Gradient Descent Algorithm.
Gradient Descent is an algorithm that finds the best-fit line for a given training dataset in a smaller
number of iterations.
For some combination of m and c, we will get the least Error (MSE). That combination of m and c will
give us our best fit line.
The algorithm starts with some value of m and c (usually starts with m=0, c=0). We calculate MSE
(cost) at point m=0, c=0. Let say the MSE (cost) at m=0, c=0 is 100. Then we reduce the value of m and
c by some amount (Learning Step). We will notice a decrease in MSE (cost). We will continue doing the
same until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy).
1. Let m = 0 and c = 0. Let L be our learning rate. It could be a small value like 0.01 for good accuracy.
Learning rate gives the rate of speed where the gradient moves during gradient descent. Setting it too
high would make your path instable, too low would make convergence slow. Put it to zero means your
model isn’t learning anything from the gradients.
2. Calculate the partial derivative of the Cost function with respect to m. Let partial derivative of the
Cost function with respect to m be Dm (With little change in m how much Cost function changes).
Similarly, let’s find the partial derivative with respect to c. Let partial derivative of the Cost function
with respect to c be Dc (With little change in c how much Cost function changes).
3. Now update the current values of m and c using the following equation:
4. We will repeat this process until our Cost function is very small (ideally 0).
Gradient Descent Algorithm gives optimum values of m and c of the linear regression equation. With
these values of m and c, we will get the equation of the best-fit line and ready to make predictions.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases
are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider
the below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Types of SVM
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair (x1, x2) of coordinates in either green or blue. Consider the below
image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane
with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Support Vector Machine (SVM) is a supervised machine learning algorithm for classification and
regression tasks. Here’s a detailed description of the SVM algorithm for binary classification:
1. Problem Formulation
Given a set of training examples:
The goal is to find a decision boundary (a hyperplane) that maximizes the margin between the two
classes.
A hyperplane can be defined as wᵀx + b = 0, where w is the weight vector (normal to the hyperplane),
and b is the bias term. The equation can be rewritten as
The margin can be calculated as the perpendicular distance between the closest data points and the
hyperplane, given by 2 / ||w||. To maximize the margin, we need to minimize ||w||² / 2, subject to the
constraints
4. Lagrange Multipliers
Introduce Lagrange multipliers αᵢ for each constraint, and form the Lagrangian:
To find the saddle point of L(w, b, α), compute the gradients with respect to w and b, and set them to
zero:
6. Dual Formulation
Substitute the gradients back into the Lagrangian to obtain the dual problem:
The dual problem is a convex quadratic programming problem. Solve it using optimization techniques
such as the Sequential Minimal Optimization (SMO) algorithm, gradient ascent, or specialized quadratic
programming solvers.
8. Obtain w and b
Once you have the optimal αᵢ values, compute the weight vector w:
To find the bias term b, use any support vector (xₛ, yₛ) where αₛ > 0:
9. Make Predictions
For a new data point x, the predicted class label ŷ can be calculated using:
The sign of the result determines the class: if the result is positive, the predicted class is 1, and if the
result is negative, the predicted class is -1.
For non-linearly separable data, you can use the kernel trick to map the data to a higher-dimensional
space where it becomes linearly separable. Replace the dot product (xᵢᵀxⱼ) in the dual problem with a
kernel function K(xᵢ, xⱼ) that computes the dot product in the higher-dimensional space:
Common kernel functions include the linear kernel (K(xᵢ, xⱼ) = xᵢᵀxⱼ), polynomial kernel (K(xᵢ, xⱼ) =
(xᵢᵀxⱼ + c)ᵈ), and radial basis function (RBF) or Gaussian kernel (K(xᵢ, xⱼ) = exp(-γ||xᵢ — xⱼ||²)).
11. Make Predictions with Kernels
For a new data point x, the predicted class label ŷ can be calculated using the kernel function
Kernel Methods
Higher-Dimensional Feature Space: By applying a kernel function, the data is transformed into a new,
higher-dimensional space where the data may become linearly separable. In this new feature space,
SVM can find a linear hyperplane that effectively separates the classes, even though the data appeared
non-linear in the original space. It is very difficult to solve this classification using a linear classifier as
there is no good linear line that should be able to classify the red and the green dots as the points are
randomly distributed. Here comes the use of kernel function which takes the points to higher
dimensions, solves the problem over there and returns the output
The key idea of SVMs is that we don’t need to explicitly compute the mapping to the higher-
dimensional feature space. Instead, the kernel function computes the similarity between data points in
the higher-dimensional space without having to directly compute the coordinates of each point in that
space. This allows SVMs to handle complex, non-linear relationships between features while
maintaining computational efficiency
Linear Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined by the dot
product of these two vectors:
K(x1, x2) = x1 . x2
1. The linear kernel is the simplest and most straightforward kernel function.
2. This kernel is used when the data is already linearly separable. It effectively means that no
transformation is applied to the data.
3. Advantages:
Simple and fast to compute.
Effective for linearly separable data.
4. Disadvantages:
Not suitable for complex, non-linear data.
Polynomial Kernel
1. The polynomial kernel allows for more complex decision boundaries by adding polynomial
features to the data. It is defined as:
2. This kernel can capture interactions between features up to a certain degree.
3. Advantages:
Can model interactions between features.
Suitable for non-linearly separable data.
4. Disadvantages:
Computationally more expensive than the linear kernel.
Risk of overfitting with high-degree polynomials.
This kernel is an example of a radial basis function kernel. Below is the equation for this:
The given sigma plays a very important role in the performance of the Gaussian kernel and should
neither be overestimated and nor be underestimated, it should be carefully tuned according to the
problem.
1. The RBF kernel, also known as the Gaussian kernel, is a popular choice due to its flexibility. It
is defined as:
2. This kernel can handle very complex and non-linear relationships.
3. Advantages:
Can handle a wide range of data distributions.
Effective in high-dimensional spaces.
4. Disadvantages:
Requires careful tuning of the σ parameter.
Can be computationally expensive with large datasets.
Sigmoid Kernel
This kernel is used in neural network areas of machine learning. The activation function for the
sigmoid kernel is the bipolar sigmoid function. The equation for the hyperbolic kernel function
Advantages:
Simple to implement.
Disadvantages:
Selecting the appropriate kernel for your SVM model depends on several factors:
Data Complexity: For linearly separable data, the linear kernel is sufficient. For more complex
data, consider polynomial or RBF kernels.
Computational Resources: RBF and polynomial kernels are computationally more intensive
than the linear kernel. Ensure that your computational resources can handle the increased
complexity.
Model Performance: Experiment with different kernels and use cross-validation to determine
which kernel yields the best performance for your specific problem.
The Machine Learning systems which are categorized as instance-based learning are the systems
that learn the training examples by heart and then generalizes to new instances based on some
similarity measure. It is called instance-based because it builds the hypotheses from the training
instances. It is also known as memory-based learning or lazy-learning. The time complexity of this
algorithm depends upon the size of training data. The worst-case time complexity of this algorithm
is O (n), where n is the number of training instances.
For example, if we were to create a spam filter with an instance-based learning algorithm, instead of
just flagging emails that are already marked as spam emails, our spam filter would be programmed
to also flag emails that are very similar to them. This requires a measure of resemblance between
two emails. A similarity measure between two emails could be the same sender or the repetitive use
of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.
Disadvantages:
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch
K Nearest Neighbor
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
o
How does K-NN work?
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o To measure the similarity between target and training data points Euclidean distance is used.
Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target point are nearest neighbors. By calculating
the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
When you want to classify a data point into a category (like spam or not spam), the K-NN
algorithm looks at the K closest points in the dataset. These closest points are called
neighbors. The algorithm then looks at which category the neighbors belong to and picks the
one that appears the most. This is called majority voting.
In regression, the algorithm still looks for the K closest points. But instead of voting for a
class in classification, it takes the average of the values of those K neighbors. This average is
the predicted value for the new point for the algorithm .
Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
Large values for K are good, but it may find some difficulties.
Easy to implement- The K-NN algorithm is easy to implement because its complexity is
relatively low as compared to other machine learning algorithms.
Easily Adaptable- K-NN stores all data in memory, so when new data points are added, it
automatically adjusts and uses the new data for future predictions.
Few Hyperparameters – The only parameters which are required in the training of a KNN
algorithm are the value of k and the choice of the distance metric which we would like to
choose from our evaluation metric.
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space. You
can think of it like the shortest path you would walk if you were to go directly from one point to another.
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical lines
(like a grid or city streets). It’s also called “taxicab distance” because a taxi can only drive along the
grid-like streets of a city.
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan
distances as special cases.
Example
The table represents our data set. We have two columns — Brightness and Saturation. Each row in the
table has a class of either Red or Blue.
Brightness - 20
Saturation - 35
Class- ?
Let's rearrange the distances in ascending order: Since we chose 5 as the value of K, we'll only
consider the first five rows. That is:
As you can see above, the majority class within the 5 nearest neighbors to the new entry is Red.
Therefore, we'll classify the new entry as Red.
Tree based methods- decision Tree
o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model. Below
are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy
to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as
Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for
the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o GiniIndex
Information Gain:
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Gini Index:
oGini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below
formula: Gini Index= 1- ∑ P 2
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
o It is simple to understand as it follows the same process which a human follows while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
ID3
The ID3 algorithm begins with the original set S as the root node. On each iteration of the algorithm,
it iterates through every unused attribute of the set S and calculates the entropy H(S) or the
information gain IG(S) of that attribute. It then selects the attribute which has the smallest entropy
(or largest information gain) value. The set S is then split or partitioned by the selected attribute to
produce subsets of the data. (For example, a node can be split into child nodes based upon the
subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.)
The algorithm continues to recurse on each subset, considering only attributes never selected before.
[Link]
Here, we have 3 features and 2 output classes. To build a decision tree using Information gain. We will
take each of the features and calculate the information for each feature.
From the above images, we can see that the information gain is maximum when we make a split on
feature Y. So, for the root node best-suited feature is feature Y. Now we can see that while splitting the
dataset by feature Y, the child contains a pure subset of the target variable. So we don’t need to further
split the dataset. The final tree for the above dataset would look like this
2)
CART( Classification And Regression Trees) is a variation of the decision tree algorithm. It can handle
both classification and regression tasks.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm that is used for both
classification and regression tasks. It is a supervised learning algorithm that learns from labelled data to
predict unseen data.
Tree structure: CART builds a tree-like structure consisting of nodes and branches. The nodes
represent different decision points, and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label or value for the target
variable.
Splitting criteria: CART uses a greedy approach to split the data at each node. It evaluates all
possible splits and selects the one that best reduces the impurity of the resulting subsets. For
classification tasks, CART uses Gini impurity as the splitting criterion. The lower the Gini
impurity, the more pure the subset is. For regression tasks, CART uses residual reduction as the
splitting criterion. The lower the residual reduction, the better the fit of the model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique used to remove the nodes
that contribute little to the model accuracy. Cost complexity pruning and information gain
pruning are two popular pruning techniques. Cost complexity pruning involves calculating the
cost of each node and removing nodes that have a negative cost. Information gain pruning
involves calculating the information gain of each node and removing nodes that have a low
information gain.
Based on the best-split points of each input in Step 1, the new “best” split point is identified.
Continue splitting until a stopping rule is satisfied or no further desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree. It does that by searching for
the best homogeneity for the sub nodes, with the help of the Gini index criterion.
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared probabilities
of each class. It computes the degree of probability of a specific variable that is wrongly being classified
when chosen randomly and a variation of the Gini coefficient. It works on categorical variables,
provides outcomes either “successful” or “failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where 0 depicts that all the elements are allied to a certain class, or only one class exists there.
Gini index close to 1 means a high level of impurity, where each class contains a very small
fraction of elements, and
A value of 1-1/n occurs when the elements are uniformly distributed into n classes and each
class has an equal probability of 1/n. For example, with two classes, the Gini impurity is 1 – 1/2
= 0.5.
A classification tree is an algorithm where the target variable is categorical. The algorithm is then used
to identify the “Class” within which the target variable is most likely to fall. Classification trees are used
when the dataset needs to be split into classes that belong to the response variable (like yes or no)
For classification in decision tree learning algorithm that creates a tree-like structure to predict class
labels. The tree consists of nodes, which represent different decision points, and branches, which
represent the possible result of those decisions. Predicted class labels are present at each leaf node of the
tree.
CART for classification works by recursively splitting the training data into smaller and smaller subsets
based on certain criteria. The goal is to split the data in a way that minimizes the impurity within each
subset. Impurity is a measure of how mixed up the data is in a particular subset. For classification tasks,
CART uses Gini impurity
Gini Impurity- Gini impurity measures the probability of misclassifying a random instance
from a subset labeled according to the majority class. Lower Gini impurity means more purity
of the subset.
Splitting Criteria- The CART algorithm evaluates all potential splits at every node and chooses
the one that best decreases the Gini impurity of the resultant subsets. This process continues
until a stopping criterion is reached, like a maximum tree depth or a minimum number of
instances in a leaf node.
A Regression tree is an algorithm where the target variable is continuous and the tree is used to predict
its value. Regression trees are used when the response variable is continuous. For example, if the
response variable is the temperature of the day.
CART for regression is a decision tree learning method that creates a tree-like structure to predict
continuous target variables. The tree consists of nodes that represent different decision points and
branches that represent the possible outcomes of those decisions. Predicted values for the target variable
are stored in each leaf node of the tree.
Regression CART works by splitting the training data recursively into smaller subsets based on specific
criteria. The objective is to split the data in a way that minimizes the residual reduction in each subset.
Residual Reduction- Residual reduction is a measure of how much the average squared
difference between the predicted values and the actual values for the target variable is reduced
by splitting the subset. The lower the residual reduction, the better the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the one that
results in the greatest reduction of residual error in the resulting subsets. This process is repeated
until a stopping criterion is met, such as reaching the maximum tree depth or having too few
instances in a leaf node.
CART (Classification and Regression Trees): The original algorithm that uses binary splits to
build decision trees.
C4.5 and C5.0: Extensions of CART that allow for multiway splits and handle categorical
variables more effectively.
Random Forests: Ensemble methods that use multiple decision trees (often CART) to improve
predictive performance and reduce overfitting.
Gradient Boosting Machines (GBM): Boosting algorithms that also use decision trees (often
CART) as base learners, sequentially improving model performance.
Advantages of CART
Limitations of CART
Overfitting.
High Variance.
low bias.
Gini index
Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each
class. We can formulate it as illustrated below.
Outlook
Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for
outlook feature.
Temperature
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
Let’s summarize decisions for temperature feature.
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
Humidity
Time to decide
We’ve calculated gini index values for each feature. The winner will be outlook feature because its cost
is the lowest.
You might realize that sub dataset in the overcast leaf has only yes decisions. This means that overcast
leaf is over.
We will apply same principles to those sub datasets in the following steps.
Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature,
humidity and wind features respectively.
Humidit Ye N Number of
y s o instances
High 0 3 3
Normal 2 0 2
Ye N Number of
Wind
s o instances
Weak 1 2 3
Stron
1 1 2
g
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266
We’ve calculated gini index scores for feature when outlook is sunny. The winner is humidity because it
has the lowest value.
Gini
Feature
index
Temperatur
0.2
e
Humidity 0
Wind 0.466
As seen, decision is always no for high humidity and sunny outlook. On the other hand, decision will
always be yes for normal humidity and sunny outlook. This branch is over.
Rain outlook
Stron
6 Rain Cool Normal No
g
Temperatur Ye N
Number of instances
e s o
Cool 1 1 2
Mild 2 1 3
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5
Humidit Ye N
Number of instances
y s o
High 1 1 2
Normal 2 1 3
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5
Weak 3 0 3
Strong 0 2 2
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0
The winner is wind feature for rain outlook because it has the minimum gini index score in features.
Temperature 0.466
Humidity 0.466
Wind 0
Put the wind feature for rain outlook branch and monitor the new sub data sets.
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is
strong. This means that this branch is over.
Ensemble learning stands out as a powerful technique in machine learning, offering a robust approach to
improving model performance and predictive accuracy. Combining the strengths of multiple individual
models, ensemble methods can often outperform any single model, making them valuable in the
machine learning toolkit. In this article, we delve into the depths of ensemble learning, exploring its
various techniques, algorithms, and real-world applications. Join us to uncover the secrets behind
ensemble learning and unlock its full potential in your machine learning projects.
What Is Ensemble Learning?
Ensemble learning refers to a machine learning approach where several models are trained to address a
common problem, and their predictions are combined to enhance the overall performance. The idea
behind ensemble learning is that by combining multiple models, each with its strengths and weaknesses,
the ensemble can achieve better results than any single model alone. Ensemble learning can be applied
to various machine learning tasks, including classification, regression, and clustering. Some common
ensemble learning methods include bagging, boosting, and stacking.
Ensemble Techniques
Ensemble techniques in machine learning involve combining multiple models to improve performance.
One common ensemble technique is bagging, which uses bootstrap sampling to create multiple datasets
from the original data and trains a model on each dataset. Another technique is boosting, which trains
models sequentially, each focusing on the previous models' mistakes. Random forests are a popular
ensemble method that uses decision trees as base learners and combines their predictions to make a final
prediction. Ensemble techniques are effective because they reduce overfitting and improve
generalization, leading to more robust models.
Simple ensemble techniques combine predictions from multiple models to produce a final prediction.
These techniques are straightforward to implement and can often improve performance compared to
individual models.
Max Voting
In this technique, the final prediction is the most frequent prediction among the base models. For
example, if three base models predict the classes A, B, and A for a given sample, the final prediction
using max voting would be class A, as it appears more frequently.
Averaging
Averaging involves taking the average of predictions from multiple models. This can be particularly
useful for regression problems, where the final prediction is the mean of predictions from all models.
For classification, averaging can be applied to the predicted probabilities for a more confident
prediction.
Weighted Averaging
Weighted averaging is similar, but each model's prediction is given a different weight. The weights can
be assigned based on each model's performance on a validation set or tuned using grid or randomized
search techniques. This allows models with higher performance to have a greater influence on the final
prediction
Advanced ensemble techniques go beyond basic methods like bagging and boosting to enhance model
performance further. Here are explanations of stacking, blending, bagging, and boosting:
Stacking
Stacking, or stacked generalization, combines multiple base models with a meta-model to make
predictions.
Instead of using simple methods like averaging or voting, stacking trains a meta-model to learn how to
combine the base models' predictions best.
The base models can be diverse to capture different aspects of the data, and the meta-model learns to
weight its predictions based on its performance.
Blending
Instead of a meta-model, blending uses a simple method like averaging or a linear model to combine the
predictions of the base models.
Blending is often used in competitions where simplicity and efficiency are important.
Bagging is a technique where multiple subsets of the dataset are created through bootstrapping
(sampling with replacement).
A base model (often a decision tree) is trained on each subset, and the final prediction is the average (for
regression) or majority vote (for classification) of the individual predictions.
Bagging helps reduce variance and overfitting, especially for unstable models.
Boosting
Boosting is an ensemble technique where base models are trained sequentially, with each subsequent
model focusing on the mistakes of the previous ones.
The final prediction is a weighted sum of the individual models' predictions, with higher weights given
to more accurate models.
Boosting algorithms like AdaBoost, Gradient Boosting, and XGBoost are popular because they improve
model performance.
Random Forest
Random Forest is a technique in ensemble learning that utilizes a decision tree group to make
predictions.
The key concept behind Random Forest is introducing randomness in tree-building to create
diverse trees.
To create each tree, a random subset of the training data is sampled (with replacement), and a
decision tree is trained on this subset.
Additionally, rather than considering all features, a random subset of features is selected at each
tree node to determine the best split.
The final prediction of the Random Forest is made by aggregating the predictions of all the
individual trees (e.g., averaging for regression, majority voting for classification).
Random Forests are robust against overfitting and perform well on many datasets. Compared to
individual decision trees, they are also less sensitive to hyperparameters.
Step 2: This algorithm will construct a decision tree for every training data.
Step 4: Finally, select the most voted prediction result as the final prediction result.
This combination of multiple models is called Ensemble. Ensemble uses two methods:
1. Bagging: Creating a different training subset from sample training data with replacement is
called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential models such that
the final model has the highest accuracy is called Boosting. Example: ADA BOOST, XG
BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses the Bagging
code. Now, let us understand this concept in detail. Bagging is also known as Bootstrap Aggregation
used by random forest. The process begins with any original random data. After arranging, it is
organised into samples known as Bootstrap Sample. This process is known as [Link], the
models
are trained
individually, yielding different results known as Aggregation. In the last step, all the results are
combined, and the generated output is based on majority voting. This step is known as Bagging and is
done using an Ensemble Classifier.
Each tree makes its own decisions: Every tree in the forest makes its own predictions without
relying on others.
Random parts of the data are used: Each tree is built using random samples and features to
reduce mistakes.
Enough data is needed: Sufficient data ensures the trees are different and learn unique patterns
and variety.
Different predictions improve accuracy: Combining the predictions from different trees leads to
a more accurate final result.
Key Benefits
Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit
all the samples within training data. However, when there’s a robust number of decision trees in
a random forest, the classifier won’t overfit the model since the averaging of uncorrelated trees
lowers the overall variance and prediction error.
Provides flexibility: Since random forest can handle both regression and classification tasks
with a high degree of accuracy, it is a popular method among data scientists. Feature bagging
also makes the random forest classifier an effective tool for estimating missing values as it
maintains accuracy when a portion of the data is missing.
Key Challenges
Time-consuming process: Since random forest algorithms can handle large data sets, they can
be provide more accurate predictions, but can be slow to process data as they are computing
data for each individual decision tree.
Requires more resources: Since random forests process larger data sets, they’ll require more
resources to store that data.
More complex: The prediction of a single decision tree is easier to interpret when compared to a
forest of them.
The random forest algorithm has been applied across a number of industries, allowing them to make
better business decisions. Some use cases include:
Finance: It is a preferred algorithm over others as it reduces time spent on data management and pre-
processing tasks. It can be used to evaluate customers with high credit risk, to detect fraud, and option
pricing problems.
Healthcare: The random forest algorithm has applications within computational biology (link resides
outside [Link]), allowing doctors to tackle problems such as gene expression classification, biomarker
discovery, and sequence annotation. As a result, doctors can make estimates around drug responses to
specific medications.
Classification Metrics
In a classification task, our main task is to predict the target variable which is in the form of discrete
values. To evaluate the performance of such a model there are metrics as mentioned below:
Classification Accuracy
Logarithmic loss
Area under Curve
F1 score
Precision
Recall
Confusion Matrix
Classification Accuracy
Classification accuracy is a fundamental metric for evaluating the performance of a classification
model, providing a quick snapshot of how well the model is performing in terms of correct
predictions. This is calculated as the ratio of correct predictions to the total number of input Samples.
When is it used?
Classification accuracy is a fundamental metric for evaluating the performance of a
classification model.
It's often used as the default evaluation metric for generic models.
What does it indicate?
A model with perfect accuracy has zero false positives and zero false negatives.
Logarithmic Loss
Logarithmic loss, also known as log loss or cross-entropy loss, is a metric used to evaluate the
performance of a classification model. It measures how close a model's predicted probabilities are to
the actual class labels.
How it works
Log loss
penalizes
models for
incorrect labeling of data classes.
It takes into account the confidence of predictions, unlike accuracy which is binary.
A lower log loss value indicates more accurate predictions.
Log loss is a popular metric for measuring error in machine learning.
When it's used
Log loss is used to train binary classifiers, which are simple tasks with two labels.
It's also used to evaluate the performance of sentiment analysis models in natural language
processing.
Examples of use
Predicting whether it will rain or not rain in a city
Predicting whether an email is spam or not spam
Analyzing customer feedback
Monitoring social media
False-positive Rate
False Negatives rate is actually the proportion of actual positives that are incorrectly identified as
negatives
F1 Score
It is a harmonic mean between recall and precision. Its range is [0,1]. This metric usually tells us how
precise (It correctly classifies how many instances) and robust (does not miss any significant number
of instances) our classifier is.
Precision
There is another metric named Precision. Precision is a measure of a model’s performance that tells
you how many of the positive predictions made by the model are actually correct. It is calculated as
the number of true positive predictions divided by the number of true positive and false positive
predictions.