0% found this document useful (0 votes)

46 views63 pages

Unit 2

Machine learning is a branch of artificial intelligence that uses algorithms to learn from data and improve predictions. The document discusses different types of machine learning algorithms including supervised learning, unsupervised learning, reinforcement learning, and their advantages and disadvantages. It also provides details on supervised learning techniques like classification and regression.

Uploaded by

Mdv Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views63 pages

Unit 2

Uploaded by

Mdv Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

UNIT-II

Overview of Machine Learning:

Machine learning is a branch of artificial intelligence, a science that

researches machines to acquire new knowledge and new skills and to
identify existing knowledge. The precise definition of machine
learning is:

It’s a computer program learning from experience E with respect to

some task T and some performance measure P, if its performance
on T as measured by P, improves with E: Tom Mitchell 1998

Machine learning has been widely used in data mining, computer

vision, natural language processing, biometrics, search engines,
medical diagnostics, detection of credit card fraud, securities market
analysis, DNA sequence sequencing, speech and handwriting
recognition, strategy games and robotics.

Machine learning algorithm classification:

According to the way of learning, machine learning mainly includes:

 Supervised learning: Input data is tagged. Supervised learning

establishes a learning process, compares the predicted results with
the actual results of the “training data” (ie, input data), and
continuously adjusts the predictive model until the predicted
results of the model reach an expected accuracy, such as
classification and regression problems. . Common algorithms
include decision trees, Bayesian classification, least squares
regression, logistic regression, support vector machines, neural
networks, and so on.

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the
output on the basis of prior experiences.
o In supervised learning, we can have an exact idea about the
classes of objects.
o Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the
complex tasks.
o Supervised learning cannot predict the correct output if the test
data is different from the training dataset.
o Training required lots of computation times.

o In supervised learning, we need enough knowledge about the

classes of object.

Unsupervised learning: Input data has no tags, but algorithms to infer

the intrinsic links of data, such as clustering and association rule
learning. Common algorithms include independent component
analysis, K-Means and Apriori algorithms.

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as
compared to supervised learning because, in unsupervised
learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled
data in comparison to labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than
supervised learning as it does not have corresponding output.
o The result of the unsupervised learning algorithm might be less
accurate as input data is not labeled, and algorithms do not know
the exact output in advance.

Semi-supervised learning: input data part tags, is an extension of

supervised learning, often used for classification and regression.
Common algorithms include graph theory inference algorithms,
Laplacian support vector machines, and so on.

Reinforcement learning: Input data as feedback to the model,

emphasizing how to act based on the environment to maximize the
expected benefits. The difference between supervised learning is that
it does not require the correct input/output pairs and does not require
precise correction of sub-optimal behavior. Reinforcement learning is
more focused on online planning and requires a balance between
exploration (in the unknown) and compliance (existing knowledge).

Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to
a particular behavior, increases the strength and the frequency of
the behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
 Maximizes Performance
 Sustain Change for a long period of time

Disadvantages of reinforcement learning:

 Too much Reinforcement can lead to overload of states which
can diminish the results
2. Negative –
Negative Reinforcement is defined as strengthening of a behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
 Increases Behavior
 Provide defiance to minimum standard of performance
Disadvantages of reinforcement learning:
 It Only provides enough to meet up the minimum behavior

Supervised Machine Learning

Supervised learning is the types of machine learning in which machines
are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data
is already tagged with the correct output.

In supervised learning, the training data provided to the machines work

as the supervisor that teaches the machines to predict the output
correctly. It applies the same concept as a student learns in the
supervision of the teacher.

Supervised learning is a process of providing input data as well as

correct output data to the machine learning model. The aim of a
supervised learning algorithm is to find a mapping function to map
the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk

Assessment, Image classification, Fraud Detection, spam filtering,
etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where

the model learns about each type of data. Once the training process is
completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the
below example and diagram:

Suppose we have a dataset of different types of shapes which includes

square, rectangle, triangle, and Polygon. Now the first step is that we
need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then
it will be labelled as a Square.
o If the given shape has three sides, then it will be labelled as
a triangle.
o If the given shape has six equal sides then it will be labelled
as hexagon.

Now, after training, we test our model using the test set, and the task of
the model is to identify the shape.

The machine is already trained on all types of shapes, and when it finds
a new shape, it classifies the shape on the bases of a number of sides,
and predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset

o Collect/Gather the labelled training data.

o Split the training dataset into training dataset, test dataset, and
validation dataset.
o Determine the input features of the training dataset, which should
have enough knowledge so that the model can accurately predict
the output.
o Determine the suitable algorithm for the model, such as support
vector machine, decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
o Evaluate the accuracy of the model by providing the test set. If
the model predicts the correct output, which means our model is
accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the

input variable and the output variable. It is used for the prediction of
continuous variables, such as Weather forecasting, Market Trends, etc.
Below are some popular Regression algorithms which come under
supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is

categorical, which means there are two classes such as Yes-No, Male-
Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Splitting a dataset:

Train-Test Split Evaluation

The train-test split is a technique for evaluating the performance of a
machine learning algorithm.

It can be used for classification or regression problems and can be

used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two

subsets. The first subset is used to fit the model and is referred to as
the training dataset. The second subset is not used to train the model;
instead, the input element of the dataset is provided to the model, then
predictions are made and compared to the expected values. This
second dataset is referred to as the test dataset.

 Train Dataset: Used to fit the machine learning model.

 Test Dataset: Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning
model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it

on available data with known inputs and outputs, then make
predictions on new examples in the future where we do not have the
expected output or target values.

The train-test procedure is appropriate when there is a sufficiently

large dataset available.

To train any machine learning model irrespective what type of dataset

is being used you have to split the dataset into training data and testing
data. So, let us look into how it can be done? Here I am going to use
the ‘train_test_split’ library from sklearn

from sklearn.model_selection import train_test_split

The reason is that when the dataset is split into train and test sets, there
will not be enough data in the training dataset for the model to learn
an effective mapping of inputs to outputs. There will also not be enough
data in the test set to effectively evaluate the model performance.

Bayes Theorem of Conditional Probability

Marginal probability is the probability of an event, irrespective of

other random variables. If the random variable is independent, then it
is the probability of the event directly, otherwise, if the variable is
dependent upon other variables, then the marginal probability is the
probability of the event summed over all outcomes for the dependent
variables, called the sum rule.

 Marginal Probability: The probability of an event irrespective of the

outcomes of other random variables, e.g. P(A).
The joint probability is the probability of two (or more) simultaneous
events, often described in terms of events A and B from two
dependent random variables, e.g. X and Y. The joint probability is
often summarized as just the outcomes, e.g. A and B.

 Joint Probability: Probability of two (or more) simultaneous events,

e.g. P(A and B) or P(A, B).
The conditional probability is the probability of one event given the
occurrence of another event, often described in terms of events A and
B from two dependent random variables e.g. X and Y.

 Conditional Probability: Probability of one (or more) event given

the occurrence of another event, e.g. P(A given B) or P(A | B).
The joint probability can be calculated using the conditional
probability; for example:

P(A, B) = P(A | B) * P(B)

This is called the product rule. Importantly, the joint probability is
symmetrical, meaning that:

P(A, B) = P(B, A)
The conditional probability can be calculated using the joint
probability; for example:
P(A | B) = P(A, B) / P(B)
The conditional probability is not symmetrical; for example:

P(A | B) != P(B | A)

Now, there is another way to calculate the conditional probability.

Specifically, one conditional probability can be calculated using the

other conditional probability; for example:

P(A|B) = P(B|A) * P(A) / P(B)

The reverse is also true; for example:

P(B|A) = P(A|B) * P(B) / P(A)

This alternate approach of calculating the conditional probability is

useful either when the joint probability is challenging to calculate
(which is most of the time), or when the reverse conditional
probability is available or easy to calculate.

This alternate calculation of the conditional probability is referred to

as Bayes Rule or Bayes Theorem.

Bayes Theorem: Principled way of calculating a conditional

probability without the joint probability.

It is often the case that we do not have access to the denominator

directly, e.g. P(B).

We can calculate it an alternative way; for example:

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This gives a formulation of Bayes Theorem that we can use that uses
the alternate calculation of P(B), described below:

P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

Or with brackets around the denominator for clarity:

P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A))

P(not A) = 1 – P(A)
Additionally, if we have P(not B|not A), then we can calculate P(B|not
A) as its complement; for example:

P(B|not A) = 1 – P(not B|not A)

Now that we are familiar with the calculation of Bayes Theorem, let’s
take a closer look at the meaning of the terms in the equation.

Firstly, in general, the result P(A|B) is referred to as the posterior

probability and P(A) is referred to as the prior probability.

P(A|B): Posterior probability.

P(A): Prior probability.

Sometimes P(B|A) is referred to as the likelihood and P(B) is referred

to as the evidence.

P(B|A): Likelihood.

P(B): Evidence.

This allows Bayes Theorem to be restated as:

Posterior = Likelihood * Prior / Evidence

Bayes Theorem for Modeling Hypotheses:

Bayes Theorem is a useful tool in applied machine learning. It
provides a way of thinking about the relationship between data and a
model.

A machine learning algorithm or model is a specific way of thinking

about the structured relationships in the data. In this way, a model can
be thought of as a hypothesis about the relationships in the data, such
as the relationship between input (X) and output (y). The practice of
applied machine learning is the testing and analysis of different
hypotheses (models) on a given dataset.

Bayes Theorem provides a probabilistic model to describe the

relationship between data (D) and a hypothesis (h); for example:

P(h|D) = P(D|h) * P(h) / P(D)

Breaking this down, it says that the probability of a given hypothesis

holding or being true given some observed data can be calculated as
the probability of observing the data given the hypothesis multiplied
by the probability of the hypothesis being true regardless of the data,
divided by the probability of observing the data regardless of the
hypothesis.

Bayes theorem provides a way to calculate the probability of a

hypothesis based on its prior probability, the probabilities of
observing various data given the hypothesis, and the observed data
itself.

Under this framework, each piece of the calculation has a specific

name; for example:

P(h|D): Posterior probability of the hypothesis (the thing we want to

calculate).

P(h): Prior probability of the hypothesis.

This gives a useful framework for thinking about and modeling a

machine learning problem.

If we have some prior domain knowledge about the hypothesis, this is

captured in the prior probability. If we don’t, then all hypotheses may
have the same prior probability.

If the probability of observing the data P(D) increases, then the

probability of the hypothesis holding given the data P(h|D) decreases.
Conversely, if the probability of the hypothesis P(h) and the
probability of observing the data given hypothesis increases, the
probability of the hypothesis holding given the data P(h|D) increases.

The notion of testing different models on a dataset in applied machine

learning can be thought of as estimating the probability of each
hypothesis (h1, h2, h3, … in H) being true given the observed data.

The optimization or seeking the hypothesis with the maximum

posterior probability in modeling is called maximum a posteriori or
MAP for short.
Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis. We can determine the MAP hypotheses
by using Bayes theorem to calculate the posterior probability of each
candidate hypothesis.

Under this framework, the probability of the data (D) is constant as it

is used in the assessment of each hypothesis. Therefore, it can be
removed from the calculation to give the simplified unnormalized
estimate as follows:

max h in H P(h|D) = P(D|h) * P(h)

If we do not have any prior information about the hypothesis being
tested, they can be assigned a uniform probability, and this term too
will be a constant and can be removed from the calculation to give the
following:

max h in H P(h|D) = P(D|h)

That is, the goal is to locate a hypothesis that best explains the
observed data.

Fitting models like linear regression for predicting a numerical value,

and logistic regression for binary classification can be framed and
solved under the MAP probabilistic framework. This provides an
alternative to the more common maximum likelihood estimation
(MLE) framework.

Bayes Theorem for Classification

Classification is a predictive modeling problem that involves
assigning a label to a given input data sample.
The problem of classification predictive modeling can be framed as
calculating the conditional probability of a class label given a data
sample, for example:

P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

This calculation can be performed for each class in the problem and
the class that is assigned the largest probability can be selected and
assigned to the input data.

In practice, it is very challenging to calculate full Bayes Theorem for

classification.

The priors for the class and the data are easy to estimate from a
training dataset, if the dataset is suitability representative of the
broader problem.

The conditional probability of the observation based on the class

P(data|class) is not feasible unless the number of examples is
extraordinarily large, e.g. large enough to effectively estimate the
probability distribution for all different possible combinations of
values. This is almost never the case, we will not have sufficient
coverage of the domain.

As such, the direct application of Bayes Theorem also becomes

intractable, especially as the number of variables or features (n)
increases.

Naive Bayes Classifier

The Bayes Theorem assumes that each input variable is dependent
upon all other variables. This is a cause of complexity in the
calculation. We can remove this assumption and consider each input
variable as being independent from each other.

This changes the model from a dependent conditional probability

model to an independent conditional probability model and
dramatically simplifies the calculation.
This means that we calculate P(data|class) for each input variable
separately and multiple the results together, for example:

P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … *

P(Xn|class) * P(class) / P(data)

We can also drop the probability of observing the data as it is a

constant for all calculations, for example:

 P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … *

P(Xn|class) * P(class)

This simplification of Bayes Theorem is common and widely used for

classification predictive modeling problems and is generally referred
to as Naive Bayes.

Bayes Classifier: Probabilistic model that makes the most probable

prediction for new examples.

What is the most probable classification of the new instance given the
training data?

This is different from the MAP framework that seeks the most
probable hypothesis (model). Instead, we are interested in making a
specific prediction.

The equation below demonstrates how to calculate the conditional

probability for a new instance (vi) given the training data (D), given a
space of hypotheses (H).

P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where vj is a new instance to be classified, H is the set of hypotheses

for classifying the instance, hi is a given hypothesis, P(vj | hi) is the
posterior probability for vi given hypothesis hi, and P(hi | D) is the
posterior probability of the hypothesis hi given the data D.
Linear Regression:
Linear Regression Model Representation
Linear regression is an attractive model because the representation is
so simple.

The representation is a linear equation that combines a specific set of

input values (x) the solution to which is the predicted output for that
set of input values (y). As such, both the input values (x) and the
output value are numeric.

The linear equation assigns one scale factor to each input value or
column, called a coefficient and represented by the capital Greek
letter Beta (B). One additional coefficient is also added, giving the
line an additional degree of freedom (e.g. moving up and down on a
two-dimensional plot) and is often called the intercept or the bias
coefficient.

For example, in a simple regression problem (a single x and a single

y), the form of the model would be:

y = B0 + B1*x

In higher dimensions when we have more than one input (x), the line
is called a plane or a hyper-plane. The representation therefore is the
form of the equation and the specific values used for the coefficients
(e.g. B0 and B1 in the above example).

It is common to talk about the complexity of a regression model like

linear regression. This refers to the number of coefficients used in the
model.

When a coefficient becomes zero, it effectively removes the influence

of the input variable on the model and therefore from the prediction
made from the model (0 * x = 0). This becomes relevant if you look at
regularization methods that change the learning algorithm to reduce
the complexity of regression models by putting pressure on the
absolute size of the coefficients, driving some to zero.
Now that we understand the representation used for a linear
regression model, let’s review some ways that we can learn this
representation from data.

Linear Regression Learning the Model

Learning a linear regression model means estimating the values of the
coefficients used in the representation with the data.

In this section we will take a brief look at four techniques to prepare a

linear regression model.

1. Simple Linear Regression

With simple linear regression when we have a single input, we can
use statistics to estimate the coefficients.

This requires that you calculate statistical properties from the data
such as means, standard deviations, correlations and covariance. All
of the data must be available to traverse and calculate statistics.

2. Ordinary Least Squares

When we have more than one input we can use Ordinary Least
Squares to estimate the values of the coefficients.

The Ordinary Least Squares procedure seeks to minimize the sum of

the squared residuals. This means that given a regression line through
the data we calculate the distance from each data point to the
regression line, square it, and sum all of the squared errors together.
This is the quantity that ordinary least squares seeks to minimize.
This approach treats the data as a matrix and uses linear algebra
operations to estimate the optimal values for the coefficients. It means
that all of the data must be available and you must have enough
memory to fit the data and perform matrix operations.

It is unusual to implement the Ordinary Least Squares procedure

yourself unless as an exercise in linear algebra. It is more likely that
you will call a procedure in a linear algebra library. This procedure is
very fast to calculate.
3. Gradient Descent
When there are one or more inputs you can use a process of
optimizing the values of the coefficients by iteratively minimizing the
error of the model on your training data.

This operation is called Gradient Descent and works by starting with

random values for each coefficient. The sum of the squared errors are
calculated for each pair of input and output values. A learning rate is
used as a scale factor and the coefficients are updated in the direction
towards minimizing the error. The process is repeated until a
minimum sum squared error is achieved or no further improvement is
possible.

When using this method, you must select a learning rate (alpha)
parameter that determines the size of the improvement step to take on
each iteration of the procedure.

Gradient descent is often taught using a linear regression model

because it is relatively straightforward to understand. In practice, it is
useful when you have a very large dataset either in the number of
rows or the number of columns that may not fit into memory.

4. Regularization

There are extensions of the training of the linear model called

regularization methods. These seek to both minimize the sum of the
squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (like the
number or absolute size of the sum of all coefficients in the model).

Two popular examples of regularization procedures for linear

regression are:

Lasso Regression: where Ordinary Least Squares is modified to also

minimize the absolute sum of the coefficients (called L1
regularization).
Ridge Regression: where Ordinary Least Squares is modified to also
minimize the squared absolute sum of the coefficients (called L2
regularization).

These methods are effective to use when there is collinearity in your

input values and ordinary least squares would overfit the training data.

Now that you know some techniques to learn the coefficients in a

linear regression model, let’s look at how we can use a model to make
predictions on new data.

Making Predictions with Linear Regression

Given the representation is a linear equation, making predictions is as
simple as solving the equation for a specific set of inputs.

Let’s make this concrete with an example. Imagine we are predicting

weight (y) from height (x). Our linear regression model representation
for this problem would be:

y = B0 + B1 * x1

weight =B0 +B1 * height

Where B0 is the bias coefficient and B1 is the coefficient for the

height column. We use a learning technique to find a good set of
coefficient values. Once found, we can plug in different height values
to predict the weight.

For example, lets use B0 = 0.1 and B1 = 0.5. Let’s plug them in and
calculate the weight (in kilograms) for a person with the height of 182
centimeters.

weight = 0.1 + 0.5 * 182

weight = 91.1
You can see that the above equation could be plotted as a line in two-
dimensions. The B0 is our starting point regardless of what height we
have. We can run through a bunch of heights from 100 to 250
centimeters and plug them to the equation and get weight values,
creating our line.

Sample Height vs Weight Linear Regression

Now that we know how to make predictions given a learned linear
regression model, let’s look at some rules of thumb for preparing our
data to make the most of this type of model.

 Linear Assumption. Linear regression assumes that the relationship

between your input and output is linear. It does not support anything
else. This may be obvious, but it is good to remember when you have
a lot of attributes. You may need to transform data to make the
relationship linear.

 Remove Noise. Linear regression assumes that your input and output
variables are not noisy. Consider using data cleaning operations that
let you better expose and clarify the signal in your data. This is most
important for the output variable and you want to remove outliers in
the output variable (y) if possible.
 Remove Collinearity. Linear regression will over-fit your data when
you have highly correlated input variables. Consider calculating
pairwise correlations for your input data and removing the most
correlated.

 Gaussian Distributions. Linear regression will make more reliable

predictions if your input and output variables have a Gaussian
distribution. You may get some benefit using transforms (e.g. log or
BoxCox) on you variables to make their distribution more Gaussian
looking.
 Rescale Inputs: Linear regression will often make more reliable
predictions if you rescale input variables using standardization or
normalization.

Assumptions in Regression

Regression is a parametric approach. ‘Parametric’ means it makes

assumptions about data for the purpose of analysis. Due to its
parametric side, regression is restrictive in nature. It fails to deliver
good results with data sets which doesn’t fulfill its assumptions.
Therefore, for a successful regression analysis, it’s essential to validate
these assumptions.

Let’s look at the important assumptions in regression analysis:

1. There should be a linear and additive relationship between

dependent (response) variable and independent (predictor)
variable(s). A linear relationship suggests that a change in
response Y due to one unit change in X¹ is constant, regardless of
the value of X¹. An additive relationship suggests that the effect
of X¹ on Y is independent of other variables.
2. There should be no correlation between the residual (error) terms.
Absence of this phenomenon is known as Autocorrelation.
3. The independent variables should not be correlated. Absence of
this phenomenon is known as multicollinearity.
4. The error terms must have constant variance. This phenomenon
is known as homoskedasticity. The presence of non-constant
variance is referred to heteroskedasticity.
5. The error terms must be normally distributed.

Regularization

This is a form of regression, that constrains/ regularizes or shrinks the

coefficient estimates towards zero. In other words, this technique
discourages learning a more complex or flexible model, so as to
avoid the risk of overfitting.

A simple relation for linear regression looks like this. Here Y

represents the learned relation and β represents the coefficient
estimates for different variables or predictors(X).

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

The fitting procedure involves a loss function, known as residual sum

of squares or RSS. The coefficients are chosen, such that they
minimize this loss function.

Now, this will adjust the coefficients based on your training data. If
there is noise in the training data, then the estimated coefficients
won’t generalize well to the future data. This is where regularization
comes in and shrinks or regularizes these learned estimates towards
zero.
Ridge Regression

Above image shows ridge regression, where the RSS is modified by

adding the shrinkage quantity. Now, the coefficients are estimated by
minimizing this function. Here, λ is the tuning parameter that decides
how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its
coefficients, and if we want to minimize the above function, then these
coefficients need to be small. This is how the Ridge regression
technique prevents coefficients from rising too high. Also, notice that
we shrink the estimated association of each variable with the response,
except the intercept β0, This intercept is a measure of the mean value
of the response when xi1 = xi2 = …= xip = 0.

When λ = 0, the penalty term has no eﬀect, and the estimates produced
by ridge regression will be equal to least squares. However, as λ→∞,
the impact of the shrinkage penalty grows, and the ridge regression
coeﬃcient estimates will approach zero. As can be seen, selecting a
good value of λ is critical. Cross validation comes in handy for this
purpose. The coefficient estimates produced by this method are also
known as the L2 norm.

The coefficients that are produced by the standard least squares

method are scale equivariant, i.e. if we multiply each input by c then
the corresponding coefficients are scaled by a factor of 1/c. Therefore,
regardless of how the predictor is scaled, the multiplication of
predictor and coefficient(Xjβj) remains the same. However, this is not
the case with ridge regression, and therefore, we need to standardize
the predictors or bring the predictors to the same scale before
performing ridge regression. The formula used to do this is given
below.

Lasso

Lasso is another variation, in which the above function is minimized.

Its clear that this variation differs from ridge regression only in
penalizing the high coefficients. It uses |βj|(modulus)instead of
squares of β, as its penalty. In statistics, this is known as the L1 norm.

A standard least squares model tends to have some variance in it, i.e.
this model won’t generalize well for a data set different than its
training data. Regularization, significantly reduces the variance of
the model, without substantial increase in its bias. So the tuning
parameter λ, used in the regularization techniques described above,
controls the impact on bias and variance. As the value of λ rises, it
reduces the value of coefficients and thus reducing the variance. Till a
point, this increase in λ is beneficial as it is only reducing the
variance(hence avoiding overfitting), without loosing any important
properties in the data. But after certain value, the model starts loosing
important properties, giving rise to bias in the model and thus
underfitting. Therefore, the value of λ should be carefully selected.

Elastic Net Regression

Linear regression refers to a model that assumes a linear relationship
between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher
dimensions, this relationship can be thought of as a hyperplane that
connects the input variables to the target variable. The coefficients of
the model are found via an optimization process that seeks to
minimize the sum squared error between the predictions (yhat) and
the expected target values (y).
loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the

model can become large, making the model sensitive to inputs and
possibly unstable. This is particularly true for problems with few
observations (samples) or more samples (n) than input predictors (p)
or variables (so-called p >> n problems).

One approach to addressing the stability of regression models is to

change the loss function to include additional costs for a model that
has large coefficients. Linear regression models that use these
modified loss functions during training are referred to collectively as
penalized linear regression.

One popular penalty is to penalize a model based on the sum of the

squared coefficient values. This is called an L2 penalty. An L2
penalty minimizes the size of all coefficients, although it prevents any
coefficients from being removed from the model.

l2_penalty = sum j=0 to p beta_j^2

Another popular penalty is to penalize a model based on the sum of

the absolute coefficient values. This is called the L1 penalty. An L1
penalty minimizes the size of all coefficients and allows some
coefficients to be minimized to the value zero, which removes the
predictor from the model.

l1_penalty = sum j=0 to p abs(beta_j)

Elastic net is a penalized linear regression model that includes both

the L1 and L2 penalties during training.

A hyperparameter “alpha” is provided to assign how much weight is

given to each of the L1 and L2 penalties. Alpha is a value between 0
and 1 and is used to weight the contribution of the L1 penalty and one
minus the alpha value is used to weight the L2 penalty.
elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)

Principle of Naive Bayes Classifier:

A Naive Bayes classifier is a probabilistic machine learning model

that’s used for classification task. The crux of the classifier is based on
the Bayes theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening,

given that B has occurred. Here, B is the evidence and A is the
hypothesis. The assumption made here is that the predictors/features
are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Example:

Let us take an example to get some better intuition. Consider the

problem of playing golf. The dataset is represented as below.

We classify whether the day is suitable for playing golf, given the
features of the day. The columns represent these features and the rows
represent individual entries. If we take the first row of the dataset, we
can observe that is not suitable for playing golf if the outlook is rainy,
temperature is hot, humidity is high and it is not windy. We make two
assumptions here, one as stated above we consider that these
predictors are independent. That is, if the temperature is hot, it does
not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the
outcome. That is, the day being windy does not have more importance
in deciding to play golf or not.

According to this example, Bayes theorem can be rewritten as:

The variable y is the class variable(play golf), which represents if it is

suitable to play golf or not given the conditions. Variable X represent
the parameters/features.

X is given as,

Here x_1,x_2….x_n represent the features, i.e they can be mapped to

outlook, temperature, humidity and windy. By substituting for X and
expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and
substitute them into the equation. For all entries in the dataset, the
denominator does not change, it remain static. Therefore, the
denominator can be removed and a proportionality can be introduced.

In our case, the class variable(y) has only two outcomes, yes or no.
There could be cases where the classification could be multivariate.
Therefore, we need to find the class y with maximum probability.

Using the above function, we can obtain the class, given the
predictors.

K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning

algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category
by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification
problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on
the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or
dog. So for this identification, we can use the KNN algorithm, as
it works on a similarity measure. Our KNN model will find the
similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or
dog category.

Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of
a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below
algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of
neighbors
o Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data
points in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o Firstly, we will choose the number of neighbors, so we will
choose the k=5.
o Next, we will calculate the Euclidean distance between the data
points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated
as:
o By calculating the Euclidean distance we got the nearest
neighbors, as three nearest neighbors in category A and two
nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence
this new data point must belong to category A.

Below are some points to remember while selecting the value of K in

the K-NN algorithm:

o There is no particular way to determine the best value for "K", so

we need to try some values to find the best out of them. The most
preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and
lead to the effects of outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.

o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex
some time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.

Logistic Regression

Logistic regression is a classification algorithm, used when the value

of the target variable is categorical in nature. Logistic regression is
most commonly used when the data in question has binary output, so
when it belongs to one class or another, or is either a 0 or 1.
Remember that classification tasks have discrete categories, unlike
regressions tasks.

Here, by the idea of using a regression model to solve the

classification problem, we rationally raise a question of whether we
can draw a hypothesis function to fit to the binary dataset. For
simplification, we only concern the binary classification problem with
the dataset below:

The answer is that you will have to use a type of function, different
from linear functions, called a logistic function, or a sigmoid
function.

Sigmoid Function
The sigmoid function/logistic function is a function that resembles an
“S” shaped curve when plotted on a graph. It takes values between 0
and 1 and “squishes” them towards the margins at the top and bottom,
labeling them as 0 or 1.
The equation for the Sigmoid function is this:

.
Let’s see how the sigmoid function represent the given dataset.
This gives a value y that is extremely close to 0 if x is a large negative
value and close to 1 if x is a large positive value. After the input value
has been squeezed towards 0 or 1, the input can be run through a
typical linear function, but the inputs can now be put into distinct
categories.

It is important to understand that logistic regression should only be

used when the target variables fall into discrete categories and that if
there’s a range of continuous values the target value might be, logistic
regression should not be used. Examples of situations you might use
logistic regression in include:

 Predicting if an email is spam or not spam

 Whether a tumor is malignant or benign
 Whether a mushroom is poisonous or edible.

When using logistic regression, a threshold is usually specified that

indicates at what value the example will be put into one class vs. the
other class. In the spam classification task, a threshold of 0.5 might be
set, which would cause an email with a 50% or greater probability of
being spam to be classified as “spam” and any email with probability
less than 50% classified as “not spam”.

Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised
machine learning algorithms which are used both for classification and
regression. But generally, they are used in classification problems. In
1960s, SVMs were first introduced but later they got refined in 1990.
SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular
because of their ability to handle multiple continuous and categorical
variables.
Working of SVM
An SVM model is basically a representation of different classes in a
hyperplane in multidimensional space. The hyperplane will be
generated in an iterative manner by SVM so that the error can be
minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH).

The followings are important concepts in SVM −

 Support Vectors − Datapoints that are closest to the hyperplane
is called support vectors. Separating line will be defined with the
help of these data points.
 Hyperplane − As we can see in the above diagram, it is a
decision plane or space which is divided between a set of objects
having different classes.
 Margin − It may be defined as the gap between two lines on the
closet data points of different classes. It can be calculated as the
perpendicular distance from the line to the support vectors. Large
margin is considered as a good margin and small margin is
considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH) and it can be done in the
following two steps −
 First, SVM will generate hyperplanes iteratively that segregates
the classes in best way.
 Then, it will choose the hyperplane that separates the classes
correctly.
SVM Kernels
In practice, SVM algorithm is implemented with kernel that
transforms an input data space into the required form. SVM uses a
technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional
space. In simple words, kernel converts non-separable problems into
separable problems by adding more dimensions to it. It makes SVM
more powerful, flexible and accurate. The following are some of the
types of kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The
formula of linear kernel is as below −
K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two
vectors say 𝑥 & 𝑥𝑖 is the sum of the multiplication of each pair of input
values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or
nonlinear input space. Following is the formula for polynomial kernel
−
k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually
in the learning algorithm.
Pros and Cons of SVM Classifiers
Pros of SVM classifiers
SVM classifiers offers great accuracy and work well with high
dimensional space. SVM classifiers basically use a subset of training
points hence in result uses very less memory.
Cons of SVM classifiers
They have high training time hence in practice not suitable for large
datasets. Another disadvantage is that SVM classifiers do not work
well with overlapping classes.

Decision Tree:

A decision tree is a flowchart-like structure in which each internal

node represents a test on a feature (e.g. whether a coin flip comes up
heads or tails) , each leaf node represents a class label (decision taken
after computing all features) and branches represent conjunctions of
features that lead to those class labels. The paths from root to leaf
represent classification rules. Below diagram illustrate the basic flow
of decision tree for decision making with labels (Rain(Yes), No
Rain(No)).
Decision Tree for Rain Forecasting

Decision tree is one of the predictive modelling approaches used

in statistics, data mining and machine learning.

Decision trees are constructed via an algorithmic approach that

identifies ways to split a data set based on different conditions. It is
one of the most widely used and practical methods for supervised
learning. Decision Trees are a non-parametric supervised
learning method used for both classification and regression tasks.

Tree models where the target variable can take a discrete set of values
are called classification trees. Decision trees where the target variable
can take continuous values (typically real numbers) are
called regression trees. Classification And Regression Tree (CART)
is general term for this.

Throughout this post i will try to explain using the examples.

Data Format

Data comes in records of forms.

(x,Y)=(x1,x2,x3,....,xk,Y)

The dependent variable, Y, is the target variable that we are trying to

understand, classify or generalize. The vector x is composed of the
features, x1, x2, x3 etc., that are used for that task.

Approach to make decision tree

While making decision tree, at each node of tree we ask different type
of questions. Based on the asked question we will calculate the
information gain corresponding to it.

Information Gain

Information gain is used to decide which feature to split on at each

step in building the tree. Simplicity is best, so we want to keep our tree
small. To do so, at each step we should choose the split that results in
the purest daughter nodes. A commonly used measure of purity is
called information. For each node of the tree, the information
value measures how much information a feature gives us about the
class. The split with the highest information gain will be taken as
the first split and the process will continue until all children nodes
are pure, or until the information gain is 0.

Algorithm for constructing decision tree usually works top-down, by

choosing a variable at each step that best splits the set of items.
Different algorithms use different metrices for measuring best.
Gini Impurity

First let’s understand the meaning of Pure and Impure.

Pure

Pure means, in a selected sample of dataset all data belongs to same

class (PURE).

Impure

Impure means, data is mixture of different classes.

Definition of Gini Impurity

Gini Impurity is a measurement of the likelihood of an incorrect

classification of a new instance of a random variable, if that new
instance were randomly classified according to the distribution of class
labels from the data set.

If our dataset is Pure then likelihood of incorrect classification is 0. If

our sample is mixture of different classes then likelihood of incorrect
classification will be high.

Steps for Making decision tree

 Get list of rows (dataset) which are taken into consideration for
making decision tree (recursively at each nodes).

 Calculate uncertanity of our dataset or Gini impurity or how much

our data is mixed up etc.

 Generate list of all question which needs to be asked at that node.

 Partition rows into True rows and False rows based on each
question asked.

 Calculate information gain based on gini impurity and partition of

data from previous step.

 Update highest information gain based on each question asked.

 Update best question based on information gain (higher

information gain).

 Divide the node on best question. Repeat again from step 1 again
until we get pure node (leaf nodes).

Advantage of Decision Tree

 Easy to use and understand.

 Can handle both categorical and numerical data.

 Resistant to outliers, hence require little data preprocessing.

Disadvantage of Decision Tree

 Prone to overfitting.

 Require some kind of measurement as to how well they are doing.

 Need to be careful with parameter tuning.

 Can create biased learned trees if some classes dominate.

Overfitting the Decision tree model

Overfitting is one of the major problem for every model in machine

learning. If model is overfitted it will poorly generalized to new
samples. To avoid decision tree from overfitting we remove the
branches that make use of features having low importance. This
method is called as Pruning or post-pruning. This way we will
reduce the complexity of tree, and hence imroves predictive accuracy
by the reduction of overfitting.

Pruning should reduce the size of a learning tree without reducing

predictive accuracy as measured by a cross-validation set. There are 2
major Pruning techniques.

 Minimum Error: The tree is pruned back to the point where the
cross-validated error is a minimum.

 Smallest Tree: The tree is pruned back slightly further than the
minimum error. Technically the pruning creates a decision tree
with cross-validation error within 1 standard error of the minimum
error.

Early Stop or Pre-pruning

An alternative method to prevent overfitting is to try and stop the tree-

building process early, before it produces leaves with very small
samples. This heuristic is known as early stopping but is also
sometimes known as pre-pruning decision trees.

At each stage of splitting the tree, we check the cross-validation error.

If the error does not decrease significantly enough then we stop. Early
stopping may underfit by stopping too early. The current split may be
of little benefit, but having made it, subsequent splits more
significantly reduce the error.

Early stopping and pruning can be used together, separately, or not at

all. Post pruning decision trees is more mathematically rigorous,
finding a tree at least as good as early stopping. Early stopping is a
quick fix heuristic. If used together with pruning, early stopping may
save time. After all, why build a tree only to prune it back again?

Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to

the supervised learning technique. It can be used for both Classification
and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve
a complex problem and to improve the performance of the model.

As the name suggests, “Random Forest is a classifier that contains a

number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that
dataset”. Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy

and prevents the problem of overfitting.

The below diagram explains the working of the Random Forest

algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of
the dataset, it is possible that some decision trees may predict the
correct output, while others may not. But together, all the trees predict
the correct output. Therefore, below are two assumptions for a better
Random forest classifier:

o There should be some actual values in the feature variable of the

dataset so that the classifier can predict accurate results rather
than a guessed result.
o The predictions from each tree must have very low correlations.

Random Forest works in two-phase first is to create the random forest

by combining N decision tree, and second is to make predictions for
each tree created in the first phase.

The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree,
and assign the new data points to the category that wins the majority
votes.

The working of the algorithm can be better understood by the below

example:

Example: Suppose there is a dataset that contains multiple fruit

images. So, this dataset is given to the Random forest classifier. The
dataset is divided into subsets and given to each decision tree. During
the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the
Random Forest classifier predicts the final decision. Consider the
below image:
Advantages of Random Forest
o Random Forest is capable of performing both Classification and
Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting
issue.

Disadvantages of Random Forest

o Although random forest can be used for both classification and

regression tasks, it is not more suitable for Regression tasks.
Model Performance Evaluation Metrics

Regression

There are many metrics for determining model performance for

regression problems, but the most commonly used metric is known as
the mean square error (MSE), or variation called the root mean
square error (RMSE), which is calculated by taking the square root of
the mean squared error.

The error in this case is the difference in value between a given model
prediction and its actual value for an out-of-sample observation. The
mean squared error is therefore the average of all of the squared errors
across all new observations, which is the same as adding all of the
squared errors (sum of squares) and dividing by the number of
observations.

In addition to being used as a stand-alone performance metric, mean

squared error (or RMSE) can also be used for model selection,
controlling model complexity, and model tuning. Often many models
are created and evaluated (e.g., cross-validation), and then MSE (or
similar metric) is plotted on the y-axis, with the tuning or validation
parameter given on the x-axis.

The tuning or validation parameter is changed in each model creation

and evaluation step, and the plot described above can help determine
the ideal tuning parameter value. The number of predictors is a great
example of a potential tuning parameter in this case.

Before moving on to classification, it is worth mentioning R2 briefly.

R2 is often thought of as a measure of model performance, but it’s
actually not. R2 is a measure of the amount of variance explained by
the model, and is given as a number between 0 and 1. A value of 1
means the model explains all of the data perfectly, but when
computed with training data is more of an indication of potential
overfitting than high predictive performance.

As discussed earlier, the more complex the model, the more the
model tends to fit the data better and potentially overfit, or contribute
to additional model variance. Given this, adjusted R2 is a more robust
and reliable metric in that it adjusts for any increases in model
complexity (e.g., adding more predictors), so that one can better
gauge underlying model improvement in lieu of the increased
complexity.

Classification

Recall the different results from a binary classifier, which are true
positives, true negatives, false positives, and false negatives. These
are often shown in a confusion matrix. A confusion matrix is
conceptually the basis of many classification performance metrics as
shown. We will discuss a few of the more popular ones associated
with machine learning here.

Accuracy is a key measure of performance, and is more specifically

the rate at which the model is able to predict the correct value
(classification or regression) for a given data point or observation. In
other words, accuracy is the proportion of correct predictions out of
all predictions made.

The other two metrics from the confusion matrix worth discussing
are precision and recall. Precision (positive predictive value) is the
ratio of true positives to the total amount of positive predictions made
(i.e., true or false). Said another way, precision measures the
proportion of accurate positive predictions out of all positive
predictions made.

Recall on the other hand, or true positive rate, is the ratio of true
positives to the total amount of actual positives, whether predicted
correctly or not. So in other words, recall measures the proportion of
accurate positive predictions out of all actual positive observations.

A metric that is associated with precision and recall is called the F-

score (also called F1 score), which combines them mathematically,
and somewhat like a weighted average, in order to produce a single
measure of performance based on the simultaneous values of both. Its
values range from 0 (worst) to 1 (best).
Another important concept to know about is the receiver operating
characteristic, which when plotted, results in what’s known as
an ROC curve.

An ROC curve is a two-dimensional plot of sensitivity (recall, or true

positive rate) vs specificity (false positive rate). The area under the
curve is referred to as the AUC, and is a numeric metric used to
represent the quality and performance of the classifier (model). An
AUC of 0.5 is essentially the same as random guessing without a
model, whereas an AUC of 1.0 is considered a perfect classifier.
Generally, the higher the AUC value the better, and an AUC above
0.8 is considered quite good.

The higher the AUC value, the closer the curve gets to the upper left
corner of the plot. One can easily see from the ROC curves then that
the goal is to find and tune a model that maximises the true positive
rate, while simultaneously minimising the false positive rate. Said
another way, the goal as shown by the ROC curve is to correctly
predict as many of the actual positives as possible, while also
predicting as many of the actual negatives as possible, and therefore
minimise errors for both.

Error Analysis and Tradeoffs

There are multiple types of errors associated with machine learning

and predictive analytics. The primary types are in-sample and out-of-
sample errors. In-sample errors are the error rate found from the
training data, i.e., the data used to build predictive models.

Out-of-sample errors are the error rates found on a new data set, and
are the most important since they represent the potential performance
of a given predictive model on new and unseen data.

In-sample error rates may be very low and seem to be indicative of a

high-performing model, but one must be careful, as this may be due to
overfitting as mentioned, which would result in a model that is unable
to generalise well to new data.
Training and validation data is used to build, validate, and tune a
model, but test data is used to evaluate model performance and
generalisation capability. One very important point to note is that
prediction performance and error analysis should only be done on test
data, when evaluating a model for use on non-training or new data.

Generally speaking, model performance on training data tends to be

optimistic, and therefore data errors will be less than those involving
test data. There are tradeoffs between the types of errors that a
machine learning practitioner must consider and often choose to
accept.

For binary classification problems, there are two primary types of

errors. Type 1 errors (false positives) and Type 2 errors (false
negatives). It’s often possible through model selection and tuning to
increase one while decreasing the other, and often one must choose
which error type is more acceptable. This can be a major tradeoff
consideration depending on the situation.

A typical example of this tradeoff dilemma involves cancer diagnosis,

where the positive diagnosis of having cancer is based on some test.
In this case, a false positive means that someone is told that have have
cancer when they do not. Conversely, the false negative case is when
someone is told that they do not have cancer when they actually do.

Telling someone they have cancer when they don’t can result in
tremendous emotional distress, stress, additional tests and medical
costs, and so on. On the other hand, failing to detect cancer in
someone that actually has it can mean the difference between life and
death.

In the spam or ham case, neither error type is nearly as serious as the
cancer case, but typically email vendors err slightly more on the side
of letting some spam get into your inbox as opposed to you missing a
very important email because the spam classifier is too aggressive.
1. Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of

classes being predicted. For the problem in hand, we have N=2, and
hence we get a 2 X 2 matrix. Here are a few definitions, you need to
remember for a confusion matrix :

 Accuracy : the proportion of the total number of predictions that

were correct.
 Positive Predictive Value or Precision : the proportion of
positive cases that were correctly identified.
 Negative Predictive Value : the proportion of negative cases
that were correctly identified.
 Sensitivity or Recall : the proportion of actual positive cases
which are correctly identified.
 Specificity : the proportion of actual negative cases which are
correctly identified.

The accuracy for the problem in hand comes out to be 88%. As you
can see from the above two tables, the Positive predictive Value is high,
but negative predictive value is quite low. Same holds for Sensitivity
and Specificity. This is primarily driven by the threshold value we have
chosen. If we decrease our threshold value, the two pairs of starkly
different numbers will come closer.
2. F1 Score

F1-Score is the harmonic mean of precision and recall values for a

classification problem. The formula for F1-Score is as follows:

Now, an obvious question that comes to mind is why are taking a

harmonic mean and not an arithmetic mean. This is because HM
punishes extreme values more. Let us understand this with an
example. We have a binary classification model with the following
results:

Precision: 0, Recall: 1

Here, if we take the arithmetic mean, we get 0.5. It is clear that the
above result comes from a dumb classifier which just ignores the
input and just predicts one of the classes as output. Now, if we were
to take HM, we will get 0 which is accurate as this model is useless
for all purposes.

This seems simple. There are situations however for which a data
scientist would like to give a percentage more importance/weight to
either precision or recall. Altering the above expression a bit such that
we can include an adjustable parameter beta for this purpose, we get:

Fbeta measures the effectiveness of a model with respect to a user

who attaches β times as much importance to recall as precision.

3. Area Under the ROC curve (AUC – ROC)

This is again one of the popular metrics used in the industry. The
biggest advantage of using ROC curve is that it is independent of the
change in proportion of responders. This statement will get clearer in
the following sections.

Let’s first try to understand what is ROC (Receiver operating

characteristic) curve. If we look at the confusion matrix below, we
observe that for a probabilistic model, we get different value for each
metric.

Hence, for each sensitivity, we get a different specificity.The two vary

as follows:

The ROC curve is the plot between sensitivity and (1- specificity). (1-
specificity) is also known as false positive rate and sensitivity is also
known as True Positive rate. Following is the ROC curve for the case
in hand.
Let’s take an example of threshold = 0.5 (refer to confusion matrix).
Here is the confusion matrix :

As you can see, the sensitivity at this threshold is 99.6% and the (1-
specificity) is ~60%. This coordinate becomes on point in our ROC
curve. To bring this curve down to a single number, we find the area
under this curve (AUC).

Note that the area of entire square is 1*1 = 1. Hence AUC itself is the
ratio under the curve and the total area. For the case in hand, we get
AUC ROC as 96.4%. Following are a few thumb rules:

 .90-1 = excellent (A)

 .80-.90 = good (B)
 .70-.80 = fair (C)
 .60-.70 = poor (D)
 .50-.60 = fail (F)
We see that we fall under the excellent band for the current model. But
this might simply be over-fitting. In such cases it becomes very
important to to in-time and out-of-time validations

4. Log Loss

AUC ROC considers the predicted probabilities for determining our

model’s performance. However, there is an issue with AUC ROC, it
only takes into account the order of probabilities and hence it does not
take into account the model’s capability to predict higher probability
for samples more likely to be positive. In that case, we could us the
log loss which is nothing but negative average of the log of corrected
predicted probabilities for each instance.

 p(yi) is predicted probability of positive class

 1-p(yi) is predicted probability of negative class
 yi = 1 for positive class and 0 for negative class (actual values)

Let us calculate log loss for a few random values to get the gist of the
above mathematical function:

Logloss(1, 0.1) = 2.303

Logloss(1, 0.5) = 0.693

Logloss(1, 0.9) = 0.105

If we plot this relationship, we will get a curve as follows:

It’s apparent from the gentle downward slope towards the right that
the Log Loss gradually declines as the predicted probability improves.
Moving in the opposite direction though, the Log Loss ramps up very
rapidly as the predicted probability approaches 0.

So, lower the log loss, better the model. However, there is no absolute
measure on a good log loss and it is use-case/application dependent.

Whereas the AUC is computed with regards to binary classification

with a varying decision threshold, log loss actually takes “certainty”
of classification into account.

5. Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression

problems. It follows an assumption that error are unbiased and follow
a normal distribution. Here are the key points to consider on RMSE:

1. The power of ‘square root’ empowers this metric to show large

number deviations.
2. The ‘squared’ nature of this metric helps to deliver more robust
results which prevents cancelling the positive and negative error
values. In other words, this metric aptly displays the plausible
magnitude of error term.
3. It avoids the use of absolute error values which is highly
undesirable in mathematical calculations.
4. When we have more samples, reconstructing the error
distribution using RMSE is considered to be more reliable.
5. RMSE is highly affected by outlier values. Hence, make sure
you’ve removed outliers from your data set prior to using this
metric.
6. As compared to mean absolute error, RMSE gives higher
weightage and punishes large errors.

RMSE metric is given by:

where, N is Total Number of Observations.

6. Root Mean Squared Logarithmic Error

In case of Root mean squared logarithmic error, we take the log of the
predictions and actual values. So basically, what changes are the
variance that we are measuring. RMSLE is usually used when we
don’t want to penalize huge differences in the predicted and the actual
values when both predicted and true values are huge numbers.

1. If both predicted and actual values are small: RMSE and

RMSLE are same.
2. If either predicted or the actual value is big: RMSE > RMSLE
3. If both predicted and actual values are big: RMSE > RMSLE
(RMSLE becomes almost negligible)

11. R-Squared/Adjusted R-Squared

In the case of a classification problem, if the model has an accuracy of

0.8, we could gauge how good our model is against a random model,
which has an accuracy of 0.5. So the random model can be treated as
a benchmark. But when we talk about the RMSE metrics, we do
not have a benchmark to compare.

This is where we can use R-Squared metric. The formula for R-

Squared is as follows:

MSE(model): Mean Squared Error of the predictions against the

actual values

MSE(baseline): Mean Squared Error of mean prediction against the

actual values

In other words how good our regression model as compared to a very

simple model that just predicts the mean value of target from the train
set as predictions.
Adjusted R-Squared

A model performing equal to baseline would give R-Squared as 0.

Better the model, higher the r2 value. The best model with all correct
predictions would give R-Squared as 1. However, on adding new
features to the model, the R-Squared value either increases or remains
the same. R-Squared does not penalize for adding features that add no
value to the model. So an improved version over the R-Squared is the
adjusted R-Squared. The formula for adjusted R-Squared is given by:

k: number of features

n: number of samples

As you can see, this metric takes the number of features into account.
When we add more features, the term in the denominator n-(k +1)
decreases, so the whole expression increases.

If R-Squared does not increase, that means the feature added isn’t
valuable for our model. So overall we subtract a greater value from 1
and adjusted r2, in turn, would decrease.

12. Cross Validation

Cross Validation is one of the most important concepts in any type of

data modelling. It simply says, try to leave a sample on which you do
not train the model and test the model on this sample before finalizing
the model.
Above diagram shows how to validate model with in-time sample. We
simply divide the population into 2 samples, and build model on one
sample. Rest of the population is used for in-time validation.

We make a 50:50 split of training population and the train on first 50

and validate on rest 50. Then, we train on the other 50, test on first 50.
This way we train the model on the entire population, however on 50%
in one go. This reduces bias because of sample selection to some extent
but gives a smaller sample to train the model on. This approach is
known as 2-fold cross validation.

k-fold Cross validation:

Let’s extrapolate the last example to k-fold from 2-fold cross

validation. Now, we will try to visualize how does a k-fold validation
work.

This is a 7-fold cross validation.

Here’s what goes on behind the scene : we divide the entire population
into 7 equal samples. Now we train models on 6 samples (Green boxes)
and validate on 1 sample (grey box). Then, at the second iteration we
train the model with a different sample held as validation. In 7
iterations, we have basically built model on each sample and held each
of them as validation. This is a way to reduce the selection bias and
reduce the variance in prediction power. Once we have all the 7 models,
we take average of the error terms to find which of the models is best.

k-fold cross validation is widely used to check whether a model is an

overfit or not. If the performance metrics at each of the k times
modelling are close to each other and the mean of metric is highest.

For a small k, we have a higher selection bias but low variance in the
performances.

For a large k, we have a small selection bias but high variance in the
performances.

Think of extreme cases :

k = 2 : We have only 2 samples similar to our 50-50 example. Here we

build model only on 50% of the population each time. But as the
validation is a significant population, the variance of validation
performance is minimal.

k = number of observations (n) : This is also known as “Leave one

out”. We have n samples and modelling repeated n number of times
leaving only one observation out for cross validation. Hence, the
selection bias is minimal but the variance of validation performance is
very large.

Generally a value of k = 10 is recommended for most purpose.

Unit II
No ratings yet
Unit II
25 pages
Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
No ratings yet
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
25 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
Ai Unit-4-1
No ratings yet
Ai Unit-4-1
9 pages
ML Type
No ratings yet
ML Type
13 pages
Machine Learning - UNIT I Notes
No ratings yet
Machine Learning - UNIT I Notes
31 pages
8-Module 5 Linear and Logical Regression-18-03-2024
No ratings yet
8-Module 5 Linear and Logical Regression-18-03-2024
14 pages
Unit 1
No ratings yet
Unit 1
21 pages
ML 2
No ratings yet
ML 2
166 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
25 pages
CH 1
No ratings yet
CH 1
34 pages
Machine Learning Types Overview
No ratings yet
Machine Learning Types Overview
129 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
19 pages
Chapter-2-Fundamentals of Machine Learning
No ratings yet
Chapter-2-Fundamentals of Machine Learning
23 pages
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
No ratings yet
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
27 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
Supervised Learning
No ratings yet
Supervised Learning
19 pages
Understanding Supervised Learning Basics
No ratings yet
Understanding Supervised Learning Basics
26 pages
Ann Unit 2
No ratings yet
Ann Unit 2
21 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
8 pages
BDAunit 5
No ratings yet
BDAunit 5
26 pages
Supervised Learning Techniques
No ratings yet
Supervised Learning Techniques
4 pages
Unit-5 Machine Learning
No ratings yet
Unit-5 Machine Learning
25 pages
2 ML
No ratings yet
2 ML
9 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Introduction To ML
No ratings yet
Introduction To ML
46 pages
Chapter 2 Supervised Learning - p1-2
No ratings yet
Chapter 2 Supervised Learning - p1-2
45 pages
SCSA3015 Deep Learning Unit 1 Notes PDF
No ratings yet
SCSA3015 Deep Learning Unit 1 Notes PDF
30 pages
Unit 5
No ratings yet
Unit 5
16 pages
Unit-1 DLL
No ratings yet
Unit-1 DLL
73 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
61 pages
Machine Learning Basics and Types
No ratings yet
Machine Learning Basics and Types
20 pages
Machine Learning: A Guide for Students
No ratings yet
Machine Learning: A Guide for Students
18 pages
Null 5
No ratings yet
Null 5
16 pages
Unit 3 Material
No ratings yet
Unit 3 Material
8 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
14 pages
PDF&Rendition 1 2
No ratings yet
PDF&Rendition 1 2
27 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
17 pages
Machine Learning Unit-I
No ratings yet
Machine Learning Unit-I
41 pages
Intro To ML
No ratings yet
Intro To ML
34 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
21 pages
Unit 1
No ratings yet
Unit 1
19 pages
Lecture 01 02
No ratings yet
Lecture 01 02
30 pages
ML Unit - 2
No ratings yet
ML Unit - 2
36 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
14 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
29 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Unit 5 PPT
No ratings yet
Unit 5 PPT
32 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
56 pages
Ai Unit 4
No ratings yet
Ai Unit 4
32 pages
ML Quation Bank
No ratings yet
ML Quation Bank
50 pages
ML Unit 1
No ratings yet
ML Unit 1
19 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Python UNIT-5
100% (1)
Python UNIT-5
67 pages
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
No ratings yet
2-Capacity, Underfitting, overfitting-15-Jul-2020Material - I - 15-Jul-2020 - ML - Fundamentals
35 pages
Unit 1
No ratings yet
Unit 1
24 pages
Employment Certificate for M Jhansi Rani
No ratings yet
Employment Certificate for M Jhansi Rani
1 page
Python Programming Lab Guide
100% (1)
Python Programming Lab Guide
3 pages
Alcohol Taglines: Global Spirits Guide
No ratings yet
Alcohol Taglines: Global Spirits Guide
3 pages
QuickSort Algorithm Explained
No ratings yet
QuickSort Algorithm Explained
1 page
Overview of Computer System Components
No ratings yet
Overview of Computer System Components
4 pages
Input/Output Function Types Guide
No ratings yet
Input/Output Function Types Guide
1 page
Overview of Computer System Components
No ratings yet
Overview of Computer System Components
4 pages
QuickSort Algorithm Explained
No ratings yet
QuickSort Algorithm Explained
1 page
JavaScript Basics: Syntax and Functions
No ratings yet
JavaScript Basics: Syntax and Functions
15 pages
Dreamweaver: Essential Web Design Tool
No ratings yet
Dreamweaver: Essential Web Design Tool
21 pages
CSS3 PDF
100% (1)
CSS3 PDF
5 pages
Jquery: HTML Head Script
No ratings yet
Jquery: HTML Head Script
5 pages
Web Development Course Overview
No ratings yet
Web Development Course Overview
3 pages
HR Demand Forecasting Techniques
No ratings yet
HR Demand Forecasting Techniques
18 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
Simple Linear Regression in Engineering
No ratings yet
Simple Linear Regression in Engineering
28 pages
Chapter 1 Simple Linear Regression
No ratings yet
Chapter 1 Simple Linear Regression
17 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
4 pages
Machine Learning Lab Manual in Python
No ratings yet
Machine Learning Lab Manual in Python
36 pages
Sheet 1 Simple Regression
No ratings yet
Sheet 1 Simple Regression
4 pages
Final Report Spss
No ratings yet
Final Report Spss
14 pages
Da Module 3
No ratings yet
Da Module 3
54 pages
Plaxis Print - Output PDF
No ratings yet
Plaxis Print - Output PDF
184 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
100 pages
Statistics
No ratings yet
Statistics
41 pages
(Ebook PDF) First Course in Statistics A 11th Edition - The Latest Ebook Version Is Now Available For Instant Access
100% (7)
(Ebook PDF) First Course in Statistics A 11th Edition - The Latest Ebook Version Is Now Available For Instant Access
52 pages
BBA 2nd Semester Syllabus Overview
No ratings yet
BBA 2nd Semester Syllabus Overview
12 pages
Machine Learning with SVM & Regression
No ratings yet
Machine Learning with SVM & Regression
9 pages
(Ebook PDF) First Course in Statistics A 11Th Edition Download
No ratings yet
(Ebook PDF) First Course in Statistics A 11Th Edition Download
44 pages
Text Problems Solved
No ratings yet
Text Problems Solved
9 pages
Iit Notes
No ratings yet
Iit Notes
5 pages
Using Excel For Principles of Econometrics 4th Edition Genevieve Briand
No ratings yet
Using Excel For Principles of Econometrics 4th Edition Genevieve Briand
41 pages
BA Material
No ratings yet
BA Material
103 pages
Regression & Correlation Analysis Lecture
No ratings yet
Regression & Correlation Analysis Lecture
9 pages
Impact of Outliers in Regression Analysis
No ratings yet
Impact of Outliers in Regression Analysis
3 pages
Regresi Linear: Konsep dan Contoh Soal
No ratings yet
Regresi Linear: Konsep dan Contoh Soal
16 pages
EC403 U3 Random Regressors and Moment Based Estimation
No ratings yet
EC403 U3 Random Regressors and Moment Based Estimation
42 pages
Chapter 4 Thomas Managerial Economics
No ratings yet
Chapter 4 Thomas Managerial Economics
36 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
51 pages
Gábor Békés, Gábor Kézdi - Data Analysis For Business, Economics, and Policy-Cambridge University Press (2021)
100% (9)
Gábor Békés, Gábor Kézdi - Data Analysis For Business, Economics, and Policy-Cambridge University Press (2021)
742 pages
Regression and Correlation (Ch.14) )
No ratings yet
Regression and Correlation (Ch.14) )
7 pages
Business Statistics Exercises Answers
0% (1)
Business Statistics Exercises Answers
16 pages
Econometrics Lecture4 MultipleRegression
No ratings yet
Econometrics Lecture4 MultipleRegression
40 pages