0% found this document useful (0 votes)
8 views25 pages

Machine Learning (Module II)

The document discusses various machine learning concepts, focusing on linear models, supervised learning, and regression techniques. It explains classification types, including linear and non-linear classifiers, and details regression methods such as linear, multiple, polynomial, and regularized regression. Additionally, it covers logistic regression, perceptrons, and multilayer neural networks, highlighting their structures and functions in machine learning.

Uploaded by

m acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views25 pages

Machine Learning (Module II)

The document discusses various machine learning concepts, focusing on linear models, supervised learning, and regression techniques. It explains classification types, including linear and non-linear classifiers, and details regression methods such as linear, multiple, polynomial, and regularized regression. Additionally, it covers logistic regression, perceptrons, and multilayer neural networks, highlighting their structures and functions in machine learning.

Uploaded by

m acharya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Machine learning

MODULE – II
LINEAR MODELS Linear classification – univariate linear regression – multivariate
linear regression – regularized regression – Logistic regression – perceptron’s –
multilayer neural networks – learning neural networks structures – support vector
machines – soft margin SVM – going beyond linearity – generalization and overfitng –
regularization – validation.
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Supervised Learning:
Supervised learning is a type of machine learning algorithm that learns from labeled
training data to make predictions or decisions without human intervention. Here
“labeled” means that each training example is paired with an output label.

The primary goal of supervised learning is map from the input data (features) to the
output labels. This learned function can then be used to predict the output of unseen
data. Supervised learning can be categorized into two main types: classification and
regression.
I. Classification: This involves predicting a discrete label or class, such as
identifying an email as spam or not spam. The main objective of classification machine
learning is to build a model that can accurately assign a label or category to a new
observation based on its features.
Classification Types
There are two main classification types in machine learning:
A. Linear Classifier: Linear classifiers make predictions based on a linear
combination of the input features. When data is linearly separable (meaning classes can
be separated with a straight line or hyperplane), linear classifiers can perform
exceptionally well.
- Their decision boundary in a two-dimensional feature space is a straight line, in three
1

dimensions it’s a plane, and in higher dimensions, it’s a hyperplane.


Page

- Examples of linear classifiers include:


 Linear Regression (used for classification, e.g., logistic regression)
Prof. Rojalin Dash , Prof Mousumi Acharya
 Linear Support Vector Machines (SVM)
 Perceptron
B. Non-Linear Classifier: Nonlinear classifiers can model more complex
relationships between features and can produce curved (nonlinear) decision boundaries.
- Examples of nonlinear classifiers include:
 Kernel Support Vector Machines
 Neural Networks
 Decision Trees and Random Forests
 K-Nearest Neighbors (KNN)
 Naive Bayes (under certain conditions)
II. Regression: Regression is a supervised learning technique that models the
relationship between input features and a continuous target variable, using statistical
methods to predict the target variable based on new input data, like forecasting the price
of a house based on its features.
Types of Regression:

1)Linear Regression : Linear regression shows the linear relationship between the
independent variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression .
2
Page

-The mathematical equation for Linear regression: Y= aX+b

Prof. Rojalin Dash , Prof Mousumi Acharya


Here, Y = dependent variables (target variables), X= Independent variables (predictor
variables), a and b are the linear coefficients
-Some popular applications of linear regression are:
o Analyzing trends and sales estimates
o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic
i)Multiple Linear Regression: Multiple Regression is a method used to measure the
degree at which more than one independent variable (predictors) and more than one
dependent variable (responses), are linearly related.

- For multiple linear regression, the form of the model is


Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn
Here, • Y is a dependent variable
• X1, X2, …., Xn are independent variables
• β0, β1,…, βn are the regression coefficients
• βj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an
impact on Y
- Example : .Create a model that predicts blood pressure based on variables such as age,
gender, weight, diet, exercise, and medication.
ii)Polynomial Regression : Polynomial Regression is a regression algorithm that
models the relationship between a dependent(y) and independent variable(x) as nth
degree polynomial. The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
- a linear model with some modification in order to increase the accuracy.
3
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


iii)Regularized regression : Regularization is one of the most important concepts of
machine learning. It is a technique to prevent the model from overfitting by adding
extra information to it.
- Sometimes the machine learning model performs well with the training data but does
not perform well with the test data.

- It means the model is not able to predict the output when deals with unseen data by
introducing noise in the output, and hence the model is called overfitted. This problem
can be deal with the help of a regularization technique.
- Techniques of Regularization
• Ridge Regression
• Lasso Regression

a) Ridge Regression : Ridge regression is one of the types of linear regression in which a
small amount of bias is introduced so that we can get better long-term predictions.
- Ridge regression is a regularization technique, which is used to reduce the complexity
4
Page

of the model. It is also called as L2 regularization.


- In this technique, the cost function is altered by adding the penalty term to it.

Prof. Rojalin Dash , Prof Mousumi Acharya


- The amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
- The equation for the cost function in ridge regression will be:

- In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
- if the values of λ tend to zero, the equation becomes the cost function of the linear
regression model. Hence, for the minimum value of λ, the model will resemble the
linear regression model.
- A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
- It helps to solve the problems if we have more parameters than samples.
b) Lasso Regression: Lasso regression is another regularization technique to reduce the
complexity of the model. It stands for Least Absolute and Selection Operator.
- It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
- Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
- It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:

- Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Difference between Ridge Regression and Lasso Regression

 Ridge regression is mostly used to reduce the overfitting in the model, and it includes
all the features present in the model. It reduces the complexity of the model by
shrinking the coefficients.
 Lasso regression helps to reduce the overfitting in the model as well as feature
selection.
i) Logistic regression : . It is used for predicting the categorical dependent variable using
a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore
5

the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
Page

true or False, etc. but instead of giving the exact value as 0 and 1, it gives the

Prof. Rojalin Dash , Prof Mousumi Acharya


probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, etc,
• Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
• Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Logistic Regression Equation:


The Logistic regression equation can be obtained from the Linear Regression equation.
The mathematical steps to get Logistic Regression equations are given below:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
6
Page

On the basis of the categories, Logistic Regression can be classified into three types:
 Binomial: In binomial Logistic regression, there can be only two possible types of

Prof. Rojalin Dash , Prof Mousumi Acharya


the dependent variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".

 Perceptron: A Perceptron is an Artificial Neuron. It is the simplest


possible Neural Network and Neural Networks are the building blocks of Machine
Learning. It can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
perceptron model is a binary classifier which contains three main components. These
are as follows:

 Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
 Wight and Bias:
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
 Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
Types of Activation functions:
- Sign function
- Step function, and
- Sigmoid function
7
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function.
Step- 1:
The perceptron model begins with the multiplication of all input values and their
weights, then adds these values together to create the weighted sum. Then this weighted
sum is applied to the activation function 'f' to obtain the desired output. This activation
function is also known as the step function and is represented by 'f'.
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2:
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models


8
Page

The step function or Activation function plays a vital role in ensuring that output is
Prof. Rojalin Dash , Prof Mousumi Acharya
mapped between required values (0,1) or (-1,1). It is important to note that the weight
of input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.
Explanation :
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Based on the layers, Perceptron models are divided into two types. These are as
follows:
- Single-layer Perceptron Model
- Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-
determined value, the model gets activated and shows the output value as +1.
Multi-Layered Perceptron Model:
A multi-layer perceptron model has the same model structure as single layer but has a
greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows:
 Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
 Backward Stage: In the backward stage, weight and bias values are modified as
per the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
 Multilayer neural networks
-A multilayer perceptron in a neural network is a tightly connected neural network from
the input to the output layer. It has 3 layers: an input layer, a hidden layer, and an
output layer. There are various nodes in each layer, and all nodes are interconnected
with each other.
-Multi-Layer perceptron defines the most complex architecture of artificial neural
networks. It is substantially formed from multiple layers of the perceptron.
The pictorial representation of multi-layer perceptron learning is as shown
9
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


 Learning Neural Network
NN is a computational system consisting of many interconnected units called artificial
neurons. The connection between artificial neurons can transmit a signal from one
neuron to another.
So, there are multiple possibilities for connecting the neurons based on which the
architecture we are going to adopt for a specific solution. Some permutations and
combinations are as follows:
-There may be just two layers of neuron in the network – the input and output layer.
-There can be one or more intermediate ‘hidden’ layers of a neuron.
-The neurons may be connected with all neurons in the next layer and so on …..

So let’s start talking about the various possible architectures:

1) Single-layer Feed Forward Network:

It is the simplest and most basic architecture of ANN’s. It consists of only two layers-
the input layer and the output layer. The input layer consists of ‘m’ input neurons
connected to each of the ‘n’ output neurons.
The connections carry weights w11 and so on. The input layer of the neurons doesn’t
conduct any processing – they pass the i/p signals to the o/p neurons.
The computations are performed in the output layer. So, though it has 2 layers of
neurons, only one layer is performing the computation.
10

This is the reason why the network is known as SINGLE layer. Also, the signals
always flow from the input layer to the output layer. Hence, the network is known as
Page

FEED FORWARD.
2) Multi-layer Feed Forward Network:
Prof. Rojalin Dash , Prof Mousumi Acharya
The multi-layer feed-forward network is quite similar to the single-layer feed-forward
network, except for the fact that there are one or more intermediate layers of neurons
between the input and output layer. Hence, the network is termed as multi-layer.
Each of the layers may have a varying number of neurons. For example, the one shown
in the above diagram has ‘m’ neurons in the input layer and ‘r’ neurons in the output
layer and there is only one hidden layer with ‘n’ neurons.

3) Competitive Network:
It is same as the single-layer feed-forward network in structure. The only difference is that
the output neurons are connected with each other (either partially or fully). Below
is the diagram for this type of network.

According to the diagram, it is clear that few of the output neurons are interconnected
to each other. For a given input, the output neurons compete against themselves to
represent the input. It represents a form of an unsupervised learning algorithm in ANN
that is suitable to find the clusters in a data set.
4)Recurrent Network:
In feed-forward networks, the signal always flows from the input layer towards the
output layer (in one direction only). In the case of recurrent neural networks, there is a
feedback loop (from the neurons in the output layer to the input layer neurons). There
can be self-loops too.
11
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


Learning Process In ANN:
Learning process in ANN mainly depends on four factors, they are:
1. The number of layers in the network (Single-layered or multi-layered)
2. Direction of signal flow (Feedforward or recurrent)
3. Number of nodes in layers: The number of node in the input layer is equal to the
number of features of the input data set. The number of output nodes will depend on
possible outcomes
i.e. the number of classes in case of supervised learning. But the number of layers in the
hidden layer is to be chosen by the user. A larger number of nodes in the hidden layer,
higher the performance but too many nodes may result in overfitting as well as
increased computational expense.
4. Weight of Interconnected Nodes: Deciding the value of weights attached with each
interconnection between each neuron so that a specific learning problem can be solved
correctly is quite a difficult problem by itself.
 Support Vector Machine(SVM):
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
-The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n- dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane. 12
Page

Example:

Prof. Rojalin Dash , Prof Mousumi Acharya


Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and dogs, and then we test it
with this strange creature. So as support vector creates a decision boundary between
these two data (cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog. On the basis of the support vectors, it will classify it as a
cat.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM:
 Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such data
is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
 Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is termed
as non-linear data and classifier used is called as Non-linear SVM classifier.
Working of Linear SVM:
 Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. We want a classifier that can classify
 the pair(x1, x2) of coordinates in either green or blue.

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
13

 Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane. SVM algorithm finds the closest
point of the lines from both the classes.
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
 If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

 So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as: z=x2 +y2
 By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
14
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


 Hence we get a circumference of radius 1 in case of non-linear data.
Soft Margin : Soft margin SVM allows some misclassification to happen by relaxing
the hard constraints of Support Vector Machine. Soft margin SVM is implemented with
the help of the Regularization parameter (C). Regularization parameter (C): It tells us
how much misclassification we want to avoid

Linear Inseparability

Figure 1: Data representation where the two classes are not linearly separable

In the 2D figure, it is evident that there’s no specific linear decision boundary that can
perfectly separate the data, i.e. the data is linearly inseparable. We can have a similar
situation in higher-dimensional representations as well. This can be attributed to the
fact that usually, the features we derive from the data don’t contain sufficient
information so that we can clearly separate the two classes.
Soft Margin Formulation
This idea is based on a simple premise: allow SVM to make a certain number of
mistakes and keep margin as wide as possible so that other points can still be classified
correctly. This can be done simply by modifying the objective of SVM.
Motivation
Let us briefly go over the motivation for having this kind of formulation.

 As mentioned earlier, almost all real-world applications have data that is linearly
inseparable.
 In rare cases where the data is linearly separable, we might not want to choose a
decision boundary that perfectly separates the data to avoid overfitting. For example,
consider the following diagram:
15
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


Figure 2: Which decision boundary is better? Red or Green?

Here the red decision boundary perfectly separates all the training points. However, is it
really a good idea of having a decision boundary with such less margin? Do you think
such kind of decision boundary will generalize well on unseen data? The answer is: No.
The green decision boundary has a wider margin that would allow it to generalize well
on unseen data. In that sense, soft margin formulation would also help in avoiding the
overfitting problem.

Going beyond linearity


Linearity
Linearity refers to the property of a system or model where the output is directly
proportional to the input, while nonlinearity implies that the relationship between input
and output is more complex and cannot be expressed as a simple linear function.
We can move beyond linearity through methods such as polynomial regression, step
functions, splines, local regression, and GAMs.

Generalization and overfitting


 Overfitting and Underfitting in Machine Learning
Overfitting and Underfitting are the two main problems that occur in machine learning
and degrade the performance of the machine learning models.
The main goal of each machine learning model is to generalize well. Here
16

generalization defines the ability of an ML model to provide a suitable output by


Page

adapting the given set of unknown input. It means after providing training on the
dataset, it can produce reliable and accurate output. Hence, the underfitting and

Prof. Rojalin Dash , Prof Mousumi Acharya


overfitting are the two terms that need to be checked for the performance of the model
and whether the model is generalizing well or not.
Before understanding the overfitting and underfitting, let's understand some basic term :
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of
the model.
o Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference between the
predicted values and the actual values.
o Variance: If the machine learning model performs well with the training dataset,
but does not perform well with the test dataset, then variance occurs.
 Overfitting : Overfitting occurs when our ML model tries to cover all the data
points or more than the required data points present in the given dataset. Because of
this, the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model has
low bias and high variance.The chances of occurrence of overfitting increase as much
we provide training to our model. It means the more we train our model, the more
chances of occurring the overfitted model.
Overfitting is the main problem that occurs in supervised learning.
Example: The concept of the overfitting can be understood by the below graph of the
linear regression output:

As we can see from the above graph, the model tries to cover all the data points present
in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of
the regression model to find the best fit line, but here we have not got any best fit, so, it
17

will generate the prediction errors.


Page

How to avoid the Overfitting in Model

Prof. Rojalin Dash , Prof Mousumi Acharya


o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
 Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training
data can be stopped at an early stage, due to which the model may not learn enough
from the training data. As a result, it may fail to find the best fit of the dominant trend
in the data.
In the case of underfitting, the model is not able to learn enough from the training data,
and hence it reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.
Example: We can understand the underfitting using below output of the linear
regression model:

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.

How to avoid underfitting:


o By increasing the training time of the model.
o By increasing the number of features
18
Page

Regularization

Prof. Rojalin Dash , Prof Mousumi Acharya


Regularization refers to techniques that are used to calibrate machine learning models
in order to minimize the adjusted loss function and prevent overfitting or underfitting

What is Regularization?
Regularization is one of the most important concepts of machine learning. It is a
technique to prevent the model from overfitting by adding extra information to it.
Sometimes the ML model performs well with the training data but does not perform
well with the test data. It means the model is not able to predict the output when deals
with unseen data by introducing noise in the output, and hence the model is called
overfitted. This problem can be deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or
features in the model by reducing the magnitude of the variables. Hence, it maintains
accuracy as well as a generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple
words, "In regularization technique, we reduce the magnitude of the features by
keeping the same number of features."
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model.
Let's consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted X1, X2, …Xn are the
features for Y. β0,β1,…..βn are the weights or magnitude attached to the features,
respectively. Here represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function.
The equation for the cost function for the linear model is given below:
19
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called as
RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Validation :
Validation of a model is referred to as the process where a trained model is evaluated
with a testing data set. The testing data set is a separate portion of the same data set
from which the training set is derived.
Validating the machine learning model outputs are important to ensure its accuracy.
When a machine learning model is trained, a huge amount of training data is used and
the main aim of checking the model validation provides an opportunity for machine
learning engineers to improve the data quality and quantity.
 Cross-Validation
Cross-validation is a technique for validating the model efficiency by training it on the
subset of input data and testing on previously unseen subset of the input data. We can
also say that it is a technique to check how a statistical model generalizes to an
independent dataset.
In Machine Learning there is always the need to test the stability of the model. It means
based only on the training dataset; we can't fit our model on the training dataset. For
this purpose, we reserve a particular sample of the dataset, which was not part of the
training dataset. After that, we test our model on that sample before deployment, and
this complete process comes under cross-validation. This is something different from
the general train-test split.
The basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs
20

well with the validation set, perform the further step, else check for the issues.
Page

Methods used for Cross-Validation

Prof. Rojalin Dash , Prof Mousumi Acharya


There are some common methods that are used for cross-validation. These methods are
given below:
 Validation Set Approach
 Leave-P-out cross-validation
 Leave one out cross-validation
 K-fold cross-validation
 Stratified k-fold cross-validation
 Holdout Method
1) Validation Set Approach
We divide our input dataset into a training set and test or validation set in the validation
set approach. Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train our
model, so the model may miss out to capture important information of the dataset. It
also tends to give the underfitted model.
2) Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n data points in the original input dataset, then n-p data points will be used as the
training dataset and the p data points as the validation set. This complete process is
repeated for all the samples, and the average error is calculated to know the
effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult for
the large p.
3) Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to
take 1 dataset out of training. It means, in this approach, for each learning set, only one
datapoint is reserved, and the remaining dataset is used to train the model. This process
repeats for each datapoint. Hence for n samples, we get n different training set and n
test set. It has the following features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
21

o This approach leads to high variation in testing the effectiveness of the model as
Page

we iteratively check against one data point.

Prof. Rojalin Dash , Prof Mousumi Acharya


4) K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction
function uses k-1 folds, and the rest of the folds are used for the test set. This approach
is a very popular CV approach because it is easy to understand, and the output is less
biased than other methods.
The steps for k-fold cross-validation are:
o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using
the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds.
On 1st iteration, the first fold is reserved for test the model, and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
Consider the below diagram:

5) Stratified k-fold cross-validation


This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to
ensure that each fold or group is a good representative of the complete dataset. To deal
with the bias and variance, it is one of the best approaches.
22
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified k-fold
cross-validation technique is useful.
6) Holdout Method
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by
training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of
high variance, and it also produces misleading results sometimes.
Comparison of Cross-validation to train/test split in Machine Learning
o Train/test split: The input data is divided into two parts, that are training set and test
set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest
disadvantages.
o Training Data: The training data is used to train the model, and the dependent
variable is known.
o Test Data: The test data is used to make the predictions from the model that is
already trained on the training data. This has the same features as training data but not the
part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split
by splitting the dataset into groups of train/test splits, and averaging the result. It can be
used if we want to optimize our model that has been trained on the training dataset for the
best performance. It is more efficient as compared to train/test split as every observation is
used for the training and testing both.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-validation, as
there is no certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model for
23

the prediction of stock market values, and the data is trained on the previous 5 years stock
Page

values, but the realistic future values for the next 5 years may drastically different, so it is
difficult to expect the correct output for such situations.

Prof. Rojalin Dash , Prof Mousumi Acharya


Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive
modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by
the data scientists in the field of medical statistics.

24
Page

Prof. Rojalin Dash , Prof Mousumi Acharya


25
Page

Prof. Rojalin Dash , Prof Mousumi Acharya

You might also like