0% found this document useful (0 votes)
10 views29 pages

Unit IV Machine Learning Notes

This document provides an overview of neural networks, including their structure, properties, and learning methods. It discusses the biological inspiration behind artificial neural networks (ANNs), the architecture of basic neural networks, and various learning algorithms such as supervised and unsupervised learning. Additionally, it covers the perceptron model for pattern classification and its application in solving complex problems through adaptive learning.

Uploaded by

mansoorkhan.a006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views29 pages

Unit IV Machine Learning Notes

This document provides an overview of neural networks, including their structure, properties, and learning methods. It discusses the biological inspiration behind artificial neural networks (ANNs), the architecture of basic neural networks, and various learning algorithms such as supervised and unsupervised learning. Additionally, it covers the perceptron model for pattern classification and its application in solving complex problems through adaptive learning.

Uploaded by

mansoorkhan.a006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT V NEURAL NETWORKS 9

Perceptron - Multilayer perceptron, activation functions, network training –


gradient descent optimization – stochastic gradient descent, error backpropagation, from
shallow networks to deep networks –Unit saturation (aka the vanishing gradient problem) –
ReLU, hyperparameter tuning, batch normalization, regularization, dropout.

NEURAL NETWORK- INTRODUCTION

• Neural networks, also known as artificial neural networks (ANNs) or simulated


neural networks (SNNs), are a subset of machine learning and are at the heart of
deep learning algorithms. Their name and structure are inspired by the human
brain, mimicking the way that biological neurons signal to one another.
• A neural network is a method in artificial intelligence that teaches computers to
process data in a way that is inspired by the human brain.
• It is a type of machine learning process, called deep learning, that uses
interconnected nodes or neurons in a layered structure that resembles the human
brain.
• It creates an adaptive system that computers use to learn from their mistakes and
improve continuously.
• Thus, artificial neural networks attempt to solve complicated problems, like
summarizing documents or recognizing faces, with greater accuracy.
Artificial neural networks (ANNs) provide a general, practical method for learning real-
valued,discrete-valued, and vector-valued target functions.

Biological Motivation

The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons.
Human information processing system consists of brain neuron: basic building
blockcell that communicates information to and from various parts of body.
Facts of Human Neurobiology

Number of neurons ~ 1011


Connection per neuron ~ 10 4 – 5
Neuron switching time ~ 0.001 second or 10 -3
Scene recognition time ~ 0.1 second
100 inference steps doesn’t seem like enough
Highly parallel computation based on distributed representation

Properties of Neural Networks

Many neuron-like threshold switching units


Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
Input is a high-dimensional discrete or real-valued (e.g, sensor input )

NEURAL NETWORK REPRESENTATIONS

A prototypical example of ANN learning is provided by system ALVINN,which


uses a learned ANN to steer an autonomous vehicle driving at normal speeds on
public highways
The input to the neural network is a 30x32 grid of pixel intensities obtained from a forward-
pointed camera
mounted on the vehicle.The network output is the direction in which the vehicle
is steered

Neural network learning to steer an autonomous vehicle.

The network is shown on the left side of the figure, with the input camera image
depicted below it.
Each node (i.e., circle) in the network diagram corresponds to the output of a
single network unit, and the lines entering the node from below are its inputs.
There are four units that receive inputs directly from all of the 30 x 32 pixels in
the image. These are called "hidden" units because their output is available only
within thenetwork and is not available as part of the global network output. Each
of these four hidden units computes a single real-valued output based on a
weighted combination of its 960 inputs
These hidden unit outputs are then used as inputs to a second layer of 30 "output" units.
Each output unit corresponds to a particular steering direction, and the output
values ofthese units determine which steering direction is recommended most
strongly.

APPROPRIATE PROBLEMS FOR NEURAL NETWORK LEARNING

ANN learning is well-suited to problems in which the training data corresponds to


noisy,complex sensor data, such as inputs from cameras and microphones.

ANN is appropriate for problems with the following characteristics:

1. Instances are represented by many attribute-value pairs.


2. The target function output may be discrete-valued, real-valued, or a vector of
severalreal- or discrete-valued attributes.
3. The training examples may contain errors.
4. Long training times are acceptable.
5. Fast evaluation of the learned target function may be required
6. The ability of humans to understand the learned target function is not important.
Neural computing is an information processing paradigm, inspired by biological
system, composed of a large number of highly interconnected processing
elements(neurons) working in unison to solve specific problems.
Dendrites are branching fibres that extend from the cell body or soma. Soma or cell
body of a neuron contains the nucleus and other structures, support chemical
processing and production of neurotransmitters.
Axon is a singular fiber carries information away from the soma to the synaptic sites
of other neurons (dendrites ans somas), muscels, or glands.
Myelin sheath consists of fat-containing cells that insulate the axon from electrical
activity. This insulation acts to increase the rate of transmission of signals. A gap exists
between each myelinsheath cell along the axon. Since fat inhibits the propagation of
electricity, the signals jump from onegap to the next.
Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat
serves as a good insulator, the myelin sheaths speed the rate of transmission of an
electrical impulse along the axon.
Synapse is the point of connection between two neurons or a neuron and a muscle or a gland.
Electrochemical communication between neurons take place at these junctions.
Terminal buttons of a neuron are the small knobs at the end of an axon that release
chemicals called neuro transmitters.
Axon is a singular fiber carries information away from the soma to the synaptic sites
of other neurons (dendrites ans somas), muscels, or glands.
Axon hillock is the site of summation for incoming information. At any moment,
the collective influence of all neurons that conduct impulses to a given neuron will
determine whether orn ot an action potential will be initiated at the axon hillock and
propagated along the axon.
Myelin sheath consists of fat-containing cells that insulate the axon from electrical
activity. This insulation acts to increase the rate of transmission of signals. A gap exists
between each myelinsheath cell along the axon. Since fat inhibits the propagation of
electricity, the signals jump from onegap to the next.
Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells. Since fat
serves as a good insulator, the myelin sheaths speed the rate of transmission of an
electrical impulse along the axon.
Synapse is the point of connection between two neurons or a neuron and a muscle or a gland.
Electrochemical communication between neurons take place at these junctions.
Terminal buttons of a neuron are the small knobs at the end of an axon that atlases
chemicals called neurotransmitters.

Artificial neuron model


Simple neural network architecture
A basic neural network has interconnected artificial neurons in three layers:

• Input Layer
Information from the outside world enters the artificial neural network from the input
layer. Input nodes process the data, analyze or categorize it, and pass it on to the next
layer.

• Hidden Layer
Hidden layers take their input from the input layer or other hidden layers. Artificial
neural networks can have a large number of hidden layers. Each hidden layer analyzes
the output from the previous layer, processes it further, and passes it on to the next
layer.

• Output Layer
The output layer gives the final result of all the data processing by the artificial neural
network. It can have single or multiple nodes. For instance, if we have a binary
(yes/no) classification problem, the output layer will have one output node, which will
give the result as 1 or 0. However, if we have a multi-class classification problem, the
output layer might consist of more than one output node.
An artificial neuron is a mathematical function conceived as a simple model of a
real (biological) neuron.
 The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic
Unit.
 A set of input connections brings in activations from other neuron.
 A processing unit sums the inputs, and then applies a non-
linear activation function (i.e. squashing/transfer/threshold
function).
 An output line transmits the result to other neurons.
Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single
activation function. An Artificial neural network (ANN) model based on the
biological neural systems is shownin figure.

Different Learning Rules


A brief classification of Different Learning algorithms is depicted in figure 3.

Training: It is the process in which the network is taught to change its weight
and bias.
Learning: It is the internal process of training where the artificial neural systemlearns
to update/adapt the weights and biases.
Different Training /Learning procedure available in ANN are
 Supervised learning, Unsupervised learning
 Reinforced learning, Hebbian learning
 Gradient descent learning, Competitive learning, Stochastic learning

Requirements of Learning Laws:


• Learning Law should lead to convergence of weights

Learning or training time should be less for capturing the information from the
trainingpairs
• Learning should use the local information

Learning process should able to capture the complex non linear mapping
availablebetween the input & output pairs
• Learning should able to capture as many as patterns as possible

Storage of pattern information's gathered at the time of learning should be high


for thegiven network

Different Training methods of Artificial Neural Network


Supervised learning :

very input pattern that is used to train the network is associated with an output pattern
which isthe target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is
made between the network’s computed output and the correct expected output, to
determine the error.The error can then be used to change network parameters, which
result in an improvement in performance.
Unsupervised learning:
In this learning method the target output is not presented to the network. It is as if
there is no teacher to present the desired patterns and hence the system learns of its
own by discovering and adapting to structural features in the input patterns.
Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but
only indicatesif the computed output corrects or incorrect. The information provided
helps the network in the learning process.
PERCEPTRON Model

Simple Perceptron for Pattern Classification


Perceptron network is capable of performing pattern classification into two or more
categories. The perceptron is trained using the perceptron learning rule. We will first
consider classification into two categories and then the general multiclass classification
later. For classification
One type of ANN system is based on a unit called a perceptron. Perceptron is
a single layer neural network.

Figure: A perceptron

A perceptron takes a vector of real-valued inputs, calculates a linear


combination ofthese inputs, then outputs a 1 if the result is greater than some
threshold and -1 otherwise.
Given inputs x through x, the output O(x1, . . . , xn) computed by the perceptron is

Where, each wi is a real-valued constant, or weight, that determines the


contribution ofinput xi to the perceptron output.
-w0 is a threshold that the weighted combination of inputs w1x1 + . . . + wnxn
must surpassin order for the perceptron to output a 1.
Sometimes, the perceptron function is written as,
Learning a perceptron involves choosing values for the weights w0 , . . . , wn . Therefore,
the space H of candidate hypotheses considered in perceptron learning is the set of all
possible real-valued weight vectors.

Representational Power of Perceptrons

The perceptron can be viewed as representing a hyperplane decision surface


in the n-dimensional space of instances (i.e., points)
The perceptron outputs a 1 for instances lying on one side of the hyperplane and
outputsa -1 for instances lying on the other side, as illustrated in below figure

Perceptrons can represent all of the primitive Boolean functions AND, OR, NAND (~
AND),and NOR (~OR)
Some Boolean functions cannot be represented by a single perceptron, such as
the XORfunction whose value is 1 if and only if x1 ≠ x2.

How Perceptron will work?


Example: Representation of AND functions:

If A=0 & B=0 → 0*0.6 + 0*0.6 = 0.


This is not greater than the threshold of 1, so the output = 0.
If A=0 & B=1 → 0*0.6 + 1*0.6 = 0.6.
This is not greater than the threshold, so the output = 0.
If A=1 & B=0 → 1*0.6 + 0*0.6 = 0.6.
This is not greater than the threshold, so the output = 0.
If A=1 & B=1 → 1*0.6 + 1*0.6 = 1.2.
This exceeds the threshold, so the output = 1.
Supportive problem
Suppose that we are going to work on AND Gate problem. The gate returns if and only if both
inputs are true.

X1 X2 Y

0 0 0

0 1 0

1 0 0

1 1 1

We are going to set weights randomly. Let’s say that w1 = 0.9 and w2 = 0.9
Round 1
We will apply 1st instance to the perceptron. x1 = 0 and x2 = 0.

Sum unit will be 0 as calculated below

Σ = x1 * w1 + x2 * w2 = 0 * 0.9 + 0 * 0.9 = 0

Activation unit checks sum unit is greater than a threshold. If this rule is satisfied, then it is
fired and the unit will return 1, otherwise it will return 0. BTW, modern neural networks
architectures do not use this kind of a step function as activation.

Activation threshold would be 0.5.

Sum unit was 0 for the 1st instance. So, activation unit would return 0 because it is less than
0.5. Similarly, its output should be 0 as well. We will not update weights because there is no
error in this case.

Let’s focus on the 2nd instance. x1 = 0 and x2 = 1.

Sum unit: Σ = x1 * w1 + x2 * w2 = 0 * 0.9 + 1 * 0.9 = 0.9

What about errors?


Activation unit will return 1 because sum unit is greater than 0.5. However, output of this
instance should be 0. This instance is not predicted correctly. That’s why, we will update
weights based on the error.

ε = actual – prediction = 0 – 1 = -1

We will add error times learning rate value to the weights. Learning rate would be 0.5. BTW,
we mostly set learning rate value between 0 and 1.

w1 = w1 + α * ε = 0.9 + 0.5 * (-1) = 0.9 – 0.5 = 0.4

w2 = w2 + α * ε = 0.9 + 0.5 * (-1) = 0.9 – 0.5 = 0.4

Focus on the 3rd instance. x1 = 1 and x2 = 0.

Sum unit: Σ = x1 * w1 + x2 * w2 = 1 * 0.4 + 0 * 0.4 = 0.4

Activation unit will return 0 this time because output of the sum unit is 0.5 and it is less than
0.5. We will not update weights.

Mention the 4rd instance. x1 = 1 and x2 = 1.

Sum unit: Σ = x1 * w1 + x2 * w2 = 1 * 0.4 + 1 * 0.4 = 0.8


Activation unit will return 1 because output of the sum unit is 0.8 and it is greater than the
threshold value 0.5. Its actual value should 1 as well. This means that 4th instance is predicted
correctly. We will not update anything.

Round 2
In previous round, we’ve used previous weight values for the 1st instance and it was classified
correctly. Let’s apply feed forward for the new weight values.

Remember the 1st instance. x1 = 0 and x2 = 0.

Sum unit: Σ = x1 * w1 + x2 * w2 = 0 * 0.4 + 0 * 0.4 = 0.4

Activation unit will return 0 because sum unit is 0.4 and it is less than the threshold value 0.5.
The output of the 1st instance should be 0 as well. This means that the instance is classified
correctly. We will not update weights.

Feed forward for the 2nd instance. x1 = 0 and x2 = 1.

Sum unit: Σ = x1 * w1 + x2 * w2 = 0 * 0.4 + 1 * 0.4 = 0.4

Activation unit will return 0 because sum unit is less than the threshold 0.5. Its output should
be 0 as well. This means that it is classified correctly and we will not update weights.

We’ve applied feed forward calculation for 3rd and 4th instances already for the current weight
values in the previous round. They were classified correctly.

Perceptron for AND Gate


Multi-Layer Perceptron Model:
The general representation of Multi-layer Perceptron network. In between the input
and output Layer there will be some more layers also known as Hidden layers.

Multilayer Perceptron falls under the category of feedforward algorithms,because inputs are
combined with the initial weights in a weighted sum and subjected to the activation function,
just like in the Perceptron. But the difference is that each linear combination is propagated to
the next layer.
Each layer is feeding the next one with the result of their computation, their internal
representation of the data. This goes all the way through the hidden layers to the output layer.
But it has more to it.
If the algorithm only computed the weighted sums in each neuron, propagated results to the
output layer, and stopped there, it wouldn’t be able to learn the weights that minimize the cost
function. If the algorithm only computed one iteration, there would be no actual learning.

Multi-layer Perceptron neural architecture

Structure of MLPs

A multi-layer perceptron (MLP) is composed of multiple layers of interconnected neurons.


With our student's example, we can say that each neuron is like a student in the group, and each
neuron is only able to perform simple arithmetic operations.

• In a typical MLP network, the input units (Xi) are fully connected to all hidden
layer units (Yj) and the hidden layer units are fully connected to all output layer
units (Zk). Each of the connections between the input to hidden and hidden to
output layer units has an associated weight attached to it (Wij or Wjk)

• The hidden and output layer units also derive their bias values (b j or bk) from
weighted connections to units whose outputs are always 1 (true neurons)
The Multilayer Perceptron was developed to tackle this limitation. It is a neural network where
the mapping between inputs and output is non-linear. A Multilayer Perceptron has input and
output layers, and one or more hidden layers with many neurons stacked together. And while
in the Perceptron the neuron must have an activation function that imposes a threshold, like
ReLU or sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.

Multilayer Perceptron.
The structure of an MLP can be broken down into three main parts: the input layer, the hidden
layers, and the output layer.

 The input layer is like the teacher giving out the math problem to the students. It
receives the input data, in this case, the equation 5 x 3 + 2 x 4 + 8 x 2, and passes it on
to the next layer.
 The hidden layers are like the students working together to solve the problem. Each
hidden layer contains a set of interconnected neurons, which process and analyze the
input data passed on from the previous layer. In this example, the hidden layer can have
three neurons, each one solving a specific part of the equation "5 x 3", "2 x 4" and "8 x
2".
 The output layer is like the student who is putting together the final solution. It
receives the output from the previous layers, combines them, and produces the final
output which is the solution to the problem. In this example, the output neuron can be
calculated as "15 + 8" and "23 + 16" to get the final result of 39.
 The structure of MLP is shown here:
Fig 3 (MLP structure)

 Note that, the neurons in the input layer must be the size of the training instances, and
the output layer must be the size of the output labels. However, there can be any number
of neurons or layers in the hidden layer of the neural network according to the needs,
So the more neurons in the hidden layer the more complex problem the network can
solve.

MLP training algorithm


A Multi-Layer Perceptron (MLP) neural network trained using the Backpropagation
learning algorithm is one of the most powerful forms of supervised neural network
system.
The training of such a network involves three stages:
• feedforward of the input training pattern,
• calculation and backpropagation of the associated error
• adjustment of the weights
This procedure is repeated for each pattern over several complete passes (epochs)
through the training set. After training, application of the net only involves the
computations of the feedforward phase.
Multi-layer perceptrons working

Ok, let's start with an example, Imagine a group of 7-year-old students who are working on a
math problem, Imagine that each of them can only do arithmetic with two numbers. But you
are giving them an equation like this 5 x 3 + 2 x 4 + 8 x 2, how can they solve it?
To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student
can solve the second part of the equation "2 x 4 = 8". The third student can solve the third part
"8 x 2 = 16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve
"15 + 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer.So here we are
breaking down the large math problem into different sections and giving them to each of the
students who are just doing really simple calculations, but as a result of the teamwork, they can
solve the problem efficiently,

(Example for working of MLP)


Just like how we broke down the equation into smaller parts and gave each student a specific
section to solve, in an MLP, the input data is passed through different layers of interconnected
neurons, each layer solving a specific part of the problem. And just like how the students
combined their answers to get the final solution, the output of each neuron is passed on to the
next neuron, until the final output is produced which is the solution to the complex problem.
This is just an easy example of how neural networks work, to make your mind visualize it.
Neural Networks are often more versatile in solving a lot of problems, not just math problems.

Applications of Multi-layer Perceptron

Multi-layer perceptrons have been used in a wide variety of applications. Some of the most
common applications of MLPs include:

 Image recognition: MLPs can be trained to recognize patterns in images and classify
them into different categories. This is useful in applications such as facial recognition,
object detection, and image segmentation.
 Natural Language Processing (NLP): MLPs can be used to understand and generate
human language. This is useful in applications such as text-to-speech, machine
translation, and sentiment analysis.
 Predictive modeling: It can be used to make predictions based on past data. This is
useful in applications such as stock market prediction, weather forecasting, and fraud
detection.
 Medical diagnosis: Can be used to diagnose diseases or interpret medical images by
recognizing patterns in the data.
Backpropagation Learning Algorithm

 BACKPROPAGATION The BACKPROPAGATION Algorithm learns the


weights for a multilayer network, given a network with a fixed set of units and
interconnections. It employs gradient descent to attempt to minimize the squared
error between the network output values andthe target values for these outputs.
 In BACKPROPAGATION algorithm, we consider networks with multiple
output unitsrather than single units as before, so we redefine E to sum the errors
over all of the network output units.

where,
 outputs - is the set of output units in the network
 tkd and Okd - the target and output values associated with the kth output unit
 d - training example
Feed Forward phase:
• Xi = input[i]

• Yj = f( bj + XiWij)
• Zk = f( bk + YjWjk)
Backpropagation of errors:
• k = Zk[1 - Zk](dk - Zk)

• j = Yj[1 - Yj]  k Wjk


Weight updating:
• Wjk(t+1) = Wjk(t) + kYj + [Wjk(t) - Wjk(t - 1)]

• bk(t+1) = bk(t) + kYtn + [bk(t) - bk(t - 1)]


• Wij(t+1) = Wij(t) + jXi + [Wij(t) - Wij(t - 1)]

• bj(t+1) = bj(t) + jXtn + [bj(t) - bj(t - 1)]


Test stopping condition
After each epoch of training the Root Mean Square error of the network for all of the patterns
in a separate validation set is calculated.

ERMS =  (dk - Zk)2 n.k


• n is the number of patterns in the set
• k is the number of neuron units in the output layer
Training is terminated when the ERMS value for the validation set either starts to increase or
remains constant over several epochs.
This prevents the network from being overtrained (i.e. memorising the training set) and
ensures that the ability of the network to generalise (i.e. correctly classify non-trained
patterns) will be at its maximum.
Need for Backpropagation:

Backpropagation is “backpropagation of errors” and is very useful for training neural


networks. It’s fast, easy to implement, and simple. Backpropagation does not require any
parameters to be set, except the number of inputs. Backpropagation is a flexible method
because no prior knowledge of the network is required.

Types of Backpropagation:

There are two types of backpropagation networks.

 Static backpropagation: Static backpropagation is a network designed to map


static inputs for static outputs. These types of networks are capable of solving
static classification problems such as OCR (Optical Character Recognition).
 Recurrent backpropagation: Recursive backpropagation is another network
used for fixed-point learning. Activation in recurrent backpropagation is feed-
forward until a fixed value is reached. Static backpropagation provides an instant
mapping, while recurrent backpropagation does not provide an instant mapping.

Advantages:

 It is simple, fast, and easy to program.


 Only numbers of the input are tuned, not any other parameter.
 It is Flexible and efficient.
 No need for users to learn any special functions.

Disadvantages:

 It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
 Performance is highly dependent on input data.
 Spending too much time training.
 The matrix-based approach is preferred over a mini-batch.

Regularization in Machine Learning


Overfitting is a phenomenon that occurs when a Machine Learning model is constraint to
training set and not able to perform well on unseen data.
Regularization is a technique used to reduce the errors by fitting the function appropriately
on the given training set and avoid overfitting.
The commonly used regularization techniques are :

1. L1 regularization
2. L2 regularization
3. Dropout regularization

A regression model which uses L1 Regularization technique is called LASSO(Least


Absolute Shrinkage and Selection Operator) regression.
A regression model that uses L2 regularization technique is called Ridge regression.

Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the
loss function(L).

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function(L).

NOTE that during Regularization the output function(y_hat) does not change. The change
is only in the loss function.
The output function:

The loss function before regularization:

The loss function after regularization:


We define Loss function in Logistic Regression as :

L(y_hat,y) = y log y_hat + (1 - y)log(1 - y_hat)

Loss function with no regularization :

L = y log (wx + b) + (1 - y)log(1 - (wx + b))


Lets say the data overfits the above function.
Loss function with L1 regularization :

L = y log (wx + b) + (1 - y)log(1 - (wx + b)) + lambda*||w||1


Loss function with L2 regularization :

L = y log (wx + b) + (1 - y)log(1 - (wx + b)) + lambda*||w||22

lambda is a Hyperparameter Known as regularization constant and it is greater than zero.

lambda > 0
Dropout Regularization:Dropout regularization is a technique that randomly drops a number
of neurons in a neural network during model training.This means the contribution of the
dropped neurons is temporally removed and they do not have an impact on the model’s
performance.The image below shows how dropout regularization works:

In the image above, the neural network on the left shows an original neural network where all
neurons are activated and working.
On the right, the red neurons have been removed from the neural network. Therefore, red
neurons will not be considered during model training.

We will implement this concept practically using TensorFlow.

How will dropout help with overfitting?

Dropout regularization will ensure the following:

 The neurons can’t rely on one input because it might be dropped out at random. This
reduces bias due to over-relying on one input, bias is a major cause of overfitting.
 Neurons will not learn redundant details of inputs. This ensures only important
information is stored by the neurons. This enables the neural network to gain useful
knowledge which it uses to make predictions.

An unregularized network overfits instantly on the training dataset. Take note of how the

validation loss for the no-dropout run diverges dramatically after only a few epochs. This

explains why the generalization error has grown.

Overfitting is avoided by training with two dropout layers and a dropout probability of 25%.

However, this affects training accuracy, necessitating the training of a regularised network over

a longer period.

Leaving improves model generalisation. Although the training accuracy is lower than that of

the unregularized network, the total validation accuracy has improved. This explains why the

generalization error has decreased.

2. Hyperparameter tuning

A Machine Learning model is defined as a mathematical model with a number of parameters


that need to be learned from the data. By training a model with existing data, we are able to
fit the model parameters.
However, there is another kind of parameter, known as Hyperparameters, that cannot be
directly learned from the regular training process. They are usually fixed before the actual
training process begins. These parameters express important properties of the model such as
its complexity or how fast it should learn.
Some examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. The learning rate for training a neural network.
3. The C and sigma hyperparameters for support vector machines.
4. The k in k-nearest neighbors.

Models can have many hyperparameters and finding the best combination of parameters can
be treated as a search problem. The two best strategies for Hyperparameter tuning are:
 GridSearchCV
 RandomizedSearchCV

GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated for a range of
hyperparameter values. This approach is called GridSearchCV, because it searches for the
best set of hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the
best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a
combination of C=0.3 and Alpha=0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.

Drawback: GridSearchCV will go through all the intermediate combinations of


hyperparameters which makes grid search computationally very expensive.
RandomizedSearchCV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a
fixed number of hyperparameter settings. It moves within the grid in a random fashion to find
the best set of hyperparameters. This approach reduces unnecessary computation.

The following code illustrates how to use GridSearchCV

from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import GridSearchCV

# Creating the hyperparameter grid


c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier
logreg = LogisticRegression()

# Instantiating the GridSearchCV object


logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)

logreg_cv.fit(X, y)

# Print the tuned parameters and score


print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))

Output:
Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381} Best score is
0.7708333333333334
The following code illustrates how to use RandomizedSearchCV
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Creating the hyperparameter grid


param_dist = {"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"]}

# Instantiating Decision Tree classifier


tree = DecisionTreeClassifier()

# Instantiating RandomizedSearchCV object


tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)

tree_cv.fit(X, y)

# Print the tuned parameters and score


print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Output:
Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’:
5, ‘criterion’: ‘gini’} Best score is 0.7265625
********************************************************************

You might also like