0 ratings0% found this document useful (0 votes) 40 views21 pagesFML Unit5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Neural Networks
Syllabus
Mette or heiinea an Junctions, network training - gradient descent optimization -
ha Tota the wameninor ackpropagation, from shallow networks 1 deep networks Unit
saturation ‘anishing gradient problem) - ReLU, hyperparameter tuning, batch
normalization, regularization, dropout.
Contents
41° Perceptron
42. Activation Functions
43 Gradient Descent Optimization
44. Error Backpropagation
45 Shallow Networks
46 Deep Network
47 Vanishing Gradient Problem
48 ReLU
49 Hyperparameter Tuning
4.10 Normalization
4.11 Regularization
412 Two Marks Questions with Answers
@-1)Neural Ne
4-2 OtMorky
El Perceptron . ae
ut neu al
e outpt leams 4
york with on
* The perceptron is a feed-forward networ!
ing hyper-plane in a pattern space.
separating hyper-p! to one threshold output Fy neuron, Th.
© The "n” linear Fx neurons feed forward ‘tome
perceptron separates linearly separable set of pal .
EI Single Layer Perceptron
ork with one output neuron that leams q
‘The "n" linear Fx neurons feed forward
tron separates linearly separable set
The perceptron is a feed-forward netw'
separating hyper-plane in a pattern space.
to one threshold output Fy neuron. The percep
of patterns.
SLP is the simplest type of artificial neural networks and can only classify linearly
separable cases with a binary target (1, 0).
© We can connect any number of McCulloch-Pitts neurons together in any way we
like. An arrangement of one input layer of McCulloch-Pitts neurons feeding
forward to one output layer of McCulloch-Pitts neurons is known as a Perceptron,
* A single layer feed-forward network consists of one or more output neurons, each
of which is connected with a weighting factor W; to all of the inputs X;.
* The Perceptron is a kind of a single-layer artificial network with only one neuron.
The Percepton is a network in which the neuron unit calculates the linear
combination of its real-valued or boolean inputs and passes it through a threshold
activation function, Fig. 4.1.1 shows Perceptron.
Input 4
Input 2: Sere é
. - Output
. Cur) Sigmoid
Input N-
Threshold 6
TECHNICAL PUBLICATIONS® . 8N Up-thrust for
lt awees
In the simplest case the network has only two inputs and a single output. The
qutput of the neuron is :
yr (3 wax 4}
suppose that the activation function is a threshold then
1 ifs>0
f= 1-1 ifsso
‘The Perceptron can represent most of the primitive boolean functions : AND, OR,
NAND and NOR but can not represent XOR.
In single layer perceptron, initial weight values are assigned randomly because it
does not have previous knowledge. It sum all the weighted inputs. If the sum is
greater than the threshold value then it is activated i.e. output = 1.
Output
W1X1+W2X2 +--+ WpXp 28 9 1
WX +WoX2 ++ WaXn $0 = 0
‘The input values are presented to the perceptron, and if the predicted output is
the same as the desired output, then the performance is considered satisfactory
and no changes to the weights are made.
If the output does not match the desired output, then the weights need to be
changed to reduce the error.
The weight adjustment is done as follows =
AW = x dxx
Where x = Input data
d = Predicted output and desired output.
11 = Learning rate
* If the output of the perceptron is correct then we do not take any action. If the
output is incorrect then the weight vector is W > W + AW.
* The process of weight adaptation is called learning.
* Perceptron Learning Algorithm :
1. Select random sample from training set as input.
2 If classification is correct, do nothing.
3. If classification is incorrect, modify the weight vector W using
Wy = Wy +nd (n) Xj (9)
Repeat this procedure until the entire training set is classified correctly.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeNeural jy,
Me 4-4 Stet
fachine Leaming
my Multilayer Perceptron
ture of a singy
* A MultiLayer: Perceptron (MLP) has the same actin’ of ae
Perceptron with one or more hidden layers. An Ple
Neurons called perceptrons. . f source nodes for.
A typical multilayer perceptron network consists of ie nodes and an on
the input layer, one or more hidden layers of comp! tut
layer of nodes. .
* It is not possible to find weights which enable single ne passpiene © deg
: : See Fig. 4.1.2.
with non-linearly separable problems like XOR : See Fig,
oR AND
Fig. 4.1.2
ERED Limitation of Learning in Perceptron : Linear Separability
* Consider two-input patterns
(X1,X2) being classified
into two classes as shown
in Fig. 4.13. Each point
with either symbol of x or
O represents a pattem with
a set of values(X1,X,),
* Each pattern is classified
into one of two classes,
Notice that these classes
can be separated with a
TECHNICAL PUBLICA TION:
+ 8” up-thrust for knowledgeeorine Leeming 4-5 ‘Neural Networks
Linear separability refers to the fact that classes of patterns with Trdimensional
vector X = (X1/X2/--Xp) can be separated with a single decision surface. In the
case above, the line L represents the decision surface.
If two classes of patterns can be separated by a decision boundary, represented by
the linear equation then they are said to be linearly separable. The simple network
can correctly classify any patterns.
Decision boundary (i.e., W, b or q) of linearly separable classes can be determined
either by some learning procedures or by solving linear equation systems based on
representative patterns of each classes.
If such a decision boundary does not exist, then the two classes are said to be
linearly inseparable.
Linearly inseparable problems cannot be solved by the simple network, more
sophisticated architecture is needed.
« Examples of linearly separable classes
1. Logical AND function
Decision boundary
Patterns (bipolar)
HY wal
-1-1-1 wa leal
-1101 be-l
r-101 aio)
114 = ltt xy = 0
° x
° +> °
X: Class I (y= 1)
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Learning
a i ion
2. Logical OR functior Decision boundary
Patterns (bipolar)
wh
w=
b
-11 1 q-0
17151
14 gt x2 = 0
yaad
Xied eto soo
° x
X:Class I (y= 1)
O: Class Il (y =-1)
Fig. 4.1.5
* Examples of linearly inseparable classes
1. Logical XOR (exclusive OR) function
Patterns (bipolar)
x1 X2 y
-1 -1 -1
-1 101
1-104
leigie 1
ce ate t | 0
° _: x
x Class 1 ( )
©: Class It y = 4)
Fi-
octine Learning 4-7 Neural Networks
+ No line can separate these two classes, as can be seen from the fact that the
following linear inequality system has no solution.
pewpew2 <0
pew tw220 Q)
pew -w220 2)
paw #2 <0 @)
pecause we have b < 0 from (1) + (4), and b > = 0 from (2) + (3), which is a
contradiction.
[1 Activation Functions
Activation functions also known as transfer function is used to map input nodes
to output nodes in certain fashion.
‘The activation function is the most important factor in a neural network which
decided whether or not a neuron will be activated or not and transferred to the
next layer.
+ Activation functions help in normalizing the output between 0 to 1 or - 1 to 1. It
helps in the process of backpropagation due to their differentiable property.
During backpropagation, loss function gets updated, and activation function helps
the gradient descent curves to achieve their local minima.
* Activation function basically decides in any neural network that given input or
receiving information is relevant or it is irrelevant.
* These activation function makes the multilayer network to have greater
representational power than single layer network only when norlinearity is
introduced.
* The input to the activation function is sum which is defined by the following
equation.
sum = 1,W, +I, W24--+InWn
n
= XI Wwjtb
inMachi 4-8 eure Ney
line Leaming Woy
istic tion
* Activation Function : Logistic Func
Firat derivative of logistic function
1
{(sum) = Ress am;
= (L4s7S%sumy-1
Fig. 4.2.1
limit - 1) to an upper
i i from a lower limit (0 or pp
istic ion monotonically increases )
, eats increases. In which values vary between 0 and 1, with a value of
0.5 when I is zero.
* Activation Function : Ar Tangent _
2, 1
i {(stim) = = tan 1 (sxsum) a
os
oa
o2|
s20
fun) 0
~02|
TECHNICAL PUBLICATIONS® 4 Up-thrust for knowledgein
sve Le0r7i00 4-9 Neural Networks
|_ activation Function : Hyperbolic Tangent
yperbatie activation function
jeu) = tanh (5° D ;
sx sum
e os
os.
oa.
02,
tum) g
-02|
-o4
-06
08
oF
acs o
[23] icentity or Linear Activation Function
g output vectors
«A linear activation is a mathematical equation used for obtainin}
with specific properties.
«It is a simple straight line 15
activation function where
our function is directly
proportional to the
weighted sum of neurons
or input.
«Linear activation functions =6
are better in giving a wide ae
range of activations and a
line of a positive slope a
may increase the firing
rate as the input rate =.
increases.
* Fig. 4.2.4 shows identity
function.
* The equation for linear a
fo) = ax
When a = 1 then f(x) = x and this is a special case known as identity.
-1
05
Fig. 4.2.4
ctivation function is :
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge4-10
Machine Leaming
* Properties : _,
1. Range is ~ infinity to + infinity:
be achieved faster,
so optimisation © "
face
2. Provides a convex error sur!
3. df(x)/dx = a which is consta!
* Limitations :
1. Since the derivative is constant,
2. Back propagation is constant as thi
ot be optimised with gradient descent,
nt. So cann'
Jation with input,
the gradient has no rel
e change is delta x.
. tworks in practice.
3. Activation function does not work in neural nel
EEX sigmoia
* A sigmoid function produces a curve with ;
function shown on the left is a special case of
the growth of some set. 1 ——
5 1
an "S" shape. The example sigmoig
the logistic function, which models
sig (t) = 09
et l+e
. (08
* In general, a sigmoid function is |
real-valued and differentiable, having
a non-negative or non-positive first 0.6
derivative, one local minimum and 05
one local maximum. 4
* The logistic sigmoid function is 93
related to the hyperbolic tangent as
02
follows :
‘| x od
1 - 2sig(x) =. 1-2 ~tanh> °
1+e -6 -4 2 0 2 4 6
* Sigmoid functions are often used in Fig. 4.2.5
artificial neural networks to introduce
nonlinearity in the model.
applies a sigmoid function to the result,
* A reason for its popularity in neural
satisfies a property between the
computationally easy to perform,
networks is because the sigmoid function
derivative and itself such that it is
d. 1
api8() = sig(t) (1-sig(t))
Derivatives of the sigmoid function are usually employed in learning algorithms.
learning algorithms.
TECHNICAL PUBLICATIONS®
TIONS® «an uy
Ip-thrust for knowled
190ocrine Leeming “en Neural Networks
op Gradient Descent Optimization
. Gradient Descent is an
optimization algorithm in
gadget mastering used to
limit a feature with the aid
of _ iteratively — moving
towards the minimal fee of
the characteristic.
Cost function
We essentially use this
algorithm when we have to
locate the least possible
values which could fulfill a
given fee function. In
gadget getting to know,
greater regularly that not
we try to limit loss features
(like Mean Squared Error).
By minimizing the loss characteristic, we will improve our model and Gradient
Descent is one of the most popular algorithms used for this cause.
Fig. 4.3.1 Gradient descent algorithm
The graph above shows how exactly a Gradient Descent set of rules works.
We first take a factor in the value function and begin shifting in steps in the
direction of the minimum factor. The size of that step or how quickly we ought to
converge to the minimum factor is defined by Learning Rate. We can cowl more
location with better learning fee but at the risk of overshooting the minima. On
the opposite hand, small steps/smaller gaining knowledge of charges will eat a
number of time to attain the lowest point.
* Now, the direction wherein algorithm has to transport (closer to minimal) is also
important. We calculate this by way of using derivatives. You need to be familiar
with derivatives from calculus. A spinoff is largely calculated because the slope of
the graph at any specific factor. We get that with the aid of finding the tangent
line to the graph at that point. The extra steep the tangent, would suggest that
more steps would be needed to reach minimum point, much less steep might
suggest lesser steps are required to reach the minimum factor.
E tea Stochastic Gradient Descent
© The word ‘stochastic’ means a system or a process that is linked with a random
Probability. Hence, in Stochastic Gradient Descent, a few samples are selected
tandomly instead of the whole data set for each iteration.
TECHNICAL PUBLIGATIONS® - an up-thrust for knowledgeMachine Leaming
is a type of
gradient descent that runs
0
ing epoch for each exam”
parameters one at a tim, ™Mple
e,
nt (SGD)
tion. It proc
.s each trainin}
ple at
some com]
t shows frequen
rocesses & train
g example’s
a time,
utational efficiency |
p' losses ig
it updates
dates that require
Stochastic Gradient Desce!
training example per itera
within a dataset and update:
ining exam
it shows
tems as i
hence it is easier to stor,
© in
‘As it requires only one 179
However,
allocated memory:
oon to batch gradient sy°
d speed.
quent updat
helpful in
comparis’
more detail an z x
a noisy gradient. Howeve
t,
is also treated as
imum and also escaping th,
le
ft es, it i a
Further, due to fret Finding the global mint
sometimes it can be
local minimum.
stic gradient descent :
te in desired memory.
“Advantages of Stocha
.n batch gradient descent.
a) It is easier to alloca
by It is relatively fast to compute thai
more efficient for large datasets.
ic Gradient Descent :
ber of hyperparameter:
ber of iterations.
otis
Disadvantages of Stochasti
a) SGD requires a num
parameter and the num
5 such as the regularization
b) SGD is sensitive to feature scaling.
mg Error Backpropagation
layer neural network. It is
Backpropagation is a training method used for a multi-l
t descent method which
alec] called the generalized delta rule. It is a gradient
minimizes the total squared error of the output computed by the net.
The b: " “i :
a re pret algorithm looks for the minimum value of the error function
ete eee ior id eee called the delta rule or gradient descent. The
J error i is i :
eae ction is then considered to be a solution to the
Backpropagation is a
opag systematic method. init i
generalization of Widrow-Hoff error Gerais mnallvle wee eee
uses backpropagation. correction rule. 80 % of ANN applications
Fig. 4.4.1 (See on next page) shows back;
Consider a simple neuron : Te een ees
a. Neuron has a summing junction and
rere
b Any fond ines. fact oe oe function.
everywhere wit i i
rere with sum canbe used as ache eis everywhere. and in ee
's activation function.ing
chine Lear 4.
wee es Neural Networks
Activation
function
y
Oo ouput
‘Summing
junction
Threshold (sum)
Synaptic
weights
Fig. 4.4.1 Backpropagation network
c Examples : Logistic function, Arc tangent function, Hyperbolic tangent
activation function.
+ These activation function makes the multilayer network to have greater
representational power than single layer network only when non-linearity is
introduced.
Need of hidden layers :
1. A network with only two layers (input and output) can only represent the
input with whatever representation already exists in the input data.
2. If the data is discontinuous or non-linearly separable, the innate representation
is inconsistent and the mapping cannot be leamed using two layers (input and
Output).
3. Therefore, hidden layer(s) are used
mn) in one
d by the value of the connec
f the activation for the uni
between input and output layers
layer only to those in the next higher layer.
ting weight and it is fed
its in the next higher
Weights connects unit (neuro
The output of the unit is scalec
: forward to provide a portion o!
layer.
jal neural network with any number
Backpropagation can be applied to an artifici
of hidden layers. The training objective is to adjust the weights so that the
: application of a set of inputs produces the desired outputs.
: The network is usually trained, with a large number of
Training procedure
input - output pairs.
1. Generate weights ran
negative) to ensure that the net
weights.
om values (both positive and
domly to small rand
t saturated by large values of
twork is not
TECHNICAL PUBLICATIONS® - an up-hrust for knowledgeNeural Networks
i t.
Choose a training pair from the training se
2
3. Apply the input vector to network input.
4. Calculate the network output.
5.
vork output and the desir
Calculate the error, the difference between the network outp| red
output.
. ; ; vay
Adjust the weights of the network in a W i er
7. Repeat steps 2 - 6 for each pair of input-output in the training set until the
error for the entire system is acceptably low.
that minimizes this error.
>
Forward pass and backward pass : ec tes pastes
5 imine i ‘0 passes.
* Backpropagation neural network training involves ;
1. In the forward pass, the input signals moves forward from the network input
to the output.
2. In the backward pass, the calculated error signals propagate backward through
the network, where they are used to adjust the weights.
3._ In the forward pass, the calculation of the output is carried out, layer by layer,
in the forward direction. The output of one layer is the input to the next layer,
* In the reverse pass,
a. The weights of the output neuron layer are adjusted first since the target value
of each output neuron is available to guide the adjustment of the associated
weights, using the delta rule.
b. Next, we adjust the weights of the middle layers. As the middle layer neurons
have no target values, it makes the problem complex.
* Selection of number of hidden units
the number of input units.
1. Never choose h to be more than twice the number of input units.
+ The number of hidden units depends on
2. You can load p patterns of I elements into log, p hidden units.
3. Ensure that we must have at least Ie times as many training examples.
Feature extraction requires fewer hidden units than inputs.
4,
5. Learning many examples of disjointed inputs requires more hidden units than
inputs.
2
The number of hidden units required for a classification task increases with the
number of classes in the task. Large networks Tequire longer training times.
TECHNICAL PUBLICATIONS® . an Uprthrust for knowledge4-15 Neural Networks
chine Leaming
tors influencing Backpropagation training
fatto waining time can be reduced by using
1, Bias : Networks with biases can represent relationships between inputs and
outputs more easily than networks without biases. Adding a bias to each
neuron is usually desirable to offset the origin of the activation function. The
weight of the bias is trainable similar to weight except that the input is always
+1.
Momentum : The use of momentum enhances the stability of the training,
process. Momentum is used to keep the training process going in the same
general direction analogous to the way that momentum of a moving object
behaves. In backpropagation with momentum, the weight change is a
combination of the current gradient and the previous gradient.
"ERI Advantages and Disadvantages
advantages of backpropagation :
1, Itis simple, fast and easy to program.
2. Only numbers of the input are tuned and not any other parameter.
3, No need to have prior knowledge about the network.
4, It is flexible.
5, A standard approach and works efficiently.
6. It does not require the user to learn special functions.
" Disadvantages of backpropagation :
1. Backpropagation possibly be sensitive
2. The performance of this is highly reliant on the input data
to noisy data and irregularity.
3. Needs excessive time for training.
4. The need for a matrix-based method for backpropagation instead of mini -
_ EEX shallow Networks
+ The terms shallow and deep refer to the number of layers in a neural network;
shallow neural networks refer to a neural network that have a small number of
layers, usually regarded as having a single hidden layer and deep neural networks
refer to neural networks that have multiple hidden layers. Both types of networks
perform certain tasks better than the other and selecting the right network depth is
important for creating a successful model.
es of the feature vector of the data to be
hidden layer of nodes (neurons) each of
batch.
In a shallow neural network, the valu
classified (the input layer) are passed to a
TECHNICAL PUBLICATIONS® - an up-thrust for knawiedgeNeural,
4-16 Netw
Machine Learning ; , :
¢ activation function, g, acting on, the
ing to som
which generates a response according
i values, 2.
weighted sum of those valu‘ is then passed to a final, output
+ layer .
Jes of each unit in the hidden layer 1s Te” Pe ro tices a
+ The respons single uit),
layer (which may consist of
classification prediction output.
EX Deep Network
is a new area of machine s
ae eres objective of moving machine learning ee t0 one of is
aan Deep learning is about learning multiple levels of presen
in .
snd absteetion that help to make sense of data such as images, sound and text,
learning research, which has been
‘Deep learning’ means using a neural network with several layers of odes
between input and output. It is generally better than other met! Bs S on image,
speech and certain other types of data because the series of layers tween input
and output do feature identification and processing in a series of stages, just as
our brains seem to.
Deep Learning emphasizes the network architecture of today's most successful
machine leaming approaches. These methods are based on "deep" multi - layer
neural networks with many hidden layers.
EEE TensorFlow
* TensorFlow is one of the most popular frameworks used to build deep learning
models. The framework is developed by Google Brain Team.
* Languages like C++, R and Python are supported by the framework to create the
models as well as the libraries, This framework can be accessed from both -
desktop and mobile,
* The translator used by Google is the best example of TensorFlow. In this, the
model is created by adding the functionalities of text classification, natural
language processing, speech or handwriting recognition, image recognition, etc.
* The framework has its own visualization toolkit, named TensorBoard which helps
in powerful data visualization of the network along with its performance.
* One more tool added in TensorFlow, TensorFlo}
and easy deployment of the newly developed a
change in the existing API or architecture.
© TensorFlow framework comes along with a detailed documentation for the users
to adapt it quickly and easil aan ;
ly, making it ¢
framework to model deep learning algorithms, epee cares Recep lease
w Serving, can be used for quick
Igorithms without introducing any
TECHNICAL PUBLICATIONS® - an “p-thrust for knowledge“a
ochine Learning 4017 Neural Networks
some of the characteristics of TensorFlow is :
© Multiple GPU supported.
© One can visualize graphs and queues easily using TensorBoard
o Powerful documentation and larger support from community.
pa Keras
+ If you are comfortable in programming with Python, then learning Keras will not
prove hard to you. This will be the most recommended framework to create deep
Jearning models for ones having a sound of Python.
Keras is built purely on Python and can run on the top of TensorFlow. Due to its
complexity and use of low - level libraries, TensorFlow can be comparatively
harder to adapt for the new users as compared to Keras. Users those who are
beginners in deep learning, and find its models difficult to understand in
TensorFlow generally prefer Keras as it solves all complex models in no time.
+ Keras has been developed keeping in mind the complexities in the deep learning
models, and hence it can run quickly’ to get the results in minimum time.
Convolutional as well as Recurrent Neural networks are supported in Keras. The
framework can run easily on CPU and GPU.
+ The models in Keras can be classified into 2 categories =
1, Sequential model :
The layers in the deep learning model are defined in a sequential manner. Hence the
implementation of the layers in this model will also be done sequentially.
2. Keras functional API :
Deep leaming models that has multiple outputs, or has shared layers, ie. more
complex models can be implemented in Keras functional API.
EEE] pitference between Deep Network and Shallow Network
|
| Sr.No. Deep network Shallow network |
| :
| 1 Deep network contains many Shallow network contains only © |
| hidden layers. fone hidden layer. |
peed y |
| 2 Deep network can compactly Shallow networks with one |
| express highly complex functions Hidden layer cannot place
i over input space complex functions over the input |
L |
|
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming
Vanishing Gradient Problem
a network is more |
i 3. Training in DN is easy and no ee ein our current |
issue of local minima in DN te wath
, net's needs more |
| 4. DN can fit functions better with shallow 2 ee batter |
| Jess parameters than a shallow paral a
| network So sail
isa problem that user face, when we are training
ke backpropagation. This
dient-based methods il :
.d tune the parameters of the earlier layers in
The vanishing gradient problem
Neural Networks by using gta
problem makes it difficult to learn an
the network.
The vanishing gradient pro
multilayer feed-forward netwo:
have the ability to propagate use
the model back to the layers near the.input end of the model.
dered unable to learn on a specific
It results in models with many layers being re!
dataset, It could even cause models with many layers to prematurely converge to
a substandard solution.
When the backpropagation algorithm advances downwards or backward going
from the output layer to the input layer, the gradients tend to shrink, becoming
‘mailer and smaller till they approach zero. This ends up leaving the weights of
the initial or lower layers practically unchanged. In this situation, the gradient
descent does not ever end up converging to the optimum.
blem is essentially 4 situation in which a deep
rk or a Recurrent Neural Network (RNN) does not
ful gradient information from the output end of
Vanishing gradient does not necessarily imply that the gradient vector is all zero.
It implies that the gradients are minuscule, which would cause the learning to be
very slow.
‘The most important solution to the vanishin; i is i
gradient problem is a specific type
of neural network called Long Short-Term Memory Networks (LSTMs). ‘ ”
Indication of vanishing gradient problem :
a) The parameters of the higher la
ers change t i
parameters of lower layers barely change. ea aaa
b) The model weights could become 0 during training,
©) The model learns at a particular ‘
ly slow pace ini
a very early phase after only a few iio are elena
Some methods that are proposed to overcome the vani.
a) Residual neural networks (ResNets) oor
g could stagnate at
ishing gradient problem :
TECHNICA! Bret im»Machine Learning. 4-19 Neural Networks
b) Multi-level hierarchy
) Long Short Term Memory (LSTM)
d) Faster hardware
e) ReLU
) Batch normalization
[Eel ReLu
« Rectified Linear Unit (ReLU) solve the vanishing gradient problem. ReLU is a
nor-linear function or piecewise linear function that will output the input directly
if it is positive, otherwise, it will output zero.
« It is the most commonly used activation function in neural networks, especially in
Convolutional Neural Networks (CNNs) and Multilayer perceptron’s.
+ Mathematically, it is expressed as
f(x) = max (0, x)
where x : input to neuron
_« Fig. 4.8.1 shows ReLU function
0
Riz) = max(0, 2)
|
|
t i
rs
-10_ 800
8
6
4
2
°
Fig. 4.8.1 ReLU function
_* The derivative of an activation function is required when updating the weights
during the back-propagation of the error. The slope of ReLU is 1 for positive
values and 0 for negative values. It becomes non-differentiable when the input x is
zero, but it can be safely assumed to be zero and causes no problem in practice.
_ * ReLU is used in the hidden layers instead of Sigmoid or tanh. The ReLU function
Solves the problem of computational complexity of the Logistic Sigmoid and Tanh
functions,
* A ReLU activation unit is known to be less likely to create a vanishing gradient
Problem because its derivative is always 1 for positive values of the argument.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeMachine Leaming 4-20 Neural Networks
* Advantages of ReLU function
a) ReLU is simple to compute and has a
backpropagation of the error.
predictable gradient for the
b) Easy to implement and very fast.
©) The calculation speed is very fast. The ReLU function has only a direct
relationship.
d) It can be used for deep network training.
Disadvantages of ReLU function : ,
a) When the input ‘is negative, ReLU is not fully functional which means when it
comes to the wrong number installed, ReLU will die. This problem is also
known as the Dead Neurons problem.
b) ReLU function can only be used within hi
Model.
idden layers of a Neural Network
(ERM LReLu and ERELU
1. LReLU
«The Leaky ReLU is one of the most well-known activation function. It is the same
as ReLU for positive numbers. But instead of being 0 for all negative values, it has
a constant slope (less than 1.)
+ Leaky ReLU is a type of activation function that helps to prevent the function
from becoming saturated at 0. It has a small slope instead of the standard ReLU
which has an infinite slope.
© Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Fig. 4.8.2 shows
LReLU function.
fy)
Fig. 4.8.2 LReLU function
«The leak helps to increase the ran,
is 0.01 or so. ge of the ReLU function. Usually, the value of @
Tt
ECHNICAL PUBLICATIONS® . an up thrust for k
F knowledna= chine Leeming en
1 Networks
+ The motivation for
can also result in
activation function
using LReLU i
an rR eLU instead of ReLU is that constant zero gradients
‘amning, as when a saturated neuron uses a sigmoid
EReLU
+ An Elastic ReLU (EReLU) considers
distribution during the training for th
non-linearity.
@ slope randomly drawn from a uniform
'e positive inputs to control the amount of
+ The EReLU is defined as : EReLU() =
max(Rx; 0) in the output range of [0;1)
where R is a random number u 8
« At the test time, the EReLU becomes the identity function for positive inputs.
BEI Hyperparameter Tuning
_ + Hyperparameters are parameters whose values control the leaming process and
determine the values of model parameters that a learnin;
ig algorithm ends up
learning.
While designing a machine learning model, one always has multiple choices for
the architectural design for the model. This creates a confusion on which design to
choose for the model based on its optimality. And due to this, there are always
trials for defining a perfect machine learning model.
The parameters that are used to define these machine learning models are known
as the hyperparameters and the rigorous search for these parameters to build an
optimized model is known as hyperparameter tuning.
Hyperparameters are not model parameters, which can be directly trained from
data. Model parameters usually specify the way to transform the input into the
required output, whereas hyperparameters define the actual structure of the model
that gives the required data.
49.1 | Layer Size
* Layer size is defined by the number of neurons in a given layer. Input and output
layers are relatively easy to figure out because they correspond directly to how
our modeling problem handles input and ouput.
For the input layer, this will match up to the number of features in the input
Vector, For the output layer, this will either be a single output neuron or a number
of neurons matching the number of classes we are trying to predict.
It is obvious that a neural network with 3 layers will give better performance than
that of 2 layers. Increasing more than 3 doesn't help that much in neural networks.
In the case of CNN, an increasing number of layers makes the model better.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge