University Of Khartoum
Department Of Electronics & Electrical
Engineering
Software & Control Engineering
EEE52511: NEURAL NETWORKS
& FUZZY SYSTEMS
By: Dr. Hiba Hassan Sayed
Lecture 4
30/1/2023 U of K: Dr. Hiba Hassan 2
GRADIENT DESCENT LEARNING
30/1/2023 U of K: Dr. Hiba Hassan 3
Gradient Descent Learning in NN
• The gradient is the rate of change of f(x) at a particular value of x.
• Hence, it is the partial derivative of f(x) with respect to x.
• That led to Gradient Descent Learning, its aim is to find the
minimum error by computing the derivative of the error function
with respect to the weight. Sometimes it is called Gradient
Descent Minimization.
30/1/2023 U of K: Dr. Hiba Hassan 4
Finding the minimum of a function: gradient descent
30/1/2023 U of K: Dr. Hiba Hassan 5
Cont.
30/1/2023 U of K: Dr. Hiba Hassan 6
Cont.
• For a target (t) & an actual output (o), the error is given by the
following mean square error cost function,
• Where D is the set of training examples.
• There are 2 types of gradient descent based cost function, they are
stated next;
30/1/2023 U of K: Dr. Hiba Hassan 7
Where,
30/1/2023 U of K: Dr. Hiba Hassan 8
30/1/2023 U of K: Dr. Hiba Hassan 9
Batch Training
• Batch Training: In batch mode the weights and biases of the
network are updated only after the entire training set has been
applied to the network. The gradients calculated at each training
example are added together to determine the change in the weights
and biases.
• Batch Gradient Descent: In the batch steepest descent training
function the weights and biases are updated in the direction of the
negative gradient of the performance function.
30/1/2023 U of K: Dr. Hiba Hassan 10
Batch Gradient Descent with Momentum
• This algorithm often provides faster convergence.
• Momentum allows a network to respond not only to the local
gradient, but also to recent trends in the error surface.
• Acting like a low-pass filter, momentum allows the network to ignore
small features in the error surface.
• Without momentum a network may get stuck in a shallow local
minimum, such as shown in the next figure.
30/1/2023 U of K: Dr. Hiba Hassan 11
Cont.
Local and global minima Effect of adding Momentum
30/1/2023 U of K: Dr. Hiba Hassan 12
Incremental Mode Gradient Descent
• When we use the gradient with respect to one training example at
a time, the gradient descent becomes the Hoff’s delta rule, which
is given by,
Wi (t o) xi
• Also called the Least Mean Square, LMS, method.
30/1/2023 U of K: Dr. Hiba Hassan 13
LMS Learning Rule
Mean Square Error:
• Like the perceptron learning rule, the least mean square (LMS) - the delta
rule - algorithm is an example of supervised training, in which the learning
rule is provided with a set of examples of desired network behavior:
p1 , t1 , p2 , t 2 , ... , pQ , t Q
• We want to minimize the average of the sum of the squared errors
between target & actual network output:
Q Q
1 1
mse e(k ) (t (k ) - a(k ))
2 2
Q k 1 Q k 1
30/1/2023 U of K: Dr. Hiba Hassan 14
LMS Algorithm/ Widrow-Hoff rule
• The LMS algorithm was presented by Widrow and Hoff, hence, it is
called Widrow-Hoff learning algorithm.
• As seen before, it is based on an approximate steepest descent
procedure.
• Widrow and Hoff decided that they could estimate the mean
square error by using the squared error at each iteration.
30/1/2023 U of K: Dr. Hiba Hassan 15
Comparing Perceptron & Delta Rules
• Perceptron rule
• Thresholded output.
• Converges after a finite number of iterations to a hypothesis that perfectly
classifies the training data, provided the training examples are linearly
separable.
• Linearly separable data.
• Delta rule
• Unthresholded output.
• Converges toward the error minimum, possibly requiring unbounded time,
but converges regardless of whether the training data are linearly
separable or not.
• Linearly non-separable data.
30/1/2023 U of K: Dr. Hiba Hassan 16
Adaptive Linear Neuron Network Architecture (ADALINE)
• The ADALINE network is a single layer neural network with multiple
nodes, where each node accepts multiple inputs to generate one output.
• ADALINE networks are similar to the perceptron, but their transfer
function is linear rather than hard-limiting. This allows their outputs to
take on any value, whereas the perceptron output is limited to either 0 or
1.
• Both the ADALINE and the perceptron can only solve linearly separable
problems.
30/1/2023 U of K: Dr. Hiba Hassan 17
Cont.
• An adaptive linear system responds to changes in its environment
as it is operating.
• These networks are often used in error cancellation, signal
processing, and control systems. For example, they are used by
many long distance phone lines for echo cancellation.
• The pioneering work in this field was done by Widrow and Hoff,
who gave the name ADALINE to adaptive linear elements.
30/1/2023 U of K: Dr. Hiba Hassan 18
The ADALINE Neural Network
30/1/2023 U of K: Dr. Hiba Hassan 19
Cont.
• Multiple layer ADALINE is called MADALINE.
• The Widrow-Hoff rule can only train single-layer linear networks.
• This is not much of a disadvantage; single-layer linear networks are
just as capable as multilayer linear networks.
• For every multilayer linear network, there is an equivalent single-
layer linear network.
30/1/2023 U of K: Dr. Hiba Hassan 20
BACKPROPAGATION ALGORITHM
30/1/2023 U of K: Dr. Hiba Hassan 21
BackPropagation Algorithm
• The backpropagation algorithm was made popular by Rumelhart,
Hinton and Williams in 1986 "Learning Internal Representations by
Error Propagation". Rumelhart, David E.; McClelland, James
L. (eds.). Parallel Distributed Processing : Explorations in the
Microstructure of Cognition. Vol. 1 : Foundations. Cambridge: MIT
Press. ISBN 0-262-18120-7.]
• The researchers used semi-linear neurons with differentiable activation
functions in the hidden neurons (logistic activation functions or
sigmoids).
30/1/2023 U of K: Dr. Hiba Hassan 22
Cont.
• The error between the target and actual output is calculated at
every iteration and is back propagated through the layers of the
ANN to adapt the weights.
• The weights are adapted such that the error is minimized.
• Once the error has reached a justified minimum value, the training
is stopped.
• Among the first applications of the BP algorithm is speech
synthesis called NETalk developed by Terence Sejnowski
[Sejnowski & Rosenberg, 1987 “Parallel Networks that Learn to
Pronounce English Text”, Complex Systems 1, 145-168]
30/1/2023 U of K: Dr. Hiba Hassan 23
Cont.
• The configuration for training a neural network using the BP
algorithm is shown in the figure below.
30/1/2023 U of K: Dr. Hiba Hassan 24
The Generalized Delta Rule (G.D.R.)
• In BP algorithm, like in other learning algorithms, the goal is to find
the next value of the adaptation weights (Δw) which is also known
as the G.D.R.
• Consider the following ANN model:
30/1/2023 U of K: Dr. Hiba Hassan 25
Cont.
• We need to obtain the following algorithm to adapt the weights
between the output (k) and hidden (j) layers:
• Where the weights are adapted as follows:
• And t is the iteration number and is the error signal between the
output and hidden layers & is given by:
30/1/2023 U of K: Dr. Hiba Hassan 26
Cont.
• Adaptation between input (i) and hidden (j) layers :
• The new weight is thus:
• and the error signal through layer j is:
• Where,
• And,
30/1/2023 U of K: Dr. Hiba Hassan 27
Backpropagation Algorithm
• The following ANN model is used to derive the backpropagation
algorithm:
30/1/2023 U of K: Dr. Hiba Hassan 28
BP (cont.)
• The backpropagation has two steps,
• Forward propagation, and
• Backward propagation.
• Our ANN model has the following assumptions:
• A two-layer multilayer NN model, i.e. with 1 set of hidden neurons.
• Neurons in layer i are fully connected to layer j and neurons in
layer j are fully connected to layer k.
• Input layer neurons have linear activation functions and hidden
and output layer neurons have logistic activation functions
(sigmoids).
30/1/2023 U of K: Dr. Hiba Hassan 29
Note: Sigmoid Function
• Sigmoids have a variable c that controls their firing angle.
30/1/2023 U of K: Dr. Hiba Hassan 30
Cont.
• When c is large, the sigmoid becomes like a threshold function and
when is c is small, the sigmoid becomes more like a straight line
(linear).
• When c is large learning is much faster but a lot of information is
lost, however when c is small, learning is very slow but information
is retained.
• Since this function is differentiable, it enables the B.P. algorithm to
adapt the lower layers of weights in a multilayer neural network.
30/1/2023 U of K: Dr. Hiba Hassan 31
Cont.
• The firing angle used here is c=1.
• Bias weights are used with bias signals of 1 for hidden (j) and output
layer (k) neurons.
• In many ANN models, bias weights (θ) with bias signals of 1 are used to
speed up the convergence process.
• The learning parameter is given by the symbol η and is usually fixed a
value between 0 and 1, however, in many applications nowadays an
adaptive η is used.
• Usually η is set large in the initial stage of learning and reduced to a
small value at the final stage of learning.
• A momentum term α is also used in the G.D.R. to avoid local minima.
30/1/2023 U of K: Dr. Hiba Hassan 32
Steps of BP Algorithm
• Step 1: Obtain a set of training patterns.
• Step 2: Set up neural network model: No. of Input neurons, Hidden
neurons, and Output Neurons.
• Step 3: Set learning rate η and momentum rate α
• Step 4: Initialize all connection Wji , Wkj and bias weights θj θk to
random values.
• Step 5: Set minimum error, Emin
• Step 6: Start training by applying input patterns one at a time and
propagate through the layers then calculate total error.
30/1/2023 U of K: Dr. Hiba Hassan 33
Cont.
• Step 7: Backpropagate error through output and hidden layer and
adapt weights.
• Step 8: Backpropagate error through hidden and input layer and
adapt weights.
• Step 9: Check if Error < Emin
• If not repeat Steps 6-9. If yes stop training.
30/1/2023 U of K: Dr. Hiba Hassan 34
Solving an XOR Problem
• In this example we use the BP algorithm to solve a 2-bit XOR problem.
• The training patterns of this ANN is the XOR example as given in the next
table.
• For simplicity, the ANN model has only 4 neurons (2 inputs, 1 hidden and
1 output) and has no bias weights.
• The input neurons have linear functions and the hidden and output
neurons have sigmoid functions.
• The weights are initialized randomly.
• We train the ANN by providing the patterns #1 to #4 through an iteration
process until the error is minimized.
30/1/2023 U of K: Dr. Hiba Hassan 35
Cont.
• The training patterns of this ANN is the XOR example as given in
the following table:
30/1/2023 U of K: Dr. Hiba Hassan 36
Cont.
• The ANN model and its initial weights,
• Training begins when the pattern#1 and its target are provided to the
ANN.
• 1st pattern: 0, 0 target : 0
30/1/2023 U of K: Dr. Hiba Hassan 37
30/1/2023 U of K: Dr. Hiba Hassan 38
Compute the error by comparing this value to the target,
30/1/2023 U of K: Dr. Hiba Hassan 39
Cont.
• This error is now backpropagated through the layers following the
error signal equations given as follows:
• Between output (k) and hidden (j) layer
• Thus
• Between hidden (j) and input (i) layer :
• = -0.0035
30/1/2023 U of K: Dr. Hiba Hassan 40
Cont.
• Now we have calculated the error signal between layers (k) and (j)
• If we had chosen the learning rate and momentum term as follows :
• η = 0.1 and α= 0.9
• and the previous change in weight is 0 and Ojo= 0.5
• Then,
= -0.0064
30/1/2023 U of K: Dr. Hiba Hassan 41
Cont.
• This is the increment of the weight after the first iteration for the
weight between layers k and j.
• Now this change in weight is added to the actual weight as follows
• and thus the weight between layers k and j has been adapted.
30/1/2023 U of K: Dr. Hiba Hassan 42
Cont.
• Similarly for the weights between layers j and i, the adaptation follows
• Now this change in weight is added to the actual weight as follows:
• and this is the adapted weight between layers j and i after pattern#1 is
seen by the ANN in the first iteration.
• The whole calculation is then repeated for the next pattern (pattern#2 =
[0, 1]) with tk=1.
• After all the 4 patterns have been completed the whole process is
repeated for pattern#1 again.
30/1/2023 U of K: Dr. Hiba Hassan 43
UNSUPERVISED LEARNING
30/1/2023 U of K: Dr. Hiba Hassan 44
Unsupervised Learning
• Unsupervised learning is the process of finding structure, patterns
or correlation in the given data.
• Many times this type of learning depends on associative learning
procedures.
• We focus on two main approaches:
• Unsupervised Hebbian learning
• Principal component analysis
• Unsupervised competitive learning
• Clustering
30/1/2023 U of K: Dr. Hiba Hassan 45
Types of Analysis used in
Unsupervised Learning
• Correlational analysis
• Identifying the correlations among features.
• Accomplished via Hebbian learning
• Cluster analysis
• Identifying the relational structure of the data.
• Accomplished via competitive learning.
• Cluster analysis is a form of categorization, whereas
Correlational analysis is a form of simplification.
30/1/2023 U of K: Dr. Hiba Hassan 46
Hebbian Learning
• An association principle was proposed by Hebb in 1949 in the
context of biological neurons.
• Hebb’s principle
When a neuron repeatedly excites another neuron, then the
threshold of the latter neuron is decreased, or the synaptic
weight between the neurons is increased, in effect increasing
the likelihood of the second neuron to be excited by the first.
30/1/2023 U of K: Dr. Hiba Hassan 47
Hebbian Learning as Correlation Learning
• Hebbian learning is an associative learning, it associates things that
occur together.
• Thus Hebbian learning can be thought of as learning the auto-
correlation of the input space.
• Example: a child recognizes a banana by its shape & wants to eat it.
Then, he smells it and after a couple of exposures to that experiment
starts, drooling! Once he smells it without even seeing it.
• Conclusion: the child has associated the smell with the banana &
produced a response (hunger effect) even without seeing its shape.
30/1/2023 U of K: Dr. Hiba Hassan 48
Cont.
• Brilliant idea by Hebb(1949):cells that fire together, wire
together
Banana-smell Hungry Neuron
Neuron
30/1/2023 U of K: Dr. Hiba Hassan 49
Hebbian Learning Neural Network
Output Signals
Input Signals
i j
30/1/2023 U of K: Dr. Hiba Hassan 50
Banana Associator Example
30/1/2023 U of K: Dr. Hiba Hassan 51
Example (cont.)
• The inputs are defined as follows:
• If we want the network to associate the response to the shape of
the banana & not its smell, w0 is assigned a value greater than –b,
while w is assigned a value less than –b.
• Hence we choose; w0 = 1& w = 0.
• The output of the network reduces to;
a = hardlim(p0 - 0.5)
30/1/2023 U of K: Dr. Hiba Hassan 52
Hebbian Learning
• Hebbian learning rule Δwji = ηyjxi
• Consider the update of a single weight w,
w(n + 1) = w(n) + ηy(n)x(n)
• For a linear activation function
w(n + 1) = w(n)[1 + ηx2(n)]
• Weights increase without bounds. If initial weight is negative, then
it will increase in the negative range. If it is positive, then it will
increase in the positive range.
• Hebbian learning is naturally unstable.
30/1/2023 U of K: Dr. Hiba Hassan 53
Oja’s Learning Rule
• To solve the problem of the simple Hebbian rule that causes the
weights to increase (or decrease) without bounds,
• The weights need to be normalized to one as follows,
wji(n + 1) = [wji(n) + ηyj(n)xi(n)] / √Σi[wji(n) + ηyj(n)xi(n)]2
• This equation effectively imposes a constraint on the weights.
• Oja approximated the normalization (for small η) as:
30/1/2023 U of K: Dr. Hiba Hassan 54
Oja’s Rule (continued)
wji(n + 1) = wji(n) + ηyj(n)[xi(n) – yj(n)wji(n)]
• This rule is also known as the generalized Hebbian rule.
• The 2nd term is called a weight decay term or a ‘forgetting term’.