Part- 3
ARTIFICIAL NEURAL
NETWORKS: AN
INTRODUCTION
DEFINITION OF NEURAL NETWORKS
According to the DARPA Neural Network Study (1988, AFCEA
International Press, p. 60):
• ... a neural network is a system composed of many simple processing
elements operating in parallel whose function is determined by network
structure, connection strengths, and the processing performed at
computing elements or nodes.
According to Haykin (1994)
A neural network is a massively parallel distributed processor that has a
natural propensity for storing experiential knowledge and making it
available for use. It resembles the brain in two respects:
• Knowledge is acquired by the network through a learning process.
• Interneuron connection strengths known as synaptic weights are
used to store the knowledge.
BRAIN COMPUTATION
The human brain contains about 10 billion nerve cells, or
neurons. On average, each neuron is connected to other
neurons through approximately 10,000 synapses.
BIOLOGICAL (MOTOR) NEURON
ARTIFICIAL NEURAL NET
Information-processing system.
Neurons process the information.
The signals are transmitted by means of connection links.
The links possess an associated weight.
The output signal is obtained by applying activations to the net
input.
MOTIVATION FOR NEURAL NET
Scientists are challenged to use machines more effectively for
tasks currently solved by humans.
Symbolic rules don't reflect processes actually used by humans.
Traditional computing excels in many areas, but not in others.
The major areas being:
Massive parallelism
Distributed representation and computation
Learning ability
Generalization ability
Adaptively
Inherent contextual information processing
Fault tolerance
Low energy consumption.
ARTIFICIAL NEURAL NET
W1
X1 Y
W2
X2
The figure shows a simple artificial neural net with two input neurons
(X1, X2) and one output neuron (Y). The inter connected weights are
given by W1 and W2.
ASSOCIATION OF BIOLOGICAL NET
WITH ARTIFICIAL NET
PROCESSING OF AN ARTIFICIAL NET
The neuron is the basic information processing unit of a NN. It consists
of:
1. A set of links, describing the neuron inputs, with weights W1, W2,
…, Wm.
2. An adder function (linear combiner) for computing the weighted
sum of the inputs (real numbers):
m
u = ∑ W jX j
j =1
3. Activation function for limiting the amplitude of the neuron output.
y = ϕ (u + b)
BIAS OF AN ARTIFICIAL NEURON
The bias value is added to the weighted sum
∑wixi so that we can transform it from the origin.
Yin = ∑wixi + b, where b is the bias
x1-x2= -1
x2
x1-x2=0
x1-x2= 1
x1
MULTI LAYER ARTIFICIAL NEURAL NET
INPUT: records without class attribute with normalized attributes
values.
INPUT VECTOR: X = { x1, x2, …, xn} where n is the number of
(non-class) attributes.
INPUT LAYER: there are as many nodes as non-class attributes, i.e.
as the length of the input vector.
HIDDEN LAYER: the number of nodes in the hidden layer and the
number of hidden layers depends on implementation.
OPERATION OF A NEURAL NET
- Bias
x0 w0j
x1 w1j
∑ f
Output y
xn wnj
Input Weight Weighted Activation
vector x vector w sum function
WEIGHT AND BIAS UPDATION
Per Sample Updating
• updating weights and biases after the presentation of each sample.
Per Training Set Updating (Epoch or Iteration)
• weight and bias increments could be accumulated in variables and
the weights and biases updated after all the samples of the
training set have been presented.
STOPPING CONDITION
All change in weights (∆wij) in the previous epoch are below some
threshold, or
The percentage of samples misclassified in the previous epoch is
below some threshold, or
A pre-specified number of epochs has expired.
In practice, several hundreds of thousands of epochs may be
required before the weights will converge.
NEURAL NETWORKS
Neural Network learns by adjusting the weights so as to be able
to correctly classify the training data and hence, after testing phase,
to classify unknown data.
Neural Network needs long time for training.
Neural Network has a high tolerance to noisy and incomplete
data.
BUILDING BLOCKS OF ARTIFICIAL NEURAL NET
Network Architecture (Connection between Neurons)
Setting the Weights (Training)
Activation Function
LAYER PROPERTIES
Input Layer: Each input unit may be designated by an attribute
value possessed by the instance.
Hidden Layer: Not directly observable, provides nonlinearities for
the network.
Output Layer: Encodes possible values.
TRAINING METHODS
Supervised Training - Providing the network with a series of
sample inputs and comparing the output with the expected
responses.
Unsupervised Training - Most similar input vector is assigned to
the same output unit.
Reinforcement Training - Right answer is not provided but
indication of whether ‘right’ or ‘wrong’ is provided.
ACTIVATION FUNCTION
ACTIVATION LEVEL – DISCRETE OR CONTINUOUS
HARD LIMIT FUCNTION (DISCRETE)
• Binary Activation function
• Bipolar activation function
• Identity function
SIGMOIDAL ACTIVATION FUNCTION (CONTINUOUS)
• Binary Sigmoidal activation function
• Bipolar Sigmoidal activation function
ACTIVATION FUNCTION
Activation functions:
(A) Identity
(B) Binary step
(C) Bipolar step
(D) Binary sigmoidal
(E) Bipolar sigmoidal
(F) Ramp
CONSTRUCTING ANN
Determine the network properties:
• Network topology
• Types of connectivity
• Order of connections
• Weight range
Determine the node properties:
• Activation range
Determine the system dynamics
• Weight initialization scheme
• Activation – calculating formula
• Learning rule
PROBLEM SOLVING
Select a suitable NN model based on the nature of the problem.
Construct a NN according to the characteristics of the application
domain.
Train the neural network with the learning procedure of the
selected model.
Use the trained network for making inference or solving problems.
SALIENT FEATURES OF ANN
Adaptive learning
Self-organization
Real-time operation
Fault tolerance via redundant information coding
Massive parallelism
Learning and generalizing ability
Distributed representation
McCULLOCH–PITTS NEURON
Neurons are sparsely and randomly connected
Firing state is binary (1 = firing, 0 = not firing)
All but one neuron are excitatory (tend to increase voltage of other
cells)
• One inhibitory neuron connects to all other neurons
• It functions to regulate network activity (prevent too many
firings)
LINEAR SEPARABILITY
Linear separability is the concept wherein the separation of the
input space into regions is based on whether the network response
is positive or negative.
Consider a network having
positive response in the first
quadrant and negative response
in all other quadrants (AND
function) with either binary or
bipolar data, then the decision
line is drawn separating the
positive response region from
the negative response region.
HEBB NETWORK
Donald Hebb stated in 1949 that in the brain, the learning is performed
by the change in the synaptic gap. Hebb explained it:
“When an axon of cell A is near enough to excite cell B, and repeatedly
or permanently takes place in firing it, some growth process or
metabolic change takes place in one or both the cells such that A’s
efficiency, as one of the cells firing B, is increased.”
HEBB LEARNING
The weights between neurons whose activities are positively
correlated are increased:
dw ij
~ correlation ( x i , x j )
dt
Associative memory is produced automatically
The Hebb rule can be used for pattern association, pattern
categorization, pattern classification and over a range of other
areas.
DEFINITION OF SUPERVISED LEARNING NETWORKS
Training and test data sets
Training set; input & target are specified
PERCEPTRON NETWORKS
Linear threshold unit (LTU)
x1 w1
w0
w2
x2 Σ o
n
. Σ
. wn
w i xi
. i=0
n
xn 1 if Σ wi xi >0
f(xi)= { i=0
-1 otherwise
PERCEPTRON LEARNING
wi = wi + ∆wi
∆wi = η (t - o) xi
where
t = c(x) is the target value,
o is the perceptron output,
η Is a small constant (e.g., 0.1) called learning rate.
If the output is correct (t = o) the weights wi are not changed
If the output is incorrect (t ≠ o) the weights wi are changed such
that the output of the perceptron for the new weights is closer to t.
The algorithm converges to the correct classification
• if the training data is linearly separable
• η is sufficiently small
LEARNING ALGORITHM
Epoch : Presentation of the entire training set to the neural
network.
In the case of the AND function, an epoch consists of four sets of
inputs being presented to the network (i.e. [0,0], [0,1], [1,0],
[1,1]).
Error: The error value is the amount by which the value output by
the network differs from the target value. For example, if we
required the network to output 0 and it outputs 1, then Error = -1.
Target Value, T : When we are training a network we not only
present it with the input but also with a value that we require the
network to produce. For example, if we present the network with
[1,1] for the AND function, the training value will be 1.
Output , O : The output value from the neuron.
Ij : Inputs being presented to the neuron.
Wj : Weight from input neuron (Ij) to the output neuron.
LR : The learning rate. This dictates how quickly the network
converges. It is set by a matter of experimentation. It is typically
0.1.
TRAINING ALGORITHM
Adjust neural network weights to map inputs to outputs.
Use a set of sample patterns where the desired output (given the
inputs presented) is known.
The purpose is to learn to
• Recognize features which are common to good and bad
exemplars
MULTILAYER PERCEPTRON
Output Values
Output Layer
Adjustable
Weights
Input Layer
Input Signals
LAYERS IN NEURAL NETWORK
The input layer:
• Introduces input values into the network.
• No activation function or other processing.
The hidden layer(s):
• Performs classification of features.
• Two hidden layers are sufficient to solve any problem.
• Features imply more layers may be better.
The output layer:
• Functionally is just like the hidden layers.
• Outputs are passed on to the world outside the neural
network.
A training procedure which allows multilayer feed forward Neural
Networks to be trained.
Can theoretically perform “any” input-output mapping.
Can learn to solve linearly inseparable problems.
MULTILAYER FEEDFORWARD NETWORK
Inputs
Hiddens
I0
Outputs
h0
I1 o0
h1
I2 o1
h2 Outputs
I3 Hiddens
Inputs
MULTILAYER FEEDFORWARD NETWORK:
ACTIVATION AND TRAINING
For feed forward networks:
• A continuous function can be
• differentiated allowing
• gradient-descent.
• Back propagation is an example of a gradient-descent technique.
• Uses sigmoid (binary or bipolar) activation function.
In multilayer networks, the activation function is
usually more complex than just a threshold function,
like 1/[1+exp(-x)] or even 2/[1+exp(-x)] – 1 to allow for
inhibition, etc.
GRADIENT DESCENT
Gradient-Descent(training_examples, η)
Each training example is a pair of the form <(x1,…xn),t> where
(x1,…,xn) is the vector of input values, and t is the target output
value, η is the learning rate (e.g. 0.1)
Initialize each wi to some small random value
Until the termination condition is met, Do
• Initialize each ∆wi to zero
• For each <(x1,…xn),t> in training_examples Do
Input the instance (x1,…,xn) to the linear unit and compute
the output o
For each linear unit weight wi Do
• ∆wi= ∆wi + η (t-o) xi
• For each linear unit weight wi Do
• wi=wi+∆wi
MODES OF GRADIENT DESCENT
Batch mode : gradient descent
w=w - η ∇ED[w] over the entire data D
ED[w]=1/2Σd(td-od)2
Incremental mode: gradient descent
w=w - η ∇Ed[w] over individual training examples d
Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if η is small enough.
SIGMOID ACTIVATION FUNCTION
x0=1
x1 w1
w0 net=Σi=0n wi xi o=σ(net)=1/(1+e-net)
w2
x2 Σ o
.
. wn
σ(x) is the sigmoid function: 1/(1+e-x)
. dσ(x)/dx= σ(x) (1- σ(x))
xn
Derive gradient decent rules to train:
• one sigmoid function
∂E/∂wi = -Σd(td-od) od (1-od) xi
• Multilayer networks of sigmoid units
backpropagation
BACKPROPAGATION TRAINING ALGORITHM
Initialize each wi to some small random value.
Until the termination condition is met, Do
• For each training example <(x1,…xn),t> Do
Input the instance (x1,…,xn) to the network and compute the
network outputs ok
For each output unit k
δk=ok(1-ok)(tk-ok)
For each hidden unit h
δh=oh(1-oh) Σk wh,k δk
For each network weight w,j Do
wi,j=wi,j+∆wi,j where
∆wi,j= η δj xi,j
BACKPROPAGATION
Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum -in practice
often works well (can be invoked multiple times with different initial
weights)
Often include weight momentum term
∆wi,j(t)= η δj xi,j + α ∆wi,j (t-1)
Minimizes error training examples
Will it generalize well to unseen instances (over-fitting)?
Training can be slow typical 1000-10000 iterations (use Levenberg-
Marquardt instead of gradient descent)
APPLICATIONS OF BACKPROPAGATION
NETWORK
Load forecasting problems in power systems.
Image processing.
Fault diagnosis and fault detection.
Gesture recognition, speech recognition.
Signature verification.
Bioinformatics.
Structural engineering design (civil).