NNFL Unit II For ECE & EEE
NNFL Unit II For ECE & EEE
Learning Process
LAYERS IN ANN:
ANN is made of three layers namely input layer, output layer, and hidden layer/s. There must
be a connection from the nodes in the input layer with the nodes in the hidden layer and from
each hidden layer node with the nodes of the output layer. The input layer takes the data from
the network.
In figure There are three layers; an input layer, hidden layers, and an output layer. Inputs are
inserted into the input layer, and each node provides an output value via an activation
function. The outputs of the input layer are used as inputs to the next hidden layer.
B. Feedback Network: As the name suggests, a feedback network has feedback paths,
which means the signal can flow in both directions using loops. This makes it a non-
linear dynamic system, which changes continuously until it reaches a state of
equilibrium. It may be divided into the following types:
Recurrent networks: They are feedback networks with closed loops. Following are
the two types of recurrent networks.
Fully recurrent network: It is the simplest neural network architecture because all
nodes are connected to all other nodes and each node works as both input and output.
Jordan network − It is a closed loop network in which the output will go to the input
again as feedback as shown in the following diagram.
Supervised Learning: As the name suggests, this type of learning is done under the
supervision of a teacher. This learning process is dependent. During the training of ANN
under supervised learning, the input vector is presented to the network, which will give an
output vector. This output vector is compared with the desired output vector. An error signal
is generated, if there is a difference between the actual output and the desired output vector.
On the basis of this error signal, the weights are adjusted until the actual output is matched
with the desired output.
Unsupervised Learning: As the name suggests, this type of learning is done without the
supervision of a teacher. This learning process is independent. During the training of ANN
under unsupervised learning, the input vectors of similar type are combined to form clusters.
When a new input pattern is applied, then the neural network gives an output response
indicating the class to which the input pattern belongs. There is no feedback from the
environment as to what should be the desired output and if it is correct or incorrect. Hence, in
this type of learning, the network itself must discover the patterns and features from the input
data, and the relation for the input data over the output.
Reinforcement Learning: As the name suggests, this type of learning is used to reinforce or
strengthen the network over some critic information. This learning process is similar to
supervised learning, however we might have very less information. During the training of
network under reinforcement learning, the network receives some feedback from the
environment. This makes it somewhat similar to supervised learning. However, the feedback
obtained here is evaluative not instructive, which means there is no teacher as in supervised
learning. After receiving the feedback, the network performs adjustments of the weights to
get better critic information in future.
3. Activation Functions: An activation function is a mathematical equation that determines
the output of each element (perceptron or neuron) in the neural network. It takes in the
input from each neuron and transforms it into an output, usually between one and zero or
between -1 and +1. It may be defined as the extra force or effort applied over the input to
obtain an exact output. In ANN, we can also apply activation functions over the input to get
the exact output. Followings are some activation functions of interest:
i) Linear Activation Function: It is also called the identity function as it performs no input
editing. It can be defined as: F(x)=x
ii) Sigmoid Activation Function: It is of two type as follows −
Binary sigmoidal function: This activation function performs input editing between
0 and 1. It is positive in nature. It is always bounded, which means its output cannot
be less than 0 and more than 1. It is also strictly increasing in nature, which means
more the input higher would be the output. It can be defined as
F(x)=sigm(x)=11+exp(−x)F(x)=sigm(x)=11+exp(−x)
Bipolar sigmoidal function: This activation function performs input editing between
-1 and 1. It can be positive or negative in nature. It is always bounded, which means
its output cannot be less than -1 and more than 1. It is also strictly increasing in
nature like sigmoid function. It can be defined as
F(x)=sigm(x)=21+exp(−x)−1=1−exp(x)1+exp(x)
1. The sigmoid function has a smooth gradient and outputs values between zero and
one. For very high or low values of the input parameters, the network can be very
slow to reach a prediction, called the vanishing gradient problem.
2. The TanH function is zero-cantered making it easier to model inputs that are
strongly negative strongly positive or neutral.
3. The ReLu function is highly computationally efficient but is not able to process
inputs that approach zero or negative.
4. The Leaky ReLu function has a small positive slope in its negative area, enabling it
to process zero or negative values.
5. The Parametric ReLu function allows the negative slope to be learned, performing
backpropagation to learn the most effective slope for zero and negative input values.
6. Softmax is a special activation function use for output neurons. It normalizes outputs
for each class between 0 and 1, and returns the probability that the input belongs to a
specific class.
7. Swish is a new activation function discovered by Google researchers. It performs
better than ReLu with a similar level of computational efficiency
.
Learning Neural Networks and Learning Rules | Artificial Intelligence
In this article we will discuss about:
1. Introduction to Learning Neural Networks
2. Learning Rules of Neurons in Neural Networks.
Introduction to Learning Neural Networks:
The property which is of primary significance for a neural network is the ability of the
network to learn from its environment, and to improve its performance through learning.
The improvement in performance takes place over time in accordance with some
prescribed measure.
A neural network learns about its environment through an inter-active process of
adjustments applied to its synaptic weights and bias levels. Ideally, the network becomes
more knowledgeable about its environment after each iteration of the learning process.
There are too many activities associated with the notion of learning. Moreover, the
process of learning is a matter of view-point, which makes it all the more difficult to
agree on a precise definition of the term. For example, learning as viewed by a
psychologist is quite different from learning in a classroom sense. Recognising that our
particular interest is in neural networks, we use a definition of learning which is adapted
from Mendel and McClaren (1970).
We define learning in the context of neural networks as:
Learning is a process by which the free parameters of a neural network are adapted
through a process of stimulation by the environment in which the network is embedded.
The type of learning is determined by the manner in which the parameter changes take
place.
This definition of the learning process implies the following sequence of events:
1. The neural network is stimulated by an environment.
2. The neural network undergoes changes in its free parameters as a result of this
stimulation.
3. The neural network responds in a new way to the environment because of the changes
which have occurred in its internal structure.
A prescribed set of well-defined rules for the solution of a learning problem is called a
learning algorithm. There is no unique learning algorithm for the design of neural
networks. Rather, we have a kit of tools represented by a diverse variety of learning
algorithms, each of which offers advantages of its own. Basically, learning algorithms
differ from each other in the way in which the adjustment to a synaptic weight of a
neuron is formulated.
Another factor to be considered is the manner in which a neural network (learning
machine), made up of a set of interconnected neurons, reacts to its environment. In this
latter context we speak of a learning paradigm which refers to a model of the
environment in which the neural network operates.
The five learning rules:
1. Error-correction learning,
2. Memory-based learning,
3. Hebbian learning,
4. Competitive learning and
5. Boltzmann learning are basic to design of neural networks.
Some of these algorithms require the use of a teacher and some do not called supervised
and non-supervised learning respectively.
In the study of supervised learning, a key provision is a ‘teacher’ capable of supplying
exact corrections to the network outputs when an error occurs. Such a method is not
possible in biological organism which have neither the exact reciprocal nervous
connections needed for the back propagation of error corrections nor the nervous means
for the in position of behaviour from outside.
Nevertheless, supervised learning has established itself as a powerful paradiagram for
the design of artificial neural networks. In contrast self-organised (unsupervised)
learning is motivated by neurobiological considerations.
Learning Rules of Neurons in Neural Networks :
Five basic learning rules of Neuron are:
1. Error correctional earning,
2. Memory based- learning,
3. Hebbian learning,
4. Competitive learning and
5. Boltzmann learning.
Error correction learning is rooted in optimum filtering, Memory-based learning and
competitive learning are both inspired by neurobiological considerations. Boltzmann
learning is different and is based on ideas borrowed from statistical mechanics. Also two
learning paradigms, learning with a teacher and learning without a teacher, including the
credit-assignment problem, so basic to learning process have been discussed.
1. Error-Correction Learning:
To illustrate our first learning rule of learning process consider the simple case of a
neuron k constituting the only computational node in the output layer of a feed forward
neural network, as depicted in Fig. 11.21. Neuron k is driven by a signal vector x(n)
produced by one or more layers of hidden neurons, which are themselves driven by an
input vector (stimulus) applied to the source nodes (i.e., input layer) of the neural
network.
The argument n denotes discrete time, or more precisely, the time step of an iterative
process involved in adjusting the synaptic weights of neuron k. The output signal of
neuron k is denoted y k(n). This output signal, representing the only output of the neural
network, is compared to a desired response or target output, denoted by y k(n).
Consequently, an error signal, denoted by e k(n), is produced. By definition, we thus have
The error signal e k(n) actuates a control mechanism, the purpose of which is to apply a
sequence of corrective adjustments to the synaptic weights of neuron k. The corrective
adjustments are designed to make the output signal y k(n) come closer to the desired
response dm(n) in a step-by-step manner.
This objective is achieved by minimizing a cost function or index of performance
ɛ(n) defined in terms of the error signal e k(n) as:
That is, ԑ(n) is the instantaneous value of the error energy. The step-by-step adjustments
to the synaptic weights of neuron k are continued until the system reaches a steady state
(i.e., the synaptic weights are essentially stabilized. At that point the learning process is
terminated.
The learning process described herein is obviously referred to as error correction
learning. In particular, minimisation of the cost function ԑ(n) leads to a learning rule
commonly referred to as the delta rule or Widrow-Hoff rule, named in honor of its
originators. Let ω kj (n) denote the value of synaptic weight ω kj. of neuron k excited by
element xj (n) of the signal vector x(n) at time step n. According to the delta rule, the
adjustment Δωkj(n) applied to the synaptic weight ω kj at time step n is defined by
Δ ωkj (n)= ƞek (n) xj (n)
where, ƞ is a positive constant which determines the rate of learning as we proceed from
one step in the learning process to another. It is therefore natural that we refer to n as the
learning-rate parameter.
In other words, the data rule maybe stated as:
The adjustment made to a synaptic weight of a neuron is proportional to the product of
the error signal and the input signal of the synapse in question.
The delta rule, as stated herein, presumes that the error signal is directly measurable. For
this measurement to be feasible we clearly need a supply of desired response from some
external source, which is directly accessible to neuron k.
In other words, neuron k is visible to the outside world, and depicted in Fig. 11.21(a).
From this figure we also observe that error-correction learning is in fact local in nature.
This amounts to saying that the synaptic adjustments made by the delta rule are localised
around neuron k.
Having computed the synaptic adjustment Δω kj(n), the updated value of synaptic weight
Δωkj, is given by equation 11.26.
Effectively, ω kj(n) and ω kj(n + 1) may be viewed as the old and new values of synaptic
weight ωkj, respectively.
In computational terms we may also write:
where, z-1 is the unit-delay operator. That is, z -1 represents a storage element.
Fig. 11.21(b) shows a signal-flow graph representation of the error-correction learning
process, with regard to neuron k. The input signal x j and the induced local field v k of the
neuron k are referred to as presynaptic and postsynaptic signals of the j th synapse of
neuron k, respectively. Also, the Fig. shows that the error-correction learning is an
example of a closed-loop feedback system.
But from the control theory we know that the stability of such a system is determined by
those parameters which constitute the feedback loops of the system. In this case there is
a single feedback loop and the one of the parameters of interest is ƞ, the learning rate. So
to ensure the stability of convergence of iterative learning ƞ should be selected
judiciously.
2. Memory-Based Learning:
In memory-based learning, all (or most) of the past experiences are explicitly stored in a
large memory of correctly classified input-output examples: [(x i, di)]Ni =1 , where
xi denotes an input vector and d i denotes the corresponding desired response. Without
loss of generality, we have restricted the desired response to be a scalar.
For example, in a binary pattern classification problem there are two classes of
hypotheses, denoted by ԑ 1and ԑ2, to be considered. In this example, the desired response
di takes the value 0 (or -1) for class ԑ 1 and the value 1 for class ԑ 2. When classification of
a test vector test (not seen before) is required, the algorithm responds by retrieving and
analysing the training data in a “local neighbourhood” of x test.
All memory-based learning algorithms involve two essential ingredients:
a. Criterion used for defining the local neighbourhood of the test vector x test.
b. Learning rule applied to the training examples in the local neighbourhood of x test.
The algorithms differ from each other in the way in which these two ingredients are
defined.
In a simple yet effective type of memory-based learning known as the nearest neighbour
rule, the local neighbourhood is defined as the training example which lies in the
immediate neighbourhood of the test vector x test. In particular, the vector.
where, d(xi, xtest ) is the Euclidean distance between the vectors x i and xtest. The class
associated with the minimum distance, that is, vector x’ N is reported as the classification
of xtest . This rule is independent of the underlying distribution responsible for generating
the training examples.
Cover and Hart (1967) have formally studied the nearest neighbour rule as a tool for
pattern classification.
The analysis is based on two assumptions:
a. The classified examples (x i, di) are independently and identically distributed (iid),
according to the joint probability distribution of the example (x, d).
b. The sample size N is infinitely large.
Under these two assumptions, it is shown that the probability of classification error
incurred by the nearest neighbour rule is bounded about by twice the Bayes probability
of error, that is, the minimum probability of error over all decision rule. In this sense, it
may be said that half the classification information in a training set of infinite size is
contained in the nearest neighbour, which is a surprising result.
A variant of the nearest neighbour classifier is the k-nearest neighbour classifier,
which proceeds as:
a. Identify the k classified patterns which lie nearest to the test vector x test for some
integer k.
b. Assign x test to class (hypothesis) which is most frequently represented in the k nearest
neighbours to x test (i.e., use a majority vote to make the classification).
Thus, the k-nearest neighbour classifier acts like an averaging device.
3. Hebbian Learning (Generalised Learning) Supervised Learning:
Hebb’s postulate of learning is the oldest and the most famous of all learning rules; it is
named in honor of the neuropsychologist Hebb (1949).
When an axon of cell A is near enough to excite a cell B and repeatedly or persistently
takes part in firing it, some growth process or metabolic changes take place in one or
both cells such that A’s efficiency as one of the cells firing B, is increased.
Hebb proposed this change as a basis of associative learning (at the cellular level), which
would result in an enduring modification in the activity pattern of a spatially distributed
“assembly of nerve cells”.
This statement is made in a neurobiological context. We may expand and rephrase
it as a two-part rule:
a. If two neurons on either side of a synapse are activated simultaneously (i.e.,
synchronously), then the strength of that synapse is selectively increased.
b. If two neurons on either side of a synapse are activated asynchronously, then that
synapse is selectively weakened or eliminated.
Such a synapse is called Hebbian synapse. More precisely, we define a Hebbian synapse
as a synapse which uses a time-dependent, highly local, and strongly interactive
mechanism to increase synaptic efficiency as a function of the correlation between the
presynaptic and postsynaptic activities.
From this definition we may deduce the following four key properties which
characterise a Hebbian synapse:
i. Time-Dependent Mechanism:
This mechanism refers to the facts that the modifications in a Hebbian synapse depend
on the exact time of occurrence of the presynaptic and postsynaptic signals.
ii. Local Mechanism:
By its very nature, a synapse is the transmission site where information-bearing signals
(representing on going activity in the presynaptic and postsynaptic units) are in spatio
temporal contiguity. This locally available information is used by a Hebbian synapse to
produce a local synaptic modification which is input specific.
iii. Interactive Mechanism:
The occurrence of a change in a Hebbian synapse depends on signals on both sides of the
synapse. That is, a Hebbian form of learning depends on a “true interaction” between
presynaptic and postsynaptic signals in the sense that we cannot make a prediction from
either one of these two activities by itself.
iv. Conjunctional or Correlation Mechanism:
One interpretation of Hebb’s postulate of learning is that the condition for a change in
synaptic efficiency is the conjunction of presynaptic and postsynaptic signals. Thus,
according to this interpretation, the co-occurrence, of presynaptic and postsynaptic
signals (within a short interval of time) is sufficient to produce the synaptic
modification. It is for this reason that a Hebbian synapse is sometimes referred to as a
conjunctional synapse or correlational synapse.
4. Competitive Learning Unsupervised Learning:
In competitive learning, as the name implies, the output neurons of a neural network
compete among themselves to become active. Whereas in a neural network based on
Hebbian learning several output neurons may be active simultaneously, in competitive
learning only a single output neuron is active at any one time. It is this feature which
makes competitive learning highly suited to discover statistically salient features which
may be used to classify a set of input patterns.
There are three basic elements to a competitive learning rule:
i. A set of neurons which are all the same except for some randomly distributed synaptic
weights, and which therefore, respond differently to a given get of input patterns.
ii. A limit imposed on the ‘strength’ of each neuron.
iii. A mechanism which permits the neurons to compete for the right to respond to a
given subset of inputs, such that only one output neuron or only one neuron per group, is
active (i.e., ‘on’) at a time. The neuron which wins the competition is called a winner-
takes-all neuron.
Accordingly, the individual neurons of the network learn to specialise on ensembles of
similar patterns; in so doing they become feature detectors for different classes of input
patterns.
In the simplest form of competitive learning, the neural network has a single layer of
output neurons, each of which is fully connected to the input nodes. The network may
include feedback connections among the neurons, as indicated in Fig. 11.22. In the
network architecture described herein, the feedback connections perform lateral
inhibition, with each neuron tending to inhibit the neuron to which it is laterally
connected. In contrast, the feed forward synaptic connections in the network of Fig.
11.15 all are excitatory.
For a neuron k to be the winning neuron, its induced local field v k for a specified input
pattern x must be the largest among all the neurons in the network. The output signal
yk of winning neuron k is set equal to one; the output signals of all the neurons which
lose the competition are set equal to zero.
We thus write:
where, the induced local field y k represents the combined action of all the forward and
feedback inputs to neuron k.
Let ωkj denote the synaptic weight connecting input node j to neuron k. Suppose that
each neuron is allotted a fixed amount of synaptic weight (i.e., all synaptic weights are
positive), which is distributed among its input nodes that is, for all k
A neuron then learns by shifting synaptic weights from its inactive to active input nodes.
If a neuron does not respond to a particular input pattern, no learning takes place in that
neuron.
If a particular neuron wins the competition, each input node of that neuron relinquishes
some proportion of its synaptic weight, and the weight relinquished is then distributed
equally among the active input nodes. According to the standard competitive learning
rule, the change Δω kj applied to synaptic weight ω kj is defined by
where, ƞ is the learning rate parameter. This rule has the overall effect of moving the
synaptic weight vector ω k of winning neuron k towards the input pattern,v
5. Boltzmann Learning:
The Boltzmann learning rule, named in honor of Ludwig Boltzmann, is a stochastic
learning algorithm derived from idea rooted in statistical mechanics. A neural network
designed on the basis of the Boltzmann learning rule is called a Boltzmann machine.
In a Boltzmann machine the neurons constitute a recurrent structure, and they operate in
a binary manner since, for example, they are either in an ‘on’ state denoted by + 1 or in
an ‘off’ state denoted by -1. The machine is characterised by an energy function; E the
value of which is determined by the particular states occupied by the individual neurons
of the machine, as shown by
where xj is the state of neuron j and ω kj is the synaptic weight connecting neuron j to
neuron k. The fact that j ≠ k means simply that none of the neurons in the machine has
self-feedback. The machine operates by choosing a neuron at random for example,
neuron k is at some step of the learning process, then flipping the state of neuron k from
state xk to state – xk at some temperature T with probability
where, ΔEk is the energy change (i.e., the change in the energy function of the machine)
resulting from such a flip. We may note that T is not a physical temperature, but rather a
pseudo temperature under stochastic Model of a Neuron. If this rule is applied
repeatedly, the machine will reach thermal equilibrium.
The neurons of a Boltzmann machine partition into two functional groups:
a. Visible and
b. Hidden.
The visible neurons provide an interface between the network and the environment in
which it operates, whereas the hidden neurons always operate freely.
There are two modes of operation to be considered:
I. Clamped condition, in which the visible neurons are all clamped onto specific states
determined by the environment.
II. Free-running condition, in which all the neurons (visible and hidden) are allowed to
operate freely.
Let P+kj denote the correlation between the states of neurons j and k, with the network in
its clamped condition and P –kj the correlation between the states of neurons j and k with
the network in its free-running condition. Both correlations are averaged over all
possible states of the machine when it is in thermal equilibrium.
Then, according to the Boltzmann learning rule, the change Δω kj applied to the
synaptic weight ω kj from neuron j to neuron k is defined by:
Δωkj = ƞ (ρk+j – ρ–kj), j≠ k…. (11.35)
where, ƞ is learning-rate. Moreover, both ρk j+ and ρ–kj range in value from -1 to +1.
If the shape of the object is rounded and has a depression at the top, is red in color, then
it will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it
will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to
use it wisely. It will first classify the fruit with its shape and color and would confirm the
fruit name as BANANA and put it in the Banana category. Thus the machine learns the
things from training data(basket containing fruits) and then applies the knowledge to test
data(new fruit).
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a category, such
as “Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbours)
Decision Trees
Support Vector Machine
Advantages:
Supervised learning allows collecting data and produces data output from previous
experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation
problems.
It performs classification and regression tasks.
It allows estimating or mapping the result to a new sample.
We have complete control over choosing the number of classes we want in the training
data.
Disadvantages:
Classifying big data can be challenging.
Training for supervised learning needs a lot of computation time. So, it requires a lot of
time.
Supervised learning cannot handle all complex tasks in Machine Learning.
Computation time is vast for supervised learning.
It requires a labelled data set.
It requires a training process.
Steps
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither
classified nor labelled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to
the machine. Therefore, the machine is restricted to find the hidden structure in unlabelled
data by itself.
For instance, suppose it is given an image having both dogs and cats which it has never
seen.
Thus, the machine has no idea about the features of dogs and cats so we can’t categorize it
as ‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behaviour.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning:
Algorithms are trained using Algorithms are used against data that is
Input Data
labelled data. not labelled(Unlabelled data)
Computational
Simpler method Computationally complex
Complexity
Training data Use training data to infer model. No training data is used.
It is not possible to learn larger and It is possible to learn larger and more
Complex model more complex models than with complex models with unsupervised
supervised learning. learning.