Unit-2 ML Notes
Unit-2 ML Notes
UNIT II
Artificial Neural Networks - Introduction
Artificial Neural Network (ANN) is a deep learning algorithm that emerged and evolved
from the idea of Biological Neural Networks of human brains. An attempt to simulate the
workings of the human brain culminated in the emergence of ANN. ANN works very similar
to the biological neural networks but doesn’t exactly resemble its workings.
ANN algorithm would accept only numeric and structured data as input. To accept
unstructured and non-numeric data formats such as Image, Text, and Speech, Convolutional
Neural Networks (CNN), and Recursive Neural Networks (RNN) are used respectively. In
this post, we concentrate only on Artificial Neural Networks.
1. The Input Layer — Represents the input variables plus the bias term. Hence if there
are n input variables, the size of the input layer is n + 1, where + 1 is the bias term
2. The Hidden Layer/ Layers — These signify neurons where all mathematical
calculations are done. Note a given neural network can have more than one neuron in
a hidden layer or multiple hidden layers as well
3. The Activation Function — Converts the output of a given layer before passing on
the information to consecutive layers. Activation functions are mathematical
equations that determine the output of a given neural network. The is a part of each
neuron in the hidden layers and determines output relevant for prediction
4. The Output Layer — The final “output prediction” of the network
5. Forward Propagation — Calculating the output of each iteration from the input
layer to the output layer
6. Backward Propagation — Calculates revised weights (w1, w2, w3, and b1) after
each forward propagation by analyzing the derivative of the cost function used to
optimize the model output
7. Learning Rate — Determined the percentage change attributed to each weight and
bias term after every backward propagation, i.e. controls the speed at which the model
learns information about the data
Synapse
Synapse Dendrites
Axon
Axon
Soma Soma
Dendrites
Synapse
1.Adaptive learning: An ANN is capable with the ability to learn how to do tasks based on
the data given for training or initial experience.
2.Self organization: An ANN can create its own organization or representation of the
information it receives during learning time.
3.Real-time operation: ANN computations may be carried out in parallel. Special hardware
devices are being designed and manufactured to take advantage of this capability of
ANNs.
6. Animal behaviour, predator/prey relationships and population cycles may be suitable for
analysis by neural networks.
8. Betting on horse races, stock markets, sporting events, etc. could be based on neural
network predictions.
9. Criminal sentencing could be predicted using a large sample of crime details as input and
the resulting sentences as output.
10. Complex physical and chemical processes that may involve the interaction of numerous
(possibly unknown) mathematical formulas could be modelled heuristically using a neural
network.
11. Data mining, cleaning and validation could be achieved by determining which records
suspiciously diverge from the pattern of their peers.
12. Direct mail advertisers could use neural network analysis of their databases to decide
which customers should be targeted, and avoid waiting money on unlikely targets.
13. Echo patterns from sonar, radar, seismic and magnetic instruments could be used to
predict their targets.
14. Econometric modelling based on neural networks should be more realistic than older
models based on classical statistics.
15. Employee hiring could be optimized if the neural networks were able to predict which
job applicant would show the best job performance.
16. Expert consultants could package their intuitive expertise in to a neural network to
automate their services.
17. Fraud detection regarding credit cards, insurance or faxes could be automated using a
neural network analysis of past incidents.
18. Handwriting and typewriting could be recognized by imposing a grid over the writing,
and then each square of the grid becomes an input to the neural network. This is called
"Optical Character Recognition."
19. Lake water levels could be predicted based upon precipitation patterns and river/dam
flows.
3.The training examples may contain errors. ANN learning methods are quite robust to
noise in the training data.
Perceptron
Perceptron is a single layer neural network and a multi-layer perceptron is called Neural Networks.
Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the
given input data
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.
Perceptron Function
Perceptron is a function that maps its input “x,” which is multiplied with the learned weight
coefficient; an output value ”f(x)”is generated.
A Boolean output is based on inputs such as salaried, married, age, past credit profile, etc. It has only
two values: Yes and No or True and False. The summation function “∑” multiplies all inputs of “x”
by weights “w” and then adds them up as follows:
Step function gets triggered above a certain value of the neuron output; else it outputs
zero.
Sign Function outputs +1 or -1 depending on whether neuron output is greater than
zero or not.
Sigmoid is the S-curve and outputs a value between 0 and 1.
For example:
Step function gets triggered above a certain value of the neuron output; else it outputs zero.
Sign Function outputs +1 or -1 depending on whether neuron output is greater than zero or
not. Sigmoid is the S-curve and outputs a value between 0 and 1.
Problems
1. For the network shown in Figure I, calculate the weights are net input to
the output neuron.
Solution:
The given neural net consists of three input neurons and one output neuron
2. Calculate the net input for the network shown in Figure 2 with bias
included in the network
Solution:
3. Obtain the output of the neuron Y for the network shown in Figure 3
using activation functions as: (i) binary sigmoidal and (ii) bipolar
sigmoidal.
Solution:
Multilayer networks and the Back-propagation algorithm.
It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a
deep ANN.
The weight adjustment training is done via back propagation. Deeper neural networks are
better at processing data. However, deeper layers can lead to vanishing gradient problem.
Special algorithms are required to solve this issue.
The bottom layer that takes input from your dataset is called the visible layer, because it is the
exposed part of the network. Often a neural network is drawn with a visible layer with one
neuron per input value or column in your dataset. These are not neurons as described above,
but simply pass the input value though to the next layer.
Hidden Layers
Layers after the input layer are called hidden layers because that are not directly exposed to
the input. The simplest network structure is to have a single neuron in the hidden layer that
directly outputs the value.
Given increases in computing power and efficient libraries, very deep neural networks can be
constructed. Deep learning can refer to having many hidden layers in your neural network.
They are deep because they would have been unimaginably slow to train historically, but may
take seconds or minutes to train using modern techniques and hardware.
Output Layer
The final hidden layer is called the output layer and it is responsible for outputting a value or
vector of values that correspond to the format required for the problem.
The choice of activation function in he output layer is strongly constrained by the type of
problem that you are modeling. For example:
A regression problem may have a single output neuron and the neuron may have no
activation function.
A binary classification problem may have a single output neuron and use a sigmoid
activation function to output a value between 0 and 1 to represent the probability of
predicting a value for the class 1. This can be turned into a crisp class value by using a
threshold of 0.5 and snap values less than the threshold to 0 otherwise to 1.
A multi-class classification problem may have multiple neurons in the output layer,
one for each class. In this case a softmax activation function may be used to output a
probability of the network predicting each of the class values. Selecting the output
with the highest probability can be used to produce a crisp class classification value.
Forward and backward passes
The forward and backward phases are repeated from some epochs. In each epoch, the
following occurs:
1. The inputs are propagated from the input to the output layer.
2. The network error is calculated.
3. The error is propagated from the output layer to the input layer.
Back-propagation Algorithm
1.Inputs X, arrive through the preconnected path
Input the instance (x1,…,xn) to the network and compute the network outputs ok
2.Input is modeled using real weights W. The weights are usually randomly selected.
For each output unit k k=ok(1-ok)(tk-ok)
3.Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
For each hidden unit h h=oh(1-oh) k wh,k k
4.Calculate the error in the outputs
Travel back from the output layer to the hidden layer to adjust the weights such that
the error is decreased.
5. Keep repeating the process until the desired output is achieved
1. The first step is to calculate the error for each output neuron, this will give us our
error signal (input) to propagate backwards through the network.
2. The error for a given neuron can be calculated as follows:
3. error = (expected - output) * transfer_derivative(output)
4. Where expected is the expected output value for the neuron, output is the output
value for the neuron and transfer_derivative() calculates the slope of the neuron’s
output value, as shown above.
STEP 5.Predict
1. We have already seen how to forward-propagate an input pattern to get an output.
2. We can use the output values themselves directly as the probability of a pattern
belonging to each output class.
3. It may be more useful to turn this output back into a crisp class prediction.
4. We can do this by selecting the class value with the larger probability. This is also
called the arg max function.
5. Below is a function named predict() that implements this procedure.
6. It returns the index in the network output that has the largest probability.
7. It assumes that class values have been converted to integers starting at 0.
Remarks on the Back-Propagation algorithm
1.Boolean functions:
Every boolean function can be represented by network with two layers of units where
the number of hidden units required grows exponentially.
2.Continuous functions:
Every bounded continuous function can be approximated with arbitrarily small error,
by network with two layers of units
3.Arbitrary functions:
Any function can be approximated to arbitrary accuracy by a network with three
layers of units
4.Hypothesis space search
Every possible assignment of network weights represents a syntactically distinct
hypothesis.
This hypothesis space is continuous in contrast to that of decision tree.
5.Inductive bias
One can roughly characterize it as smooth interpolation between data points.
One of interesing property of back propagation ability to discover useful intermediate
representations at the hidden unit layers inside the network.
Because training examples constrain only the network inputs and outputs, the
weight-tuning procedure is free to set weights that define whatever hidden unit
representation is most effective at minimizing the squared error E..
An illustrative example: face recognition model
The learning task here involves classifying camera images of faces of various people
in various poses. Images of 20 different people were collected
approximately 32 images per person, varying the person's expression (happy, sad,
angry, neutral), the direction in which they were looking (left, right, straight ahead,
up), and whether or not they were wearing sunglasses
As can be seen from the example images, there is also variation in the background
behind the person, the clothing worn by the person, and the position of the person's
face within the image.
In total, 624 greyscale images were collected, each with a resolution of 120 x 128,
with each image pixel described by a greyscale intensity value between 0 (black) and
255 (white).
Output encoding:
The ANN must output one of four values indicating the direction in which the
person is looking (left, right, up, or straight).
Note we could encode this four-way classification using a single output unit,
assigning outputs of, say, 0.2, 0.4, 0.6, and 0.8 to encode these four possible
values. Instead, we use four distinct output units, each representing one of the four
possible face directions, with the highest-valued output taken as the network
prediction.
Network graph structure
1. Therefore, another design choice we face is how many units to include in the
network and how to interconnect them.
2. The most common network structure is a layered network with feed forward
connections from every unit in one layer to every unit in the next.
3. In the current design we chose this standard structure, using two layers of sigmoid
units (one hidden layer and one output layer)
4. It is common to use one or two layers of sigmoid units and, occasionally, three
layers.
5. It is not common to use more layers than this because training times become very
long and because networks with three layers of sigmoid units.
6. Given our choice of a layered feed forward network with one hidden layer, how
many hidden units should we include
7. In the results reported in Figure, only three hidden units were used, yielding a test
set accuracy of 90%.
8. In other experiments 30 hidden units were used, yielding a test set accuracy one
to two percent higher.
9. Although the generalization accuracy varied only a small amount between these
two experiments, the second experiment required significantly more training time.
10. Using 260 training images, the training time was approximately 1 hour on a Sun
Sparc5 workstation for the 30 hidden unit network, compared to approximately
5 minutes for the 3 hidden unit network
This causes the gradient descent search to seek weight vectors with small magnitudes,
thereby reducing the risk of overfitting. One way to do this is to redefine E as
which yields a weight update rule identical to the Backpropagation, except that each weight
is multiplied by the constant (1 - 2yq) upon each iteration.
Adding a term for errors in the slope, or derivative of the target function. In some cases,
training information may be available regarding desired derivatives of the target function, as
well as desired values
In both of these systems the error function is modified to add a term measuring the
discrepancy between these training derivatives and the actual derivatives of the learned
network.
Minimizing the cross entropy of the network with respect to the target values. Consider
learning a probabilistic function, such as predicting whether a loan applicant will pay back a
loan based on attributes such as the applicant's age and bank balance. Although the training
examples exhibit only boolean target
values (either a 1 or 0, depending on whether this applicant paid back the loan), the
underlying target function might be best modeled by outputting the probability that the given
applicant will repay the loan, rather than attempting to output the actual 1 and 0 value for
each input instance
Given such situations in which we wish for the network to output probability estimates, it can
be shown that the best (i.e., maximum likelihood) probability estimates are given by the
network that minimizes the cross entropy, defined as
Recurrent Networks
Recurrent networks are artificial neural networks that apply to time series data and that use
outputs of network units at time t as the input to other units at time t + 1. In this way, they
support a form of directed cycles in the network. To illustrate, consider the time series
prediction task of predicting the next day's stock market average y(t + 1 ) based on the current
day's economic indicators x(t). Given a time series of such data, one obvious approach is to
train a feedforward network to predict y(t + 1 ) as its output, based on the input values x(t)
One limitation of such a network is that the prediction of y(t + 1 ) depends only on x(t) and
cannot capture possible dependencies of y (t + 1 ) on earlier values of x. This might be
necessary, for example, if tomorrow's stock market average -(+t 1 ) depends on the difference
between today's economic indicator values x(t) and yesterday's values x(t - 1 ) .
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner
Scanned with CamScanner