Unit IV Machine Learning Notes
Unit IV Machine Learning Notes
Biological Motivation
The study of artificial neural networks (ANNs) has been inspired by the
observation that biological learning systems are built of very complex webs of
interconnected Neurons.
Human information processing system consists of brain neuron: basic building
blockcell that communicates information to and from various parts of body.
Facts of Human Neurobiology
The network is shown on the left side of the figure, with the input camera image
depicted below it.
Each node (i.e., circle) in the network diagram corresponds to the output of a
single network unit, and the lines entering the node from below are its inputs.
There are four units that receive inputs directly from all of the 30 x 32 pixels in
the image. These are called "hidden" units because their output is available only
within thenetwork and is not available as part of the global network output. Each
of these four hidden units computes a single real-valued output based on a
weighted combination of its 960 inputs
These hidden unit outputs are then used as inputs to a second layer of 30 "output" units.
Each output unit corresponds to a particular steering direction, and the output
values ofthese units determine which steering direction is recommended most
strongly.
• Input Layer
Information from the outside world enters the artificial neural network from the input
layer. Input nodes process the data, analyze or categorize it, and pass it on to the next
layer.
• Hidden Layer
Hidden layers take their input from the input layer or other hidden layers. Artificial
neural networks can have a large number of hidden layers. Each hidden layer analyzes
the output from the previous layer, processes it further, and passes it on to the next
layer.
• Output Layer
The output layer gives the final result of all the data processing by the artificial neural
network. It can have single or multiple nodes. For instance, if we have a binary
(yes/no) classification problem, the output layer will have one output node, which will
give the result as 1 or 0. However, if we have a multi-class classification problem, the
output layer might consist of more than one output node.
An artificial neuron is a mathematical function conceived as a simple model of a
real (biological) neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic
Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-
linear activation function (i.e. squashing/transfer/threshold
function).
An output line transmits the result to other neurons.
Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single
activation function. An Artificial neural network (ANN) model based on the
biological neural systems is shownin figure.
Training: It is the process in which the network is taught to change its weight
and bias.
Learning: It is the internal process of training where the artificial neural systemlearns
to update/adapt the weights and biases.
Different Training /Learning procedure available in ANN are
Supervised learning, Unsupervised learning
Reinforced learning, Hebbian learning
Gradient descent learning, Competitive learning, Stochastic learning
Learning or training time should be less for capturing the information from the
trainingpairs
• Learning should use the local information
Learning process should able to capture the complex non linear mapping
availablebetween the input & output pairs
• Learning should able to capture as many as patterns as possible
very input pattern that is used to train the network is associated with an output pattern
which isthe target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is
made between the network’s computed output and the correct expected output, to
determine the error.The error can then be used to change network parameters, which
result in an improvement in performance.
Unsupervised learning:
In this learning method the target output is not presented to the network. It is as if
there is no teacher to present the desired patterns and hence the system learns of its
own by discovering and adapting to structural features in the input patterns.
Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but
only indicatesif the computed output corrects or incorrect. The information provided
helps the network in the learning process.
PERCEPTRON Model
Figure: A perceptron
Perceptrons can represent all of the primitive Boolean functions AND, OR, NAND (~
AND),and NOR (~OR)
Some Boolean functions cannot be represented by a single perceptron, such as
the XORfunction whose value is 1 if and only if x1 ≠ x2.
X1 X2 Y
0 0 0
0 1 0
1 0 0
1 1 1
We are going to set weights randomly. Let’s say that w1 = 0.9 and w2 = 0.9
Round 1
We will apply 1st instance to the perceptron. x1 = 0 and x2 = 0.
Σ = x1 * w1 + x2 * w2 = 0 * 0.9 + 0 * 0.9 = 0
Activation unit checks sum unit is greater than a threshold. If this rule is satisfied, then it is
fired and the unit will return 1, otherwise it will return 0. BTW, modern neural networks
architectures do not use this kind of a step function as activation.
Sum unit was 0 for the 1st instance. So, activation unit would return 0 because it is less than
0.5. Similarly, its output should be 0 as well. We will not update weights because there is no
error in this case.
ε = actual – prediction = 0 – 1 = -1
We will add error times learning rate value to the weights. Learning rate would be 0.5. BTW,
we mostly set learning rate value between 0 and 1.
Activation unit will return 0 this time because output of the sum unit is 0.5 and it is less than
0.5. We will not update weights.
Round 2
In previous round, we’ve used previous weight values for the 1st instance and it was classified
correctly. Let’s apply feed forward for the new weight values.
Activation unit will return 0 because sum unit is 0.4 and it is less than the threshold value 0.5.
The output of the 1st instance should be 0 as well. This means that the instance is classified
correctly. We will not update weights.
Activation unit will return 0 because sum unit is less than the threshold 0.5. Its output should
be 0 as well. This means that it is classified correctly and we will not update weights.
We’ve applied feed forward calculation for 3rd and 4th instances already for the current weight
values in the previous round. They were classified correctly.
Multilayer Perceptron falls under the category of feedforward algorithms,because inputs are
combined with the initial weights in a weighted sum and subjected to the activation function,
just like in the Perceptron. But the difference is that each linear combination is propagated to
the next layer.
Each layer is feeding the next one with the result of their computation, their internal
representation of the data. This goes all the way through the hidden layers to the output layer.
But it has more to it.
If the algorithm only computed the weighted sums in each neuron, propagated results to the
output layer, and stopped there, it wouldn’t be able to learn the weights that minimize the cost
function. If the algorithm only computed one iteration, there would be no actual learning.
Structure of MLPs
• In a typical MLP network, the input units (Xi) are fully connected to all hidden
layer units (Yj) and the hidden layer units are fully connected to all output layer
units (Zk). Each of the connections between the input to hidden and hidden to
output layer units has an associated weight attached to it (Wij or Wjk)
• The hidden and output layer units also derive their bias values (b j or bk) from
weighted connections to units whose outputs are always 1 (true neurons)
The Multilayer Perceptron was developed to tackle this limitation. It is a neural network where
the mapping between inputs and output is non-linear. A Multilayer Perceptron has input and
output layers, and one or more hidden layers with many neurons stacked together. And while
in the Perceptron the neuron must have an activation function that imposes a threshold, like
ReLU or sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
Multilayer Perceptron.
The structure of an MLP can be broken down into three main parts: the input layer, the hidden
layers, and the output layer.
The input layer is like the teacher giving out the math problem to the students. It
receives the input data, in this case, the equation 5 x 3 + 2 x 4 + 8 x 2, and passes it on
to the next layer.
The hidden layers are like the students working together to solve the problem. Each
hidden layer contains a set of interconnected neurons, which process and analyze the
input data passed on from the previous layer. In this example, the hidden layer can have
three neurons, each one solving a specific part of the equation "5 x 3", "2 x 4" and "8 x
2".
The output layer is like the student who is putting together the final solution. It
receives the output from the previous layers, combines them, and produces the final
output which is the solution to the problem. In this example, the output neuron can be
calculated as "15 + 8" and "23 + 16" to get the final result of 39.
The structure of MLP is shown here:
Fig 3 (MLP structure)
Note that, the neurons in the input layer must be the size of the training instances, and
the output layer must be the size of the output labels. However, there can be any number
of neurons or layers in the hidden layer of the neural network according to the needs,
So the more neurons in the hidden layer the more complex problem the network can
solve.
Ok, let's start with an example, Imagine a group of 7-year-old students who are working on a
math problem, Imagine that each of them can only do arithmetic with two numbers. But you
are giving them an equation like this 5 x 3 + 2 x 4 + 8 x 2, how can they solve it?
To solve this problem, we can break it down into smaller parts and give them to each of the
students. One student can solve the first part of the equation "5 x 3 = 15" and another student
can solve the second part of the equation "2 x 4 = 8". The third student can solve the third part
"8 x 2 = 16".
Finally, we can simplify it to 15 + 8 + 16. Same way, one of the students in the group can solve
"15 + 8 = 23" and another one can solve "23 + 16 = 39", and that's the answer.So here we are
breaking down the large math problem into different sections and giving them to each of the
students who are just doing really simple calculations, but as a result of the teamwork, they can
solve the problem efficiently,
Multi-layer perceptrons have been used in a wide variety of applications. Some of the most
common applications of MLPs include:
Image recognition: MLPs can be trained to recognize patterns in images and classify
them into different categories. This is useful in applications such as facial recognition,
object detection, and image segmentation.
Natural Language Processing (NLP): MLPs can be used to understand and generate
human language. This is useful in applications such as text-to-speech, machine
translation, and sentiment analysis.
Predictive modeling: It can be used to make predictions based on past data. This is
useful in applications such as stock market prediction, weather forecasting, and fraud
detection.
Medical diagnosis: Can be used to diagnose diseases or interpret medical images by
recognizing patterns in the data.
Backpropagation Learning Algorithm
where,
outputs - is the set of output units in the network
tkd and Okd - the target and output values associated with the kth output unit
d - training example
Feed Forward phase:
• Xi = input[i]
• Yj = f( bj + XiWij)
• Zk = f( bk + YjWjk)
Backpropagation of errors:
• k = Zk[1 - Zk](dk - Zk)
Types of Backpropagation:
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
1. L1 regularization
2. L2 regularization
3. Dropout regularization
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the
loss function(L).
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function(L).
NOTE that during Regularization the output function(y_hat) does not change. The change
is only in the loss function.
The output function:
lambda > 0
Dropout Regularization:Dropout regularization is a technique that randomly drops a number
of neurons in a neural network during model training.This means the contribution of the
dropped neurons is temporally removed and they do not have an impact on the model’s
performance.The image below shows how dropout regularization works:
In the image above, the neural network on the left shows an original neural network where all
neurons are activated and working.
On the right, the red neurons have been removed from the neural network. Therefore, red
neurons will not be considered during model training.
The neurons can’t rely on one input because it might be dropped out at random. This
reduces bias due to over-relying on one input, bias is a major cause of overfitting.
Neurons will not learn redundant details of inputs. This ensures only important
information is stored by the neurons. This enables the neural network to gain useful
knowledge which it uses to make predictions.
An unregularized network overfits instantly on the training dataset. Take note of how the
validation loss for the no-dropout run diverges dramatically after only a few epochs. This
Overfitting is avoided by training with two dropout layers and a dropout probability of 25%.
However, this affects training accuracy, necessitating the training of a regularised network over
a longer period.
Leaving improves model generalisation. Although the training accuracy is lower than that of
the unregularized network, the total validation accuracy has improved. This explains why the
2. Hyperparameter tuning
Models can have many hyperparameters and finding the best combination of parameters can
be treated as a search problem. The two best strategies for Hyperparameter tuning are:
GridSearchCV
RandomizedSearchCV
GridSearchCV
In GridSearchCV approach, the machine learning model is evaluated for a range of
hyperparameter values. This approach is called GridSearchCV, because it searches for the
best set of hyperparameters from a grid of hyperparameters values.
For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the
best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a
combination of C=0.3 and Alpha=0.2, the performance score comes out to
be 0.726(Highest), therefore it is selected.
logreg_cv.fit(X, y)
Output:
Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381} Best score is
0.7708333333333334
The following code illustrates how to use RandomizedSearchCV
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
tree_cv.fit(X, y)
Output:
Tuned Decision Tree Parameters: {‘min_samples_leaf’: 5, ‘max_depth’: 3, ‘max_features’:
5, ‘criterion’: ‘gini’} Best score is 0.7265625
********************************************************************