UNIT–II: SINGLE LAYER NETWORKS
What is Linear Discriminant Analysis (LDA)?
Although the logistic regression algorithm is limited to only two-class, linear
Discriminant analysis is applicable for more than two classes of classification
problems.
Linear Discriminant analysis is one of the most popular dimensionality
reduction techniques used for supervised classification problems in
machine learning. It is also considered a pre-processing step for modeling
differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having
multiple features efficiently, the Linear Discriminant Analysis model is
considered the most common technique to solve such classification
problems. For e.g., if we have two classes with multiple features and need to
separate them efficiently. When we classify them using a single feature, then
it may show overlapping.
To overcome the overlapping issue in the classification process, we must
increase the number of features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can
separate these data points efficiently but using linear Discriminant analysis;
we can dimensionally reduce the 2-D plane into the 1-D plane. Using this
technique, we can also maximize the separability between multiple classes.
How Linear Discriminant Analysis (LDA) works?
Linear Discriminant analysis is used as a dimensionality reduction technique
in machine learning, using which we can easily transform a 2-D and 3-D
graph into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having
an X-Y axis, and we need to classify them efficiently. As we have already seen
in the above example that LDA enables us to draw a straight line that can
completely separate the two classes of the data points. Here, LDA uses an X-
Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce
the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following
criteria:
o It maximizes the distance between means of two classes.
o It minimizes the variance within the individual class.
Using the above two conditions, LDA generates a new axis in such a way that
it can maximize the distance between the means of the two classes and
minimizes the variation within each class.
In other words, we can say that the new axis will increase the separation
between the data points of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification
algorithms that perform well for binary classification but falls short in
the case of multiple classification problems with well-separated
classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of
features, just as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is
used to extract useful data from different faces. Coupled with
eigenfaces, it produces effective results.
Drawbacks of Linear Discriminant Analysis (LDA)
Although, LDA is specifically used to solve supervised classification
problems for two or more classes which are not possible using logistic
regression in machine learning. But LDA also fails in some cases where the
Mean of the distributions is shared. In this case, LDA fails to create a new axis
that makes both the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in
machine learning.
Difference between Linear Discriminant Analysis and PCA
Below are some basic differences between LDA and PCA:
o PCA is an unsupervised algorithm that does not care about classes and
labels and only aims to find the principal components to maximize the
variance in the given dataset. At the same time, LDA is a supervised
algorithm that aims to find the linear discriminants to represent the
axes that maximize separation between different classes of data.
o LDA is much more suitable for multi-class classification tasks
compared to PCA. However, PCA is assumed to be an as good
performer for a comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques,
where PCA is first followed by LDA.
2. Linearly Separable Classes
o The concept of separability applies to binary classification problems.
In them, we have two classes: one positive and the other negative. We
say they’re separable if there’s a classifier whose decision boundary
separates the positive objects from the negative ones.
o If such a decision boundary is a linear function of the features, we
say that the classes are linearly separable.
o Since we deal with labelled data, the objects in a dataset will be linearly
separable if the classes in the feature space are too.
o
o 2.1. Linearly Separable 2D Data
o We say a two-dimensional dataset is linearly separable if we can
separate the positive from the negative objects with a straight
line.
o It doesn’t matter if more than one such line exists. For linear
separability, it’s sufficient to find only one:
o
o Conversely, no line can separate linearly inseparable 2D data:
o
Generalized linear discriminants
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique
commonly used for supervised classification problems. The goal of LDA is to
project the dataset onto a lower-dimensional space while maximizing the
class separability.
LDA is very similar to Principal Component Analysis (PCA), but there are
some important differences. PCA is an unsupervised algorithm, meaning it
doesn't need class labels . PCA's goal is to find the principal components that
maximize the variance in a dataset. LDA, on the other hand, is a supervised
algorithm, which uses both the input data and the class labels to find linear
discriminants that maximize the separation between multiple classes.
LDA can be performed in 5 steps:
1. Compute the mean vectors for the different classes from the dataset.
2. Compute the scatter matrices (in-between-class and within-class
scatter matrices).
3. Compute the eigenvectors and corresponding eigenvalues for the
scatter matrices.
4. Sort the eigenvectors by decreasing eigenvalues and choose k
eigenvectors with the largest eigenvalues.
5. Use this eigenvector matrix to transform the samples onto the new
subspace.
The perceptron:
What is a Neural Network?
The term “neural” comes from “neuron”, which is the term used for a
single nerve cell. That’s right – a neural network essentially means a network
of neurons that perform simple and actions in our daily lives.
Example:
• Children memorizing what an apple looks like
• An animal recognizing its mother or owner
• Perceiving whether something is hot or cold
Our neural networks perform these complicated computations.
Humans have now been able to build a computation system that can perform
in a manner similar to our nervous system. These are called artificial neural
networks (ANNs).
An artificial neuron or Perceptron Model:
Perceptron is a supervised learning algorithm that classifies the data into
two categories, thus it is a binary classifier.
The Linear Unit:
So let's begin with the fundamental component of a neural network: the
individual neuron. As a diagram, a neuron (or unit) with one input looks like:
The Linear Unit: y=wx+b
• The input is x. Its connection to the neuron has a weight which is w.
Whenever a value flows through a connection, you multiply the value
by the connection's weight. For the input x, what reaches the neuron
is w * x.
• A neural network "learns" by modifying its weights.
• The b is a special kind of weight we call the bias. The bias doesn't have
any input data associated with it; instead, we put a 1 in the diagram so
that the value that reaches the neuron is just b (since 1 * b = b).
• The bias enables the neuron to modify the output independently of its
inputs.
• The y is the value the neuron ultimately outputs. To get the output, the
neuron sums up all the values it receives through its connections. This
neuron's activation is y = w * x + b, or as a formula y=wx+b.
• Does the formula y=wx+b look familiar?
It's an equation of a line! It's the slope-intercept equation, where w is
the slope and b is the y-intercept.
Example - The Linear Unit as a Model:
Though individual neurons will usually only function as part of a larger
network, it's often useful to start with a single neuron model as a baseline.
Single neuron models are linear models.
Let's think about how this might work on a dataset like 80 Cereals. Training
a model with 'sugars' (grams of sugars per serving) as input
and 'calories' (calories per serving) as output, we might find the bias
is b=90 and the weight is w=2.5. We could estimate the calorie content of a
cereal with 5 grams of sugar per serving like this:
Computing with the linear unit.
And, checking against our formula, we have calories=2.5×5+90=102, just
like we expect.
Multiple Inputs
The 80 Cereals dataset has many more features than just 'sugars'. What if we
wanted to expand our model to include things like fiber or protein content?
That's easy enough. We can just add more input connections to the neuron,
one for each additional feature. To find the output, we would multiply each
input to its connection weight and then add them all together.
A linear unit with three inputs.
The formula for this neuron would be y=w0x0+w1x1+w2x2+b. A linear unit
with two inputs will fit a plane, and a unit with more inputs than that will fit
a hyperplane.
What is an Artificial Neural Network Model?
A multi-layer, fully-connected neural network containing an input layer,
hidden layers, and an output layer is called an artificial neural network or
ANN.
The image below represents an ANN.
If you see carefully, you will notice that each node in one layer is connected
to every node in the layer next to it.
As you increase the number of hidden layers, the network becomes deeper.
Let’s see what an individual node in the output or hidden layer looks like.
As you can see, the node gets many inputs. It sums up all the weights and
passes it on as output, via a non-linear activation function.
This output of the node becomes the input of the node in the next layer.
An important thing to note here is that the signal will always move from left
to right. Once all the nodes have followed the procedure, the final output will
be given.
Here’s what the equation of a node looks like.
In the above equation, b is bias. It is the input to all nodes and always carries
the value 1.
Bias helps to move the activation function result to left or right.
Glossary of Artificial Neural Network Model
Let’s look at the basic terms you should know when it comes to an artificial
neural network model.
Inputs
The data first fed into the neural network from the source is called the input.
Its goal is to give the network data to make a decision or prediction about
the information fed into it. The neural network model usually accepts real
value sets of inputs and it should be fed into a neuron in the input layer.
Training set
The inputs for which you already know the correct outputs are called
training sets. These are used to help the neural network get trained and
memorize the result for the given input set.
Outputs
Every neural network generates an output as a prediction or decision about
the data fed into it. This output is in the form of real values set or Boolean
decisions. Only of the neurons in the output layer generates the output value.
Neuron
Also known as a perceptron, a neuron is the basic unit of a neural network.
It accepts an input value and generates an output based on it.
As discussed before, every neuron receives a part of the input and passes it
through the non-linear activation function to the node in the next layer.
These activation functions can be TanH, sigmoid, or ReLu. The non-linear
feature of these functions helps to train the network.
Weight space
Every neuron has a numeric weight. When it delivers input to another note,
its weight is totalled with the others to generate an output. By making small
changes to these weights are how neural networks are trained. The fine-
tuning of weights helps determine the correct set of weights and biases that
would generate the best outcome. This is where backpropagation comes in.
Least Square method:
Least squares is a commonly used method in regression analysis for estimating
the unknown parameters by creating a model which will minimize the sum of
squared errors between the observed data and the predicted data.
Basically, it is one of the widely used methods of fitting curves that works by
minimizing the sum of squared errors as small as possible. It helps you draw
a line of best fit depending on your data points.
Finding the Line of Best Fit Using Least Square Regression
Given any collection of a pair of numbers and the corresponding scatter graph,
the line of best fit is the straight line that you can draw through the scatter
points to best represent the relationship between them. So, back to our
equation of the straight line, we have:
Y=mX+c
Where,
Y: Dependent Variable
m: Slope
X: Independent Variable
c: y-intercept
Our aim here is to calculate the values of slope, y-intercept, and substitute them
in the equation along with the values of independent variable X, to determine
the values of dependent variable Y. Let’s assume that we have ‘n’ data points,
then we can calculate slope using the scary looking formula below:
Then, y-intercept is calculated using the formula:
Lastly, we substitute these values in the final equation Y = mX + c.
Gradient based Strategies:
Learning Rate Decay
Learning rate decay is a technique for training modern neural networks. It
starts training the network with a large learning rate and then slowly
reducing/decaying it until local minima is obtained. It is empirically observed
to help both optimization and generalization.
How does it work?
Algorithm converging with a constant learning rate (Noisy and represented
with Blue)
Algorithm converging while decaying Learning Rate over time (less noisy
and represented with green)
In the very first image where we have a constant learning rate, the steps taken
by our algorithm while iterating towards minima are so noisy that after
certain iterations it seems wandering around the minima and do not actually
converges.
But in the second image where learning rate is reducing over time
(represented with green line), since the learning rate is large initially we still
have relatively fast learning but as tending towards minima learning rate gets
smaller and smaller, end up oscillating in a tighter region around minima
rather than wandering far away from it.
Learning rate decay (common method):
“ α=(1/(1+decayRate×epochNumber))*α0 ”
1 epoch : 1 pass through data
α : learning rate (current iteration)
α0 : Initial learning rate
decayRate : hyper-parameter for the method
Let’s take a rough example for the above method for better intuition :
Suppose we have α0 = 0.2 and decay rate=1 , then for the each epoch we can
examine the fall in learning rate α as:
Epoch 1: alpha 0.1
Epoch 2: alpha 0.067
Epoch 3: alpha 0.05
Epoch 4: alpha 0.04
This is a typical method (commonly used) to apply learning rate decay for
training neural networks
Gradient Descent
Gradient descent is an optimization algorithm.
It is technically referred to as a first-order optimization algorithm as it
explicitly makes use of the first-order derivative of the target objective
function.
First-order methods rely on gradient information to help direct the search for
a minimum …
The first-order derivative, or simply the “derivative,” is the rate of change or
slope of the target function at a specific point, e.g. for a specific input.
If the target function takes multiple input variables, it is referred to as a
multivariate function and the input variables can be thought of as a vector.
In turn, the derivative of a multivariate target function may also be taken as
a vector and is referred to generally as the “gradient.”
• Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent
of the target function for a specific input.
Gradient descent refers to a minimization optimization algorithm that
follows the negative of the gradient downhill of the target function to locate
the minimum of the function.
The gradient descent algorithm requires a target function that is being
optimized and the derivative function for the objective function. The target
function f() returns a score for a given set of inputs, and the derivative
function f'() gives the derivative of the target function for a given set of
inputs.
The gradient descent algorithm requires a starting point (x) in the problem,
such as a randomly selected point in the input space.
The derivative is then calculated and a step is taken in the input space that
is expected to result in a downhill movement in the target function, assuming
we are minimizing the target function.
A downhill movement is made by first calculating how far to move in the
input space, calculated as the step size (called alpha or the learning rate)
multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
• x = x – step_size * f '(x)
The steeper the objective function at a given point, the larger the magnitude
of the gradient and, in turn, the larger the step taken in the search space. The
size of the step taken is scaled using a step size hyperparameter.
• Step Size (alpha): Hyperparameter that controls how far to move in the
search space against the gradient each iteration of the algorithm, also called
the learning rate.
If the step size is too small, the movement in the search space will be small
and the search will take a long time. If the step size is too large, the search
may bounce around the search space and skip over the optima.
Now that we are familiar with the gradient descent optimization algorithm,
let’s take a look at momentum
Momentum-Based Learning
Gradient descent with momentum will always work much faster than the
algorithm Standard Gradient Descent. The basic idea of Gradient Descent
with momentum is to calculate the exponentially weighted average of your
gradients and then use that gradient instead to update your weights.It
functions faster than the regular algorithm for the gradient descent.
How it works ?
Consider an example where we are trying to optimize a cost function that has
contours like the one below and the red dot denotes the local optima
(minimum) location.
We start gradient descent from point ‘A’ and we through end up at point ‘B’
after one iteration of gradient descent, the other side of the ellipse. Then
another phase of downward gradient can end at ‘C’ level. With through
iteration of gradient descent, with oscillations up and down, we step towards
the local optima. If we use higher learning rate then the frequency of the
vertical oscillation would be greater.This vertical oscillation therefore slows
our gradient descent and prevents us from using a much higher learning rate.
By using the exponentially weighted average dW and db values, we tend to
average the oscillations in the vertical direction closer to zero as they are in
both (positive and negative) directions. Whereas all the derivatives point to
the right of the horizontal direction in the horizontal direction, the average
in the horizontal direction will still be quite large. It enables our algorithm to
take a straighter forward path to local optima and to damp out vertical
oscillations. Because of this the algorithm will end up with a few iterations
at local optima.
Implementation
We use dW and db to update our parameters W and b during the backward
propagation as follows:
W = W — learning rate * dW
b = b — learning rate * db
In momentum we take the exponentially weighted averages of dW and db,
instead of using dW and db independently for each epoch.
VdW = β * VdW + (1 — β) * dW
Vdb = β * Vdb + (1 — β) *db
Where beta ‘β’ is a different hyperparameter called momentum, ranging
from 0 to 1. To calculate the new weighted average, it sets the weight
between the average of previous values and the current value.
We’ll update our parameters after calculating the exponentially weighted
averages.
W = W — learning rate * VdW
b = b — learning rate * Vdb
How to choose Beta?
• The momentum (beta) must be higher to smooth out the update
because we give more weight to the past gradients.
• Using the default value for β = 0.9 is suggested but can be tuned
between 0.8 to 0.999 if needed.
• Momentum takes into account past gradients so as to smooth down
gradient measures. It can be implemented with descent by batch
gradient, descent by mini-batch gradient or descent by stochastic
gradient.
Parameter-Specific Learning Rates (Or) Adagrad
Adagrad stands for Adaptive Gradient Optimizer. There were
optimizers like Gradient Descent, Stochastic Gradient Descent, mini-batch
SGD, all were used to reduce the loss function with respect to the weights.
The weight updating formula is as follows:
Based on iterations, this formula can be written as:
where
w(t) = value of w at current iteration, w(t-1) = value of w at previous
iteration and η = learning rate.
In SGD and mini-batch SGD, the value of η used to be the same for each
weight, or say for each parameter. Typically, η = 0.01. But in Adagrad
Optimizer the core idea is that each weight has a different learning rate
(η). This modification has great importance, in the real-world dataset, some
features are sparse (for example, in Bag of Words most of the features are
zero so it’s sparse) and some are dense (most of the features will be noon-
zero), so keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks like:
Where alpha(t) denotes different learning rates for each weight at each
iteration.
Here, η is a constant number, epsilon is a small positive value number to
avoid divide by zero error if in case alpha(t) becomes 0 because if alpha(t)
become zero then the learning rate will become zero which in turn after
multiplying by derivative will make w(old) = w(new), and this will lead to
small convergence.
Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
What is Gradient Clipping and how does it occur?
Gradient clipping involves capping the error derivatives before
propagating them back through the network. The capped gradients are
used to update the weights hence resulting in smaller weights. The
gradients are capped by scaling and clipping. Gradient clipping involves
forcing the gradients to a certain number when they go above or
below a defined threshold.
Types of Clipping techniques
Gradient clipping can be applied in two common ways:
• Clipping by value
• Clipping by norm
Let’s look at the differences between the two.
Gradient Clipping-by-value
Clipping the gradient by value involves defining a minimum and a
maximum threshold. If the gradient goes above the maximum value it is
capped to the defined maximum. Similarly, if the gradient goes below the
minimum it is capped to the stated minimum value.
Gradient Clipping-by-norm
Clipping the gradient by norm ensures that the gradient of every weight is
clipped such that its norm won’t be above the specified value.
“…a norm is a function that accepts as input a vector from our vector space V
and spits out a real number that tells us how big that vector is. ”
Second-Order Derivatives:
The second order derivatives are used to get an idea of the shape of the
graph for the given function. Whether it will be upwards to downwards.
It is based on concavity. The concavity of the given graph function is
classified into two types: — Concave up and Concave down.
The second derivative is the first derivative of the first derivative. In general
the n+1st derivative (with respect to x) of a (differentiable) function of x, is
the first derivative of the nth derivative. This goes all the way ‘down’ to the
zeroth derivative which is the function itself.
The second derivative is the slope of the first derivative.
Normally, the second derivative of a given function links to the concavity of
the graph. If the second-order derivative value is positive, then the graph of
a function is upwardly concave. If the second-order derivative value is
negative, then the graph of a function is downwardly open.
We already know that second derivative of a function determines the local
maximum (maxima) or minimum (minima), inflexion point values. These can
be identified with the help of below conditions:
• If f”(x) < 0, then the function f(x) has a local maximum at x.
• If f”(x) > 0, then the function f(x) has a local minimum at x.
• If f”(x) = 0, then it is not possible to conclude anything about the point x,
a possible inflexion point.
Relation between first and second derivative
The relation between 1st and 2nd derivative is the same as between a
function and its 1st derivative. The first derivative is the rate of change. The
second derivative is the rate of change of the rate of change.
A common practical application is if you measure distance/time, the
first derivative will be velocity (speed) and the second the acceleration.
You can test whether your extreme found in the first derivative is a maximum
or minimum by plugging in the x value into the equation. If the subsequent y
value is less than 0, then you have a maximum. If your value is greater than 0,
then you have a minimum. As well as this, the zeroes of this function show
inflection points for the concavity (whether the graph is facing up or down).
You can test these points by plugging in a value higher and lower than the 0.
If there is a sign change, then it is an inflection point.