0% found this document useful (0 votes)

14 views29 pages

Ann Unit-Ii

The document discusses Linear Discriminant Analysis (LDA) as a dimensionality reduction technique for supervised classification, highlighting its ability to separate multiple classes efficiently. It contrasts LDA with Principal Component Analysis (PCA), emphasizing LDA's focus on maximizing class separability while PCA aims to maximize variance without considering class labels. Additionally, the document covers the basics of artificial neural networks, including the perceptron model, and introduces concepts like least squares regression and gradient descent for optimization.

Uploaded by

lucifersivaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views29 pages

Ann Unit-Ii

Uploaded by

lucifersivaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT–II: SINGLE LAYER NETWORKS

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear

Discriminant analysis is applicable for more than two classes of classification
problems.

Linear Discriminant analysis is one of the most popular dimensionality

reduction techniques used for supervised classification problems in
machine learning. It is also considered a pre-processing step for modeling
differences in ML and applications of pattern classification.

Whenever there is a requirement to separate two or more classes having

multiple features efficiently, the Linear Discriminant Analysis model is
considered the most common technique to solve such classification
problems. For e.g., if we have two classes with multiple features and need to
separate them efficiently. When we classify them using a single feature, then
it may show overlapping.

To overcome the overlapping issue in the classification process, we must

increase the number of features regularly.

Example:

Let's assume we have to classify two different classes having two sets of data
points in a 2-dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can
separate these data points efficiently but using linear Discriminant analysis;
we can dimensionally reduce the 2-D plane into the 1-D plane. Using this
technique, we can also maximize the separability between multiple classes.

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique

in machine learning, using which we can easily transform a 2-D and 3-D
graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having
an X-Y axis, and we need to classify them efficiently. As we have already seen
in the above example that LDA enables us to draw a straight line that can
completely separate the two classes of the data points. Here, LDA uses an X-
Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce
the 2-D plane into 1-D.

To create a new axis, Linear Discriminant Analysis uses the following

criteria:

o It maximizes the distance between means of two classes.

o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that
it can maximize the distance between the means of the two classes and
minimizes the variation within each class.

In other words, we can say that the new axis will increase the separation
between the data points of the two classes and plot them onto the new axis.
Why LDA?

o Logistic Regression is one of the most popular classification

algorithms that perform well for binary classification but falls short in
the case of multiple classification problems with well-separated
classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of
features, just as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is
used to extract useful data from different faces. Coupled with
eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification

problems for two or more classes which are not possible using logistic
regression in machine learning. But LDA also fails in some cases where the
Mean of the distributions is shared. In this case, LDA fails to create a new axis
that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in

machine learning.

Difference between Linear Discriminant Analysis and PCA

Below are some basic differences between LDA and PCA:

o PCA is an unsupervised algorithm that does not care about classes and
labels and only aims to find the principal components to maximize the
variance in the given dataset. At the same time, LDA is a supervised
algorithm that aims to find the linear discriminants to represent the
axes that maximize separation between different classes of data.
o LDA is much more suitable for multi-class classification tasks
compared to PCA. However, PCA is assumed to be an as good
performer for a comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques,
where PCA is first followed by LDA.

2. Linearly Separable Classes

o The concept of separability applies to binary classification problems.

In them, we have two classes: one positive and the other negative. We
say they’re separable if there’s a classifier whose decision boundary
separates the positive objects from the negative ones.
o If such a decision boundary is a linear function of the features, we
say that the classes are linearly separable.
o Since we deal with labelled data, the objects in a dataset will be linearly
separable if the classes in the feature space are too.
o

o 2.1. Linearly Separable 2D Data

o We say a two-dimensional dataset is linearly separable if we can
separate the positive from the negative objects with a straight
line.
o It doesn’t matter if more than one such line exists. For linear
separability, it’s sufficient to find only one:
o

o Conversely, no line can separate linearly inseparable 2D data:

o
Generalized linear discriminants

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique

commonly used for supervised classification problems. The goal of LDA is to
project the dataset onto a lower-dimensional space while maximizing the
class separability.

LDA is very similar to Principal Component Analysis (PCA), but there are
some important differences. PCA is an unsupervised algorithm, meaning it
doesn't need class labels . PCA's goal is to find the principal components that
maximize the variance in a dataset. LDA, on the other hand, is a supervised
algorithm, which uses both the input data and the class labels to find linear
discriminants that maximize the separation between multiple classes.

LDA can be performed in 5 steps:

1. Compute the mean vectors for the different classes from the dataset.
2. Compute the scatter matrices (in-between-class and within-class
scatter matrices).
3. Compute the eigenvectors and corresponding eigenvalues for the
scatter matrices.
4. Sort the eigenvectors by decreasing eigenvalues and choose k
eigenvectors with the largest eigenvalues.
5. Use this eigenvector matrix to transform the samples onto the new
subspace.
The perceptron:
What is a Neural Network?
The term “neural” comes from “neuron”, which is the term used for a
single nerve cell. That’s right – a neural network essentially means a network
of neurons that perform simple and actions in our daily lives.

Example:

• Children memorizing what an apple looks like

• An animal recognizing its mother or owner
• Perceiving whether something is hot or cold

Our neural networks perform these complicated computations.

Humans have now been able to build a computation system that can perform
in a manner similar to our nervous system. These are called artificial neural
networks (ANNs).

An artificial neuron or Perceptron Model:

Perceptron is a supervised learning algorithm that classifies the data into

two categories, thus it is a binary classifier.

The Linear Unit:

So let's begin with the fundamental component of a neural network: the

individual neuron. As a diagram, a neuron (or unit) with one input looks like:
The Linear Unit: y=wx+b
• The input is x. Its connection to the neuron has a weight which is w.
Whenever a value flows through a connection, you multiply the value
by the connection's weight. For the input x, what reaches the neuron
is w * x.
• A neural network "learns" by modifying its weights.
• The b is a special kind of weight we call the bias. The bias doesn't have
any input data associated with it; instead, we put a 1 in the diagram so
that the value that reaches the neuron is just b (since 1 * b = b).
• The bias enables the neuron to modify the output independently of its
inputs.
• The y is the value the neuron ultimately outputs. To get the output, the
neuron sums up all the values it receives through its connections. This
neuron's activation is y = w * x + b, or as a formula y=wx+b.

• Does the formula y=wx+b look familiar?

It's an equation of a line! It's the slope-intercept equation, where w is
the slope and b is the y-intercept.

Example - The Linear Unit as a Model:

Though individual neurons will usually only function as part of a larger

network, it's often useful to start with a single neuron model as a baseline.
Single neuron models are linear models.
Let's think about how this might work on a dataset like 80 Cereals. Training
a model with 'sugars' (grams of sugars per serving) as input
and 'calories' (calories per serving) as output, we might find the bias
is b=90 and the weight is w=2.5. We could estimate the calorie content of a
cereal with 5 grams of sugar per serving like this:

Computing with the linear unit.

And, checking against our formula, we have calories=2.5×5+90=102, just
like we expect.

Multiple Inputs

The 80 Cereals dataset has many more features than just 'sugars'. What if we
wanted to expand our model to include things like fiber or protein content?
That's easy enough. We can just add more input connections to the neuron,
one for each additional feature. To find the output, we would multiply each
input to its connection weight and then add them all together.
A linear unit with three inputs.

The formula for this neuron would be y=w0x0+w1x1+w2x2+b. A linear unit

with two inputs will fit a plane, and a unit with more inputs than that will fit
a hyperplane.

What is an Artificial Neural Network Model?

A multi-layer, fully-connected neural network containing an input layer,

hidden layers, and an output layer is called an artificial neural network or
ANN.
The image below represents an ANN.
If you see carefully, you will notice that each node in one layer is connected
to every node in the layer next to it.
As you increase the number of hidden layers, the network becomes deeper.
Let’s see what an individual node in the output or hidden layer looks like.

As you can see, the node gets many inputs. It sums up all the weights and
passes it on as output, via a non-linear activation function.
This output of the node becomes the input of the node in the next layer.
An important thing to note here is that the signal will always move from left
to right. Once all the nodes have followed the procedure, the final output will
be given.
Here’s what the equation of a node looks like.

In the above equation, b is bias. It is the input to all nodes and always carries
the value 1.
Bias helps to move the activation function result to left or right.

Glossary of Artificial Neural Network Model

Let’s look at the basic terms you should know when it comes to an artificial
neural network model.

Inputs

The data first fed into the neural network from the source is called the input.
Its goal is to give the network data to make a decision or prediction about
the information fed into it. The neural network model usually accepts real
value sets of inputs and it should be fed into a neuron in the input layer.

Training set
The inputs for which you already know the correct outputs are called
training sets. These are used to help the neural network get trained and
memorize the result for the given input set.

Outputs

Every neural network generates an output as a prediction or decision about

the data fed into it. This output is in the form of real values set or Boolean
decisions. Only of the neurons in the output layer generates the output value.

Neuron

Also known as a perceptron, a neuron is the basic unit of a neural network.

It accepts an input value and generates an output based on it.
As discussed before, every neuron receives a part of the input and passes it
through the non-linear activation function to the node in the next layer.
These activation functions can be TanH, sigmoid, or ReLu. The non-linear
feature of these functions helps to train the network.

Weight space

Every neuron has a numeric weight. When it delivers input to another note,
its weight is totalled with the others to generate an output. By making small
changes to these weights are how neural networks are trained. The fine-
tuning of weights helps determine the correct set of weights and biases that
would generate the best outcome. This is where backpropagation comes in.
Least Square method:

Least squares is a commonly used method in regression analysis for estimating

the unknown parameters by creating a model which will minimize the sum of
squared errors between the observed data and the predicted data.

Basically, it is one of the widely used methods of fitting curves that works by
minimizing the sum of squared errors as small as possible. It helps you draw
a line of best fit depending on your data points.

Finding the Line of Best Fit Using Least Square Regression

Given any collection of a pair of numbers and the corresponding scatter graph,
the line of best fit is the straight line that you can draw through the scatter
points to best represent the relationship between them. So, back to our
equation of the straight line, we have:

Y=mX+c

Where,

Y: Dependent Variable

m: Slope

X: Independent Variable

c: y-intercept

Our aim here is to calculate the values of slope, y-intercept, and substitute them
in the equation along with the values of independent variable X, to determine
the values of dependent variable Y. Let’s assume that we have ‘n’ data points,
then we can calculate slope using the scary looking formula below:
Then, y-intercept is calculated using the formula:

Lastly, we substitute these values in the final equation Y = mX + c.

Gradient based Strategies:

Learning Rate Decay

Learning rate decay is a technique for training modern neural networks. It

starts training the network with a large learning rate and then slowly
reducing/decaying it until local minima is obtained. It is empirically observed
to help both optimization and generalization.

How does it work?

Algorithm converging with a constant learning rate (Noisy and represented

with Blue)
Algorithm converging while decaying Learning Rate over time (less noisy
and represented with green)

In the very first image where we have a constant learning rate, the steps taken
by our algorithm while iterating towards minima are so noisy that after
certain iterations it seems wandering around the minima and do not actually
converges.

But in the second image where learning rate is reducing over time
(represented with green line), since the learning rate is large initially we still
have relatively fast learning but as tending towards minima learning rate gets
smaller and smaller, end up oscillating in a tighter region around minima
rather than wandering far away from it.

Learning rate decay (common method):

“ α=(1/(1+decayRate×epochNumber))*α0 ”
1 epoch : 1 pass through data

α : learning rate (current iteration)

α0 : Initial learning rate

decayRate : hyper-parameter for the method

Let’s take a rough example for the above method for better intuition :

Suppose we have α0 = 0.2 and decay rate=1 , then for the each epoch we can
examine the fall in learning rate α as:

Epoch 1: alpha 0.1

Epoch 2: alpha 0.067

Epoch 3: alpha 0.05

Epoch 4: alpha 0.04

This is a typical method (commonly used) to apply learning rate decay for
training neural networks

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it
explicitly makes use of the first-order derivative of the target objective
function.

First-order methods rely on gradient information to help direct the search for
a minimum …

The first-order derivative, or simply the “derivative,” is the rate of change or

slope of the target function at a specific point, e.g. for a specific input.
If the target function takes multiple input variables, it is referred to as a
multivariate function and the input variables can be thought of as a vector.
In turn, the derivative of a multivariate target function may also be taken as
a vector and is referred to generally as the “gradient.”
• Gradient: First-order derivative for a multivariate objective function.
The derivative or the gradient points in the direction of the steepest ascent
of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that

follows the negative of the gradient downhill of the target function to locate
the minimum of the function.

The gradient descent algorithm requires a target function that is being

optimized and the derivative function for the objective function. The target
function f() returns a score for a given set of inputs, and the derivative
function f'() gives the derivative of the target function for a given set of
inputs.
The gradient descent algorithm requires a starting point (x) in the problem,
such as a randomly selected point in the input space.
The derivative is then calculated and a step is taken in the input space that
is expected to result in a downhill movement in the target function, assuming
we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the

input space, calculated as the step size (called alpha or the learning rate)
multiplied by the gradient. This is then subtracted from the current point,
ensuring we move against the gradient, or down the target function.
• x = x – step_size * f '(x)
The steeper the objective function at a given point, the larger the magnitude
of the gradient and, in turn, the larger the step taken in the search space. The
size of the step taken is scaled using a step size hyperparameter.

• Step Size (alpha): Hyperparameter that controls how far to move in the
search space against the gradient each iteration of the algorithm, also called
the learning rate.
If the step size is too small, the movement in the search space will be small
and the search will take a long time. If the step size is too large, the search
may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm,
let’s take a look at momentum

Momentum-Based Learning
Gradient descent with momentum will always work much faster than the
algorithm Standard Gradient Descent. The basic idea of Gradient Descent
with momentum is to calculate the exponentially weighted average of your
gradients and then use that gradient instead to update your weights.It
functions faster than the regular algorithm for the gradient descent.
How it works ?
Consider an example where we are trying to optimize a cost function that has
contours like the one below and the red dot denotes the local optima
(minimum) location.

We start gradient descent from point ‘A’ and we through end up at point ‘B’
after one iteration of gradient descent, the other side of the ellipse. Then
another phase of downward gradient can end at ‘C’ level. With through
iteration of gradient descent, with oscillations up and down, we step towards
the local optima. If we use higher learning rate then the frequency of the
vertical oscillation would be greater.This vertical oscillation therefore slows
our gradient descent and prevents us from using a much higher learning rate.
By using the exponentially weighted average dW and db values, we tend to
average the oscillations in the vertical direction closer to zero as they are in
both (positive and negative) directions. Whereas all the derivatives point to
the right of the horizontal direction in the horizontal direction, the average
in the horizontal direction will still be quite large. It enables our algorithm to
take a straighter forward path to local optima and to damp out vertical
oscillations. Because of this the algorithm will end up with a few iterations
at local optima.

Implementation
We use dW and db to update our parameters W and b during the backward
propagation as follows:
W = W — learning rate * dW
b = b — learning rate * db
In momentum we take the exponentially weighted averages of dW and db,
instead of using dW and db independently for each epoch.
VdW = β * VdW + (1 — β) * dW
Vdb = β * Vdb + (1 — β) *db
Where beta ‘β’ is a different hyperparameter called momentum, ranging
from 0 to 1. To calculate the new weighted average, it sets the weight
between the average of previous values and the current value.
We’ll update our parameters after calculating the exponentially weighted
averages.
W = W — learning rate * VdW
b = b — learning rate * Vdb
How to choose Beta?
• The momentum (beta) must be higher to smooth out the update
because we give more weight to the past gradients.
• Using the default value for β = 0.9 is suggested but can be tuned
between 0.8 to 0.999 if needed.
• Momentum takes into account past gradients so as to smooth down
gradient measures. It can be implemented with descent by batch
gradient, descent by mini-batch gradient or descent by stochastic
gradient.

Parameter-Specific Learning Rates (Or) Adagrad

Adagrad stands for Adaptive Gradient Optimizer. There were
optimizers like Gradient Descent, Stochastic Gradient Descent, mini-batch
SGD, all were used to reduce the loss function with respect to the weights.
The weight updating formula is as follows:

Based on iterations, this formula can be written as:

where

w(t) = value of w at current iteration, w(t-1) = value of w at previous

iteration and η = learning rate.
In SGD and mini-batch SGD, the value of η used to be the same for each
weight, or say for each parameter. Typically, η = 0.01. But in Adagrad
Optimizer the core idea is that each weight has a different learning rate
(η). This modification has great importance, in the real-world dataset, some
features are sparse (for example, in Bag of Words most of the features are
zero so it’s sparse) and some are dense (most of the features will be noon-
zero), so keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks like:

Where alpha(t) denotes different learning rates for each weight at each
iteration.
Here, η is a constant number, epsilon is a small positive value number to
avoid divide by zero error if in case alpha(t) becomes 0 because if alpha(t)
become zero then the learning rate will become zero which in turn after
multiplying by derivative will make w(old) = w(new), and this will lead to
small convergence.

Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable

What is Gradient Clipping and how does it occur?

Gradient clipping involves capping the error derivatives before

propagating them back through the network. The capped gradients are
used to update the weights hence resulting in smaller weights. The
gradients are capped by scaling and clipping. Gradient clipping involves
forcing the gradients to a certain number when they go above or
below a defined threshold.

Types of Clipping techniques

Gradient clipping can be applied in two common ways:

• Clipping by value
• Clipping by norm
Let’s look at the differences between the two.

Gradient Clipping-by-value
Clipping the gradient by value involves defining a minimum and a
maximum threshold. If the gradient goes above the maximum value it is
capped to the defined maximum. Similarly, if the gradient goes below the
minimum it is capped to the stated minimum value.

Gradient Clipping-by-norm
Clipping the gradient by norm ensures that the gradient of every weight is
clipped such that its norm won’t be above the specified value.

“…a norm is a function that accepts as input a vector from our vector space V
and spits out a real number that tells us how big that vector is. ”
Second-Order Derivatives:

The second order derivatives are used to get an idea of the shape of the
graph for the given function. Whether it will be upwards to downwards.
It is based on concavity. The concavity of the given graph function is
classified into two types: — Concave up and Concave down.

The second derivative is the first derivative of the first derivative. In general
the n+1st derivative (with respect to x) of a (differentiable) function of x, is
the first derivative of the nth derivative. This goes all the way ‘down’ to the
zeroth derivative which is the function itself.

The second derivative is the slope of the first derivative.

Normally, the second derivative of a given function links to the concavity of

the graph. If the second-order derivative value is positive, then the graph of
a function is upwardly concave. If the second-order derivative value is
negative, then the graph of a function is downwardly open.
We already know that second derivative of a function determines the local
maximum (maxima) or minimum (minima), inflexion point values. These can
be identified with the help of below conditions:

• If f”(x) < 0, then the function f(x) has a local maximum at x.

• If f”(x) > 0, then the function f(x) has a local minimum at x.

• If f”(x) = 0, then it is not possible to conclude anything about the point x,

a possible inflexion point.

Relation between first and second derivative

The relation between 1st and 2nd derivative is the same as between a
function and its 1st derivative. The first derivative is the rate of change. The
second derivative is the rate of change of the rate of change.

A common practical application is if you measure distance/time, the

first derivative will be velocity (speed) and the second the acceleration.

You can test whether your extreme found in the first derivative is a maximum
or minimum by plugging in the x value into the equation. If the subsequent y
value is less than 0, then you have a maximum. If your value is greater than 0,
then you have a minimum. As well as this, the zeroes of this function show
inflection points for the concavity (whether the graph is facing up or down).
You can test these points by plugging in a value higher and lower than the 0.
If there is a sign change, then it is an inflection point.

ML Unit4
No ratings yet
ML Unit4
44 pages
Unit 4
No ratings yet
Unit 4
38 pages
ML Unit4
No ratings yet
ML Unit4
41 pages
Unit 4 ML
No ratings yet
Unit 4 ML
11 pages
U20cs604 Machine Learning Unit II
No ratings yet
U20cs604 Machine Learning Unit II
50 pages
Lda
No ratings yet
Lda
14 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
28 pages
3.4 Lda
No ratings yet
3.4 Lda
12 pages
Machine Learning Unit 4 Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4 Machine Learning Unit 4
29 pages
Introduction To Linear Discriminants and Linear Discriminants For Classification in Machine Learning
No ratings yet
Introduction To Linear Discriminants and Linear Discriminants For Classification in Machine Learning
16 pages
What Is Linear Discriminant Analysis
No ratings yet
What Is Linear Discriminant Analysis
3 pages
Unit-4 Dimensionality Reduction
No ratings yet
Unit-4 Dimensionality Reduction
17 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
9 - Linear Discriminant Analysis
No ratings yet
9 - Linear Discriminant Analysis
19 pages
Linear Discriminant Analysis
No ratings yet
Linear Discriminant Analysis
2 pages
Week 8 Notes - DM
No ratings yet
Week 8 Notes - DM
26 pages
1694601448-Unit 3.5 Linear Discriminant Analysis CU 2.0
No ratings yet
1694601448-Unit 3.5 Linear Discriminant Analysis CU 2.0
25 pages
Linear Discriminat Analysis
No ratings yet
Linear Discriminat Analysis
23 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
21 pages
Linear Discriminant Analysis
No ratings yet
Linear Discriminant Analysis
27 pages
ML Ex 1
No ratings yet
ML Ex 1
4 pages
CO3 Session 14
No ratings yet
CO3 Session 14
15 pages
LDA
No ratings yet
LDA
10 pages
Linear Discriminant Analysis & Machine Learning
No ratings yet
Linear Discriminant Analysis & Machine Learning
41 pages
Linear Discriminant Analysis Guide
No ratings yet
Linear Discriminant Analysis Guide
24 pages
ML-Unit IV-1
No ratings yet
ML-Unit IV-1
28 pages
Linear Discriminant Analysis: January 2015
No ratings yet
Linear Discriminant Analysis: January 2015
67 pages
ML Unit 4
No ratings yet
ML Unit 4
41 pages
Inbound 294668028401590159
No ratings yet
Inbound 294668028401590159
10 pages
Inbound 5071897253100924049
No ratings yet
Inbound 5071897253100924049
10 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
3 pages
Unit - 4
No ratings yet
Unit - 4
76 pages
Reference Material - LDA
No ratings yet
Reference Material - LDA
24 pages
Understanding Linear Discriminant Analysis
No ratings yet
Understanding Linear Discriminant Analysis
24 pages
CBM342 BCI Unit IV
No ratings yet
CBM342 BCI Unit IV
22 pages
Lecture 5
No ratings yet
Lecture 5
13 pages
Presentation Topic
No ratings yet
Presentation Topic
4 pages
LDA Final
No ratings yet
LDA Final
25 pages
Inbound 4475599455889975691
No ratings yet
Inbound 4475599455889975691
10 pages
Inbound 383207600016622711
No ratings yet
Inbound 383207600016622711
10 pages
Artificial Neural Network Bao
No ratings yet
Artificial Neural Network Bao
26 pages
LDA for Binary Classification
No ratings yet
LDA for Binary Classification
12 pages
ML 4
No ratings yet
ML 4
14 pages
PA
No ratings yet
PA
8 pages
PCA, LDA, and Clustering Explained
No ratings yet
PCA, LDA, and Clustering Explained
51 pages
Detailed Linear Discriminant Functions Notes
No ratings yet
Detailed Linear Discriminant Functions Notes
2 pages
Dimension Reductio - LDA
No ratings yet
Dimension Reductio - LDA
19 pages
SCSA3015 Deep Learning Unit 3
100% (2)
SCSA3015 Deep Learning Unit 3
23 pages
ML Unit 3
No ratings yet
ML Unit 3
29 pages
Linear Discriminant Analysis Reference
No ratings yet
Linear Discriminant Analysis Reference
6 pages
Linear Discriminant Analysis: A Detailed Tutorial
No ratings yet
Linear Discriminant Analysis: A Detailed Tutorial
22 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
AI Unit V and II
No ratings yet
AI Unit V and II
40 pages
ML Unit-4
No ratings yet
ML Unit-4
83 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
46 pages
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
No ratings yet
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
16 pages
3rd Unit Last 5 Answer AIML
No ratings yet
3rd Unit Last 5 Answer AIML
21 pages
Dimensionality Reduction: Linear Discriminant Analysis (LDA)
No ratings yet
Dimensionality Reduction: Linear Discriminant Analysis (LDA)
8 pages
Linear Discriminant Analysis
No ratings yet
Linear Discriminant Analysis
1 page
Foundations of Economics 7th Edition Bade Get PDF Now
No ratings yet
Foundations of Economics 7th Edition Bade Get PDF Now
301 pages
The Two Ibn Sina's PDF
No ratings yet
The Two Ibn Sina's PDF
26 pages
Effect of Gum Acacia Incorporation On The Bread Making Performance of Punjab Wheat
No ratings yet
Effect of Gum Acacia Incorporation On The Bread Making Performance of Punjab Wheat
11 pages
Year 9 History: Anzac Legend Assessment
No ratings yet
Year 9 History: Anzac Legend Assessment
10 pages
2nd Summative Test English 8
No ratings yet
2nd Summative Test English 8
2 pages
Nuclear MCQ 2
100% (4)
Nuclear MCQ 2
2 pages
Motor Impairments in Children With Autism Spectrum Disorder - A Systematic Review and Meta Analysis
No ratings yet
Motor Impairments in Children With Autism Spectrum Disorder - A Systematic Review and Meta Analysis
22 pages
Car Series A, C,& D, Practice Questions
No ratings yet
Car Series A, C,& D, Practice Questions
26 pages
Preliminary Project Works Overview
67% (3)
Preliminary Project Works Overview
53 pages
C18T2S3 - Adam and Michelle's Presentation On Laki Eruption
No ratings yet
C18T2S3 - Adam and Michelle's Presentation On Laki Eruption
2 pages
Complete Notes - f1 - Business Studies
No ratings yet
Complete Notes - f1 - Business Studies
22 pages
KumbhMela. Mapping The Ephemeral Mega-City PDF
100% (2)
KumbhMela. Mapping The Ephemeral Mega-City PDF
53 pages
Class XII Physics HOT Questions
No ratings yet
Class XII Physics HOT Questions
54 pages
Final Imperial Residencies 1march New
No ratings yet
Final Imperial Residencies 1march New
13 pages
Forbes Estates Lipa
No ratings yet
Forbes Estates Lipa
19 pages
Coal Mining Worksheet
No ratings yet
Coal Mining Worksheet
4 pages
Oluwo Ewe Maferefun Osain Tomo 4
No ratings yet
Oluwo Ewe Maferefun Osain Tomo 4
152 pages
Reading Passage 1: Succeed in IELTS Volume 12
No ratings yet
Reading Passage 1: Succeed in IELTS Volume 12
15 pages
Theories of Development - ReviseSociology
No ratings yet
Theories of Development - ReviseSociology
32 pages
Blocks and Barriers Within Creative Environment
No ratings yet
Blocks and Barriers Within Creative Environment
6 pages
Artificial Neural Networks Lecture Notes: Stephen Lucci, PHD
No ratings yet
Artificial Neural Networks Lecture Notes: Stephen Lucci, PHD
19 pages
NA - BR - 1166 - Dematic - AGV - CompactGeneral 2
No ratings yet
NA - BR - 1166 - Dematic - AGV - CompactGeneral 2
4 pages
Full Essentials of Food Science 5th Edition Vickie Vaclavik PDF All Chapters
No ratings yet
Full Essentials of Food Science 5th Edition Vickie Vaclavik PDF All Chapters
55 pages
DLP PR2 - 7
No ratings yet
DLP PR2 - 7
2 pages
Compressed Air Management System
No ratings yet
Compressed Air Management System
24 pages
Kinasang-an: Understanding Globalization
100% (1)
Kinasang-an: Understanding Globalization
131 pages
7th SLST (AT)-Main Exam Admit Card
No ratings yet
7th SLST (AT)-Main Exam Admit Card
2 pages
Mitsubishi NT ND Specs en
No ratings yet
Mitsubishi NT ND Specs en
8 pages
Soil Mechanics - KNS2123: Tutorial Compaction
No ratings yet
Soil Mechanics - KNS2123: Tutorial Compaction
5 pages
Project Report Format - Draft - BMS
No ratings yet
Project Report Format - Draft - BMS
21 pages

Ann Unit-Ii

Uploaded by

Ann Unit-Ii

Uploaded by

UNIT–II: SINGLE LAYER NETWORKS

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear

Linear Discriminant analysis is one of the most popular dimensionality

Whenever there is a requirement to separate two or more classes having

To overcome the overlapping issue in the classification process, we must

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique

To create a new axis, Linear Discriminant Analysis uses the following

o It maximizes the distance between means of two classes.

o Logistic Regression is one of the most popular classification

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification

To overcome such problems, we use non-linear Discriminant analysis in

Difference between Linear Discriminant Analysis and PCA

Below are some basic differences between LDA and PCA:

2. Linearly Separable Classes

o The concept of separability applies to binary classification problems.

o 2.1. Linearly Separable 2D Data

o Conversely, no line can separate linearly inseparable 2D data:

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique

LDA can be performed in 5 steps:

• Children memorizing what an apple looks like

Our neural networks perform these complicated computations.

An artificial neuron or Perceptron Model:

Perceptron is a supervised learning algorithm that classifies the data into

The Linear Unit:

So let's begin with the fundamental component of a neural network: the

• Does the formula y=wx+b look familiar?

Example - The Linear Unit as a Model:

Though individual neurons will usually only function as part of a larger

Computing with the linear unit.

The formula for this neuron would be y=w0x0+w1x1+w2x2+b. A linear unit

What is an Artificial Neural Network Model?

A multi-layer, fully-connected neural network containing an input layer,

Glossary of Artificial Neural Network Model

Every neural network generates an output as a prediction or decision about

Also known as a perceptron, a neuron is the basic unit of a neural network.

Least squares is a commonly used method in regression analysis for estimating

Finding the Line of Best Fit Using Least Square Regression

Lastly, we substitute these values in the final equation Y = mX + c.

Learning Rate Decay

Learning rate decay is a technique for training modern neural networks. It

How does it work?

Algorithm converging with a constant learning rate (Noisy and represented

Learning rate decay (common method):

α : learning rate (current iteration)

α0 : Initial learning rate

decayRate : hyper-parameter for the method

Epoch 1: alpha 0.1

Epoch 2: alpha 0.067

Epoch 3: alpha 0.05

Epoch 4: alpha 0.04

Gradient descent is an optimization algorithm.

The first-order derivative, or simply the “derivative,” is the rate of change or

Gradient descent refers to a minimization optimization algorithm that

The gradient descent algorithm requires a target function that is being

A downhill movement is made by first calculating how far to move in the

Parameter-Specific Learning Rates (Or) Adagrad

Based on iterations, this formula can be written as:

w(t) = value of w at current iteration, w(t-1) = value of w at previous

What is Gradient Clipping and how does it occur?

Gradient clipping involves capping the error derivatives before

Types of Clipping techniques

The second derivative is the slope of the first derivative.

Normally, the second derivative of a given function links to the concavity of

• If f”(x) < 0, then the function f(x) has a local maximum at x.

• If f”(x) > 0, then the function f(x) has a local minimum at x.

• If f”(x) = 0, then it is not possible to conclude anything about the point x,

Relation between first and second derivative

A common practical application is if you measure distance/time, the

You might also like