0% found this document useful (0 votes)
23 views22 pages

482 LectureNotes Chapter 5

Chapter 5 discusses key machine learning methodologies, focusing on the Bayes Classifier, Support Vector Machines (SVM), and their applications in pattern recognition and classification tasks. It explains the principles of Bayes' Rule for calculating posterior probabilities, the concept of maximizing margins in SVMs for optimal classification, and the importance of structural risk minimization in learning systems. The chapter also highlights the challenges of high-dimensional data and the significance of generalization performance in machine learning models.

Uploaded by

Anitha Sakthivel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

482 LectureNotes Chapter 5

Chapter 5 discusses key machine learning methodologies, focusing on the Bayes Classifier, Support Vector Machines (SVM), and their applications in pattern recognition and classification tasks. It explains the principles of Bayes' Rule for calculating posterior probabilities, the concept of maximizing margins in SVMs for optimal classification, and the importance of structural risk minimization in learning systems. The chapter also highlights the challenges of high-dimensional data and the significance of generalization performance in machine learning models.

Uploaded by

Anitha Sakthivel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 5

Bayes Classifier - Support Vector Machine – Principle Component Analysis

Introduction
Machine learning (ML) methodology in general is an artificial intelligence approach to
establish and train a model to recognize the pattern or underlying mapping of a system
based on a set of training examples consisting of input and output patterns. Unlike in
classical statistics where inference is made from small datasets, machine learning involves
drawing inference from an overwhelming amount of data that could not be reasonably
parsed by manpower.

In machine learning, pattern recognition is the assignment of some sort of output value (or
label) to a given input value (or instance), according to some specific algorithm. The
approach of using examples to produce the output labels is known as learning
methodology. When the underlying function from inputs to outputs exists, it is referred to
as the target function. The estimate of the target function which is learned or output by the
learning algorithm is known as the solution of learning problem. In case of classification
this function is referred to as the decision function.

In the broadest sense, any method that incorporates information from training samples in
the design of a classifier employs learning. Learning tasks can be classified along different
dimensions. One important dimension is the distinction between supervised and
unsupervised learning. In supervised learning a category label for each pattern in the
training set is provided. The trained system will then generalize to new data samples. In
unsupervised learning , on the other hand, training data has not been labeled, and the
system forms clusters or natural grouping of input patterns based on some sort of measure
of similarity and it can then be used to determine the correct output value for new data
instances.

The first category is known as pattern classification and the second one as clustering.
Pattern classification is the main focus in this course.
Bayes Classifier
The principle of the Bayes Classifier is to calculate the posterior probability of a given
object from its prior probability via Bayes' Rule, and then assign the object to the class with
the largest posterior probability[1].

First recall Bayes' Rule, in the format

P(Y|X) : posterior , probability of Y given X

P(X|Y) : likelihood, probability of X being generated by Y

P(Y) : prior, probability of Y being selected

P(X) : marginal, probability of obtaining X

We will start with the simplest case:

Bayes' rule can be approached by computing either one of the following:

1) The posterior: and

2) The likelihood: and

The former reflects a Bayesian approach. The Bayesian approach uses previous beliefs and
observed data (e.g., the random variable ) to determine the probability distribution of
the parameter of interest (e.g., the random variable ). The probability, according to
Bayesians, is a degree of belief in the parameter of interest taking on a particular value
(e.g., ), given a particular observation (e.g., ). Historically, the difficulty in
this approach lies with determining the posterior distribution. However, more recent
methods such as Markov Chain Monte Carlo (MCMC) allow the Bayesian approach to be
implemented [2].

The latter reflects a Frequentist approach. The Frequentist approach assumes that the
probability distribution (including the mean, variance, etc.) is fixed for the parameter of
interest (e.g., the variable , which is not random). The observed data (e.g., the random
variable ) is simply a sampling of a far larger population of possible observations. Thus, a
certain repeatability or frequency is expected in the observed data. If it were possible to
make an infinite number of observations, then the true probability distribution of the
parameter of interest can be found. In general, frequentists use a technique
called hypothesis testing to compare a null hypothesis (e.g. an assumption that the mean of
the probability distribution is ) to an alternative hypothesis (e.g. assuming that the
mean of the probability distribution is larger than ) [2]. For more information on
hypothesis testing see [3].

There was some class discussion on which approach should be used. Both the ease of
computation and the validity of both approaches were discussed. A main point that was
brought up in class is that Frequentists consider X to be a random variable, but they do not
consider Y to be a random variable because it has to take on one of the values from a fixed
set (in the above case it would be either 0 or 1 and there is only one correct label for a
given value X=x). Thus, from a Frequentist's perspective it does not make sense to talk
about the probability of Y. This is actually a grey area and
sometimes Bayesians and Frequentists use each others' approaches. So using Bayes'
rule doesn't necessarily mean you're a Bayesian. Overall, the question remains unresolved.

Three Main Approaches


1. Empirical Risk Minimization: Choose a set of classifiers H (e.g., linear, neural network)
and find that minimizes (some estimate of) the true error, L(h).

2. Regression: Find an estimate ( ) of function r and define

The 1 / 2 in the expression above is a threshold set for the regression prediction output.

In general regression refers to finding a continuous, real valued y. The problem here is
more difficult, because of the restricted domain (y is a set of discrete label values).

3. Density Estimation: Estimate P(X = x | Y = 0) from Xi's for which Yi =


0 Estimate P(X = x | Y = 1) from Xi's for which Yi = 1 and

let

Define and
It is possible that there may not be enough data to use density estimation, but the main
problem lies with high dimensional spaces, as the estimation results may have a high error
rate and sometimes estimation may be infeasible. The term curse of dimensionality was
coined by Bellman [4] to describe this problem.

As the dimension of the space goes up, the data points required for learning increases
exponentially.

To learn more about methods for handling high-dimensional data see [5]

The third approach is the simplest.


Multi-Class Classification
Generalize to case Y takes on k>2 values.

Theorem: optimal rule

where
Examples of Classification

 Face detection in images.


 Medical diagnosis.
 Detecting credit card fraud (fraudulent or legitimate).
 Speech recognition.
 Handwriting recognition.
There are also some interesting reads on Bayes Classification:

 http://esto.nasa.gov/conferences/estc2004/papers/b8p4.pdf (NASA)
 http://www.cmla.ens-
cachan.fr/fileadmin/Membres/vachier/Garcia6812.pdf (application to medical images)
 http://www.springerlink.com/content/g221vh5m6744362r/ (Journal of Medical
Systems)
Support Vector Machines :

Support Vector Machine is a popular linear classifier. Suppose that we have a data set with
two classes which could be separated using a hyper-plane. Support Vector Machine (SVM)
is a method which will give us the "best" hyper-plane. There are other classifiers that find a
hyper-plane that separate the data, namely Perceptron. However, the output of Perceptron
and many other algorithms depends on the input parameters, so every run of Percetron can
give you a different output. On the other hand, SVM tries to find the hyper-plane that
separates the data and have the farthest distance from the points. This is also known as the
Max-Margin hyper-plane.

A series of linear classifiers, H2 represents a SVM, where the SVM attempts to maximize the
margin, the distance between the closest point in each data set and the linear classifier.

Support vector machines (SVMs), also referred to as max-margin classifiers, are learning
systems that use a hypothesis space of linear functions in a high dimensional feature space,
trained with a learning algorithm from optimization theory that implements a learning bias
derived from statistical learning theory. SVMs are kernel machines based on the principle
of structural risk minimization, which are used in applications of regression and
classification; however, they are mostly used as binary classifiers. Although the subject can
be said to have started in the late seventies (Vapnik, 1979), it is receiving increasing
attention recently by researchers. It is such a powerful method that in the few years since
its introduction has outperformed most other systems in a wide variety of applications,
especially in pattern recognition.

The current standard incarnation of SVM is known as "soft margin" and was proposed by
Corinna Cortes and Vladimir Vapnik [9]. In practice the data is not usually linearly
separable. Although theoretically we can make the data linearly separable by mapping it
into higher dimensions, the issues of how to obtain the mapping and how to avoid
overfitting are still of concern. A more practical approach to classifying non-linearly
separable data is to add some error tolerance to the separating hyperplane between the
two classes, meaning that a data point in class A can cross the separating hyperplane into
class B by a certain specified distance. This more generalized version of SVM is the so-
called "soft margin" support vector machine and is generally accepted as the standard form
of SVM over the hard margin case in practice today. [10]

Support Vector Machines are motivated by the idea of training linear machines with
margins. It involves preprocessing the data to represent patterns in a high dimension
(generally much higher than the original feature space). Note that using a suitable non-
linear mapping to a sufficiently high dimensional space, the data will always be separable.
(p. 263) [62]

A suitable way to describe the interest in SVM can be seen in the following quote. "The
problem which drove the initial development of SVMs occurs in several guises - the bias
variance tradeoff (Geman, Bienenstock and Doursat, 1992), capacity control (Guyon et al.,
1992), overfitting (Montgomery and Peck, 1992) - but the basic idea is the same. Roughly
speaking, for a given learning task, with a given finite amount of training data, the best
generalization performance will be achieved if the right balance is struck between the
accuracy attained on that particular training set, and the “capacity” of the machine, that is,
the ability of the machine to learn any training set without error. A machine with too much
capacity is like a botanist with a photographic memory who, when presented with a new
tree, concludes that it is not a tree because it has a different number of leaves from
anything she has seen before; a machine with too little capacity is like the botanist’s lazy
brother, who declares that if it’s green, it’s a tree. Neither can generalize well. The
exploration and formalization of these concepts has resulted in one of the shining peaks of
the theory of statistical learning (Vapnik, 1979). A Tutorial on Support Vector Machines for
Pattern Recognition
Support Vector Method Solving Real-world Problems

No matter whether the training data are linearly-separable or not, the linear boundary
produced by any of the versions of SVM is calculated using only a small fraction of the
training data rather than using all of the training data points. This is much like the
difference between the median and the mean.
SVM can also be considered a special case of Tikhonov regularization. A special property is
that they simultaneously minimize the empirical classification error and maximize the
geometric margin; hence they are also known as maximum margin classifiers. The key
features of SVM are the use of kernels, the absence of local minima, the sparseness of the
solution (i.e. few training data points are needed to construct the linear decision boundary)
and the capacity control obtained by optimizing the margin.(Shawe-Taylor and Cristianini
(2004)).

Another key feature of SVM, as discussed below, is the use of slack variables to control the
amount of tolerable misclassification on the training data, which form the soft margin SVM.
This key feature can serve to improve the generalization of SVM to new data. SVM has been
used successfully in many real-world problems:

- Pattern Recognition, such as Face Detection , Face Verification, Object Recognition,


Handwritten Character/Digit Recognition, Speaker/Speech Recognition, Image Retrieval ,
Prediction;

- Text and Hypertext categorization;

- Image classification;

- Bioinformatics, such as Protein classification, Cancer classification;

Please refer to here for more applications.


Structural Risk Minimization and VC Dimension

Linear learning machines are the fundamental formulations of SVMs. The objective of the
linear learning machine is to find the linear function that minimizes the generalization
error from a set of functions which can approximate the underlying mapping between the
input and output data. Consider a learning machine that implements linear functions in the
plane as decision rules

With n given training data with input values and output values .
The empirical error is defined as
where

The generalization error can be expressed as

which measures the error for all input/output patterns that are generated from the
underlying generator of the data characterized by the probability
distribution which is considered to be unknown. According to statistical learning
theory, the generalization (test) error can be upper bounded in terms of training error and
a confidence term as shown in

The term on left side represents generalization error. The first term on right hand side is
empirical error calculated from the training data and the second term is called VC
confidence which is associated with the VC dimension h of the learning machine. VC
dimension is used to describe the complexity of the learning system. The relationship
between these three items is illustrated in figure below:

The relation between expected risk, empirical risk and VC confidence in SVMs.
Thus, even though we don’t know the underlying distribution based on which the data
points are generated, it is possible to minimize the upper bound of the generalization error
in place of minimizing the generalization error. That means one can minimize the
expression in the right hand side of the inequality above.

Unlike the principle of Empirical Risk Minimization (ERM) applied in Neural Networks
which aims to minimize the training error, SVMs implement Structural Risk Minimization
(SRM) in their formulations. SRM principle takes both the training error and the complexity
of the model into account and intends to find the minimum of the sum of these two terms
as a trade-off solution (as shown in figure above) by searching a nested set of functions of
increasing complexity.

No matter whether the training data are linearly-separable or not, the linear boundary
produced by any of the versions of SVM is calculated using only a small fraction of the
training data rather than using all of the training data points. This is much like the
difference between the median and the mean. SVM can also be considered a special case
of Tikhonov regularization. A special property is that they simultaneously minimize the
empirical classification error and maximize the geometric margin; hence they are also
known as maximum margin classifiers. The key features of SVM are the use of kernels, the
absence of local minima, the sparseness of the solution (i.e. few training data points are
needed to construct the linear decision boundary) and the capacity control obtained by
optimizing the margin.(Shawe-Taylor and Cristianini (2004)). Another key feature of SVM,
as discussed below, is the use of slack variables to control the amount of tolerable
misclassification on the training data, which form the soft margin SVM. This key feature can
serve to improve the generalization of SVM to new data.

Infinitely many Perceptron solutions Out of many how do we choose?

With Perceptron, there can be infinitely many separating hyperplanes such that the
training error will be zero. But the question is that among all these possible solution which
one is the best. Intuitively, a good separation is achieved by the hyperplane that has the
largest distance to the nearest training data points of any class (so-called functional
margin), since in general the larger the margin the lower the generalization error of the
classifier. This makes sense because at test time, more points will be observed and they
may be closer to the other class, so the safest choice for the hyper-plane would be the one
farthest from both classes.

One of the great things about SVM is that not only it has solid theoretical guarantees, but
also it works very well in practice.

To summarize

What we mean by margin is the distance between the hyperplane and the closest point in a
class.

If the data is Linearly separable, then there exists infinitely many solution hyperplanes. Of
those, infinitely many hyperplanes, one of them is the best choice for the solution. Then the
best decision to make is the hyperplane which is furthest from both classes. Our goal is to
find a hyperplane among all possible hyperplanes which is furthest from both classes. This
is to say, find the hyperplane that has maximum margin. If such a hyperplane exists, it is
known as the maximum-margin hyperplane and the linear classifier it defines is known as a
maximum margin classifier; or equivalently, the perceptron of optimal stability.
What we mean by margin is the distance between the hyperplane and the closest point in a
class.
Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a method of dimensionality reduction/feature


extraction that transforms the data from a D dimensional space into a new coordinate
system of dimension d, where d <= D ( the worst case would be to have d=D). The goal is to
preserve as much of the variance in the original data as possible when switching the
coordinate systems. Give data on D variables, the hope is that the data points will lie mainly
in a linear subspace of dimension lower than D. In practice, the data will usually not lie
precisely in some lower dimensional subspace.

The new variables that form a new coordinate system are called principal
components (PCs). PCs are denoted by . The principal components form a
basis for the data. Since PCs are orthogonal linear transformations of the original variables
there is at most D PCs. Normally, not all of the D PCs are used but rather a subset of d
PCs, , to approximate the space spanned by the original data
points . We can choose d based on what percentage of the variance
of the original data we would like to maintain.

The first PC, is called first principal component and has the maximum variance, thus it
accounts for the most significant variance in the data. The second PC, is called second
principal component and has the second highest variance and so on until PC, which has
the minimum variance.

Let be the projection of the data point on the direction of w if w is of length


one.

Where is the
sample covariance matrix.

We would like to find the which gives us maximum variation:

Note: we require the constraint because if there is no constraint on the length


of then there is no upper bound. With the constraint, the direction and not the length
that maximizes the variance can be found.

Lagrange Multiplier

Before we proceed, we should review Lagrange multipliers.

"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point
where the red line tangentially touches a blue contour is our solution." [Lagrange
Multipliers, Wikipedia]

Lagrange multipliers are used to find the maximum or minimum of a


function subject to constraint

we define a new constant λ called a Lagrange Multiplier and we form the Lagrangian,

If is the max of , there exists such that is a stationary


point of (partial derivatives are 0).
In addition is a point in which functions and touch but do not cross. At this
point, the tangents of and are parallel or gradients of and are parallel, such that:

where,
the gradient of

the gradient of

Example :

Suppose we want to maximize the function subject to the


constraint . We can apply the Lagrange multiplier method to find the
maximum value for the function ; the Lagrangian is:

We want the partial derivatives equal to zero:

Solving the system we obtain two stationary


points: and . In order to understand which one is
the maximum, we just need to substitute it in and see which one as the biggest
value. In this case the maximum is .
Determining w :
Use the Lagrange multiplier conversion to
obtain: where is a constant

Take the derivative and set it to zero:

To obtain:
Rearrange to obtain:

where is eigenvector of and is the eigenvalue of as ,


and , then we can write

Note that the PCs decompose the total variance in the data in the following way :

---- (S is a co-variance matrix, and therefore it's symmetric)

As can be seen from the above expressions, where lambda


is an eigenvalue of the sample covariance matrix and is its corresponding
eigenvector. So is maximized if is the maximum eigenvalue of and the first
principal component (PC) is the corresponding eigenvector. Each successive PC can be
generated in the above manner by taking the eigenvectors of [11] that correspond to the
eigenvalues:

such that

Alternative Derivation
Another way of looking at PCA is to consider PCA as a projection from a higher D-
dimension space to a lower d-dimensional subspace that minimizes the
squared reconstruction error. The squared reconstruction error is the difference between
the original data set and the new data set obtained by first projecting the original
data set into a lower d-dimensional subspace and then projecting it back into the the
original higher D-dimension space. Since information is (normally) lost by compressing the
the original data into a lower d-dimensional subspace, the new data set will (normally)
differ from the original data even though both are part of the higher D-dimension space.
The reconstruction error is computed as shown below.
Reconstruction Error

Minimize Reconstruction Error

Suppose where

Let where is a D by d matrix with d orthogonal unit vectors as columns.

Fit the model to the data and minimize the reconstruction error:

Differentiate with respect to :

we can rewrite reconstruction-error as :

since is a linear combination of the columns of ,

which are independent (orthogonal to each other) we can conclude that:

or equivalently,

Find the orthogonal matrix :


PCA Implementation Using Singular Value Decomposition

A unique solution can be obtained by finding the Singular Value Decomposition (SVD) of
:

For each rank d, consists of the first d columns of . Also, the covariance matrix can be

expressed as follows .

Simply put, by subtracting the mean of each of the data point features and then applying
SVD, one can find the principal components:

Where is a d by n matrix of data points and the features of each data point form a
column in . Also, is a d by n matrix with identical columns each equal to the mean of

the 's, ie . Note that the arrangement of data points is a convention and
indeed in Matlab or conventional statistics, the transpose of the matrices in the above
formulae is used.

As the matrix from the SVD has the eigenvalues arranged from largest to smallest, the
corresponding eigenvectors in the matrix from the SVD will be such that the first column
of is the first principal component and the second column is the second principal
component and so on.
Examples
Note that in the Matlab code in the examples below, the mean was not subtracted from the
datapoints before performing SVD. This is what was shown in class. However, to properly
perform PCA, the mean should be subtracted from the datapoints.
Example 1

Consider a matrix of data points with the dimensions 560 by 1965. 560 is the number of
elements in each column. Each column is a vector representation of a 20x28 grayscale pixel
image of a face (see image below) and there is a total of 1965 different images of faces.
Each of the images are corrupted by noise, but the noise can be removed by projecting the
data back to the original space taking as many dimensions as one likes (e.g, 2, 3 4 0r 5). The
corresponding Matlab commands are shown below:
An example of the face images used in Example 1 with noise removed. Source: [12]

>> % start with a 560 by 1965 matrix X that contains the data points
>> load(noisy.mat);
>>
>> % set the colors to grayscale
>> colormap gray
>>
>> % show image in column 10 by reshaping column 10 into a 20 by 28 matrix
>> imagesc(reshape(X(:,10),20,28)')
>>
>> % perform SVD, if X matrix if full rank, will obtain 560 PCs
>> [S U V] = svd(X);
>>
>> % reconstruct X ( project X onto the original space) using only the first ten principal
components
>> Y_pca = U(:, 1:10)'*X;
>>
>> % show image in column 10 of X_hat which is now a 560 by 1965 matrix
>> imagesc(reshape(X_hat(:,10),20,28)')

The reason why the noise is removed in the reconstructed image is because the noise does
not create a major variation in a single direction in the original data. Hence, the first ten PCs
taken from matrix are not in the direction of the noise. Thus, reconstructing the image
using the first ten PCs, will remove the noise.
Example 2

Consider a matrix of data points with the dimensions 64 by 400. 64 is the number of
elements in each column. Each column is a vector representation of a 8x8 grayscale pixel
image of either a handwritten number 2 or a handwritten number 3 (see image below) and
there are a total of 400 different images, where the first 200 images show a handwritten
number 2 and the last 200 images show a handwritten number 3.

An example of the handwritten number images used in Example 2. Source: [13]

The corresponding Matlab commands for performing PCA on the data points are shown
below:

>> % start with a 64 by 400 matrix X that contains the data points
>> load 2_3.mat;
>>
>> % set the colors to grayscale
>> colormap gray
>>
>> % show image in column 2 by reshaping column 2 into a 8 by 8 matrix
>> imagesc(reshape(X(:,2),8,8))
>>
>> % perform SVD, if X matrix if full rank, will obtain 64 PCs
>> [U S V] = svd(X);
>>
>> % project data down onto the first two PCs
>> Y = U(:,1:2)'*X;
>>
>> % show Y as an image (can see the change in the first PC at column 200,
>> % when the handwritten number changes from 2 to 3)
>> imagesc(Y)
>>
>> % perform PCA using Matlab build-in function (do not use for assignment)
>> % also note that due to the Matlab convention, the transpose of X is used
>> [COEFF, Y] = princomp(X');
>>
>> % again, use the first two PCs
>> Y = Y(:,1:2);
>>
>> % use plot digits to show the distribution of images on the first two PCs
>> images = reshape(X, 8, 8, 400);
>> plotdigits(images, Y, .1, 1);

Using the plotdigits function in Matlab, clearly illustrates that the first PC captured the
differences between the numbers 2 and 3 as they are projected onto different regions of the
axis for the first PC. Also, the second PC captured the tilt of the handwritten numbers as
numbers tilted to the left or right were projected onto different regions of the axis for the
second PC.
Example 3

(Not discussed in class) In the news recently was a story that captures some of the ideas
behind PCA. Over the past two years, Scott Golder and Michael Macy, researchers from
Cornell University, collected 509 million Twitter messages from 2.4 million users in 84
different countries. The data they used were words collected at various times of day and
they classified the data into two different categories: positive emotion words and negative
emotion words. Then, they were able to study this new data to evaluate subjects' moods at
different times of day, while the subjects were in different parts of the world. They found
that the subjects generally exhibited positive emotions in the mornings and late evenings,
and negative emotions mid-day. They were able to "project their data onto a smaller
dimensional space" using PCS. Their paper, "Diurnal and Seasonal Mood Vary with Work,
Sleep, and Daylength Across Diverse Cultures," is available in the journal Science.[14].

Assumptions Underlying Principal Component Analysis can be found here[15]


Example 4

(Not discussed in class) A somewhat well known learning rule in the field of neural
networks called Oja's rule can be used to train networks of neurons to compute the
principal component directions of data sets. [16] This rule is formulated as follows

where is the neuron weight change, is the learning rate, is the neuron output
given the current input, is the current input and is the current neuron weight. This
learning rule shares some similarities with another method for calculating principal
components: power iteration. The basic algorithm for power iteration (taken from
wikipedia: [17]) is shown below

a random vector
do c times:
(a vector of length m)
for each row

return

Comparing this with the neuron learning rule we can see that the term is very similar
to the update equation in the power iteration method, and identical if the neuron model
is assumed to be linear ( ) and the learning rate is set to 1. Additionally,
the term performs the normalization, the same function as the update equation
in the power iteration method.
Observations
Some observations about the PCA were brought up in class:

 PCA assumes that data is on a linear subspace or close to a linear subspace. For non-
linear dimensionality reduction, other techniques are used. Amongst the first proposed
techniques for non-linear dimensionality reduction are Locally Linear Embedding
(LLE) and Isomap. More recent techniques include Maximum Variance Unfolding
(MVU) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Kernel PCAs may also
be used, but they depend on the type of kernel used and generally do not work well in
practice. (Kernels will be covered in more detail later in the course.)

 Finding the number of PCs to use is not straightforward. It requires knowledge about
the instrinsic dimentionality of data. In practice, oftentimes a heuristic approach is
adopted by looking at the eigenvalues ordered from largest to smallest. If there is a
"dip" in the magnitude of the eigenvalues, the "dip" is used as a cut off point and only
the large eigenvalues before the "dip" are used. Otherwise, it is possible to add up the
eigenvalues from largest to smallest until a certain percentage value is reached. This
percentage value represents the percentage of variance that is preserved when
projecting onto the PCs corresponding to the eigenvalues that have been added
together to achieve the percentage.
 It is a good idea to normalize the variance of the data before applying PCA. This will
avoid PCA finding PCs in certain directions due to the scaling of the data, rather than the
real variance of the data.

 PCA can be considered as an unsupervised approach, since the main direction of


variation is not known beforehand, i.e. it is not completely certain which dimension the
first PC will capture. The PCs found may not correspond to the desired labels for the
data set. There are, however, alternate methods for performing supervised
dimensionality reduction.

 (Not in class) Even though the traditional PCA method does not work well on data set
that lies on a non-linear manifold. A revised PCA method, called c-PCA, has been
introduced to improve the stability and convergence of intrinsic dimension estimation.
The approach first finds a minimal cover (a cover of a set X is a collection of sets whose
union contains X as a subset[18]) of the data set. Since set covering is an NP-hard
problem, the approach only finds an approximation of minimal cover to reduce the
complexity of the run time. In each subset of the minimal cover, it applies PCA and
filters out the noise in the data. Finally the global intrinsic dimension can be determined
from the variance results from all the subsets. The algorithm produces robust
results.[19]

 (Not in class) While PCA finds the mathematically optimal method (as in minimizing the
squared error), it is sensitive to outliers in the data that produce large errors PCA tries
to avoid. It therefore is common practice to remove outliers before computing PCA.
However, in some contexts, outliers can be difficult to identify. For example in data
mining algorithms like correlation clustering, the assignment of points to clusters and
outliers is not known beforehand. A recently proposed generalization of PCA based on
a Weighted PCA increases robustness by assigning different weights to data objects
based on their estimated relevancy.[20]

 (Not in class) Comparison between PCA and LDA: Principal Component Analysis
(PCA)and Linear Discriminant Analysis (LDA) are two commonly used techniques for
data classification and dimensionality reduction. Linear Discriminant Analysis easily
handles the case where the within-class frequencies are unequal and their
performances has been examined on randomly generated test data. This method
maximizes the ratio of between-class variance to the within-class variance in any
particular data set thereby guaranteeing maximal separability. ... The prime difference
between LDA and PCA is that PCA does more of feature classification and LDA does data
classification. In PCA, the shape and location of the original data sets changes when
transformed to a different space whereas LDA doesn’t change the location but only tries
to provide more class separability and draw a decision region between the given
classes. This method also helps to better understand the distribution of the feature
data." [21]
Summary
The PCA algorithm can be summarized into the following steps:

1. Recover basis

2. Encode training data

3. Reconstruct training data

.
4. Encode test example

.
5. Reconstruct test example

You might also like