Module 06
Dimensionality Reduction
Introduction
• Curse of Dimensionality describes the explosive nature of increasing
data dimensions and its resulting exponential increase in
computational efforts required for its processing and/or analysis.
• This term was first introduced by Richard E. Bellman, to explain the
increase in volume of Euclidean space associated with adding extra
dimensions, in area of dynamic programming.
• This phenomena does not occur in low-dimensional settings such as
three dimensional physical space of everyday experience.
• Today, this phenomenon is observed in domains such as numerical
analysis, machine learning, data analysis, data mining
• When we say increase in dimensionality it implies an increase in the
number of features used to describe the data.
• An increase in the dimensions can in theory, add more information to
the data thereby improving the quality of data but practically
increases the noise and redundancy during its analysis.
Problems
• High dimensional data is difficult to work with since
✓Adding more features can increase the noise, and hence human error
✓Good estimates can not be obtained if observations are not enough
• This causes
✓Increase in running time and hence reduce efficiency
✓Overfitting reduces classification performance
✓Large number of Samples to be recorded which causes more storage
cost, more computation cost, and more measurement cost.
✓Complexity in interpretation/modelling
Behavior of a Machine Learning Algorithms — Need for data
points and Accuracy of Model
• Each feature represents a dimension and group of dimensions creates a data
point.
• This represents a feature vector that defines the data point to be used by a
machine learning algorithm(s).
• As the dimensionality increases, the number of data points required for good
performance of any machine learning algorithm increases exponentially.
Let’s say that for a model to perform well, we need at least 10 data points for each
combination of feature values.
✓ If we assume that we have one binary feature, then for its 2¹ unique values (0
and 1) we would need 2¹x 10 = 20 data points.
✓ For 2 binary features, we would have 2² unique values and need 2² x 10 = 40
data points.
✓ Thus, for k-number of binary features we would need 2ᵏ x 10 data points.
In general, if n samples (observations) is dense enough in 1D, then in d
dimensions, we need n^d samples. And this number n^d grows really fast
as a function of d
IN GENERAL ……
• If we cant solve problem with few features, then adding more
features seems like a good idea.
• But as the dimensionality increases, the number of data samples
required for good performance of any machine learning algorithm
increases exponentially.
• However the number of samples available usually stays the same
• The method with more features is likely to perform worse instead of
expected better.
OVERFITTING
The curse of dimensionality is also known as
Hughes phenomenon or Hughes effect.
• Hughes (1968) in his study concluded that with a fixed number of
training samples, the predictive power of any classifier first increases
as the number of dimensions increase, but after a certain value of
number of dimensions, the performance deteriorates.
Classification error
• Thus, dimensionality can be considered
as a “curse” in many algorithms.
• Paradox: If n<d, we are better off
assuming that features are uncorrelated,
Classification accuracy
even if we know this assumption is wrong
and we are likely to avoid overfitting.
Effect of Curse of Dimensionality on Distance
Functions
• For any point A, lets assume distₘᵢₙ(A) is the minimum distance between A
and its nearest neighbor and distₘₐₓ(A) is the maximum distance between A
and the farthest neighbor.
• That is, for a d — dimensional space, given n-random points, the distₘᵢₙ(A)
≈ distₘₐₓ(A) meaning, any given pair of points are equidistant to each other.
Solutions to Curse of Dimensionality
• One of the ways to reduce the impact of high dimensions is to use a different
measure of distance in a space vector. One could explore the use of cosine
similarity to replace Euclidean distance. Cosine similarity can have a lesser
impact on data with higher dimensions. However, use of such method could
also be specific to the required solution of the problem.
• Other methods:
Other methods could involve the use of reduction in dimensions. Some of
the techniques that can be used are:
1) Forward-feature selection: This method involves picking the most useful
subset of features from all given features.
2) PCA/t-SNE: Though these methods help in reduction of number of
features, but it does not necessarily preserve the class labels and thus can
make the interpretation of results a tough task.
Feature Selection and Feature Extraction
• Each machine learning process depends on feature engineering,
which mainly contains two processes; which are Feature Selection
and Feature Extraction.
• Both feature selection and feature extraction are used
for dimensionality reduction which is key to reducing model
complexity and overfitting.
Feature Selection
• "It is a process of automatically or manually identifying and selecting the subset
of most appropriate and relevant features to be used in model building."
• The goal of feature selection is to improve model performance by reducing the
number of irrelevant or redundant features that may introduce noise or bias into
the model.
• This can result in faster training times, improved accuracy, reduced generalization
error introduced due to noise by irrelevant features, and, more robust models
that are less prone to overfitting.
• We need to do feature selection
when the dataset contains a large
number of features, or when the
features are highly correlated,
redundant, or irrelevant.
Feature selection techniques:
1. Regularization techniques such as L1 norm regularization
• L1 norm regularization, also known as Lasso regularization, is a common
regularization technique used in feature selection.
• It works by adding a penalty term that encourages the model to select only
the most important features, while reducing the weights of irrelevant or
redundant features to zero.
• L1 norm regularization introduces sparsity into the feature weights,
meaning that only a subset of the features have non-zero weights.
• The other features are effectively ignored by the model, resulting in a form
of automatic feature selection.
• L1 norm regularization can be particularly useful
(Lasso)
in cases where the dataset
contains many features, and some of them are irrelevant or redundant.
2. Feature importance technique
• Feature importance techniques such as using estimator such
as Random Forest algorithm to fit a model and select features based on
the value of attribute such as feature_importances_ .
• The feature_importances_ attribute of the Random Forest estimator
can be used to obtain the relative importance of each feature in the
dataset.
• The feature_importances_ attribute of the Random Forest estimator
provides a score for each feature in the dataset, indicating how
important that feature is for making predictions.
• These scores are calculated based on the reduction in impurity (e.g.,
Gini impurity or entropy) achieved by splitting the data on that feature.
• The feature with the highest score is considered the most important,
while features with low scores can be considered less important or even
irrelevant.
One numerical and one
Both input output numerical categorical Both input output categorical
Feature Extraction
• This method is used to extract features in a format supported by machine
learning algorithms from the given datasets consisting of formats such as
text and image.
• Feature Extraction transforms an arbitrary data, such as text or images,
into numerical features that is understood by machine learning algorithms.
• It is intended to modify the features and generate new ones (reduce
number of features) by combining the raw/provided features and then
discarding the original features.
• Feature Extraction projects a data set with higher dimensionality onto a
smaller number of dimensions. As such it is useful for data visualization,
since a complex data set can be effectively visualized when it is reduced to
two or three dimensions.
Feature Extraction continued…
• A feature extraction pipeline varies a lot depending on the primary data
and the algorithm to use and it turns into something difficult to consider
abstractly.
• The Principle Component Analysis (PCA), Linear Discriminant Analysis
(LDA), and Multidimensional Scaling are the most often used
methodologies for Feature Extraction.
Conclusion
• Feature selection is a very complicated and vast field of machine learning,
and lots of studies are already made to discover the best methods.
• There is no fixed rule of the best feature selection method. However,
choosing the method depend on a machine learning engineer who can
combine and innovate approaches to find the best method for a specific
problem.
• One should try a variety of model fits on different subsets of features
selected through different statistical Measures. There are no limits to the
ways of creating features.
• Drawing these meaningful features often means a great deal of extensive
exploration that involves expertise, creativity, intuition and time, lots of
time. And this is the reason why, whereas automatic feature selection is
already here, there is no so much development in feature extraction.
What Is Principal Component Analysis?
• Principal component analysis, or PCA, is a dimensionality reduction
method that is often used to reduce the dimensionality of large data
sets, by transforming a large set of variables into a smaller one that
still contains most of the information in the large set.
• Reducing the number of variables of a data set naturally comes at the
expense of accuracy, but the trick in dimensionality reduction is to
trade a little accuracy for simplicity. Because smaller data sets are
easier to explore and visualize and make analyzing data points much
easier and faster for machine learning algorithms without extraneous
variables to process.
• So, to sum up, the idea of PCA is simple — reduce the number of
variables of a data set, while preserving as much information as
possible.
HOW DO YOU DO A
PRINCIPAL COMPONENT ANALYSIS?
1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance
matrix to identify the principal components
4. Create a feature vector to decide which principal components
to keep
5. Recast the data along the principal components axes
STEP 1: STANDARDIZATION
• The aim of this step is to standardize the range of the continuous initial variables so that
each one of them contributes equally to the analysis.
• More specifically, the reason why it is critical to perform standardization prior to PCA, is,
if there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (for example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which
will lead to biased results.
• So, transforming the data to comparable scales can prevent this problem.
• Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.
• Once the standardization is done, all the variables will be transformed to the same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
• The aim of this step is to understand how the variables of the
input data set are varying from the mean with respect to each
other, or in other words, to see if there is any relationship
between them.
• Because sometimes, variables are highly correlated in such a
way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
• e.g. For a 3-dimensional data set with 3 variables x, y, and z, the
covariance matrix is a 3×3 data matrix of this from
• Since the covariance of a variable with itself is its variance
(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom
right) we actually have the variances of each initial variable.
• And since the covariance is commutative (Cov(a,b)=Cov(b,a)),
the entries of the covariance matrix are symmetric with respect
to the main diagonal, which means that the upper and the lower
triangular portions are equal.
• It’s actually the sign of the covariance that matters:
✓ If positive then: the two variables increase or decrease together (correlated)
✓ If negative then: one increases when the other decreases (Inversely correlated)
STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS
• Eigenvectors and eigenvalues are the linear algebra concepts that we
need to compute from the covariance matrix in order to determine
the principal components of the data.
• Principal components are new variables that are constructed as
linear combinations or mixtures of the initial variables. These
combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information
within the initial variables is squeezed or compressed into the first
components.
• So, the idea is 10-dimensional data gives, 10 principal components,
but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second
and so on, until having something like shown in the scree plot on
next slide.
• Organizing information in
principal components allows
to reduce dimensionality
without losing much
information, and this by
discarding the components
with low information and
considering the remaining
components as new variables.
• Principal components are less
interpretable and don’t have
any real meaning since they
are constructed as linear
combinations of the initial
variables.
• Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of
variance, that is to say, the lines that capture most information
of the data.
• The relationship between variance and information here, is that,
the larger the variance carried by a line, the larger the
dispersion of the data points along it, and the larger the
dispersion along a line, the more information it has.
• To put all this simply, just think of principal components as new
axes that provide the best angle to see and evaluate the data, so
that the differences between the observations are better visible.
How PCA Constructs the Principal Components
…………… it’s approximately the line that matches the purple marks
How PCA Constructs the Principal Components
• The first principal component accounts for the largest
possible variance in the data set. Or mathematically
speaking, it’s the line that maximizes the variance (the average
of the squared distances from the projected points (red dots) to
the origin).
• The second principal component is calculated in the same way,
with the condition that it is uncorrelated with (i.e.,
perpendicular to) the first principal component and that it
accounts for the next highest variance.
• This continues until a total of p principal components have been
calculated, equal to the original number of variables.
• It is eigenvectors and eigenvalues who are behind all the magic
explained above, because the eigenvectors of the Covariance matrix
are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components.
And eigenvalues are simply the coefficients attached to eigenvectors,
which give the amount of variance carried in each Principal
Component.
• By ranking your eigenvectors in order of their eigenvalues, highest to
lowest, you get the principal components in order of significance.
• After having the principal components, to compute the percentage of
variance (information) accounted for by each component, we divide
the eigenvalue of each component by the sum of eigenvalues.
STEP 4: FEATURE VECTOR
• In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low
eigenvalues), and form with the remaining ones a matrix of
vectors that we call Feature vector.
• So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep.
• This makes it the first step towards dimensionality reduction,
because if we choose to keep only p eigenvectors (components)
out of n, the final data set will have only p dimensions.
STEP 5: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
• In the previous steps, apart from standardization, you do not make any
changes on the data, you just select the principal components and form the
feature vector, but the input data set remains always in terms of the
original axes (i.e., in terms of the initial variables).
• In this last step, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original
axes to the ones represented by the principal components (hence the name
Principal Components Analysis). This can be done by multiplying the
transpose of the original data set by the transpose of the feature vector
PCA
• Intuition: find the axis that shows the greatest variation, and project all
points into this axis
f2
e1
e2
f1
SVD: The mathematical formulation
• Normalize the dataset by moving
the origin to the center of the f2
dataset
• Find the eigenvectors of the data
(or covariance) matrix e2
e1
• These define the new space
• Sort the eigenvalues in “goodness” f1
order
SVD Cont’d
• Advantages:
• Optimal dimensionality reduction (for linear projections)
• Disadvantages:
• Computationally hard. … but can be improved with random sampling
• Sensitive to outliers and non-linearities
Applications of Principal Component Analysis
• PCA is mainly used as the dimensionality reduction technique in
various AI applications such as computer vision, image compression,
etc.
• It can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used are Finance, data mining,
Psychology, etc.
Example
• Final covariance matrix
• Finding final eigen vectors (Principle components)
(2.5, 2.4)
(0.5, 0.7)
(0.8534) (3.458)