Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Principal component analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Words embeddings in NLP
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Brain activity
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Challenges ?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Challenges ?
Visualize
Group in relevant clusters
Difficult with high dimensional data!
A classical dimension reduction approach Principal Component
Analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Dimension reduction
Dimension reduction without loss of information?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Dimension reduction
Dimension reduction without loss of information?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Dimension reduction
Scientific questions
How can we reduce dimension to separate observations?
Possible answer : Principal Component Analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Challenges ?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
High dimensional data in data analysis?
Challenges ?
Visualize
Group in relevant clusters
Difficult with high dimensional data!
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Dimension reduction
Main features of Principal Component Analysis (PCA)
preserves the global structure of data.
maps all the clusters as a whole
potential applications : noise filtering, feature extractions, stock
market predictions, and gene data analysis.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Vectors of Rp
Rp is the set of vectors with p components
1
For e.g. X = −3 is a 3-component vector.
4
We can also say that X belongs to R3
Concept of basis
The family (X1 , · · · , Xp ) is a basis of Rp if each vector of Rp can be
expressed in a unique way as a linear combination of X1 · · · , Xp
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Example 1
! !!
1 0
, is a basis of R2
0 1
!
x
Indeed every X = 1 can be expressed in a unique way as
x2
! !
1 0
X = x1 · + x2 ·
0 1
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
!
2
Example with X =
3
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Example 2
! !!
1 1
, is a basis of R2
1 −1
!
x1
Indeed, every X = can be expressed in a unique way as
x2
x1 + x2 1
! !
x1 − x2 1
· X= + ·
2 1 2 −1
!
3
Example with X =
2
! !
1 1
X = 2.5 · + 0.5 ·
1 −1
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
!
3
Example with X =
2
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Matrices
A matrix with p rows and p columns is an array of reals with p
rows nd p columns
A matrix maps vectors of Rp to vectors of Rp
It can be interpretated as a linear transformation of the plane in
the case p = 2
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Matrices
If we are given a matrix M, the transformation may be not so
simple to identify!
!
2 1
What is the transformation associated to M = ?
1 2
! !
x1 y1
Y = M · X with X = and Y = means
x2 y2
y1 = 2x1 + x2
y2 = x1 + 2x2
The transformation is eplicit but the geometric interpretation is
not so clear
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Simpler with a smartchange of coordinate?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Eigenvalues and eigenvectors
Let A ∈ Md (R).
The vector X ∈ Rd \ {0} is said to be a eigenvector of matrix A
associated to the eigenvalue λ if AX = λX.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Example 3
! !
2 1 1
Let A = and X =
1 2 1
!
3
Since AX = , X is an eigenvector of A with associated
3
eigenvalue 3
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Refresher on Linear Algebra
Diagonalizable matrices
The square matrix A with p columns and p rows is said to be
diagonalizable if there exists (X1 , · · · , Xp ) such that
Condition 1 : (X1 , · · · , Xp ) is a basis of Rp
Condition 2 : for each i, Xi is an eigenvector of A
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Principal Component Analysis
Principle
How perform dimension reduction in a linear way?
Mathematical tool : linear projection on a low-dimensional space
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Principal Component Analysis
Principle
How can we find the low-dimensional space H?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Principal Component Analysis
Principle
Principal component analysis : how does it work?
The k dimensional space H that we are looking for is generated
by the k eigenvectors uα associated to the k largest eigenvalues
λα of the matrix X T X
We have several possible choices for the matrix X :
General PCA : the raw data matrix X = R
Centered PCA: the centered data matrix. the matrix X T X is then
the matrix of empirical covariances
Normed PCA : the normed and centered data matrix. The matrix
X T X is then the matrix of empirical correlations
Projection of the observation oi on the axis α
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Principal Component Analysis
Principle
Principal component analysis : how does it work?
In general n ≫ d (number of observations ≫ number of initial
variables)
It is the reason why we deal with the matrix X T X with dimension
d × d rather than XX T with dimension n × n
Existence of some links between these two analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
PCA with Python
An example
To illustrate PCA we consider a dataset containing gene
expression profiles for 105 breast tumour samples measured
using Swegene Human 27K RAP UniGene188 arrays
Within the population of cells, one can focus focused on the
expression of GATA3 and XBP1, whose expression was known
to correlate with estrogen receptor status1
1
Breast cancer cells may be estrogen receptor positive, ER +,
or negative, ER , indicating capacity to respond to estrogen
signalling, which can therefore influence treatment
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
PCA with Python
An example
We plot the expression levels of GATA3 and XBP1 against one
another to visualise the data in the two-dimensional space
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
PCA with Python
An example
We perform PCA and visualize by plotting the original data
side-by-side with the transformed data
Original data versus transformed ones
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
PCA with Python
An example
We have simply rotated the original data, so that the greatest
variance aligns along the x-axis and so forth
We can find out how much of the variance each of the principle
components explains
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
PCA with Python
An example
PC1 explains the vast majority of the variance in the observations
The dimensionality reduction step of PCA occurs when we
choose to discard the later PCs.
We visualise the data using only PC1.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Limits of PCA
An example
Principle component analysis is not always appropriate for
complex datasets, particularly when dealing with nonlinearities
To illustrate this, let’s consider an simulated expression set
containing 8 genes, with 10 timepoints/conditions.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Limits of PCA
An example
The data can be separated out by a single direction
The data from time/condition 1 through to time/condition 10 can
ordered
Intuitively, the data can be represented by a single dimension
We run PCA as we would normally, and visualise the result,
plotting the first two PCs
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Limits of PCA
An example
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Limits of PCA
An example
We see that the PCA plot has placed the datapoints in a
horseshoe shape, with condition/time point 1 very close to
condition/time point 10.
From the earlier plots of gene expression profiles we can see that
the relationships between the various genes are not entirely
straightforward.
For example, gene 1 is initially correlated with gene 2, then
negatively correlated, and finally uncorrelated, whilst no
correlation exists between gene 1 and genes 5 - 8.
These nonlinearities make it difficult for PCA which, in general,
attempts to preserve large pairwise distances, leading to the well
known horseshoe effect
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA
Limits of PCA
Pro and cons of PCA
Main advantages of PCA
Simple to implement, no tuning
Highly interpretable. We can find decide on how much variance
to preserve using eigenvalues.
Main drawbacks of PCA
It is a global transform which may not preserve local structure
(clusters)
It is sensitive to outliers