0% found this document useful (0 votes)
21 views69 pages

Module7 Slides

Module 7 of the Advanced Machine Learning course focuses on high-dimensional data and dimensionality reduction techniques such as SVD, PCA, and MDS. It discusses the challenges posed by high-dimensional data, including the curse of dimensionality, and introduces methods for feature selection and dimensionality reduction to manage these challenges. The module emphasizes the importance of understanding data representation and the mathematical foundations behind dimensionality reduction techniques.

Uploaded by

Monori Bence
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views69 pages

Module7 Slides

Module 7 of the Advanced Machine Learning course focuses on high-dimensional data and dimensionality reduction techniques such as SVD, PCA, and MDS. It discusses the challenges posed by high-dimensional data, including the curse of dimensionality, and introduces methods for feature selection and dimensionality reduction to manage these challenges. The module emphasizes the importance of understanding data representation and the mathematical foundations behind dimensionality reduction techniques.

Uploaded by

Monori Bence
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DD2434 Machine Learning, Advanced Course

Module 7 : high-dimensional data and dimensionality reduction

Aristides Gionis
argioni@[Link]
KTH Royal Institute of Technology
overview of module 7

I high-dimensional data in machine learning

I basic principles of dimensionality reduction

I singular value decomposition (SVD) and eigen-decomposition

I principal component analysis (PCA)

I multidimensional scaling (MDS)

I isomap

2
reading material

I Lee and Verleysen. Nonlinear dimensionality reduction. Springer, 2007

chapters 1, 2, and 4

available online in the KTH-library website

3
machine learning today
I at the heart of an ongoing technological revolution and societal transformation

consider:
I autonomous vehicles
I healthcare (medical imaging, diagnostics, drug design, treatment, etc.)
I recommendation systems (netflix, amazon, etc.)
I searching for information in the web or social media
I intelligent interfaces (personal assistants, face recognition, speech recognition, etc.)
I . . . and many more . . .
4
data is at the center of machine learning

I data is collected everywhere nowadays


– in human activities (mobile devices, traffic, sensors, surveillance data, etc.)
– in science (climate, geological sensing, tracking of animals, biological data, etc.)
– in industrial processes (monitoring of processes, products, materials)
I data can be used to
– visualize and gain insights
– acquire knowledge and make new discoveries
– validate scientific hypotheses
– build models, make predictions, optimize processes

5
data types

I many different types of data


– relational data
– text data
– time series
– sequential data
– spatio-temporal data
– graphs / networks

6
a very common data representation
I data is a set of observations; each observation consists of values over a set of attributes
– observations are also called points, or samples
– attributes are also called dimensions, or variables, or features
I such data is represented by a matrix A ∈ Rn×d ,
– where n is the number of points and d the number of dimensions
 
y 11 y 12 . . . y 1d
y 21 y 22 . . . y 2d 
Y= .
 
. .. . . .. 
 . . . . 
y n1 y n2 . . . y nd

I sometimes attributes correspond to rows and observations to columns; we make it explicit


I very often data values are not numerical, e.g., categorical
– it often pays off to transform them to numerical
I in some cases the data values are binary; such data are called transactional
7
examples of data that can be represented as a matrix
I text data : points = documents, dimensions = words
– y ij = {number of times that document i contains word j}
I movie-ratings data : points = users, dimensions = movies
– y ij = {the rating of user i for movie j}
I purchase data : points = customers, dimensions = products
– y ij = 1, if customer i has purchased product j, 0 otherwise (binary)
I stock-market time-series data : points = traded commodities, dimensions = time
– y ij = {the price of commodity i at time j}
– (note that time imposes a natural order on the dimensions)
I social networks : points = individuals, dimensions = individuals
– y ij = 1, if individuals i and j are friends, 0 otherwise (binary)
– (note that points and dimensions are the same set of individuals)
8
data dimensionality can be very large

I dimensionality = the number of dimensions of the data

I consider the previous examples


– tens/hundreds of thousands, or millions, of words, movies, products, individuals, ...
I working with high-dimensional data can be very challenging

I the curse of dimensionality


– phenomena/challenges that appear when working with high-dimensional data

9
the curse of dimensionality

a term that refers collectively to a number of different phenomena and challenges

I the “host space” of the data increases exponentially with the dimension
– as a result, high-dimensional data tend to be very sparse
– also, optimization by exhaustive enumeration is not possible
I certain computational tasks become very challenging
– e.g., for nearest-neighbor search, it is difficult to do better than exhaustive search
I a large number of dimensions may be irrelevant
– e.g., may not have any predictive power in a classification task
I visualizing high-dimensional data is difficult

I the geometry of high-dimensional data becomes not intuitive

10
surprising properties of high-dimensional data

1. “spiky” (but convex) hypercubes

I volume of d-dimensional sphere: Vs (r ) = π d/2 r d /Γ(1 + d/2)


I volume of d-dimensional cube: Vc (r ) = (2r )d
I it follows
Vs (r )
lim =0
d→∞ Vc (r )

I an exponential number of cube corners“go out” from the sphere


these corners occupy all the available volume
I e.g., for r = 1/2, Vc (1/2) = 1, but limd→∞ Vs (1/2) = 0
so, the volume of the sphere vanishes, when dimensionality increases

11
surprising properties of high-dimensional data

2. volume of a thin spherical shell

I consider the relative hyper-volume of a thin spherical shell


r(1
<latexit sha1_base64="pxC0NaEoIxD9y3wRLe0nQ8mgP30=">AAACKnicbVDLSsNAFJ34aq2vVpdugkWoC0siii4LunBZwT6gDWUyuWmHzkzizEQood/hVvd+jbvi1g9x0mZhWw8MHM65rzl+zKjSjjOzNja3tncKxd3S3v7B4VG5ctxWUSIJtEjEItn1sQJGBbQ01Qy6sQTMfQYdf3yf+Z1XkIpG4llPYvA4HgoaUoK1kTxZcy/7ECvKInExKFedujOHvU7cnFRRjuagYhX6QUQSDkIThpXquU6svRRLTQmDaamfKIgxGeMh9AwVmIPy0vnVU/vcKIEdRtI8oe25+rcjxVypCfdNJcd6pFa9TPzP6yU6vPNSKuJEgyCLRWHCbB3ZWQR2QCUQzSaGYCKpudUmIywx0SaopUmZbEQvzZYEoOhQLH0q5SMyAj4tmejc1aDWSfuq7t7UnafrauMhD7GITtEZqiEX3aIGekRN1EIEvaA39I4+rE/ry5pZ34vSDSvvOUFLsH5+ARf+puM=</latexit>

✏)

Vs (r ) − Vs ((1 − )r ) 1d − (1 − )d


= −−−→ 1 r
<latexit sha1_base64="JkV5TZg3sXCqvkbdNqqFO2FAADM=">AAACHnicbVDLSsNAFJ3UR2t9tbp0EyyCq5KIosuCLly2YB/QhjKZ3DRDZyZhZiKU0C9wq3u/xp241b9x0mZhWw8MHM65d+69x08YVdpxfqzS1vbObrmyV90/ODw6rtVPeipOJYEuiVksBz5WwKiArqaawSCRgLnPoO9P73O//wxS0Vg86VkCHscTQUNKsDZSR45rDafpLGBvErcgDVSgPa5b5VEQk5SD0IRhpYauk2gvw1JTwmBeHaUKEkymeAJDQwXmoLxssencvjBKYIexNE9oe6H+7cgwV2rGfVPJsY7UupeL/3nDVId3XkZFkmoQZDkoTJmtYzs/2w6oBKLZzBBMJDW72iTCEhNtwln5KZeN6GX5kAAUnYiVozIekQj4vGqic9eD2iS9q6Z703Q6143WQxFiBZ2hc3SJXHSLWugRtVEXEQToBb2iN+vd+rA+ra9lackqek7RCqzvX0/UomY=</latexit>

Vs (r ) 1d d→∞

so, when dimensionality increases, the thin shell contains almost all the volume

12
surprising properties of high-dimensional data

3. diagonal of a hypercube

I consider a diagonal of the hypercube v = (±1, . . . , ±1)T


I and the i-th coordinate axis vector ei = (0, . . . , 0, 1, 0, . . . , 0)T
I the cosine of their angle is

vT e ±1
coshv, ei i = = √ −−−→ 0
kvkkei k d d→∞

so, diagonals are nearly orthogonal to all coordinates axes

I we can also show : a random vector is nearly orthogonal to all coordinates axes

13
surprising properties of high-dimensional data

4. properties of the unit sphere

I most vectors lie on the surface of the sphere


I most vectors lie on the “equator”
I most vectors are nearly orthogonal
I most vectors have almost the same distance

14
how to cope with high-dimensional data?

I quantify relevance of dimensions (variables), and possibly eliminate irrelevant dimensions


– a.k.a., feature selection
– the idea is applied mainly to supervised learning, where we can measure
– the correlation of variables (or subsets of variables) with the target
I explore correlations between dimensions
– often pairs (or larger sets) of dimensions are correlated
– e.g., a subset of dimensions can be explained by a single phenomenon
– which can be quantified by a latent dimension
– in such a case we may want to keep only one of the correlated dimensions,
– or we may want to transform the correlated dimensions into a new (smaller) set
– of dimensions, which reveal the latent dimensions
– implementation of these ideas is known as dimensionality reduction

15
data may reside in lower-dimensional manifolds

linear −→
manifolds

non-linear −→
manifolds

16
data may reside in lower-dimensional manifolds

I dimensionality reduction can be used to reveal hidden manifolds,


and represent the data in the latent manifold dimensionality

I motivated by the idea that lower-dimensional manifolds can be either linear or non-linear,
we have developed methods for linear dimensionality reduction
or non-linear dimensionality reduction

17
objectives of dimensionality reduction

I dealing with high-dimensional data is challenging

thus, we need to develop methods to


I embed data in order to reduce their dimensionality
I embed data in order to recover (a potentiall smaller number of) latent variables
I estimate the number of latent variables

18
a high-level view of dimensionality reduction

I a data point is represented by a vector y = (y 1 , . . . , y d )T ∈ Rd


the data is a set of data points Y = {y1 , . . . , yn }
equivalently, the data is represented by a matrix Y = (y1 , . . . , yn ) ∈ Rd×n
(notice that rows represent dimensions and columns represent points)
I the goal of dimensionality reduction is to map the data from Rd to Rk , with k  d
– thus, we want to find a mapping

cod : Rd → Rk , y 7→ x = cod(y)

I the inverse mapping should bring the mapped data close to the original data
– to reason about this, we use the inverse mapping

dec : Rk → Rd , x 7→ y = dec(x)

19
linear vs. non-linear dimensionality reduction

I a linear dimensionality reduction method uses linear mappings cod and dec

I a non-linear dimensionality reduction method uses non-linear mappings cod and dec

20
dimensionality reduction criterion

I to formalize the dimensionality-reduction problem, we need to define an optimization


criterion, which captures our intuition about the quality of the dimensionality-reduction
mapping
I a commonly-used criterion is mean square error

Ecod,dec = Ey ky − dec(cod(y))k22
 

I other criteria exist, e.g.,


– maximize the variance of the data on the embedded space
– make the latent variables independent
– ...

21
dimensionality reduction — general scheme

a dimensionality-reduction method has the following components

I a model to specify our data type and the mappings cod and dec
I a criterion to be optimized
I an algorithm to be used to optimize the criterion

the goal is to find optimal mappings cod and dec, within the specified model,
which optimize the desired criterion

22
categorization of dimensionality-reduction methods

I linear vs. non-linear model


I continuous vs. discrete model
I integrated vs. external estimation of the dimensionality
I layered vs. standalone embedding
I batch vs. online algorithm
I exact vs. approximate optimization

23
matrix decompositions

I we wish to decompose a matrix A by writing it as a product of two or more matrices:

Am×n = Bm×k Ck×n or Am×n = Bm×k Ck×r Dr ×n

ideally k and r are much smaller than m and n

I such a decomposition can yield insights about the nature of matrix A


or, it can be useful to solve some problem at hand

24
the singular value decomposition (SVD)

I theorem : any m × n matrix A, with m ≥ n, can be factorized into


 
Σ
A=U VT
0

where U ∈ Rm×m and V ∈ Rn×n are orthonormal (i.e., UT U = Im×m and VT V = In×n )
and Σ ∈ Rn×n is diagonal

Σ = diag(σ1 , . . . , σn ) , where σ1 ≥ . . . ≥ σn ≥ 0

I “skinny” version of SVD : A = U1 ΣVT , where U1 ∈ Rm×n

25
singular values and singular vectors

I the diagonal elements σj of Σ are the singular values of the matrix A

I the columns of U and V are the left singular vectors and right singular vectors, resp.

I equivalent form of SVD

A vj = σj uj and AT uj = σj vj

I outer-product form
 
σ1 n
..
X
T T
A = U1 ΣV = (u1 . . . un )   (v1 . . . vn ) = σj u j v T
 
. j
σn j=1

which is a sum of rank-one matrices (each term σj uj vT


j is a rank-one matrix)

26
matrix approximation using the singular-value decomposition

I theorem : let A = U Σ VT be the singular-value decomposition of A


let Uk = (u1 . . . uk ), Vk = (v1 . . . vk ), Σk = diag(σ1 . . . σk ), and define

Ak = U k Σ k V T
k

then,
min kA − Bk2 = kA − Ak k2 = σk+1
rank(B)≤k

I in other words, Ak is the best of rank-k approximation for the matrix A

I useful for
– compression
– noise reduction
– and as we will see, dimensionality-reduction

27
singular-value decomposition and matrix pseudo-inverse

I the Moore–Penrose inverse, or pseudo-inverse, of a matrix A ∈ Rm×n , is a


matrix A+ ∈ Rn×m that generalizes the notion of inverse
I if A has linearly independent columns (and thus, the matrix AT A is invertible),
then A+ is computed by A+ = (AT A)−1 AT
in this case the pseudo-inverse is called left inverse and A+ A = I
I if A = U Σ VT is the singular-value decomposition of A, then

A+ = V Σ+ UT

where, Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements,
leaving all the zeros in place, and making the matrix the right shape

28
eigen-decomposition

I let A ∈ Rn×n be a square matrix


I for a non-zero vector v, the pair (λ, v) is an eigenvalue-eigenvector pair, if

Av = λv

I an n × n matrix A has n eigenvalue-eigenvector pairs

I the eigen-decomposition of matrix A is

A = Q Λ Q−1

where Q is an n × n matrix whose i-th column is the eigenvector vi of A, and


Λ is a diagonal matrix whose diagonal elements are the eigenvalues, Λii = λi
I a symmetric matrix has real eigenvalues and orthogonal eigenvectors (Q−1 = QT )
in addition, a symmetric positive semidefinite matrix has non-negative eigenvalues

29
singular value decomposition and eigen-decomposition

I from the singular value decomposition we can write

AT A = V Σ UT U Σ VT = V Σ2 VT

A AT = U Σ VT V Σ UT = U Σ2 UT

I it follows :
– the singular values of A are the nonnegative square roots of the eigenvalues of AT A
– the columns of V (right singular vectors of A) are the eigenvectors of AT A
– the columns of U (left singular vectors of A) are the eigenvectors of A AT

30
principal component analysis (PCA)

I one of the most common dimensionality-reduction techniques

I one of the simplest and most robust

I one of oldest methods, dating back to 1901 [Pearson, 1901]


I has been rediscovered many times

I also known as Karhunen-Loève transformation, or Hotelling transformation

31
recall our general dimensionality-reduction scheme

we need to specify for PCA we use

I a model to define mappings between Rd and Rk


– a linear mapping
I a criterion to be optimized
– mean square error
I an algorithm to be used to optimize the criterion
– singular-value decomposition
note: employing SVD is not a free choice, it is a consequence of the other two choices

32
recall our general dimensionality-reduction scheme

for PCA we use we need to specify

I a model to define mappings between Rd and Rk


– a linear mapping
I a criterion to be optimized
– mean square error
I an algorithm to be used to optimize the criterion
– singular-value decomposition
note: employing SVD is not a free choice, it is a consequence of the other two choices

33
data model of PCA

I a data point is represented by a vector y = (y 1 , . . . , y d )T ∈ Rd


the data is a set of data points Y = {y1 , . . . , yn }
equivalently, the data is represented by a matrix Y = (y1 , . . . , yn ) ∈ Rd×n
(notice that rows represent dimensions and columns represent points)
I we assume that each vector y = (y 1 , . . . , y d )T is the result of applying a linear
transformation W ∈ Rd×k to a latent vector x = (x 1 , . . . , x k )T

y = Wx

I additionally, we assume that the columns of W are orthonormal, i.e., WT W = Ik×k


that is, W is full rank and its columns are a basis of the space that it spans
(but W WT is not necessarily equal to Id×d )

34
data centering

I we assume that the observed points y and the latent points x are “centered”

Ey [y] = 0d and Ex [x] = 0k

this can be achieved by removing the expectation Ey [y] from each point yi

yi ← yi − Ey [y]

in practice, the expectation Ey [y] is computed by the sample mean


n
1 X 1
Ey [y] ≈ yi = Y1n
n n
i=1
I data centering in matrix notation

1
Y←Y− Y1n 1T
n
n

35
PCA linear mapping

I the linear transformation y = W x provides the mapping dec from Rk to Rd

dec : Rk → Rd , x 7→ y = dec(x) = W x

I for the mapping from Rd to Rk we use the pseudo inverse W+ = (WT W)−1 WT = WT

cod : Rd → Rk , y 7→ x = cod(y) = W+ y = WT y

36
PCA optimization criterion

I mean square error = reconstruction error


h i
EW = Ey ky − dec(cod(y))k22 = Ey ky − WWT yk22
 

recall from previous slide that cod(y) = WT y and dec(x) = W x

recall that while WT W = Ik×k , matrix W WT is not necessarily equal to Id×d

37
optimizing the PCA reconstruction-error criterion
h i
EW = Ey ky − W WT yk22
h i
= Ey (y − W WT y)T (y − W WT y)
h i
= Ey yT y − 2yT W WT y + yT W WT W WT y
h i
= Ey yT y − yT W WT y
h i h i
= Ey yT y − Ey yT W WT y

Ey yT y is constant, so minimizing EW is equivalent to maximizing Ey yT W WT y


   

I we use again the sample mean


h i 1 Xn
Ey yT W WT y = yT T
i W W yi
n
i=1
1
= tr(YT W WT Y)
n
38
optimizing the PCA reconstruction-error criterion

I we want to maximize Ey yT W WT y = 1
tr(YT W WT Y)
 
n

I consider the singular value decomposition of the data matrix Y = U Σ VT

tr(YT W WT Y) = tr(V Σ UT W WT U Σ VT )

I we can observe that the above expression reaches its maximum when the k columns of W
are colinear with the columns of U that are associated with the k largest singular values
in Σ, i.e.,
W = Uk

I finally, we obtain the projection of point y to the latent point x = WT y = UT


k y

39
putting everything together

the PCA method


I input data Y = (y1 , . . . , yn ) ∈ Rd×n
1
I data centering Y←Y− n Y1n 1T
n

I singular value decomposition Y = U Σ VT

I take Uk to be the k columns of U that correspond to the k largest singular values

I map data to k-dimensional space by X = UT


k Y

40
explained variance and reconstruction error

I SVD gives a nice connection of explained variance and reconstruction error


through singular values
I the variance of the data explained via the first k principal components is

k
X
VW = σi2
i=1
Pd 2
note that the variance of the whole data is VY = i=1 σi

I on the other hand, the reconstruction error is


d
X
EW = σi2
i=k+1

and thus, the reconstruction error is 0, when k = d and W = U

41
selecting the number of latent variables (k)
I selecting the number of latent variables is a mix of science and art
I we can use the concept of explained variance

1. rule of thumb : select k so as to explain at least 85% of the data variance


– select the smallest k for which
Pk
VW σi2
= Pi=1
d
≥ 0.85
VY i=1 σi
2

– (or is it 95%?)

2. plot the singular values and look for a “gap”


– it is often more revealing when ploting the singular
– values in log scale

42
example : spatial / linguistic data analysis

I data : 9000 dialect words, 500 counties


– points = words, dimensions = counties
– data matrix Y, so that y ij = 1 if word i appears in county j, and y ij = 0 otherwise
I apply PCA to this data

I obtain principal component matrix W ∈ Rd×k


– the i-th column of W (i-th principal component), i = 1, . . . , k indicates the counties
– that explain significant part of the variation left in the data

example credited to Saara Hyvönen

43
example : spatial / linguistic data analysis
visualizing the first three components

geographical structure of principal components is apparent


note : PCA knows nothing about geography
44
conclusion and discussion

I PCA is a simple and robust dimensionality-reduction method


– a few lines of code in python, Matlab, R
– popular and widely used
I closed-form optimal solution via SVD

I it comes with some rules of thumb to select the number of latent variables

I “layered” method : adds optimal principal components one by one

I main assumption, which can be a limitation in some cases : linear model

45
distance preservation as a means for dimensionality reduction

I in some cases, data can lie in a non-linear manifold inside a high-dimensional space
– manifold not known, but we may be able to compute distances on the manifold
I in other cases, data points have no vector representation, but we can compute distances
– e.g., string edit distance
I in both cases, we can compute distances between pairs of data points

I we want to embed points in a geometric space so that distances are preserved


– distance between the embedded image of two points ≈ original distance
I if distances are preserved in the low-dimensional embedding, intuitively, the embedding
has captured the structure of the manifold, and the dimensionality reduction is successful

46
distance functions

I given a space Y, a distance d is a function d : Y × Y → R

I if a distance function d satisfies the following properties

– d(a, b) ≥ 0, for all a, b ∈ Y non-negativity


– d(a, b) = 0 if and only if a = b isolation
– d(a, b) = d(b, a), for all a, b ∈ Y symmetry
– d(a, b) ≤ d(a, c) + d(c, b), for all a, b, c ∈ Y triangle inequality

we say that the distance d is a metric


we also say that (Y, d) (the space Y equipped with distance d) is a metric space

47
Minkowski distances

I the Minkowski distances is a family of distance functions in the vector space Rd

I for any two points (vectors) a = (a1 , . . . , ad ) and b = (b 1 , . . . , b d ) in Rd the


Minkowski distance Lp between a and b is defined as

d
! p1
X
d(a, b) = ka − bkp = |ai − b i |p
i=1

I the following special cases of Lp are most commonly used


– p = 1 : city-block or Manhattan distance
– p = 2 : Euclidean distance
– p = ∞ : maximum distance
I the Minkowski distance is a metric, for all p ≥ 1

48
similarity functions
I distances are measures of dissimilarity

I often we work with similarity functions, which are measures of similarity

I similarity functions map pairs of points to reals, i,.e., sim : Y × Y → R

I commonly-used similarity functions are the dot-product, the cosine similarity,


the Jaccard coefficient, and more . . .
I similarity functions often take values in the range [0, 1], where
– value 1 indicates that two points are identical, and
– value 0 indicates that they are totally dissimilar
I on the other hand, for distance functions
– value 0 indicates that two points are identical, and
– a large value indicates that they are totally dissimilar
– sometimes we normalize distances to be in the interval [0, 1]
49
multidimensional scaling (MDS)

I intuitively : given a matrix of pair-wise distances, can be map the data in a


low-dimensional space so that distances are preserved?

=⇒

distance matrix
mapping to 2D

example credited to Stephen Borgatti, 1997

50
color perception

[Ekman, 1954]
I study color perception in human vision

I consider 14 colors differing only in their hue

I 31 people rate differences between all pairs of colors, on a five-point scale

I average ratings and convert to distances


– 0 indicates that two colors are identical, 1 indicates that they are totally different

51
color perception

pairwise distance matrix

52
color perception
MDS reproduces the well-known 2-dimensional color circle

53
1) 3.93 2) 6.80 3) 7.70 4) 9.87 5) 9.91 6) 11.33 7) 11.65 8) 11.66 9) 12.13 10) 12.62
[Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link]

image-retrieval interface with MDS

example from Rubner et al., 1997

54
2D MDS map of images

example from Rubner et al., 1997


55
multidimensional scaling (MDS)

a family of related techniques

I scaling refers to constructing a configuration of points in a target metric space from


interpoint distance information
I MDS is scaling when the target metric space is the Euclidean space

I classical MDS : the basic MDS formulation

I metric MDS : a generalization of classical MDS, with interpoint distance information,


and optimizing a general stress function
I non-metric MDS : a generalization of classical MDS, with rank information

56
classical MDS with a similarity matrix
I consider dataset Y = {y1 , . . . , yn }, represented as matrix Y = (y1 , . . . , yn ) ∈ Rd×n
I assume that data is not known, instead pairwise similarities are known
I we know the similarity matrix S, such that [S]ij = s ij = yT
i yj (dot-product similarity)
– then, S = YT Y
I as with PCA, for point i we assume a latent vector xi , so that yi = W xi ,
where W defines a linear transformation
– we have Y = W X
– we assume again that the columns of W are orthonormal, i.e., WT W = Ik×k
I we have
S = YT Y = (W X)T (W X) = XT (WT W) X = XT X

and note again that both Y and X are unknown


I the matrix S is called the Gram matrix
57
classical MDS with a similarity matrix

I recap : given similarity matrix S ∈ Rn×n , we want to find latent matrix X ∈ Rk×n ,
such that S = XT X

I the similarity matrix S is symmetric positive semi-definite, so its eigen-decomposition is

S = U Λ U−1 = U Λ UT = (Λ1/2 UT )T (Λ1/2 UT )

the non-negative eigenvalues of S are sorted in descending order in the diagonal of Λ

I it follows that we can take the k × n matrix X to be

X = Ik×n Λ1/2 UT

I the matrix X provides a k embedding for each data point


(point i is taken as the i-th column of X)

58
classical MDS with a distance matrix
I in many cases, instead of the similarity matrix S, where [S]ij = s ij = yT
i yj
we are given as input the pairwise distance matrix D, where [D]ij = d ij = kyi − yj k2
I we want to perform MDS on the distance matrix D

I we can use the fact


1
⇒ s ij = − (d 2ij − s ii − s jj )
d 2ij = s ii + s jj − 2s ij
2
but note that s ii = yT y
i i and s jj = y T y are unknown
j j

I however, we can estimate s ij by a “double centering” trick:


– subtract from each entry of D the mean of the corresponding row and the mean of
– the corresponding column, and add back the mean of all entries
I in matrix form  
1 1 T 1 T 1 T T
S=− D − D 1n 1n − 1n 1n D + 2 1n 1n D 1n 1n
2 n n n
I so, we can compute S from D, and apply MDS with similarities
59
classical MDS — putting everything together

1. if the similarity matrix S is available go to step 4


2. if the data matrix Y is available, compute S = YT Y, and go to step 4
3. if the distance matrix D is available, compute S by the “double centering” trick
4. compute the eigen-decomposition S = U Λ UT
5. a k-dimensional representation of the data is obtained by X = Ik×n Λ1/2 UT

60
metric MDS

I a generalization of classical MDS, where we ask to optimize the stress function


(reconstruction error) X
EmMDS = wij (d ij − dˆij )2
i<j

where d ij are the input distances, and dˆij the distances of the reconstructed points

I when wij = 1/d ij , the method is called Sammon’s nonlinear mapping


– the intuition is to give less importance to errors made on large distances

I for this MDS variant we cannot derive a closed-form optimal solution,


instead we search for a solution via a general optimization method,
such as gradient descent

61
non-metric MDS
I in many cases ordinal information is given instead of quantitative distances
– e.g., in a psychology study people may be able to rank a set of concepts, or
– indicate relative similarity between groups of objects, without being able to
– assign quantitative scores
I non-metric MDS asks to optimize a stress function of the form
 1
X  2 2
EnmMDS =  wij f (δij ) − dˆij 
i<j

where δij are the ordinal proximities


– f is monotone transformation of proximities, such that f (δij ) ≈ d ij
– d ij is the Euclidean distance between the unknown data points yi and yj , and
– dˆij the distances of the reconstructed points
I as with metric MDS, we rely on general optimization methods, such as gradient descent
62
conclusion and discussion

I like PCA, classical MDS is a simple and robust dimensionality-reduction method

I it also assumes a linear model

I optimal solution can be derived and efficiently computed via SVD

I MDS is more flexible than PCA, as it only assumes a similarity or distance matrix, and
it is applicable when no vector representation of data is available
I on the other hand, MDS is computationally more intensive when working with matrices
of size n × n
I metric and non-metric MDS are more general methods but they come with the cost
of having to solve a more difficult optimization problem

63
tions are computed efficiently by finding weighted graph G over the data points, with preserves the manifold’s estimated intrinsic
shortest paths in a graph with edges connect- edges of weight dX(i, j) between neighboring geometry (Fig. 3C). The coordinate vectors yi
ing neighboring data points. points (Fig. 3B). for points in Y are chosen to minimize the
nonlinear dimensionality reduction — motivation
The complete isometric feature mapping,
or Isomap, algorithm has three steps, which
In its second step, Isomap estimates the
geodesic distances dM (i, j) between all pairs
cost function
E ! !#$D G % " #$D Y %! L 2 (1)
are detailed in Table 1. The first step deter- of points on the manifold M by computing
mines which points are neighbors on the their shortest path distances dG (i, j) in the where DY denotes the matrix of Euclidean
manifold M, based on the distances dX (i, j) graph G. One simple algorithm (16) for find- distances {dY (i, j) " !yi & yj!} and !A! L2
many high-dimensional points lie in lower-dimensional manifolds
between pairs of points i, j in the input space ing shortest paths is given in Table 1. the L2 matrix norm '(i, j A2i j . The # operator

example Fig. 1. (A) A canonical dimensionality reduction


problem from visual perception. The input consists
of a sequence of 4096-dimensional vectors, rep-
resenting the brightness values of 64 pixel by 64
– consider a set of images, of resolution
pixel images of a face rendered with different
poses and lighting directions. Applied to N " 698
raw images, Isomap (K " 6) learns a three-dimen-
128 × 128, depicting an object from
sional embedding of the data’s intrinsic geometric
structure. A two-dimensional projection is shown,
different positions, and different lighting
with a sample of the original input images (red
circles) superimposed on all the data points (blue)
and horizontal sliders (under the images) repre-
– an image an be considered as a point in
senting the third dimension. Each coordinate axis
of the embedding correlates highly with one de-
gree of freedom underlying the original data: left-
a 16384-dimensional space
right pose (x axis, R " 0.99), up-down pose ( y
axis, R " 0.90), and lighting direction (slider posi-
tion, R " 0.92). The input-space distances dX(i,j )
– arguably, there are 3 degrees of freedom
given to Isomap were Euclidean distances be-
tween the 4096-dimensional image vectors. (B)
Isomap applied to N " 1000 handwritten “2”s
(specifying position and lighting)
from the MNIST database (40). The two most
significant dimensions in the Isomap embedding,
shown here, articulate the major features of the
– thus, the set of images lies in a
“2”: bottom loop (x axis) and top arch ( y axis).
Input-space distances dX(i,j ) were measured by
tangent distance, a metric designed to capture the
3-dimensional nonlinear manifold within
invariances relevant in handwriting recognition
(41). Here we used !-Isomap (with ! " 4.2) be-
a 16384-dimensional space
cause we did not expect a constant dimensionality
to hold over the whole data set; consistent with
this, Isomap finds several tendrils projecting from
the higher dimensional mass of data and repre-
senting successive exaggerations of an extra example from Tenenbaum et al., 2000
stroke or ornament in the digit.

64
nonlinear dimensionality reduction

objective

– given high-dimensional points {yi }, possibly lying on a non-linear manifold,


– reconstruct low-dimensional coordinates of the data {xi }, which describe where
– the points lie on the manifold

65
approximate dimensionality, when known. Note the tend
dimensionality, in contrast to Isomap.
isomap

intuition

– Euclidean distance in the original space


may be a poor measure of dissimilarity
between points
– instead use geodesic distance, distance
between points on the manifold
– geodesic distance captures more
accurately captures the neighborhood
relationships that should be preserved
Fig. 3. The “Swiss roll” data set, illustrating how Isomap e
paths for nonlinear dimensionality reduction. (A) For two
(circled) on a nonlinear manifold,
example their Euclidean
from Tenenbaum dista
et al., 2000
dimensional input space (length of dashed line) may
reflect their intrinsic similarity, as measured by geodesi
the low-dimensional manifold (length of solid curve). 66
(B
can be estimated by
looking for the “elbow”
algorithmic features that Isomap inherits
from PCA and MDS: a noniterative, polyno-
calculating geodesic distance
at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true or
approximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate the mial time procedure with a guarantee of glob-
dimensionality, in contrast to Isomap. al optimality; for intrinsically Euclidean man-

Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesic 1000 data points) allows an approximation (red segments) to the true
paths for nonlinear dimensionality reduction. (A) For two arbitrary points geodesic path to be computed efficiently in step two, as the shortest
(circled) on a nonlinear manifold, their Euclidean distance in the high-
I without knowing the manifold, calculating geodesic distance is impossible path in G. (C) The two-dimensional embedding recovered by Isomap in
dimensional input space (length of dashed line) may not accurately step three, which best preserves the shortest path distances in the
reflect their intrinsic similarity, as measured by geodesic distance along neighborhood graph (overlaid). Straight lines in the embedding (blue)
I for nearby points, geodesic distance ≈ Euclidean distance
the low-dimensional manifold (length of solid curve). (B) The neighbor- now represent simpler and cleaner approximations to the true geodesic
hood graph G constructed in step one of Isomap (with K " 7 and N " paths than do the corresponding graph paths (red).
I for faraway points, approximate geodesic distance by a sequence “short hops”
[Link] SCIENCE VOL 290 22 DECEMBER 2000 2321
between neighboring points

67
isomap algorithm

given data Y = {yi }

1. compute distances δij = kyi − yj k2 in the original space

2. construct a graph G , where vertices represent points, and each point is connected to
its p nearest points, according to distances δij

3. for each pair of points i and j compute the shortest path distance d ij from i to j on
the graph G (e.g., using the Floyd-Warshall algorithm)

4. use MDS on the shortest-path distances {d ij } to compute the low-dimensional


embedding {xi }

68
conclusion and discussion

I isomap is an algorithm designed to recover low-dimensional non-linear manifolds


within high-dimensional spaces
I it works by constructing a graph, which approximates geodesic distances on the manifold,
and then uses MDS
I the method can be unstable, for example
– sensitive to noise
– sensitive to number of nearest neighbors to construct the graph G
– if the graph G is disconnected, the algorithm will fail

69

You might also like