Module7 Slides
Module7 Slides
Aristides Gionis
argioni@[Link]
KTH Royal Institute of Technology
overview of module 7
I isomap
2
reading material
chapters 1, 2, and 4
3
machine learning today
I at the heart of an ongoing technological revolution and societal transformation
consider:
I autonomous vehicles
I healthcare (medical imaging, diagnostics, drug design, treatment, etc.)
I recommendation systems (netflix, amazon, etc.)
I searching for information in the web or social media
I intelligent interfaces (personal assistants, face recognition, speech recognition, etc.)
I . . . and many more . . .
4
data is at the center of machine learning
5
data types
6
a very common data representation
I data is a set of observations; each observation consists of values over a set of attributes
– observations are also called points, or samples
– attributes are also called dimensions, or variables, or features
I such data is represented by a matrix A ∈ Rn×d ,
– where n is the number of points and d the number of dimensions
y 11 y 12 . . . y 1d
y 21 y 22 . . . y 2d
Y= .
. .. . . ..
. . . .
y n1 y n2 . . . y nd
9
the curse of dimensionality
I the “host space” of the data increases exponentially with the dimension
– as a result, high-dimensional data tend to be very sparse
– also, optimization by exhaustive enumeration is not possible
I certain computational tasks become very challenging
– e.g., for nearest-neighbor search, it is difficult to do better than exhaustive search
I a large number of dimensions may be irrelevant
– e.g., may not have any predictive power in a classification task
I visualizing high-dimensional data is difficult
10
surprising properties of high-dimensional data
11
surprising properties of high-dimensional data
✏)
Vs (r ) 1d d→∞
so, when dimensionality increases, the thin shell contains almost all the volume
12
surprising properties of high-dimensional data
3. diagonal of a hypercube
vT e ±1
coshv, ei i = = √ −−−→ 0
kvkkei k d d→∞
I we can also show : a random vector is nearly orthogonal to all coordinates axes
13
surprising properties of high-dimensional data
14
how to cope with high-dimensional data?
15
data may reside in lower-dimensional manifolds
linear −→
manifolds
non-linear −→
manifolds
16
data may reside in lower-dimensional manifolds
I motivated by the idea that lower-dimensional manifolds can be either linear or non-linear,
we have developed methods for linear dimensionality reduction
or non-linear dimensionality reduction
17
objectives of dimensionality reduction
18
a high-level view of dimensionality reduction
cod : Rd → Rk , y 7→ x = cod(y)
I the inverse mapping should bring the mapped data close to the original data
– to reason about this, we use the inverse mapping
dec : Rk → Rd , x 7→ y = dec(x)
19
linear vs. non-linear dimensionality reduction
I a linear dimensionality reduction method uses linear mappings cod and dec
I a non-linear dimensionality reduction method uses non-linear mappings cod and dec
20
dimensionality reduction criterion
Ecod,dec = Ey ky − dec(cod(y))k22
21
dimensionality reduction — general scheme
I a model to specify our data type and the mappings cod and dec
I a criterion to be optimized
I an algorithm to be used to optimize the criterion
the goal is to find optimal mappings cod and dec, within the specified model,
which optimize the desired criterion
22
categorization of dimensionality-reduction methods
23
matrix decompositions
24
the singular value decomposition (SVD)
where U ∈ Rm×m and V ∈ Rn×n are orthonormal (i.e., UT U = Im×m and VT V = In×n )
and Σ ∈ Rn×n is diagonal
Σ = diag(σ1 , . . . , σn ) , where σ1 ≥ . . . ≥ σn ≥ 0
25
singular values and singular vectors
I the columns of U and V are the left singular vectors and right singular vectors, resp.
A vj = σj uj and AT uj = σj vj
I outer-product form
σ1 n
..
X
T T
A = U1 ΣV = (u1 . . . un ) (v1 . . . vn ) = σj u j v T
. j
σn j=1
26
matrix approximation using the singular-value decomposition
Ak = U k Σ k V T
k
then,
min kA − Bk2 = kA − Ak k2 = σk+1
rank(B)≤k
I useful for
– compression
– noise reduction
– and as we will see, dimensionality-reduction
27
singular-value decomposition and matrix pseudo-inverse
A+ = V Σ+ UT
where, Σ+ is formed from Σ by taking the reciprocal of all the non-zero elements,
leaving all the zeros in place, and making the matrix the right shape
28
eigen-decomposition
Av = λv
A = Q Λ Q−1
29
singular value decomposition and eigen-decomposition
AT A = V Σ UT U Σ VT = V Σ2 VT
A AT = U Σ VT V Σ UT = U Σ2 UT
I it follows :
– the singular values of A are the nonnegative square roots of the eigenvalues of AT A
– the columns of V (right singular vectors of A) are the eigenvectors of AT A
– the columns of U (left singular vectors of A) are the eigenvectors of A AT
30
principal component analysis (PCA)
31
recall our general dimensionality-reduction scheme
32
recall our general dimensionality-reduction scheme
33
data model of PCA
y = Wx
34
data centering
I we assume that the observed points y and the latent points x are “centered”
this can be achieved by removing the expectation Ey [y] from each point yi
yi ← yi − Ey [y]
1
Y←Y− Y1n 1T
n
n
35
PCA linear mapping
dec : Rk → Rd , x 7→ y = dec(x) = W x
I for the mapping from Rd to Rk we use the pseudo inverse W+ = (WT W)−1 WT = WT
cod : Rd → Rk , y 7→ x = cod(y) = W+ y = WT y
36
PCA optimization criterion
37
optimizing the PCA reconstruction-error criterion
h i
EW = Ey ky − W WT yk22
h i
= Ey (y − W WT y)T (y − W WT y)
h i
= Ey yT y − 2yT W WT y + yT W WT W WT y
h i
= Ey yT y − yT W WT y
h i h i
= Ey yT y − Ey yT W WT y
I we want to maximize Ey yT W WT y = 1
tr(YT W WT Y)
n
tr(YT W WT Y) = tr(V Σ UT W WT U Σ VT )
I we can observe that the above expression reaches its maximum when the k columns of W
are colinear with the columns of U that are associated with the k largest singular values
in Σ, i.e.,
W = Uk
39
putting everything together
40
explained variance and reconstruction error
k
X
VW = σi2
i=1
Pd 2
note that the variance of the whole data is VY = i=1 σi
41
selecting the number of latent variables (k)
I selecting the number of latent variables is a mix of science and art
I we can use the concept of explained variance
– (or is it 95%?)
42
example : spatial / linguistic data analysis
43
example : spatial / linguistic data analysis
visualizing the first three components
I it comes with some rules of thumb to select the number of latent variables
45
distance preservation as a means for dimensionality reduction
I in some cases, data can lie in a non-linear manifold inside a high-dimensional space
– manifold not known, but we may be able to compute distances on the manifold
I in other cases, data points have no vector representation, but we can compute distances
– e.g., string edit distance
I in both cases, we can compute distances between pairs of data points
46
distance functions
47
Minkowski distances
d
! p1
X
d(a, b) = ka − bkp = |ai − b i |p
i=1
48
similarity functions
I distances are measures of dissimilarity
=⇒
distance matrix
mapping to 2D
50
color perception
[Ekman, 1954]
I study color perception in human vision
51
color perception
52
color perception
MDS reproduces the well-known 2-dimensional color circle
53
1) 3.93 2) 6.80 3) 7.70 4) 9.87 5) 9.91 6) 11.33 7) 11.65 8) 11.66 9) 12.13 10) 12.62
[Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link] [Link]
54
2D MDS map of images
56
classical MDS with a similarity matrix
I consider dataset Y = {y1 , . . . , yn }, represented as matrix Y = (y1 , . . . , yn ) ∈ Rd×n
I assume that data is not known, instead pairwise similarities are known
I we know the similarity matrix S, such that [S]ij = s ij = yT
i yj (dot-product similarity)
– then, S = YT Y
I as with PCA, for point i we assume a latent vector xi , so that yi = W xi ,
where W defines a linear transformation
– we have Y = W X
– we assume again that the columns of W are orthonormal, i.e., WT W = Ik×k
I we have
S = YT Y = (W X)T (W X) = XT (WT W) X = XT X
I recap : given similarity matrix S ∈ Rn×n , we want to find latent matrix X ∈ Rk×n ,
such that S = XT X
X = Ik×n Λ1/2 UT
58
classical MDS with a distance matrix
I in many cases, instead of the similarity matrix S, where [S]ij = s ij = yT
i yj
we are given as input the pairwise distance matrix D, where [D]ij = d ij = kyi − yj k2
I we want to perform MDS on the distance matrix D
60
metric MDS
where d ij are the input distances, and dˆij the distances of the reconstructed points
61
non-metric MDS
I in many cases ordinal information is given instead of quantitative distances
– e.g., in a psychology study people may be able to rank a set of concepts, or
– indicate relative similarity between groups of objects, without being able to
– assign quantitative scores
I non-metric MDS asks to optimize a stress function of the form
1
X 2 2
EnmMDS = wij f (δij ) − dˆij
i<j
I MDS is more flexible than PCA, as it only assumes a similarity or distance matrix, and
it is applicable when no vector representation of data is available
I on the other hand, MDS is computationally more intensive when working with matrices
of size n × n
I metric and non-metric MDS are more general methods but they come with the cost
of having to solve a more difficult optimization problem
63
tions are computed efficiently by finding weighted graph G over the data points, with preserves the manifold’s estimated intrinsic
shortest paths in a graph with edges connect- edges of weight dX(i, j) between neighboring geometry (Fig. 3C). The coordinate vectors yi
ing neighboring data points. points (Fig. 3B). for points in Y are chosen to minimize the
nonlinear dimensionality reduction — motivation
The complete isometric feature mapping,
or Isomap, algorithm has three steps, which
In its second step, Isomap estimates the
geodesic distances dM (i, j) between all pairs
cost function
E ! !#$D G % " #$D Y %! L 2 (1)
are detailed in Table 1. The first step deter- of points on the manifold M by computing
mines which points are neighbors on the their shortest path distances dG (i, j) in the where DY denotes the matrix of Euclidean
manifold M, based on the distances dX (i, j) graph G. One simple algorithm (16) for find- distances {dY (i, j) " !yi & yj!} and !A! L2
many high-dimensional points lie in lower-dimensional manifolds
between pairs of points i, j in the input space ing shortest paths is given in Table 1. the L2 matrix norm '(i, j A2i j . The # operator
64
nonlinear dimensionality reduction
objective
65
approximate dimensionality, when known. Note the tend
dimensionality, in contrast to Isomap.
isomap
intuition
Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesic 1000 data points) allows an approximation (red segments) to the true
paths for nonlinear dimensionality reduction. (A) For two arbitrary points geodesic path to be computed efficiently in step two, as the shortest
(circled) on a nonlinear manifold, their Euclidean distance in the high-
I without knowing the manifold, calculating geodesic distance is impossible path in G. (C) The two-dimensional embedding recovered by Isomap in
dimensional input space (length of dashed line) may not accurately step three, which best preserves the shortest path distances in the
reflect their intrinsic similarity, as measured by geodesic distance along neighborhood graph (overlaid). Straight lines in the embedding (blue)
I for nearby points, geodesic distance ≈ Euclidean distance
the low-dimensional manifold (length of solid curve). (B) The neighbor- now represent simpler and cleaner approximations to the true geodesic
hood graph G constructed in step one of Isomap (with K " 7 and N " paths than do the corresponding graph paths (red).
I for faraway points, approximate geodesic distance by a sequence “short hops”
[Link] SCIENCE VOL 290 22 DECEMBER 2000 2321
between neighboring points
67
isomap algorithm
2. construct a graph G , where vertices represent points, and each point is connected to
its p nearest points, according to distances δij
3. for each pair of points i and j compute the shortest path distance d ij from i to j on
the graph G (e.g., using the Floyd-Warshall algorithm)
68
conclusion and discussion
69