0% found this document useful (0 votes)
102 views20 pages

High-Dimensional Data Mining Techniques

The document discusses several topics related to high-dimensional space, including dimensionality reduction techniques like feature selection and feature projection. It also covers concepts like the law of large numbers, volume in high-dimensional objects, and how most of the volume in a high-dimensional unit ball is concentrated near its equator. Random projection and the Johnson-Lindenstrauss lemma are also summarized.

Uploaded by

Jathin Sreevas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views20 pages

High-Dimensional Data Mining Techniques

The document discusses several topics related to high-dimensional space, including dimensionality reduction techniques like feature selection and feature projection. It also covers concepts like the law of large numbers, volume in high-dimensional objects, and how most of the volume in a high-dimensional unit ball is concentrated near its equator. Random projection and the Johnson-Lindenstrauss lemma are also summarized.

Uploaded by

Jathin Sreevas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

High-Dimensional Space

Prof. Asim Tewari


IIT Bombay

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Dimensionality reduction
• Feature selection: Feature selection approaches try to find a subset
of the original variables (also called features or attributes)
– Filter strategy (e.g. information gain)
– Wrapper strategy (e.g. search guided by accuracy)
– Embedded strategy (features are selected to add or be removed while
building the model based on the prediction errors)

• Feature projection: Feature projection transforms the data in the


high-dimensional space to a space of fewer dimensions. The data
transformation may be linear, as in principal component analysis
(PCA), but many nonlinear dimensionality reduction techniques also
exist.For multidimensional data, tensor representation can be used
in dimensionality reduction through multilinear subspace learning.

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Markov’s inequality

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Markov’s inequality
• Let x be a nonnegative random variable. Then
for a > 0,

Proof: For a continuous nonnegative random variable x with probability density p,

Thus,

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Markov’s inequality
• Let x be a nonnegative random variable. Then
for a > 0,

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Chebyshev’s inequality
• Let x be a random variable. Then for c > 0,

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Law of Large Numbers
• Let x1, x2, . . . , xn be n independent samples of
a random variable x. Then

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Law of Large Numbers

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Law of Large Numbers
Implications of the Law of Large Numbers:
• If we draw a point x from a d-dimensional Gaussian
with unit variance, then it will lie on a sphere of
radius sqrt(d).
This is because
|x|2 ≈ d

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Law of Large Numbers
Implications of the Law of Large Numbers:
• If we draw two point y and z from a d-dimensional
Gaussian with unit variance, then they would be
approximately orthogonal
This is since for all i,

Therefore,
Thus by the Pythagorean theorem, the two points y and z
must be approximately orthogonal.
Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume in objects of High Dimensions

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume of the Unit Ball

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume of the Unit Ball

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume of the Unit Ball

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume of the Unit Ball

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume near the Equator
• An interesting fact about the unit ball in high
dimensions is that most of its volume is
concentrated near its “equator.” In particular,
for any unit-length vector v defining “north,”
most of the volume of the unit ball lies in the
thin slab of points whose dot- product with v
has magnitude O(1/ √d).

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume near the Equator

Let A denote the portion of the ball with and let H denote the upper hemisphere.

We can then show that the ratio of the volume of A to the


volume of H goes to zero by calculating an upper bound
on volume(A) and a lower bound on volume(H) and
proving that

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume near the Equator

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Volume near the Equator

Thus:

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications
Random Projection
Johnson-Lindenstrauss Lemma
• The projection f : Rd → Rk. Pick k Gaussian vectors u1,
u2, . . . , uk in Rd with unit-variance coordinates. For any
vector v, define the projection f (v) by:
f (v) = (u1 · v, u2 · v, . . . , uk · v).
( The projection f (v) is the vector of dot products of v
with the ui) . Then, the Johnson-Lindenstrauss Lemma
States that:
• With high probability, |f (v)| ≈√k|v|
• And for any two vectors v1 and v2,
f (v1 − v2) = f (v1) − f (v2)

Asim Tewari, IIT Bombay ME 781: Engineering Data Mining and Applications

You might also like