Low Dimensional
Analysis
By : Preethi P Palankar
MCA23905
• Introduction to Low-Dimensional
Data
• Types of Low-Dimensional Analysis
Agenda • Techniques of Low-Dimensional
Analysis
• Applications of Low-Dimensional
Analysis
Introduction to Low-Dimensional Data
• Also Known as Dimensionality Reduced Data
• Dimensionality Reduction is the process of reducing the
number of input variables or features in a dataset while
preserving its essential patterns and structures.
• It aims to eliminate irrelevant, noisy, or redundant data
and convert high-dimensional data into a more
manageable and interpretable form.
Types of Dimensionality Reduction
1.Feature Selection
Selects a subset of the most relevant original features without altering
them.
Example: Removing unnecessary columns such as ID numbers or
constant values.
2.Feature Extraction
Transforms the data into a new feature space, often combining
multiple original features into a smaller number of informative ones.
Example: In House dataset , square footage and number of rooms
are related features, so combine them into a "house size" feature.
Techniques of Low-Dimensional
Analysis
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. t-Distributed Stochastic Neighbor Embedding
(t-SNE)
Principal Component Analysis
(PCA)
• PCA is an unsupervised linear transformation
technique that projects data onto a lower-dimensional
space by identifying directions (principal components)
that maximize variance.
• Preserves maximum information with fewer
dimensions
• Used in exploratory data analysis and pre-processing
Linear Discriminant Analysis (LDA)
• LDA is a supervised technique that projects data in a
way that maximizes the separation between classes.
• Effective for classification tasks
• Considers both within-class and between-class
variance
t-Distributed Stochastic
Neighbor Embedding (t-SNE)
• t-SNE is a non-linear technique primarily used for visualizing
high-dimensional data by reducing it to two or three
dimensions while preserving local structure.
• Ideal for visualizing clusters and complex relationships
• Commonly used in image and text data
Criterion Functions for
Clustering
• Introduction to Clustering
Agenda • What Are Criterion Functions?
• Types of Criterion Functions
• Real-World Use Cases
Introduction to Clustering
• Clustering is an unsupervised machine learning technique that
involves partitioning a dataset into groups, or clusters, such that
data points within the same cluster are more similar to each
other than to those in other clusters.
• A company uses clustering to group customers based on age,
income, and buying habits.
• For example: one group may be frequent buyers, another
occasional buyers.
What Are Criterion Functions?
• Criterion functions are mathematical formulas or evaluation measures
used to evaluate the quality of clusters formed by a clustering algorithm.
• These functions help to optimize the clustering process by measuring:
• Intra-cluster similarity : How close data points in a cluster are to
each other
• Inter-cluster dissimilarity : How different one cluster is from another.
Types of Criterion Functions
• Within-Cluster Sum of Squares (WCSS)
• Between-Cluster Sum of Squares (BCSS)
• Silhouette Score
• Davies–Bouldin Index (DBI)
Within-Cluster Sum of Squares
(WCSS)
➤ WCSS measures how close the data points in a cluster are to
the centroid (mean point) of that cluster.
➤ A lower WCSS means that points are tightly packed (compact),
which is ideal in clustering.
Formula:
Between-Cluster Sum of Squares
(BCSS)
➤ BCSS measures how far apart the clusters are from each other.
It looks at the distance between each cluster centroid and the
overall dataset centroid.
➤ A higher BCSS is better because it shows that clusters are well-
separated.
Formula:
Silhouette Score
➤ Silhouette score checks how well each point fits in its own
cluster compared to other clusters.
➤ Value ranges from -1 to 1
• Close to 1: Good clustering
• Close to 0: Borderline
• Below 0: Wrong clustering
Formula:
Davies–Bouldin Index (DBI)
DBI measures the similarity between clusters, based on the
ratio of intra-cluster distances and inter-cluster separation.
➤ Lower DBI = better clusters
➤ Higher DBI = clusters are too similar/overlapping
Formula:
Real-World Use Cases
•Customer segmentation
•Image segmentation
•Document or topic clustering
•Bioinformatics (e.g., gene
clustering)
18
Thank you