0% found this document useful (0 votes)
25 views22 pages

UNIT 4 ML Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views22 pages

UNIT 4 ML Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

lOMoARcPSD|50114473

ML UNIT-IV - Haa shmsxbscsbcm

Machine Learning (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Durga Bhavani Alanka ([email protected])
lOMoARcPSD|50114473

Unit IV
1. Clustering, K-Means
2. Limits of K-Means
3. Using Clustering for Image Segmentation
4. Using Clustering for Preprocessing
5. Using Clustering for Semi-Supervised Learning
6. DBSCAN, Gaussian Mixtures.
7. Dimensionality Reduction
8. The Curse of Dimensionality
9. Main Approaches for Dimensionality Reduction,
10. PCA
11. Using Scikit-Learn,
12. Randomized PCA, Kernel PCA
…………………………………………………………………………………………………………
…………………..

1. Clustering
It is basically a type of unsupervised learning method. An unsupervised learning method is a method
in which we draw references from datasets consisting of input data without labeled responses.
Generally, it is used as a process to 昀椀nd meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classi昀椀ed into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the
below picture.

Clustering is used in a wide variety of applications, including:


a) For customer segmentation: you can cluster your customers based on their purchases, their
activity on your website, and so on. This is useful to understand who your customers are and what
they need, so you can adapt your products and marketing campaigns to each segment.

For example, this can be useful in recommender systems to suggest content that other users in the
same cluster enjoyed.
b) For data analysis: when analyzing a new dataset, it is often useful to 昀椀rst discover clusters of
similar instances, as it is often easier to analyze clusters separately

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

c) As a dimensionality reduction technique: once a dataset has been clustered, it is usually possible
to measure each instances affinity with each cluster (a昀케nity is any measure of how well an
instance 昀椀ts into a cluster). Each instances feature vector x can then be replaced with the
vector of its cluster a昀케nities. If there are k-clusters, then this vector is k-dimensional. This is
typically much lower dimensional than the original feature vector, but it can preserve enough
information for further processing.

For anomaly detection (also called outlier detection): any instance that has a low a昀케nity to all the
clusters is likely to be an anomaly. For example, if you have clustered the users of your website
based on their behaviour; you can detect users with unusual behaviour, such as an unusual
number of requests per second, and so on.
Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud
detection.
d)For semi-supervised learning: if you only have a few labels, you could perform clustering
and propagate the labels to all the instances in the same cluster. This can greatly increase the
amount of labels available for a subsequent supervised learning algorithm, and thus improve
its performance.
e) For search engines: for example, some search engines let you search for images that are
similar to a reference image. To build such a system, you would 昀椀rst apply a clustering
algorithm to all the images in your database: similar images would end up in the same
cluster. Then when a user provides a reference image, all you need to do is to 昀椀nd this
image’s cluster using the trained clustering model, and you can then simply return all the
images from this cluster.

f) To segment an image: by clustering pixels according to their color, then replacing each
pixel’s color with the mean color of its cluster, it is possible to reduce the number of di昀昀erent
colors in the image considerably. This technique is used in many object detection and tracking
systems, as it makes it easier to detect the contour of each object.

2. K-means clustering
 The k-means clustering algorithm is one of the simplest unsupervised learning
algorithms for solving the clustering problem.
 Let it be required to classify a given data set into a certain number of clusters, say, k
clusters.
 We start by choosing k points arbitrarily as the “centres” of the clusters, one for each
cluster, We then
associate each of the given data points with the nearest centre.
 We now take the averages of the data points associated with a centre and replace the
centre with the average, and this is done for each of the centres.
 We repeat the process until the centres converge to some 昀椀xed points. The data points
nearest to the centres form the various clusters in the dataset. Each cluster is represented
by the associated centre.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

EXAMPLE: Class running notes

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Disadvantages of k-means
1.Choosing k manually
Use the “Loss vs. Clusters” plot to 昀椀nd the optimal (k), as discussed in Interpret Results.
2. Being dependent on initial values.
For a low , you can mitigate this dependence by running k-means several times with di昀昀erent
initial values and picking the best result. As increases, you need advanced versions of k-means
to pick better values of the initial centroids (called k-means seeding).
3. Clustering data of varying sizes and density.
k-means has trouble clustering data where clusters are of varying sizes and density. To
cluster such data, you need to generalize k-means as described in the Advantages section.
4. Clustering outliers.
Centroids can be dragged by outliers, or outliers might get their own cluster instead of being
ignored. Consider removing or clipping outliers before clustering.
5. Scaling with number of dimensions.
As the number of dimensions increases, a distance-based similarity measure converges to a
constant value between any given examples. Reduce dimensionality either by using PCA on
the feature data, or by using “spectral clustering” to modify the clustering algorithm as
explained below.

3. Using Clustering for Image Segmentation


Image segmentation is the task of partitioning an image into multiple segments.
In semantic segmentation , all pixels that are part of the same object type get assigned to the same
segment. For example, in a self-driving car’s vision system, all pixels that are part of a
pedestrian’s image might be assigned to the “pedestrian” segment (there would just be one
segment containing all the pedestrians).

In instance segmentation, all pixels that are part of the same individual object are assigned to the
same segment.
K-means clustering is a very popular clustering algorithm which applied when we have a
dataset with labels unknown. The goal is to 昀椀nd certain groups based on some kind of
similarity in the data with the number of groups represented by K. This algorithm is generally
used in areas like market segmentation, customer segmentation, etc. But, it can also be used
to segment di昀昀erent objects in the images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. De昀椀ne a similarity measure b/w feature vectors such as Euclidean distance to measure the
similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that is

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

similar to it
until you can’t combine more.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Example:
from scipy import ndimage
from sklearn.cluster import
KMeans import numpy as np
import matplotlib.pyplot as
plt import random
#PARAMS
n_clusters=1
0 #fake train
data
original_train = np.random.random((100, 100, 100, 3)) #100 images of each 100 px,py
and RGB n,x,y,c = original_train.shape
昀氀at_train = original_train.reshape((n,x*y*c))
kmeans = KMeans(n_clusters,
random_state=0) clusters =
kmeans.昀椀t_predict(昀氀at_train)
centers =
kmeans.cluster_centers_
#visualize centers:
for ci in centers:
plt.imshow(ci.reshape(x,y,c))
plt.show()
#visualize other members
for cluster in np.arange(n_clusters):
cluster_member_indices = np.where(clusters == cluster)[0]
print("There are %s members in cluster %s" % (len(cluster_member_indices),
cluster)) #pick a random member
random_member = random.choice(cluster_member_indices)
plt.imshow(original_train[random_member,:,:,:])
plt.show()

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

5. Using Clustering for Preprocessing


Clustering can be an e昀케cient approach to dimensionality reduction, in particular as a
preprocessing step before a supervised learning algorithm. For example, let’s tackle the digits
dataset which is a simple MNIST-like dataset containing 1,797 grayscale 8×8 images
representing digits 0 to 9. First, let us load the dataset
from sklearn.pipeline import
Pipeline pipeline = Pipeline([
("kmeans", KMeans(n_clusters=50)),
("log_reg", LogisticRegression()),
])
pipeline.昀椀t(X_train,
y_train)
pipeline.score(X_test,
y_test) Output:-
0.9644444444444444
How about that? We almost divided the error rate by a factor of 2! But we chose the number
of clusters k completely arbitrarily, we can surely do better. Since K-Means is just a
preprocessing step in a classi昀椀cation pipeline, finding a good value for k is much simpler
than earlier: there’s no need to perform silhouette analysis or minimize the inertia, the best
value of k is simply the one that results in the best classi昀椀cation performance during cross-
validation.
from sklearn.pipeline import
Pipeline pipeline = Pipeline([
("kmeans", KMeans(n_clusters=90)),
("log_reg", LogisticRegression()),
])
pipeline.昀椀t(X_train,
y_train)
pipeline.score(X_test,
y_test) Output:-
0.9688888888888889

6. Using Clustering for Semi-Supervised Learning


Another use case for clustering is in semi-supervised learning, when we have plenty of unlabeled
instances and very few labeled instances.
Let us train a logistic regression model on a sample of 50 labeled instances from the
digits dataset: n_labeled = 50
log_reg = LogisticRegression()
log_reg.昀椀t(X_train[:n_labeled],
y_train[:n_labeled]) What is the performance of

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

this model on the test set?


>>> log_reg.score(X_test, y_test)
0.8266666666666667

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

The accuracy is just 82.7%: it should come as no surprise that this is much lower than earlier, when
we trained the model on the full training set. Let us see how we can do better. First, let us
cluster the training set into 50 clusters, then for each cluster let us 昀椀nd the image closest to
the centroid. We will call these images the representative images
k = 50
kmeans = KMeans(n_clusters=k)
X_digits_dist =
kmeans.昀椀t_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
X_representative_digits = X_train[representative_digit_idx]

昀椀g: Fifty representative digit images (one per


cluster) Now let’s look at each image and manually label it:
y_representative_digits = np.array([4, 8, 0, 6, 8, 3, ..., 7, 6, 2, 3, 1, 1])
Now we have a dataset with just 50 labeled instances, but instead of being
completely random instances, each of them is a representative image of its
cluster. Let’s see if the performance is any better:
>>> log_reg = LogisticRegression()
>>> log_reg.昀椀t(X_representative_digits, y_representative_digits)
>>> log_reg.score(X_test, y_test)
0.9244444444444444
With this approach We jumped from 82.7% accuracy to 92.4%, although we are still only
training the model on 50 instances. Since it is often costly and painful to label instances,
especially when it has to be done manually by experts, it is a good idea to label representative
instances rather than just random instances.
But perhaps we can go one step further: what if we propagated the labels to all the other
instances in the same cluster? This is called label propagation.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

7. DBSCAN
This algorithm de昀椀nes clusters as continuous regions of high density. It is actually quite simple:

For each instance, the algorithm counts how many instances are located within a small
distanceε(epsilon) from it. This region is called the instances ε-neighborhood.

If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is
considered a core instance. In other words, core instances are those that are located in dense
regions.
All instances in the neighborhood of a core instance belong to the same cluster. This may
include other core instances, therefore a long sequence of neighboring core instances forms a
single cluster.
Any instance that is not a core instance and does not have one in its neighborhood is
considered an anomaly.
This algorithm works well if all the clusters are dense enough, and they are well separated by
low-density regions. The DBSCAN class in Scikit-Learn is as simple to use as you might expect.
Let’s test it on the moons dataset

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

8. Gaussian Mixtures
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were
generated from a mixture of several Gaussian distributions whose parameters are unknown.
All the instances generated from a single Gaussian distribution form a cluster that typically
looks like an ellipsoid. Each cluster can have a di昀昀erent ellipsoidal shape, size, density and
orientation.
When you observe an instance, you know it was generated from one of the Gaussian
distributions, but you are not told which one, and you do not know what the parameters of
these distributions are.

The below image has a few Gaussian distributions with a di昀昀erence in mean (μ) and variance
(σ2). Remember that the higher the σ value more would be the spread:

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Chapter-II
1. Dimensionality Reduction:
2. The Curse of Dimensionality
3. Main Approaches for Dimensionality Reduction
4. PCA, Using Scikit-Learn
5. Randomized PCA
6. Kernel PCA

Dimensionality Reduction

Dimensionality reduction, or dimension reduction, is the transformation of data


from a high-dimensional space into a low-dimensional space so that the low-
dimensional representation retains some meaningful properties of the original
data, ideally close to its intrinsic dimension.

1. The Curse of Dimensionality


Curse of Dimensionality refers to a set of problems that arise when working with high-
dimensional data. The dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset.
A dataset with a large number of attributes, generally of the order of a hundred or more, is
referred to as high dimensional data.
Some of the di昀케culties that come with high dimensional data manifest during analyzing or
visualizing the data to identify patterns, and some manifest while training machine learning
models.
The di昀케culties related to training machine learning models due to high dimensional data is
referred to as
‘Curse of Dimensionality’.
The popular aspects of the curse of dimensionality; ‘data sparsity’ and ‘distance concentration’
are discussed
in the following sections.
Solutions to Curse of Dimensionality:
One of the ways to reduce the impact of high dimensions is to use a di昀昀erent measure of
distance in a space vector. One could explore the use of cosine similarity to replace Euclidean
distance. Cosine similarity can have a lesser impact on data with higher dimensions. However,
use of such method could also be speci昀椀c to the required solution of the problem.
Other methods:
Other methods could involve the use of reduction in dimensions. Some of the techniques that
can be used are:
1.Forward-feature selection: This method involves picking the most useful subset of
features from all given features.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

2. PCA/t-SNE: Though these methods help in reduction of number of features, but it does
not necessarily preserve the class labels and thus can make the interpretation of results a
tough task.

2.Main Approaches for Dimensionality Reduction


The two main approaches to reducing dimensionality: projection and Manifold Learning
a) Projection

In most real-world problems, training instances are not spread out uniformly across all
dimensions. Many features are almost constant, while others are highly correlated. As a
result, all training instances actually lie within (or close to) a much lower-dimensional subspace
of the high-dimensional space. This sounds very abstract, so let us look at an example.

In Below Figure you can see a 3D dataset represented by the circles.

Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace
of the high- dimensional (3D) space. Now if we project every training instance perpendicularly
onto this subspace (as represented by the short lines connecting the instances to the plane),
we get the new 2D dataset shown in below

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

However, projection is not always the best approach to dimensionality reduction.

Manifold Learning
 Manifold learning is a type of non-linear dimensionality reduction process.
 It is believed that many datasets has an arti昀椀cially high dimensionality.
 Suppose your data is 3-dimensional. When visualizing the data, this is what you get:

 It is clear that the data here is simply a 2d plane that is twisted into a 3d space.
 Manifold learning tries to unwrap these folds.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Higher dimensionality data is usually harder to learn on, so manifold learning is just a process
to make that kind of data easier to use.

3. Principal Components Analysis


This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower
dimension space, the variance of the data in
the. lower dimensional space should be
maximum. It involves the following steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the
largest eigenvalues are used to
reconstruct a large fraction of variance of
the original data.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features,
if any. Disadvantages of Dimensionality
Reduction
 It may lead to some amount of data loss.
 PCA tends to 昀椀nd linear correlations between variables, which is sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to de昀椀ne datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied. Steps in PCA

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

PCA will provide a mechanism to recognize this geometric similarity through algebraic means.
The covariance matrix S is a symmetric matrix and According to Spectral Theorem(Spectral
Decomposition)

Here we call ⃗vi as Eigen Vector and λi as the corresponding Eigen Value and A as the
covariance matrix. Step 4 : Inferring the Principal components from Eigen Values of the Co
Variance Matrix
From Spectral theorem we infer

The Most Signi昀椀cant Principal Component is the Eigen vector corresponding to the largest Eigen
Value.

Step 5: - Projecting the data using the Principal Components


The projection matrix is obtained by selected Eigen vectors(k<d) numbers. The original dataset is
transformed via the projection matrix to obtain a reduced k dimension subspace of original
dataset. (below k is denoted as

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

4. Using Scikit-Learn
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did before.
The following code applies PCA to reduce the dimensionality of the dataset down to two
dimensions (note that it automatically takes care of centering the data):

5.Randomized PCA
If you set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a stochastic
algorithm called Randomized PCA that quickly 昀椀nds an approximation of the 昀椀rst d principal
components. Its computational complexity is O(m × d 2) + O(d 3), instead of O(m × n 2) + O(n3)
for the full SVD approach, so it is dramatically faster than full SVD when d is much smaller
than n:

By default, svd_solver is actually set to "auto": Scikit-Learn automatically uses the randomized
PCA algorithm if m or n is greater than 500 and d is less than 80% of m or n, or else it uses the
full SVD approach. If you want to force Scikit-Learn to use full SVD, you can set the svd_solver
hyperparameter to "full"

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a
lower dimensional space. The input data is centered but not scaled for each feature before
applying the SVD.
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the
method of Halko et al. 2009, depending on the shape of the input data and the number of
components to extract.
It can also use the scipy.sparse.linalg ARPACK implementation of the truncated SVD.

Downloaded by Durga Bhavani Alanka ([email protected])


lOMoARcPSD|50114473

6.Kernel PCA
Kernel PCA a mathematical technique that implicitly maps instances into a very high-
dimensional space (called the feature space), enabling nonlinear classi昀椀cation and regression
with Support Vector Machines.
A linear decision boundary in the high-dimensional feature space corresponds to a complex
nonlinear decision boundary in the original space. It turns out that the same trick can be
applied to PCA, making it possible to perform complex nonlinear projections for dimensionality
reduction. This is called Kernel PCA (kPCA).6 It is often good at preserving clusters of instances
after projection, or sometimes even unrolling datasets that lie close to a twisted manifold.
For example, the following code uses Scikit-Learn’s KernelPCA class to perform kPCA with an
RBF kernel

Downloaded by Durga Bhavani Alanka ([email protected])

You might also like