Week-9 Unsupervised Learning Algorithms
Week-9 Unsupervised Learning Algorithms
Unsupervised learning is a type of machine learning in which models are trained using unlabelled
dataset and are allowed to act on that data without any supervision.
As we can see from the above figure, when the raw input is given to the machine learning models, it
is able to group them according to certain hidden pattern and structure. Here the model can group
fruits are apples, oranges and avocado based on their respective characteristics
• Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyse and cluster unlabelled datasets.
• The goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.
Suppose the unsupervised learning algorithm is given an input dataset containing images of different
types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does
not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is
to identify the image features on their own. Unsupervised learning algorithm will perform this task
by clustering the image dataset into the groups according to similarities between images.
Unsupervised classification
Unsupervised classified as:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with
most similarities.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
Association: An association rule is an unsupervised learning method which is used for finding
the relationships between variables in the large database. It determines the set of items that occurs
together in the dataset.
Association rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis
• Hierarchal clustering.
• Anomaly detection.
• Neural Networks.
• Apriori algorithm.
• Clustering automatically divides the dataset into groups based on their similarities.
• Anomaly detection can discover unusual text or data points in your dataset. It is useful for
finding fraudulent transactions.
• Association mining identifies sets of items that often occur together in your datapoints/dataset.
• Latent variable models are broadly used for data pre-processing. Like reducing the amount of
features in a dataset or decomposing the dataset into multiple components.
• The result of the unsupervised learning algorithm might be less accurate as input data is not
labelled, and algorithms do not know the exact output in advance
2. Customer Segmentation
3. Similarity Detection
4. Recommendation Systems
1. Clustering
• Clustering is the process of grouping the given data into different clusters or groups. Unsupervised
learning can be used to do clustering when we do not know exactly the information about the
clusters.
• Elements in a group or cluster should be as similar as possible, and points in different groups
should be as dissimilar as possible.
• Unsupervised learning can be used to do clustering when we do not know exactly the information
about the clusters.
• It is used for analysing and grouping data, which does not include pre-labelled classes or class
attributes. Clustering can be helpful for businesses to manage their data in a better way.
• For example, you can go to Walmart or a supermarket and see how different items are grouped
and arranged there.
• Also, e-commerce websites like Amazon use clustering algorithms to implement a user-specific
recommendation system.
• Here is another example. Let us say we have a YouTube channel. we may have a lot of data about
the subscribers of our channel. If we want to detect groups of similar subscribers, then we may
need to run a clustering algorithm. We do not need to tell the algorithm which group a subscriber
belongs to. The algorithm can find those connections without our help.For example, it may tell
you that 35% of your subscribers are from Canada, while 20% of them are from the United States.
2. Visualization
• Visualization is the process of creating diagrams, images, graphs, charts, etc., to communicate
some information. This method can be applied using unsupervised machine learning.
• For example, let us say you are a football coach, and you have some data about your team’s
performance in a tournament. You may want to find all the statistics about the matches quickly.
• You can feed the complex and unlabelled data to some visualization algorithm.
• These algorithms will output a two-dimensional or three-dimensional representation of your data
that can easily be plotted. So, by seeing the plotted graphs, you can easily get a lot of information.
• This information will help you to maintain your winning formula, correct your previous mistakes,
and win the ultimate trophy.
3. Dimensionality Reduction
• Dimensionality reduction is the process of reducing the number of random variables under
consideration by getting a set of principal variables.
• In dimensionality reduction, the objective is to simplify the data without losing too much
information. There can be a lot of similar information in your data.
• One method to do dimensionality reduction is to merge all those correlated features into one. This
method is also called feature extraction.
These are some of the most common dimensionality reduction algorithms in machine learning:
✓ Principal Component Analysis (PCA)
✓ Kernel PCA
✓ Locally-Linear Embedding
• This is the process of finding associations between different parameters in the available data. It
discovers the probability of the co-occurrence of items in a collection, such as people that buy X
also tend to buy Y.
• In association rule learning, the algorithm will deep dive into large amounts of data and find some
interesting relationships between attributes.
• For example, when you go to Amazon and buy some items, they will show you products similar
to those in advertisements, even when you are not on their website.
• This is a kind of association rule learning. Amazon can find associations between different
products and customers. They know that if they show a particular advertisement to a particular
customer, chances are high that he will buy the product.
• Thus, by using this method, they can increase their sales and revenue very highly. This leads to a
more customized customer approach and is a pillar to customer satisfaction as well as retention.
5. Anomaly Detection
• Anomaly detection is the identification of rare items, events, or observations, which brings
suspicions by differing significantly from the normal data.
• In this case, the system is trained with a lot of normal instances. So, when it sees an unusual
instance, it can detect whether it is an anomaly or not.
• One important example of this is credit card fraud detection. You might have heard about a lot
of events related to credit card fraud.
• This problem is now solved using anomaly detection techniques in machine learning. The
system detects unusual credit card transactions to prevent fraud.
K-means Clustering
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.
Where K defines the number of pre-defined clusters that need to be created in the process, as
if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabelled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
Assigns each data point to its closest k-center. Those data points which are near to the particular
k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-3: Assign each data point, based on their distance from the randomly selected points
(Centroid), to the nearest/closest centroid which will form the predefined clusters.
Step-5: Repeat step no.3, which reassign each datapoint to the new closest centroid of each
cluster.
Step-7: FINISH
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:
Let us take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not the part of our dataset.
Consider the below image.
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between two
points. So, we will draw a median between both the centroids. Consider the below image:
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the centre of gravity of these centroids, and will find new
centroids as below:
We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the value
of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance (Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
• It executes the K-means clustering on a given dataset for different K values (ranges from 1-
10).
• Plots a curve between calculated WCSS values and the number of clusters K.
• The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
The output image is clearly showing the five different clusters with different colours. The clusters are
formed between two parameters of the dataset; Annual income of customer and Spending. We can
change the colours and labels as per the requirement or choice. We can also observe some points from
the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending so we can categorize
these customers as
o Cluster2 shows the customer has a high income but low spending, so we can categorize them
as careful.
o Cluster3 shows the low income and low spending so they can be categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can be
categorized as target, and these customers can be the most profitable customers for the mall
owner.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
In [17]:
df = pd.read_csv("C:/Users/Shilpa/Desktop/dataset/Iris.csv")
x = df.iloc[:,1:5].values
print(x)
for i in range(1,11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1,11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
As we can see from the above figure the iris dataset can be classified into three categories of
iris-setosa, iris-versicolor and iris-virginica.
N is the number of samples within the data set, C is the center of a cluster. So the Inertia simply
computes the squared distance of each sample in a cluster to its cluster center and sums them
up.
This process is done for each cluster and all samples within that data set. The smaller the Inertia
value, the more coherent are the different clusters. When as many clusters are added as there
are samples in the data set, then the Inertia value would be zero
A point is represented using x and y co-ordinates, hence the Euclidean distance between the two points
can be calculated using above formula
Similarly, Euclidean distance is calculated for every data point. After the formation of clusters, the
next step is to update or recompute the centroid values. This is done by taking the Euclidean average
or mean value of the data points in a particular cluster.
Stop criteria
1. We can stop our training when even after many iterations our centroids are stable, that is they are
the same and fixed.
2. We can stop our training when even after many iterations our data points remain in the same
cluster. That they do not change their cluster anymore.
3. We can stop our training when the fixed or maximum number of iterations is reached. This is
because an insufficient number of iterations might give us poor results and hence unstable clusters.
2. DUNN INDEX
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio
between the minimal inter-cluster distance to maximal intra-cluster distance.
Intra-cluster: The distance between two similar data points belonging to the same cluster.
Inter-cluster: The distance between two dissimilar data points belonging to different clusters.
The main objective of any good clustering algorithm is to reduce the intra-cluster distance and
maximize the inter-cluster distance. One of the main performance metrics that is used for clustering
is the Dunn Index parameter.
The Dunn Index (DI) is one of the clustering algorithms evaluation measures. It is most commonly
used to evaluate the goodness of split by a K-Means clustering algorithm for a given number of
clusters.
Dimensionality reduction
Dimensionality reduction simply refers to the process of reducing the number of attributes
in a dataset while keeping as much of the variation in the original dataset as possible. It
is a data preprocessing step meaning that we perform dimensionality reduction before training
the model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D
or 3D.
It removes irrelevant features from the data, because having irrelevant features in the
data can decrease the accuracy of the models and make your model learn based on
irrelevant features.
Not all variables in your data are independent. Some input variables may correlate with the
other input variables in the dataset. This is referred to as multicollinearity which can negatively
affect the performance of your regression and classification models.
Dimensionality reduction can be used for image compression
Dimensionality reduction reduces the size of your dataset while keeping much of the
variability in the original data as possible. A similar kind of approach can be used for image
compression. So, in image compression, we reduce the number of pixels of an image while
keeping as much of the quality in the original image as possible.
Dimensionality reduction improves the accuracy of models Dimensionality reduction can
be used to compress neural network architectures
• Encoder: This is a non-linear function that transforms the input data into a lower-
dimensional form called the latent vector.
• Decoder: This is also a non-linear function that takes the latent vector as the input and
constructs another output that is much similar to the original input. The goal is to
minimize the reconstruction error.
The encoder takes the X (784-dim) as the input and transforms it into a lower-
dimensional latent vector (184-dim) which is taken by the decoder as its input. The
output of the decoder is much similar to the original input, X but not exactly the same.
The dissimilarity between the input and output is measured by the reconstruction
error or the loss function that should be kept as minimum as possible.
• the main purpose of principal component analysis (pca) simplifies the complexity in
high-dimensional data while retaining trends and patterns.
• PCA can help us improve performance at a very low cost of model accuracy. Other
benefits of PCA include reduction of noise in the data, feature selection (to a certain
extent), and the ability to produce independent, uncorrelated features of the data.
Uses of PCA
• PCA is used to visualize multidimensional data.
As we can see iris dataset has the four features sepal length ,sepal width ,petal length, petal
width, these three specifications helps in classification the iris category it belongs to.
When scatter plot is drawn against (petal length and petal width) & (sepal length and sepal
width) In both cases we can easily distinguish the three types of Iris. This is important, because
it tells us that both the sepal features and the petal feature independently contain explanatory
information relating to the type of Iris, i.e., the target class.
If we could only choose petals or sepals to attempt classifying, then of course the petal features
would yield much better results, but not perfect:
The Iris virginica and Iris versicolor clusters blend a bit, and are not linearly separable. In
some cases, the data may even be linearly separable after applying a transformation to the data,
but no such pattern seems to exist here.
Thus, we apply PCA! Even though the sepal features seem even worse for classifying the
target (i.e. they are even less linearly separable by Iris type), they nevertheless contain
important information, or statistical variance, which may reveal more linear separability in
different dimensions while also reducing the amount of features! This is why PCA is useful.
• Because the features in the Iris dataset are on totally different scales (e.g. the sepal
lengths are much longer than the petal widths), we need to scale them so that the new
principal components treat all features equally via singular value decomposition.
iris=datasets.load_iris()
x=iris.data
y=iris.target
print(x.shape)
print(y.shape)
pca=PCA(n_components=2)
pca.fit(x)
print(pca.components_)
x=pca.transform(x)
print(x.shape)
plt.scatter(x[:,0],x[:,1],c=y)
x_train, x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
res=DecisionTreeClassifier()
res.fit(x_train,y_train)
y_predict=res.predict(x_test)
print(accuracy_score(y_test,y_predict))
MLOP’s
• A Machine Learning pipeline is a process of automating the workflow of a complete machine
learning task.
• A typical pipeline includes raw data input, features, outputs, model parameters, ML models,
and Predictions.
• Moreover, an ML Pipeline contains multiple sequential steps that perform everything ranging
from data extraction and pre-processing to model training and deployment in Machine learning
in a modular approach.
• It means that in the pipeline, each step is designed as an independent module, and all these
modules are tied together to get the final result.
What is MLOP’s?
MLOP’s stands for Machine Learning Operations for production.
MLOP’s Engineers build and maintain a platform to enable the development and deployment
of machine learning models.
MLOps, also known as Machine Learning Operations for Production, is a set of standardized
practices that can be utilized to build, deploy, and govern the lifecycle of ML models.
This setup helps to ease the interaction among cross-functional teams and provides an
automated platform to keep track of everything required for the complete cycle of ML models.
MLOps practices also result in increased scalability, security, and reliability of the ML
systems, leading to shorter development cycles and escalated profits from the ML projects.
Each stage of the data pipeline passes processed data to the next step, i.e., it gives the output
of one phase as input data into the next phase
Why MLOps?
There are many goals enterprises want to achieve through MLOps. Some of the common ones are:
• Automation
• Scalability
• Reproducibility
• Monitoring
• Governance
MLOps Lifecycle
1. ML Development: This is the basic step that involves creating a complete pipeline beginning
from data processing to model training and evaluation codes.
2. Model Training: Once the setup is ready, the next logical step is to train the model. Here,
continuous training functionality is also needed to adapt to new data or address specific
changes.
3. Model Evaluation: Performing inference over the trained model and checking the
accuracy/correctness of the output results.
4. Model Deployment: When the proof of concept stage is accomplished, the other part is to
deploy the model according to the industry requirements to face the real-life data.
5. Prediction Serving: After deployment, the model is now ready to serve predictions over the
incoming data.
6. Model Monitoring: Over time, problems such as concept drift can make the results inaccurate
hence continuous monitoring of the model is essential to ensure proper functioning.
Data and Model Management: It is a part of the central system that manages the data and models.
It includes maintaining storage, keeping track of different versions, ease of accessibility, security, and
configuration across various cross-functional teams.
Versioning of Machine learning models
A developer is accountable for questions such as the dataset used to train the model;
hyperparameters; pipeline used to create the model; last deployed version of the model etc.
This calls for the application of version control in machine learning models. The accuracy of
the dataset varies when you update and tinker with different parts of the model. With
versioning, developers can scope out the best model and its tradeoffs.
A machine learning model can fall flat for several reasons. For example, while adding more
data or incorporating performance improvement measures. In case of such failures, version
modelling helps in quickly reverting to the previous working version.
Machine learning models can be very complex. Factors such as datasets, training and testing,
frameworks, among others, account for a model’s success. Version control helps in keeping
dependency tracking.
Major updates to machine learning models are not usually rolled out at once. To ensure better
performance and failure tolerance, the ML models are released in phases. Versioning allows
the deployment of the right versions at the right time.
Git: Git is the standard versioning protocol used across the board to monitor and version
control software development and deployment. Git tracks changes made to the code and help
in implementing, storing, and merging changes.
Git also comes with a few drawbacks. It is a challenge to keep all the folders in sync in Git.
The model checkpoints and data size occupy the bulk of the space. Many users alternatively
store the datasets in cloud servers such as Amazon 3, reproducible codes in Git, and generate
models on the fly. But working with multiple data sets breeds confusion. Further, improper
documentation of data changes and upgrades can result in the model losing the context.
DVC: Data Version Control is a Git extension. It is a streamlined version of combining Git
with ML specific functionality for data management. DVC can run top of any Git repository
and is compatible with the Git server or provider. DVC also offers all the advantages of the
distributed version control system, such as lock-free, local branching, and versioning.
Pachyderm: It delivers robust data versioning and data lineage to the machine learning loop.
It also provides a flexible pipeline system that can use any tool or framework in the
transformation steps. Pachyderm uses containers to execute different pipeline steps and solves
data provenance issues by tracking data commits and optimizing the pipeline.
Machine learning metadata (MLMD): It is a recently introduced library from the Tensorflow team to
track the entire ML workflow’s full lineage. The complete lineage includes steps such as data
ingestion, preprocessing, validation, training, and deployment. MLMD can be used to trace bad
models back to the datasets.
Machine learning model registry
A model registry is a repository used to store and version trained machine learning (ML)
models. Model registries greatly simplify the task of tracking models as they move through the
ML lifecycle, from training to production deployments and ultimately retirement.
MLflow Model Registry is a centralized model repository and a UI and set of APIs that enable
you to manage the full lifecycle of MLflow Models. Model Registry provides: Chronological
model lineage (which MLflow experiment and run produced the model at a given time).
MLflow is an MLOps tool that enables data scientist to quickly productionize their Machine
Learning projects. To achieve this, MLFlow has four major components which are Tracking, Projects,
Models, and Registry. MLflow lets you train, reuse, and deploy models with any library and package
them into reproducible steps.