Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by To develop and manage
nd manage a production-ready
being statistically different from the rest of the observations. Such “anomalous” behavior typically translates model, you must work through the
to some kind of a problem like a credit card fraud, failing machine in a server, a cyber-attack etc. following stages:
K-Medoids: K-medoids clustering is a variant of K-means that is more robust to noises and outliers. Instead •Source and prepare your data.
of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent •Develop your model.
it. Medoid is the most centrally located object of the cluster, with minimum sum of distances to other points. •Train an ML model on your data:
Self-organize map: Lending- Identifying clusters of borrowers for potential default sinre-payments. Customer •Train model
Segmentation- Customers with similar characteristics can be clustered to gather for further analysis like •Evaluate model accuracy
churn rate, loyalty, promotions etc. •Tune hyperparameters
A Gaussian mixture model can be used for clustering, which is the task of grouping a set of data points into •Deploy your trained model.
clusters. GMMs can be used to find clusters in data sets where the clusters may not be clearly defined. •Send prediction requests to your model:
Additionally, GMMs can be used to estimate the probability that a new data point belongs to each cluster. •Online prediction
Hard clustering is method to grouping the data items such that each item is only assigned to one cluster, K- •Batch prediction
Means is one of them. •Monitor the predictions on an ongoing
While Soft clustering is method to grouping the data items such that an item can exist in multiple clusters, basis.
Fuzzy C-Means (FCM) is an example. •Manage your models and model versions.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which
Gradient descent Normal equation
groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the Iterative. Closed-form
process, as if K=2, there will be two clusters, and for K=3, there will It may converge gradually It converges straightforwardly
be three clusters, and so on. It is effective and productive for large It is inefficient for large datasets
The algorithm takes the unlabeled dataset as input, divides the datasets
dataset into k-number of clusters, and repeats the process until it It is slower for complex models It is quicker
does not find the best clusters. The value of k should be It must be carefully chosen It is not applicable
predetermined in this algorithm. It is applicable to different models It is restricted to linear regression
The k-means clustering algorithm mainly performs two tasks: It may get stuck in neighbourhood It is steady for most cases
•Determines the best value for K center points or centroids by an optima
iterative process.
It is appropriate for large datasets. It is constrained by matrix inversion
•Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster. for large datasets.
Hence each cluster has datapoints with some commonalities, and it is It underpins regularization methods. It requires alteration for regularization
away from other clusters. It may require include scaling. Inot influenced by highlight scaling
How does the K-Means Algorithm Work? Classification Regression
1: Select the number K to decide the number of clusters.
Classification gives out discrete values. Regression gives continuous values.
2: Select random K points or centroids. (It can be other from the input
Given a group of data, this method It uses the mapping function to map
dataset).
helps group the data into different values to continuous output.
3: Assign each data point to their closest centroid, which will form the
groups.
predefined K clusters.
In classification, the nature of the Regression has ordered predicted data.
4: Calculate the variance and place a new centroid of each cluster.
predicted data is unordered.
5: Repeat the third steps, which means reassign each data point to the
The mapping function is used to map It attempts to find a best fit line. It tries to
new closest centroid of each cluster.
values to pre−defined classes. extrapolate the graph to find/predict the
6: If any reassignment occurs, then go to step-4 else go to FINISH.
values.
7: The model is ready.
Example include Decision tree, logistic Examples include Regression tree
regression. (Random forest), Linear regression
Classification is done by measuring the Regression is done using the root mean
accuracy. square error method.