0% found this document useful (0 votes)
38 views21 pages

Datamining Unit4

The document discusses prediction in data analysis, focusing on regression techniques including linear and nonlinear regression, and their respective advantages and disadvantages. It also covers methods for evaluating classifier accuracy, such as holdout, random subsampling, cross-validation, and bootstrapping, along with cluster analysis and its applications in various fields. Key clustering methods are outlined, including partitioning, hierarchical, density-based, and grid-based methods, highlighting their characteristics and requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views21 pages

Datamining Unit4

The document discusses prediction in data analysis, focusing on regression techniques including linear and nonlinear regression, and their respective advantages and disadvantages. It also covers methods for evaluating classifier accuracy, such as holdout, random subsampling, cross-validation, and bootstrapping, along with cluster analysis and its applications in various fields. Key clustering methods are outlined, including partitioning, hierarchical, density-based, and grid-based methods, highlighting their characteristics and requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit-4

PREDICTION
To find a numerical output, prediction is used. The training dataset contains the inputs and
numerical output values. According to the training dataset, the algorithm generates a model or
predictor. When fresh data is provided, the model should find a numerical output. This
approach, unlike classification, does not have a class label. A continuous-valued function or
ordered value is predicted by the model.
In most cases, regression is utilized to make predictions. For example: Predicting the worth of a
home based on facts like the number of rooms, total area, and so on.
Consider the following scenario: A marketing manager needs to forecast how much a specific
consumer will spend during a sale. In this scenario, we are bothered to forecast a numerical
value. In this situation, a model or predictor that forecasts a continuous or ordered value
function will be built.

Prediction Issues:

Preparing the data for prediction is the most pressing challenge. The following activities are
involved in data preparation:
 Data Cleaning: Cleaning data include reducing noise and treating missing values.
Smoothing techniques remove noise, and the problem of missing values is solved by
replacing a missing value with the most often occurring value for that characteristic.
 Relevance Analysis: The irrelevant attributes may also be present in the database. The
correlation analysis method is used to determine whether two attributes are connected.
 Data Transformation and Reduction: Any of the methods listed below can be used to
transform the data.
o Normalization: Normalization is used to transform the data. Normalization is the
process of scaling all values for a given attribute so that they lie within a narrow
range. When neural networks or methods requiring measurements are utilized in the
learning process, normalization is performed.
o Generalization: The data can also be modified by applying a higher idea to it. We
can use the concept of hierarchies for this.
Other data reduction techniques include wavelet processing, binning, histogram analysis, and
clustering.

Linear Regression
Linear regression defines the relationship between a dependent variable and one or more
independent variables—marrying these two couples in a linear equation, which expresses the
best relationship that will account for the linear effect in the recorded data. The simplest is
called simple linear regression, where one independent variable is represented by the formula:
f(x,β)=β1x+β2f(x,β)=β1x+β2
where:
 Dependent variable, ( y )
 ( x ) is the independent variable,
 ( beta_0 ) is in the linear model as
 ( beta_1 ) is the slope
 ( epsilon ) is the error term.

Nonlinear Regression

The nonlinear regression models relate the independent and dependent variables via a nonlinear
equation about them. They are helpful where the linear relationship between independent and
dependent variables is non-existent. The large number of chums posed by the numerous forms
of nonlinear regression models depends on the kind of curve that will fit the data best. An
example of a nonlinear model is:
f(x,β)=β1xβ2+xf(x,β)=β2+xβ1x
How is Linear Regression different from Nonlinear Regression Models?
Linear regression assumes a linear relationship between the predictor(s) and the response
variable, represented by an equation like y=β0 +β1 x. It is linear in parameters, meaning the
change in the response variable is directly proportional to the change in predictor variables. The
parameters are estimated using the least squares method, which minimizes the sum of squared
differences between observed and predicted values, making it computationally simpler and
easier to interpret. In contrast, nonlinear regression models can capture more complex
relationships and are not linear in parameters, necessitating iterative methods for parameter
estimation. These models provide flexibility to fit data better but are more complex and
computationally intensive, with coefficients that are harder to interpret directly.
Brief Explanation of Features

1. Linear Relationship: Non-linear regression allows for a curve and a more complex pattern,
while linear regression assumes a straight-line relationship between the independent and
dependent variables.
2. Equation Form: The linear regression equation is that of a straight line. Nonlinear
regression equations vary significantly by form, utterly dependent on the curve of the data.
3. Model Complexity: The model is easy to understand for both linear regressions and the side
of nonlinear models. However, nonlinear models may require a more complex fitting and
interpretation.
4. Interpretability: Linear models will generally require an explanation due to their simplicity.
While non-linear models may lead to better accuracy, it may happen at the expense of easy
intelligibility.
5. Computation: Linear regression generally tends to solve a set of linear equations in such a
way that it is not computationally demanding. Nonlinear regression often suggests iteration,
that way increasing significantly the consumed power.
6. Convergence: Convergence is guaranteed in linear regression through the OLS. Nonlinear
regression need not be assured to have convergence; it is also highly dependent on initial
values and other parameters.
7. Flexibility: Linear model can know a linear relationship only. Whereas, though nonlinear
models do fit, they can capture more data patterns.
8. Sensitivity to Outliers: It could be one of the areas where linear regression has a high
sensitivity to outliers and where the consequences in a nonlinear regression depend heavily
on the particular model specification used when addressing outliers.
Advantages and Disadvantages of Linear Regression and Nonlinear Regression
Linear Regression:
 Advantages:
o Simplicity and ease of appropriateness.
o Efficient computation.
o Perfect for ideally linear relationship data.
 Disadvantages:
o For a small feature space only.
o Very sensitive to outliers.
Non-linear Regression:
 Advantages:
o Relations that are complex can be modeled.
o Makes it flexible in handling several data patterns.
 Disadvantages:
o It speaks in very computational-type terms.
o It might require some kind of iterative technique, possibly with careful initialization.
o Interpretation can be hard to do.
Applications of Linear Regression and Nonlinear Regression Models
Linear Regression:
 Conduct predictions from past data about sales, stock prices, and the rest of the economic
indices.
 Estimation of phase relationships of physical quantities in scientific research.
 Easy finance and insurance risk assessments.
Nonlinear Regression:
 Model population growth in Biology as it typically follows the non-linear pattern.
 Fitting pharmacokinetic models in medical research.
 Complex output study in engineering, like the stress-strain relationships.

Evaluating the accuracy of classifiers


Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. In this article, we will see
techniques to evaluate the accuracy of classifiers.

HoldOut

In the holdout method, the largest dataset is randomly divided into three subsets:

 A training set is a subset of the dataset which are been used to build predictive models.
 The validation set is a subset of the dataset which is been used to assess the performance of
the model built in the training phase. It provides a test platform for fine-tuning of the
model’s parameters and selecting the best-performing model. It is not necessary for all
modeling algorithms to need a validation set.
 Test sets or unseen examples are the subset of the dataset to assess the likely future
performance of the model. If a model is fitting into the training set much better than it fits
into the test set, then overfitting is probably the cause that occurred here.
Basically, two-thirds of the data are been allocated to the training set and the remaining one-
third is been allocated to the test set.
Random Subsampling
 Random subsampling is a variation of the holdout method. The holdout method is been
repeated K times.
 The holdout subsampling involves randomly splitting the data into a training set and a test
set.
 On the training set the data is been trained and the mean square error (MSE) is been obtained
from the predictions on the test set.
 As MSE is dependent on the split, this method is not recommended. So a new split can give
you a new MSE.
 The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}

Cross-Validation

 K-fold cross-validation is been used when there is only a limited amount of data available, to
achieve an unbiased estimation of the performance of the model.
 Here, we divide the data into K subsets of equal sizes.
 We build models K times, each time leaving out one of the subsets from the training, and use
it as the test set.
 If K equals the sample size, then this is called a “Leave-One-Out”
Bootstrapping

 Bootstrapping is one of the techniques which is used to make the estimations from the data
by taking an average of the estimates from smaller data samples.
 The bootstrapping method involves the iterative resampling of a dataset with replacement.
 On resampling instead of only estimating the statistics once on complete data, we can do it
many times.
 Repeating this multiple times helps to obtain a vector of estimates.
 Bootstrapping can compute variance, expected value, and other relevant statistics of these
estimates.

Cluster Analysis
The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering. A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be
treated collectively as one group and so may be considered as a form of data compression. Cluster
analysis tools based on k-means, k-medoids, and several methods have also been built into many
statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS.
Applications: Cluster analysis has been widely used in numerous applications, including market
research, pattern recognition, data analysis, and image processing. In business, clustering can help
marketers discover distinct groups in their customer bases and characterize customer groups based
on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar
functionality, and gain insight into structures inherent in populations. Clustering may also help in
the identification of areas of similar land use in an earth observation database and in the
identification of groups of houses in a city according to house type, value,and geographic location,
as well as the identification of groups of automobile insurance policy holders with a high average
claim cost. Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can also be used for
outlier detection,Applications of outlier detection include the detection of credit card fraud and the
monitoring of criminal activities in electronic commerce.
Typical Requirements Of Clustering InData Mining:
 Scalability: Many clustering algorithms work well on small data sets containing fewer than
several hundred data objects; however, a large database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results. Highly scalable
clustering algorithms are needed.
 Ability to deal with different types of attributes: Many algorithms are designed to cluster
interval-based (numerical) data. However, applications may require clustering other types of data,
such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters
based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures
tend to find spherical clusters with similar size and density. However, a cluster could be of any
shape. It is important to develop algorithms that can detect clusters of arbitrary shape. 
Minimal requirements for domain knowledge to determine input parameters:Many
clustering algorithms require users to input certain parameters in cluster analysis (such as the
number of desired clusters). The clustering results can be quite sensitive to input parameters.
Parameters are often difficult to determine, especially for data sets containing high-dimensional
objects. This not only burdens users, but it also makes the quality of clustering difficult to control.
 Ability to deal with noisy data: Most real-world databases contain outliers or missing,
unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to
clusters of poor quality. 
Incremental clustering and insensitivity to the order of input records: Some clustering algorithms
cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures
and, instead, must determine a new clustering from scratch. Some clustering algorithms are
sensitive to the order of input data. That is, given a set of data objects, such an algorithm may
return dramatically different clusterings depending on the order of presentation of the input
objects.
It is important to develop incremental clustering algorithms and algorithms that are insensitive to
the order of input.
 High dimensionality: A database or a data warehouse can contain several dimensions or
attributes.Many clustering algorithms are good at handling low-dimensional data,involving only
two to three dimensions. Human eyes are good at judging the quality of clustering for up to three
dimensions. Finding clusters of data objects in high dimensional space is challenging, especially
considering that such data can be sparse and highly skewed.
 Constraint-based clustering: Real-world applications may need to perform clustering under
various kinds of constraints. Suppose that your job is to choose the locations for a given number of
new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster
households while considering constraints such as the city’s rivers and highway networks, and the
type and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect
clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to
be tied to specific semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering features and methods.
Major Clustering Methods: 
Partitioning Methods 
Hierarchical Methods 
Density-Based Methods
Grid-Based Methods
Model-Based Methods
Partitioning Methods: A partitioning method constructs k partitions of the data, where each
partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together
satisfy the following requirements: Each group must contain at least one object, and Each object
must belong to exactly one group. A partitioning method creates an initial partitioning. It then uses
an iterative relocation technique that attempts to improve the partitioning by moving objects from
one group to another. The general criterion of a good partitioning is that objects in the same
cluster are close or related to each other, whereas objects of different clusters are far apart or very
different.
Hierarchical Methods: A hierarchical method creates a hierarchical decomposition of the given
set of data objects. A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed. The agglomerative approach,
also called the bottom-up approach, starts with each object forming a separate group. It
successively merges the objects or groups that are close to one another, until all of the groups are
merged into one or until a termination condition holds.  The divisive approach, also called the
top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a
cluster is split up into smaller clusters,until eventually each object is in one cluster, or until a
termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done,it can never be
undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry
about a combinatorial number of different choices. There are two approaches to improving the
quality of hierarchical clustering:  Perform careful analysis of object ―linkages‖ at each
hierarchical partitioning, such as in Chameleon, or Integrate hierarchical agglomeration and other
approaches by first using a hierarchical agglomerative algorithm to group objects into
microclusters, and then performing macro clustering on the micro clusters using another clustering
method such as iterative relocation.
Density-based methods: Most partitioning methods cluster objects based on the distance
between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at
discovering clusters of arbitrary shapes.  Other clustering methods have been developed based
on the notion of density. Their general idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of points.
Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape. 
DBSCAN and its extension, OPTICS, are typical density-based methods that growclusters
according to a density-based connectivity analysis. DENCLUE is a methodthat clusters objects
based on the analysis of the value distributions of density functions.
Grid-Based Methods: Grid-based methods quantize the object space into a finite number of cells
that form a grid structure.All of the clustering operations are performed on the grid structure i.e.,
on the quantized space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number of cells in
each dimension in the quantized space. STING is a typical example of a grid-based method. Wave
Cluster applies wavelet transformation for clustering analysis and is both grid-based and density-
based.
Model-Based Methods: Model-based methods hypothesize a model for each of the clusters and
find the best fit of the data to the given model.A model-based algorithm may locate clusters by
constructing a density function that reflects the spatial distribution of the data points.  It also
leads to a way of automatically determining the number of clusters based on standard statistics,
taking ―noise‖ or outliers into account and thus yielding robust clustering methods.
Tasks in Data Mining: 
Clustering High-Dimensional Data 
Constraint-Based Clustering
Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications require the analysis
of objects containing a large number of features or dimensions. For example, text documents may
contain thousands of terms or keywords as features, and DNA micro array data may provide
information on the expression levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality. Many
dimensions may not be relevant. As the number of dimensions increases, the data become
increasingly sparse so that the distance measurement between pairs of points become meaningless
and the average density of points anywhere in the data is likely to be low. Therefore, a different
clustering methodology needs to be developed for high-dimensional data. CLIQUE and
PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces
of the data, rather than over the entire data space. Frequent pattern–based clustering,another
clustering methodology, extracts distinct
frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group
objects and generate meaningful clusters.
Constraint-Based Clustering: It is a clustering approach that performs clustering by
incorporation of user-specified or application-oriented constraints. A constraint expresses a user’s
expectation or describes properties of the desired clustering results, and provides an effective
means for communicating with the clustering process. Various kinds of constraints can be
specified, either by a user or as per application requirements. Spatial clustering employs with the
existence of obstacles and clustering under user specified constraints. In addition, semi-supervised
clustering employs for pairwise constraints in order to improve the quality of the resulting
clustering.
Classical Partitioning Methods:
The mostwell-known and commonly used partitioningmethods are
The k-Means Method
k-Medoids Method
Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects intok clusters
so that the resulting intracluster similarity is high but the inter-cluster similarity is low. Cluster
similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed
as the cluster’s centroid or center of gravity. The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster mean or
center. For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean. It then computes the
new mean for each cluster. This process iterates until the criterion function converges. Typically,
the square-error criterion is used,

defined as whereE is the sum of the square error for all objects in the data set pis the point in
space representing a given object miis the mean of cluster Ci
The k-means partitioning algorithm:
The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster.

The k-Medoids Method:


The k-means algorithm is sensitive to outliers because an object with an extremely large value
may substantially distort the distribution of data. This effect is particularly exacerbated due to the
use of the square-error function. Instead of taking the mean value of the objects in a cluster as a
reference point, we can pick actual objects to represent the clusters, using one representative
object per cluster. Each remaining object is clustered with the representative object to which it is
the most similar. The partitioning method is then performed based on the principle of minimizing
the sum of the dissimilarities between each object and its corresponding reference point. That is,
an absolute-error criterion is used, defined as

Where E is the sum of the absolute error for all objects in the data set pis the point in space
representing a given object in clusterCj oj is the representative object of Cj
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved. This quality is estimated using a cost function that measures the
average dissimilarity between an object and the representative object of its cluster. To determine
whether a non representative object, oj random, is a good replacement for a current representative
object, oj, the following four cases are examined for each of the non representative objects.
Case 1:
pcurrently belongs to representative object, oj . If ojis replaced by orandom as a representative
object and p is closest to one of the other representative objects, oi ,i≠j, then p is reassigned to oi .
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandom as a representative
object and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi , i≠j. If ojis replaced by orandom as a representative
object and p is still closest to oi , then the assignment does not change.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned to orandom

The k-medoids method ismore robust than k-means in the presence of noise and outliers, because
a medoid is lessinfluenced by outliers or other extreme values than a mean. However, its
processing ismore costly than the k-means method.
CLARANS (Clustering Large Applications based upon RANdomized Search) − CLARANS
algorithm combines both PAM and CLARA by searching only the subset of the dataset and it does
not constraint itself to some sample at any given time. While CLARA has a constant sample at
each phase of the search, CLARANS draws a sample with some randomness in every phase of the
search.The clustering phase can be presented as searching a graph where each node is a possible
solution, i.e, a set of k medoids. The clustering obtained after replacing a single medoid is called
the neighbor of the current clustering.

CLARANS is an efficient medoid-based clustering algorithm. The k-medoids algorithm is an


adaptation of the k-means algorithm. Rather than calculate the mean of the items in each cluster, a
representative item, or medoid, is chosen for each cluster at each iteration. In CLARANS, the
process of finding k medoids from n objects is viewed abstractly as searching through a certain
graph. In the graph, a node is represented by a set of k objects as selected medoids. Two nodes are
neighbors if their sets differ by only one object. In each iteration, CLARANS considers a set of
randomly chosen neighbor nodes as candidate of new medoids. We will move to the neighbor
node if the neighbor is a better choice for medoids. Otherwise, a local optima is discovered. The
entire process is repeated multiple time to find better.
CLARANS has two parameters: the maximum number of neighbors examined (maxNeighbor) and
the number of local minima obtained (numLocal). The higher the value of maxNeighbor, the
closer is CLARANS to PAM, and the longer is each search of a local minima. But the quality of
such a local minima is higher and fewer local minima needs to be obtained.

Parameters
Data the data set.
Distance the distance/dissimilarity measure.
K the number of clusters.
MaxNeighbor the maximum number of neighbors examined during a random search of local
minima.
NumLocal the number of local minima to search for.
Hierarchical Clustering Methods:
A hierarchical clustering method works by grouping data objects into a tree of clusters. The
quality of a pure hierarchical clustering method suffers from its inability to perform adjustment
once a merge or split decision has been executed. That is, if a particular merge or split decision
later turns out to have been apoor choice, the method cannot backtrack and correct it. Hierarchical
clustering methods can be further classified as either agglomerative or divisive, depending on
whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.

Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges these
atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until
certain termination conditions are satisfied. Most hierarchical clustering methods belong to this
category. They differ only in their definition of intercluster similarity.
Divisive hierarchical clustering:
This top-down strategy does the reverse of agglomerativehierarchical clustering by starting with
all objects in one cluster. It subdividesthe cluster into smaller and smaller pieces, until each object
forms a cluster on itsown or until it satisfies certain termination conditions, such as a desired
number ofclusters is obtained or the diameter of each cluster is within a certain threshold.

Chameleon: A Hierarchical Clustering Algorithm Using Dynamic modelling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the
similarity between pairs of clusters. It was derived based on the observed weaknesses of two
hierarchical clustering algorithms: ROCK and CURE. ROCK and related schemes emphasize
cluster interconnectivity while ignoring information regarding cluster proximity. CURE and
related schemes consider cluster proximity yet ignore cluster interconnectivity.

In Chameleon, cluster similarity is assessed based on how well-connected objects are within a
cluster and on the proximity of clusters. That is, two clusters are merged if their interconnectivity
is high and they are close together. Thus, Chameleon does not depend on a static, user-supplied
model and can automatically adapt to the internal characteristics of the clusters being merged. The
merge process facilitates the discovery of natural and homogeneous clusters and applies to all
types of data as long as a similarity function can be specified.

“How does Chameleon work?” The main approach of Chameleon is illustrated in Figure 7.9.
Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each
vertex of the graph represents a data object, and there exists an edge between two vertices
(objects) if one object is among the k-most-similar objects of the other. The edges are weighted to
reflect the similarity between objects. Chameleon uses a graph partitioning algorithm to partition
the k-nearest-neighbor graph into a large number of relatively small sub clusters. It then uses an
agglomerative hierarchical clustering algorithm that repeatedly merges sub clusters based on their
similarity. To determine the pairs of most similar sub clusters, it takes into account both the
interconnectivity as well as the closeness of the clusters. We will give a mathematical definition
for these criteria shortly.

Note that the k-nearest-neighbor graph captures the concept of neighborhood dynamically: the
neighborhood radius of an object is determined by the density of the region in which the object
resides. In a dense region, the neighborhood is defined narrowly; in a sparse region, it is defined
more widely. This tends to result in more natural clusters, in comparison with density-based
methods like DBSCAN (described in Section 7.6.1) that instead use a global neighborhood.
Moreover, the density of the region is recorded as the weight of the edges. That is, the edges of a
dense region tend to weigh more than that of a sparse region.

The graph-partitioning algorithm partitions the k-nearest-neighbor graph such that it minimizes
the edge cut. That is, a cluster C is partitioned into sub clusters Ci and Cj so as to minimize the
weight of the edges that would be cut should C be bisected into Ci and Cj. Edge cut is denoted
EC{Ci, Cj} and assesses the absolute interconnectivity between clusters Ci and Cj.

Chameleon determines the similarity between each pair of clusters Ci and Cj according to their
relative interconnectivity, RI (Ci, Cj), and their relative closeness, RC(Ci, Cj):

The relative interconnectivity, RI(Ci, Cj), between two clusters, Ci and Cj, is defined as the
absolute interconnectivity between Ci and Cj, normalized with respect to the internal
interconnectivity of the two clusters, Ci and Cj. That is,
where EC{Ci, Cj} is the edge cut, defined as above, for a cluster containing both Ci and Cj. Similarly,
ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly
equal parts.

The relative closeness, RC(Ci, Cj), between a pair of clusters, Ci and Cj, is the absolute closeness
between Ci and Cj, normalized with respect to the internal closeness of the two clusters, Ci and Cj.
It is defined as where SEC{Ci;Cj} is the average weight of the edges that connect vertices in Ci to
vertices in Cj, and SECCi (or SECCj ) is the average weight of the edges that belong to the min-cut
bisector of cluster Ci (or Cj).Chameleon has been shown to have greater power at discovering
arbitrarily shaped clusters of high quality than several well-known algorithms such as BIRCH and
density based DBSCAN. However, the processing cost for high-dimensional data may require
O(n2) time for n objects in the worst case.

DBSCAN is a density-based clustering algorithm that groups data points that are closely
packed together and marks outliers as noise based on their density in the feature space. It
identifies clusters as dense regions in the data space, separated by areas of lower density.
Unlike K-Means or hierarchical clustering, which assume clusters are compact and spherical,
DBSCAN excels in handling real-world data irregularities such as:
 Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular or convex.
 Noise and Outliers: It effectively identifies and handles noise points without assigning them
to any cluster.
The figure above shows a data set with clustering algorithms: K-Means and
Hierarchical handling compact, spherical clusters with varying noise tolerance, while
DBSCAN manages arbitrary-shaped clusters and excels in noise handling.
Key Parameters in DBSCAN
 1. eps: This defines the radius of the neighborhood around a data point.
If the distance between two points is less than or equal to eps, they are considered neighbors.
Choosing the right eps is crucial:
 If eps is too small, most points will be classified as noise.
 If eps is too large, clusters may merge, and the algorithm may fail to distinguish between
them.
A common method to determine eps is by analyzing the k-distance graph.
 2. MinPts: This is the minimum number of points required within the eps radius to form a
dense region.
A general rule of thumb is to set MinPts >= D+1, where D is the number of dimensions in the
dataset. For most cases, a minimum value of MinPts = 3 is recommended.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. core points, which have a sufficient number of neighbors within a specified radius (eplison)
2. border points, which are near core points but lack enough neighbors to be core points
themselves
3. noise points, which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting density-reachable points,
DBSCAN forms clusters without relying on rigid assumptions about their shape or size.

Grid-Based Method in Data Mining

We can use the grid-based clustering method for multi-resolution of grid-based data structure. It is
used to quantize the area of the object into a finite number of cells, which is stored in the grid
system where all the operations of Clustering are implemented. We can use this method for its
quick processing time, which is generally independent of the number of data objects, still
dependent on only the multiple cells in each dimension in the quantized space.

There is an instance of a grid-based approach that involves STING, which explores statistical data
stored in the grid cells, and WaveCluster, which clusters objects using a wavelet transform
approach. And CLIQUE, which defines a grid-and density-based approach for Clustering in high-
dimensional data space.
Basics of Grid-Based Methods

When we deal with the datasets available in multidimensional characteristics, we need the help of
a grid-based approach. This method includes some spatial data such as geographical information,
image data, or datasets with multiple attributes. If we divide this data space, we can get various
advantages of the grid-based method. Some of the gained advantages are as follows.

1. Data Partitioning
This is a clustering method that classifies all the information into many groups. This
classification is based on the characteristics and similarity of the data. With the help of
data analysis, we can specify the number of clusters generated with the clustering
method's help. With the help of the portioning method, the data can be specified in
constructs user-specified(K) partitions in which each partition represents a cluster and a
particular region. So many algorithms are generated with the help of the data partitioning
method. These algorithms are K-Mean, PAM(K-Medoids), and CLARA algorithm
(Clustering Large Applications).
2. Data Reduction
We can use this technique in data mining, which is used to reduce the size of a dataset
while still preserving the most important information. Where there is a too large amount
of dataset that needs to be processed efficiently or if the dataset contains a large amount
of irrelevant or redundant information in that situation, we use the data reduction method.
3. Local Pattern Discovery
With the help of the grid-based method, we can identify the local patterns or trends within
the data. We can analyze the data within individual cells, patterns and relationships; these
things are still hidden, and all the data in the entire dataset can be uncovered. This is
especially valuable for finding localized phenomena within data.

1. Scalability
This method is known for its scalability. We can handle large datasets, making them
particularly useful when dealing with high-dimensional data. The partitioning of space
inherently reduces dimensionality, simplifying analysis.
2. Density Estimation
Density-based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points
in the region separated by two low-point density clusters are considered noise. The
surroundings with a radius ε of a given object are known as the ε neighbourhood of the
object. If the ε neighbourhood of the object comprises at least a minimum number of
MinPts of objects, it is called a core object.
3. Clustering and Classification
The grid-based mining method can divide the space of instances into two types.
Clustering techniques are then applied using the Cells of the grid, instead of individual
data points, as the base units. The biggest advantage of this method is that it improves
processing time.
4. Grid-Based Indexing
We can use grid-based indexing, which utilizes efficient access and retrieval of data.
These structures organize the data based on the grid partitions, enhancing query
performance and retrieval.

Popular Grid-Based Methods

There are so many popular methods that are based on the grid-based method. These are as follows.
All those methods have unique strengths and applications. We are going to understand some of the
methods below.

1. K-Means Clustering Algorithm


It is an unsupervised algorithm used for learning purposes and to solve the clustering problem in
data mining. Here, the term K defines the number of predefined clusters to create the different
processes. If the value of K is 2, then there will be two clusters. If the value of K is 3, then there
will be three clusters. With the help of this, we can cluster the data into different groups, and a
convenient way will be discovered which can be used to categorize groups in the unlabeled dataset
on its own without the need for any training.

It is also a centroid-based algorithm in which each cluster is associated with a centroid. The main
purpose of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

This algorithm can take all the unlabeled datasets, divide those data sets into k-number of clusters,
and then repeat this process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

This algorithm mainly performs two works. These two works are as follows.

o The first work is to determine the best value for K centre points or centroids by an
iterative process.
o The second value is it Assigns each data point to its closest k-center. Those data points
which are near the K-center create a cluster.
How does the K-Means Algorithm Work?

This algorithm follows some Steps. These Steps are as follows.

o Step 1: This Stepselects the number K to decide the number of clusters.


o Step 2: Then it Selects random K points or centroids. (It can be other than the input
dataset).
o Step 3: Then assigns the data point to its closest centroid, which will form the predefined
K clusters.
o Step 4: Then, it calculates the variance and places a new centroid of each cluster.
o Step 5: Then we have to repeat the third Step, which means reassigning each data point to
the new closest centroid of each cluster.
o Step 6: If any reassignment occurs, then go to Step4. Else go to FINISH.
o Step 7: The model is ready.

2. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

It is one of the most popular unsupervised methods for model building and machine learning
algorithms. In this region, the data sets separated by the two clusters of low point density are
considered noise. This method is surrounded by the radius ε of a given object, known as the ε
neighbourhood of the object. If the ε neighbourhood of the object comprises at least a minimum
number of MinPts of objects, it is called a core object.

Density-Based Clustering - Background

Two main parameters are used to calculate density-based Clustering.

o EPS: It is mainly considered as the maximum radius of the neighbourhood.


o MinPts: MinPts refers to the minimum number of points in an Eps neighbourhood of that
point.
o NEps (i): {k belongs to D and dist (i, k) < = Eps}
o Directly density reachable: A point i is considered as the direct density reachable from a
point k to Eps, MinPts if
i belongs to NEps(k)
o Core point condition:
NEps (k) >= MinPts

3. STING (Statistical Information Grid)

It is also a grid-based clustering technique. This technique is used for a multidimensional grid data
structure, which is used to quantify the space into a finite number of cells. The main factor of this
technique is the value space surrounding the data points. The spatial area of the STING can be
divided into rectangular cells and several levels of cells at different resolution levels. All the high-
level cells are further divided into several low-level cells.

The STING contains all the data related to the attributes in each cell, such as mean, maximum, and
minimum values, which are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks.

How STING Work:

Several Steps follow the working procedure of the STING. These are as follows.

o Step 1: First, we have to Determine a layer to begin the process.


o Step 2: For each cell, we have to calculate the confidence interval or estimated
probability range that this cell is relevant to the query.
o Step 3: Then, we must level the cell as relevant or irrelevant based on the interval
calculated.
o Step 4: If the layer is the bottom layer, go to point 6; otherwise, go to point 5.
o Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells
that form the relevant cell of the high-level layer.
o Step 6: If the specification required for the query is met, then we have to go to point 8;
otherwise, go to point 7.
o Step 7: We must retrieve data that fall into the relevant cells and do further processing.
Return the result that meets the requirement of the query. Go to point 9.
o Step 8: Find the regions of relevant cells. Return those regions that meet the query's
requirements. Go to point 9.
o Step 9: Stop or terminate.
o
Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled
objects, makes a classification design over the objects. Unlike conventional clustering, which
generally identifies groups of like objects, conceptual clustering goes one step further by also
discovering characteristic definitions for each group,where each group defines a concept or class.

Therefore, conceptual clustering is a two-step process − clustering is implemented first, followed


by characterization. Thus, clustering quality is not solely a service of single objects. Most
techniques of conceptual clustering adopt a statistical method that uses probability measurements
in deciding the concepts or clusters.

Probabilistic descriptions are generally used to define each derived concept. COBWEB is a
famous and simple method of incremental conceptual clustering. Its input objects are defined by
categorical attribute-value pairs. COBWEB makes a hierarchical clustering in the form of a
classification tree.

A classification tree differs from a decision tree. Each node in a classification tree defines a
concept and includes a probabilistic description of that concept, which summarizes the objects
classified under the node. The probabilistic description contains the probability of the concept and
conditional probabilities of the form P(Ai=vij|Ck)P(Ai=vij|Ck) is an attribute-value pair (the
ith attribute takes its jth possible value) and Ck is the concept class.

COBWEB uses a heuristic evaluation measure known as category utility to guide the construction
of the tree. Category Utility (CU) is defined as
where n is the number of nodes, concepts, or “categories” forming a partition, {C 1,C2,..., Cn}, at
the given level of the tree. In other terms, category utility is the increase in the expected number of
attribute values that can be perfectly guessed given a partition (where this expected number
corresponds to the term P(Ck)∑i∑jP(Ai=vij|Ck)2P(Ck)∑i∑jP(Ai=vij|Ck)2 over the expected
number of correct guesses with no such knowledge (corresponding to the
term ∑i∑jP(Ai=vij)2∑i∑jP(Ai=vij)2 .Although it does not have room to display the derivation,
category utility rewards intraclass similarity and interclass dissimilarity, where −

Intraclass similarity − It is the probability P(Ai=vij|Ck)P(Ai=vij|Ck). The higher this value is,
the higher the proportion of class members that share this attribute-value pair and the more
predictable the pair is of class members.

Interclass dissimilarity − It is the probability P(Ck|Ai=vij)P(Ck|Ai=vij). The higher this value


is,the fewer the objects in contrasting classes that share this attribute-value pair and the more
predictive the pair is of the class.

COBWEB descends the tree along a suitable path, refreshing counts along the way, in search of
the “best host” or node at which to define the object. This decision depends on temporarily
locating the object in each node and evaluating the category utility of the resulting partition. The
placement that results in the highest category utility should be the best host for the object.

Constraint-Based Cluster Analysis:


Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints.
Depending on the nature of the constraints, constraint-based clustering may adopt rather different
approaches. There are a few categories of constraints. 
Constraints on individual objects: We can specify constraints on the objects to be clustered. In a
real estate application, for example, one may like to spatially cluster only those luxury mansions
worth over a million dollars. This constraint confines the set of objects to be clustered. It can
easily be handled by preprocessing after which the problem reduces to an instance of
unconstrained clustering.
 Constraints on the selection of clustering parameters: A user may like to set a desired range for
each clustering parameter. Clustering parameters are usually quite specific to the given clustering
algorithm. Examples of parameters include k, the desired number of clusters in a k-means
algorithm; or e the radius and the minimum number of points in the DBSCAN algorithm.
Although such user-specified parameters may strongly influence the clustering results, they are
usually confined to the algorithm itself. Thus, their fine tuning and processing are usually not
considered a form of constraint-based clustering.
 Constraints on distance or similarity functions: We can specify different distance or similarity
functions for specific attributes of the objects to be clustered, or different distance measures for
specific pairs of objects.When clustering sportsmen, for example,we may use different weighting
schemes for height, body weight, age, and skill level. Although this will likely change the mining
results, it may not alter the clustering process per se. However, in some cases, such changes may
make the evaluation of the distance function nontrivial, especially when it is tightly intertwined
with the clustering process.
User-specified constraints on the properties of individual clusters: A user may like to specify
desired characteristics of the resulting clusters, which may strongly influence the clustering
process.
Semi-supervised clustering based on partial supervision: The quality of unsupervised
clustering can be significantly improved using some weak form of supervision.This may be in the
form of pairwise constraints (i.e., pairs of objects labeled as belonging to the same or different
cluster). Such a constrained clustering process is called semi-supervised clustering.
Outlier Analysis: There exist data objects that do not comply with the general behavior or model
of the data. Such data objects, which are grossly different from or inconsistent with the remaining
set of data, are called outliers. Many data mining algorithms try to minimize the influence of
outliers or eliminate them all together. This, however, could result in the loss of important hidden
information because one person’s noise could be another person’s signal. In other words, the
outliers may be of particular interest, such as in the case of fraud detection, where outliers may
indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining. It can be used in fraud detection, for example, by detecting unusual
usage of credit cards or telecommunication services. In addition, it is useful in customized
marketing for identifying the spending behavior of customers with extremely low or extremely
high incomes, or in medical analysis for finding unusual responses to various medical treatments.
Outlier mining can be described as follows: Given a set of n data points or objects and k, the
expected number of outliers, find the top k objects that are considerably dissimilar, exceptional, or
inconsistent with respect to the remaining data. The outlier mining problem can be viewed as two
subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
Types of outlier detection:
Statistical Distribution-Based Outlier Detection 
Distance-Based Outlier Detection 
Density-Based Local Outlier Detection 
Deviation-Based Outlier Detection
Statistical Distribution-Based Outlier Detection: The statistical distribution-based approach to
outlier detection assumes a distribution or probability model for the given data set (e.g., a normal
or Poisson distribution) and then identifies outliers with respect to the model using a discordancy
test. Application of the test requires knowledge of the data set parameters knowledge of
distribution parameters such as the mean and variance and the expected number of outliers. A
statistical discordancy test examines two hypotheses: A working hypothesis An alternative
hypothesis A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its rejection. A
discordancy test verifies whether an object, oi, is significantly large (or small) in relation to the
distribution F. Different test statistics have been proposed for use as a discordancy test, depending
on the available knowledge of the data. Assuming that some statistic, T, has been chosen for
discordancy testing, and the value of the statistic for object oi is vi, then the distribution of T is
constructed. Significance probability, SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently
small, then oi is discordant and the working hypothesis is rejected. An alternative hypothesis, H,
which states that oi comes from another distribution model, G, is adopted. The result is very much
dependent on which model F is chosen because oi may be an outlier under one model and a
perfectly valid value under another. The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier. There are different
kinds of alternative distributions. Inherent alternative distribution: In this case, the working
hypothesis that all of the objects come from distribution F is rejected in favor of the alternative
hypothesis that all of the objects arise from another distribution, G: H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same distribution. There
are constraints on the form of the G distribution in that it must have potential to produce outliers.
For example, it may have a different mean or dispersion, or a longer tail. Mixture alternative
distribution: The mixture alternative states that discordant values are not outliers in the F
population, but contaminants from some other population, G. In this case, the alternative
hypothesis is
Slippage alternative distribution: This alternative states that all of the objects (apart from some
prescribed small number) arise independently from the initial model, F, with its given parameters,
whereas the remaining objects are independent observations from a modified version of F in
which the parameters have been shifted. There are two basic types of procedures for detecting
outliers: Block procedures: In this case, either all of the suspect objects are treated as outliers or all
of them are accepted as consistent. Consecutive procedures: An example of such a procedure is
the inside out procedure. Its main idea is that the object that is least likely to be an outlier is tested
first. If it is found to be an outlier, then all of the more extreme values are also considered outliers;
otherwise, the next most extreme object is tested, and so on. This procedure tends to be more
effective than block procedures.
Distance-Based Outlier Detection: The notion of distance-based outliers was introduced to counter
the main limitations imposed by statistical methods. An object, o, in a data set, D, is a distance-
based (DB)outlier with parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a
fraction,pct, of the objects in D lie at a distance greater than dmin from o. In other words, rather
that relying on statistical tests, we can think of distance-based outliers as those objects that do not
have enough neighbors, where neighbors are defined based on distance from the given object. In
comparison with statistical-based methods, distance based outlier detection generalizes the ideas
behind discordancy testing for various standard distributions. Distance-based outlier detection
avoids the excessive computation that can be associated with fitting the observed distribution into
some standard distribution and in selecting discordancy tests. For many discordancy tests, it can
be shown that if an object, o, is an outlier according to the given test, then o is also a DB(pct,
dmin)-outlier for some suitably defined pct and dmin. For example, if objects that lie three or
more standard deviations from the mean are considered to be outliers, assuming a normal
distribution, then this definition can be generalized by a DB(0.9988, 0.13s) outlier. Several
efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm.
Nested-loop algorithm.
Cell-based algorithm.
Density-Based Local Outlier Detection: Statistical and distance-based outlier detection both
depend on the overall or global distribution of the given set of data points, D. However, data are
usually not uniformly-distributed. These methods encounter difficulties when analyzing data with
rather different density distributions. To define the local outlier factor of an object, we need to
introduce the concepts of kdistance, k-distance neighborhood, reachability distance,13 and local
reachability density. These are defined as follows: The k-distance of an object p is the maximal
distance that p gets from its knearest neighbors. This distance is denoted as k-distance(p). It is
defined as the distance, d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2
D, it holds that d(p, o’)_d(p, o). That is, there are at least k objects in D that are as close as or
closer to p than o, and for at most k-1 objects, o00 2 D, it holds that d(p;o’’) <d(p, o)
Deviation-Based Outlier Detection: Deviation-based outlier detection does not use statistical
tests or distance-based measures to identify exceptional objects. Instead, it identifies outliers by
examining the main characteristics of objects in a group.Objects that ―deviate‖ from this
description are considered outliers. Hence, in this approach the term deviations is typically used to
refer to outliers. In this section, we study two techniques for deviation-based outlier detection.The
first sequentially compares objects in a set, while the second employs an OLAP data cube
approach. Sequential Exception Technique.

You might also like