Unit- IV
Partitional Algorithms
Nonhierarchical or partitional clustering creates the clusters in one step as opposed to
several steps. Only one set of clusters is created and user must input the desired number k
of clusters.
In addition some metric of criterion function is used to determine the goodness of any
proposed solution. This measure of quality could be the average distance between
clusters or some other metric. The solution with the best value for the criterion function is
the clustering solution used.
One common measure is a squared error metric, which measures the squared distance
from each point to the centroid for the associated cluster:
Partitional algorithms suffer from a combinatorial explosion due to the number of
possible solutions. Thus, most algorithms look only at a small subset of all clusters.
Most well known algorithms are:
1. Minimum Spanning Tree
2. Squared Error Clustering Algorithm
3. K-Means Clustering
4. Nearest Neighbor Algorithm
5. PAM Algorithm
6. Bond Energy Algorithm
7. Clustering with Genetic Algorithms
8. Clustering with Neural Networks
Minimum Spanning Tree
MST, produces a minimum spanning tree given an adjacency matrix as input. The clusters are
merged in increasing order of the distance found in the MST.
A partitional MST algorithm is a very simplistic approach, but it illustrates how partitional
algorithms work
Since the clustering problem is to define a mapping, the output of this algorithm shows the
clusters as a set of ordered pairs (ti , j) where f (ti ) = Kj .
The time complexity of this algorithm is again dominated by the MST procedure, which is
O(n2 ). At most, k - 1 edges will be removed, so the last three steps of the algorithm, assuming
each step takes a constant time, is only O(k - 1).
Squared Error Clustering Algorithm
The squared error for a cluster is the sum of the squared Euclidean distances between each
element in the cluster and the cluster centroid, Ck. Given a cluster Ki, let the set of items mapped
to that cluster be {til, ti2, . . . , tim}. The squared error is defined as
For each iteration in the squared error algorithm, each tuple is assigned to the cluster with the
closest center. Since there are k clusters and n items, this is an O(kn) operation.
K-Means Clustering
K-means is an iterative clustering algorithm in which items are moved among sets of clusters
until the desired set is reached.
The cluster mean of Ki = {tn, ti2, . . . , tim} is defined as
This definition assumes that each tuple has only one numeric value as opposed to a tuple with
many attribute values.
This algorithm assumes that the desired number of clusters, k, is an input parameter.
The time complexity of K-means is O(tkn) where t is the number of iterations.
Although the K-means algorithm often produces good results, 1t 1s not t1me-effic1ent and does
not scale well.
Nearest Neighbor Algorithm
An algorithm similar to the single link technique is called the nearest neighbor algorithm. With
this serial algorithm, items are iteratively merged into the existing clusters that are closest. In this
algorithm a threshold, t, is used to determine if items will be added to existing clusters or if a
new cluster is created.
The complexity of the nearest neighbor algorithm actually depends on the number of items. For
each loop, each item must be compared to each item already in a cluster. Obviously, this is n in
the worst case. Thus, the time complexity is O(n2 ).
PAM Algorithm
The PAM (partitioning around medoids) algorithm also called, the K-medoids algorithm
represents a cluster by a medoid.
The PAM algorithm is shown below
Initially, a random set of k items is taken to be the set of medoids. Then at each step, all items
from the input dataset that are not currently medoids are examined one by one to see if they
should be medoids. That is, the algorithm determines whether there is an item that should replace
one of the existing medoids.
To use Cjih to be the cost change for an item f j associated with swapping medoid t; with non-
medoid fh. The cost is the change to the sum of all distances from items to their cluster medoids.
There are four cases that must be examined when calculating this cost:
The total impact to quality by a medoid change TCih then is given by
PAM does not scale well to large datasets because of its computational complexity. For each
iteration, we have k(n -k) pairs of objects i, h for which a cost, TCih, should be determined.
CLARA (Clustering LARge Applications) improves on the time complexity of PAM by using
samples of the dataset. The basic idea is that it applies PAM to a sample of the underlying
database and then uses the medoids found as the medoids for the complete clustering.
CLARANS (clustering large applications based upon randomized search) improves on CLARA
by using multiple different samples. In addition to the normal input to PAM, CLARANS
requires two additional parameters: maxneighbor and numlocal. Maxneighbor is the number of
neighbors of a node to which any specific node can be compared.
Bond Energy Algorithm
The bond energy algorithm (BEA) was developed and has been used in the database
design area to determine how to group data and how to physically place data on a desk.
It can be used to cluster attributes based on usage and then perform logical or physical
design accordingly.
With BEA, the affinity (bond) between database attributes is based on common usage.
This bond is used by the clustering algorithm as a similarity measure.
The actual measure counts the number of times the two attributes are used together in a
given time. To find this, all common queries must be identified.
In a distributed database, each resulting cluster is called vertical fragment and may by
stored at different sites from other fragments.
The basic steps of this clustering algorithm are: ·
1. Create an attribute affinity matrix in which each entry indicates the affinity between
the two associate attributes. The entries in the similarity matrix are based on the
frequency of common usage of attribute pairs.
2. The BEA then converts this similarity matrix to a BOND matrix in which the entries
represent a type of nearest neighbor bonding based on probability of co access. The BEA
algorithm rearranges rows or columns so that similar attributes appear dose together in
the matrix.
3. Finally, the designer draws boxes around regions in the matrix with high similarity.
Two attributes Ai and Aj have a high affinity if they are frequently used together in database
applications. At the heart of the BEA algorithm is the global affinity measure. Suppose that a
database schema consists of n attributes {A1, A2, . . . , An}. The global affinity measure, AM, is
defined as
Clustering with Genetic Algorithms
The genetic algorithms, determine how to represent each cluster. One simple approach would be
to use a bit-map representation for each possible cluster.
Algorithm shows one possible iterative refinement technique for clustering that uses a genetic
algorithm. The approach is similar to that in the squared error approach in that an initial random
solution is given and successive changes to this converge on a local optimum. A new solution is
generated from the previous solution using crossover and mutation operations.
Self-Organizing Feature Maps.
A self-organizing fe ature map (SOFM) or self organizing map (SOM) is an NN approach that
uses competitive unsupervised learning. Learning is based on the concept that the behavior of a
node should impact only those nodes and arcs near it. Weights are initially assigned randomly
and adjusted during the learning process to produce better results. During this learning process,
hidden features or patterns in the data ,are uncovered and the weights are adjusted accordingly.
SOFMs were developed by obsefving how neurons work in the brain and in ANNs. That is :
\• The firing of neurons impact the firing of other neurons that are near it.
• Neurons that are far apart seem to inhibit each other.
• Neurons seem to have specific nonoverlapping tasks.
The term self-organizing indicates the ability of these NNs to organize the nodes into clusters
based on the similarity between them.
Example
SOFM is the Kohonen self-organizing map, which is used extensively in commercial data
mining products to perform clustering
Each input node is connected to each node in this grid. Propagation occurs by sending the input
value for each input node to each node in the competitive layer. As with regular NNs, each arc
has an associated weight and each node in the competitive layer has an activation function.
A common approach is to initialize the weights on the input arcs to the com petitive layer with
normalized values. The similarity between output nodes and input vectors is then determined by
the dot product of the two vectors. Given an input tuple X = (x1, . . . , xh) and weights on arcs
input to a competitive node i as WJi, . . . , Whi, the similarity between X and i can be calculated by
The learning process uses
In this formula, c indicates the learning rate and may actually vary based on the node rather than
being a constant