0% found this document useful (0 votes)
42 views10 pages

Non-Redundant Multi-View Clustering Via Orthogonalization: Cui - Yi@neu - Edu Xfern@eecs - Oregonstate.edu Jdy@ece - Neu.edu

The document presents a new clustering paradigm aimed at discovering non-redundant multi-view clustering solutions for high-dimensional data. It introduces a framework that employs orthogonalization to generate multiple clustering views, allowing data points to belong to different clusters across various perspectives. The proposed approaches, orthogonal clustering and clustering in orthogonal subspaces, are tested on synthetic and benchmark datasets, demonstrating their ability to reveal diverse and meaningful clustering structures.

Uploaded by

sagar deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views10 pages

Non-Redundant Multi-View Clustering Via Orthogonalization: Cui - Yi@neu - Edu Xfern@eecs - Oregonstate.edu Jdy@ece - Neu.edu

The document presents a new clustering paradigm aimed at discovering non-redundant multi-view clustering solutions for high-dimensional data. It introduces a framework that employs orthogonalization to generate multiple clustering views, allowing data points to belong to different clusters across various perspectives. The proposed approaches, orthogonal clustering and clustering in orthogonal subspaces, are tested on synthetic and benchmark datasets, demonstrating their ability to reveal diverse and meaningful clustering structures.

Uploaded by

sagar deshpande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Non-Redundant Multi-View Clustering Via Orthogonalization

Ying Cui Xiaoli Z. Fern Jennifer G. Dy


ECE Department School of EECS ECE Department
Northeastern University Oregon State University Northeastern University
Boston, MA 02115 Corvallis, OR 97339 Boston, MA 02115
[Link]@[Link] xfern@[Link] jdy@[Link]

Abstract clustering algorithms group similar objects together based


on some fixed notion of similarity (distance) and output a
Typical clustering algorithms output a single clustering single clustering solution. However, in real world applica-
of the data. However, in real world applications, data tions, data can often be interpreted in many different ways
can often be interpreted in many different ways; data can and there may exist multiple groupings of the data that are
have different groupings that are reasonable and interest- all reasonable in some perspective.
ing from different perspectives. This is especially true for This problem is often more prominent for high dimen-
high-dimensional data, where different feature subspaces sional data, where each object is described by a large num-
may reveal different structures of the data. Why commit ber of features. In such cases, different feature subspaces
to one clustering solution while all these alternative clus- can often warrant different ways to partition the data, each
tering views might be interesting to the user. In this pa- presenting the user a different view of the data’s structure.
per, we propose a new clustering paradigm for explorative Figure 1 illustrates one such scenario. In particular, Fig-
data analysis: find all non-redundant clustering views of ure 1a shows a scatter plot of the data in feature subspace
the data, where data points of one cluster can belong to {F1 , F2 }. Figure 1b shows how the data looks like in fea-
different clusters in other views. We present a framework ture subspace {F3 , F4 }. Note that each subspace leads to a
to solve this problem and suggest two approaches within different clustering structure. When faced with such a situ-
this framework: (1) orthogonal clustering, and (2) cluster- ation, which features should we select (i.e., which cluster-
ing in orthogonal subspaces. In essence, both approaches ing solution is better)? Why do we have to choose? Why
find alternative ways to partition the data by projecting not keep both solutions? In fact both clustering solutions
it to a space that is orthogonal to our current solution. could be important, and provide different interpretations of
The first approach seeks orthogonality in the cluster space, the same data. For example, for the same medical data, what
while the second approach seeks orthogonality in the fea- is interesting to physicians might be different from what is
ture space. We test our framework on both synthetic and interesting to insurance companies.
high-dimensional benchmark data sets, and the results show
that indeed our approaches were able to discover varied so-
lutions that are interesting and meaningful. 20 20

18
18

keywords: multi-view clustering, non-redundant cluster- 16


16

14

ing, orthogonalization 14 12
F2

F4

12 10

8
10
6
8
4

1 Introduction 6
2 4 6 8 10 12
F1
14 16 18 20 22
2
−5 0 5 10
F3
15 20 25

(a) (b)
Many applications are characterized by data in high di-
mensions. Examples include text, image, and gene data. Figure 1. This is a scatter plot of the data
Automatically extracting interesting structure in such data in (a) subspace {F1 , F2 } and (b) subspace
has been heavily studied in a number of different areas in- {F3 , F4 }. Note that the two subspaces lead
cluding data mining, machine learning and statistical data to different clustering structures.
analysis. One approach for extracting information from un-
labeled data is through clustering. Given a data set, typical
Hierarchical clustering [18] presents a hierarchical unifying framework and seeks lower dimensional rep-
grouping of the objects. However, the different cluster- resentation of the data that reveals non-redundant in-
ing solutions obtained at different hierarchical levels differ formation about the data.
only in their granularity – objects belonging to the same
cluster in fine resolutions remain in the same cluster at the (2) Existing non-redundant clustering techniques are lim-
coarser levels. On the other hand, cluster ensemble meth- ited to finding one alternative structure given a known
ods [22, 10] do create a set of cluster solutions but the final structure. Our framework works successively to reveal
goal is still to find a single clustering solution that is most a sequence of different clustering of the data.
consistent throughout the set. Our framework produces a set of different clustering so-
The goal of exploratory data analysis is to find structures lutions, which is similar to cluster ensembles[22, 10]. A
in data, which maybe multi-faceted by nature. Traditional clear distinction of our work from cluster ensembles is that
clustering methods seek to find a unified clustering solution we intentionally search for orthogonal clusterings and do
and are thus inherently limited in achieving this goal. In this not seek to find a consensus clustering as our end product.
paper, we suggest a new exploratory clustering paradigm: Similar to cluster ensembles, multi-view clustering [3]
the goal is to find a set of non-redundant clustering views seek a single final clustering solution by learning from mul-
from data, where data points belonging to the same cluster tiple views, whereas our non-redundant multi-view cluster-
in one view can belong to different clusters in another view. ing paradigm looks for multiple clustering solutions.
Toward this goal, we present a framework that extracts An integral part of our framework is to search a cluster-
multiple clustering views of high-dimensional data that are ing structure in a high-dimensional space and find the corre-
orthogonal to each other. Note that there are k N possible sponding subspace that best reveals the clustering structure.
k disjoint partitioning of N data points. Not all of them This topic has been widely studied, and one closely related
are meaningful. We wish to find good clustering solutions work was conducted by [8]. In this work, after a clustering
based on a clustering objective function. Meanwhile, we solution is obtained, a subspace is computed to best cap-
would like to minimize the redundancy among the obtained ture the clustering, and the clustering is then refined using
solutions. Thus, we include an orthogonality constraint in the data projected onto the new subspace. Interestingly, our
our search for new clustering views to avoid providing the framework works in an opposite direction. We look at a sub-
user with redundant clustering results. The proposed frame- space that is orthogonal to the space in which the original
work works iteratively, at each step adding one clustering clustering is embedded to search for non-redundant cluster-
view by searching for solutions in a space that is orthogonal ing solutions.
to the space of the existing solutions. Within this frame- Finally, while we search for different subspaces in our
work, we develop two general approaches. The first ap- framework, it is different from the concept of subspace clus-
proach seeks orthogonality in the cluster space, while the tering in [1, 21]. In subspace clustering, the goal is still to
second one seeks orthogonality in the feature subspace. We learn a single clustering, where each cluster can be embed-
present all the multiple view clustering solutions to the user. ded in its own subspace. In contrast, our method searches
for multiple clustering solutions, each is revealed in a dif-
2 Literature Review ferent subspace.

In this section, we briefly review the literature related to 3 Multi-View Orthogonal Clustering
our research in different aspects.
First, our research can be considered as performing non- Given data X ∈ Rd×N , with N instances and d features.
redundant clustering [13, 5]. In non-redundant clustering, Our goal is to learn a set of orthogonal clustering views
we are typically given a set of data objects together with from the data.
an existing clustering solution and the goal is to learn an There are several ways to find different clustering views
alternative clustering that captures new information about [22, 10]. One can apply different objective functions, uti-
the structure of the data. Existing non-redundant clustering lize different similarity measures, or use different density
techniques include the conditional information bottleneck models. In general, one can apply different clustering algo-
approach [5, 14, 15], the conditional ensemble based ap- rithms, or utilize one algorithm on randomly sampled (ei-
proach [16] and the constrained model based approach [17]. ther in instance space or feature space) data from X. Note
Compared to existing non-redundant clustering techniques, that such methods produce each individual clustering inde-
the critical differences of our proposed research are: pendently from all the other clustering views. While the
(1) Focusing on searching for orthogonal clustering of differences in the objective functions, similarity measures,
high-dimensional data, our research combines dimen- density models, or different data samples may lead to clus-
sionality reduction and orthogonal clustering into a tering results that differ from one another, it is common to

2
see redundancy in the obtained multiple clustering views. means and the cluster membership of each data point xi .
We present a framework for successively generating multi- One can consider k-means clustering as a compression of
ple clustering views that are orthogonal from one another data X to the k cluster means µj .
and thus contain limited redundancy. Below, we describe Following the compression viewpoint, each data point xi
our framework for generating different clustering views by is represented by its cluster mean µj . Given k µj ’s for rep-
orthogonalization. resenting X, what is not captured by these µj ’s? Let us con-
sider the space that is spanned by xi , i = 1 . . . N , we refer
X
to this as the original data space. In contrast, the subspace
that is spanned by the mean vectors µj , j = 1 . . . k, is con-
Residue Space Residue Space
sidered the compressed data space. Assigning data points
to their corresponding cluster means can be essentially con-
Orthogonalization Orthogonalization Orthogonalization
sidered as projecting the data points from the original data
space onto the compressed data space. What is not covered
Clustering/ Clustering/ Clustering/ Clustering/
by the current clustering solution (i.e., its compressed data
Dim. Red. Dim. Red. Dim. Red. Dim. Red. space) is simply its residue space. In this paper, we define
residue space as the data projected onto the space orthogo-
View 1 View 2 View 3 View n
nal to our current representation. Thus, to find alternative
Figure 2. The general framework for generat- clustering solutions not covered in the current solution, we
ing multiple orthogonal clustering views. can simply perform clustering in the space that is orthogo-
nal to the compressed data space.
Figure 2 shows the general framework of our approach. Given current data X (t) , and the clustering solution we
(t) (t) (t)
We first cluster the data (this can include dimensionality re- found on X (t) (i.e., M (t) = [µ1 µ2 · · · µk ], and the clus-
duction followed by clustering), then we orthogonalize the ter assignments), we describe two variations for represent-
data to a space that is not covered by the existing cluster- ing data in the residue space, X (t+1) that are based on the
ing solutions. Repeat the process until we cover most of “hard” and “soft” clustering views respectively.
the data space or no structure can be found in the remaining
space. Hard clustering. In hard clustering, each data point be-
We developed two general approaches within this frame- longs to a single cluster, thus is represented using a
(t)
work: (1) orthogonal clustering, and (2) clustering in or- single mean vector. For data point xi belonging to
(t)
thogonal subspaces. cluster j, we project it onto its center µj as its repre-
These two approaches differ primarily in how they rep- sentation in the current clustering view. The residue,
(t+1) (t)
resent the existing clustering solutions and consequently xi , is then the projection of xi onto the subspace
how to orthogonalize the data based on existing solutions. (t)
orthogonal to µj . This can be formalized by the fol-
Specifically, the first approach represents a clustering so- lowing formula:
lution using its k cluster centroids. The second approach
(t+1) (t) (t)T (t)T (t) (t)
represents a clustering solution using the feature subspace xi = (I − µj µj /(µj µj ))xi
that best captures the clustering result. In the next two sub-
sections, we describe these two different representations in Soft clustering. In soft clustering, each data point can par-
detail and explain how to obtain orthogonal clustering solu- tially belong to all clusters. Therefore, we can repre-
tions based on these two representations. sent the data using all cluster centers. We achieve this
by projecting the data onto the subspace spanned by
3.1 Orthogonal Clustering all cluster means and compute the residue, X (t+1) , as
the projection of X (t) onto the subspace orthogonal to
Clustering can be viewed as a way for compressing data all the cluster centroids. This can be formalized by the
X. For example in k-means [11, 20], the objective function following formula:
is to minimize the sum-squared-error criterion (SSE):
X (t+1) = (I − M (t) (M (t)T M (t) )−1 M (t)T )X (t)

k 
SSE = ||xi − µj ||2 The algorithm for orthogonal clustering is summarized
j=1 xi ∈Cj in Algorithm 1. The data is first centered to have zero
mean. We then create the first view by clustering the orig-
where xi ∈ Rd is a data point assigned to cluster Cj , and inal data X. Since most of the data in our experiments are
µj is the mean of Cj . We represent xi and µj as column high-dimensional, we apply principal components analy-
vectors. The outputs for k-means clustering are the cluster sis [19] to reduce the dimensionality, followed by k-means.

3
Algorithm 1 Orthogonal Clustering. 
k
Sb = nj (µj − µ)(µj − µ)T
Inputs: The data matrix X ∈ R d×N
, and the number of
j=1
clusters k (t) for each iteration.
Output: The multiple partitioning views of the data into where yi ’s are projected data points; µj ’s are projected clus-
k (t) clusters at each iteration. ter centers; nj is the total number of points in cluster j
Pre-processing: Center the data to have zero mean. and µ is the center of the entire set of projected data. In
Initialization: Set the iteration number t = 1 and essence, LDA finds the subspace that maximizes the scatter
X (1) = X. between the cluster means normalized by the scatter within
Step1: Cluster X (t) . In our experiments, we performed each cluster.
PCA followed by k-means. The compressed solution are Similarly, the PCA approach in [8] seeks a linear projec-
(t) (t)
the k means, µj . Each µj is a column vector in Rd tion Y = AT X, but maximizes a different objective func-
(the original feature space). tion trace(M M T ), where M = [µ1 µ2 · · · µk ] and the µj ’s
(t)
Step2: Project each xi in X (t) to the space orthogonal are the projected cluster centers.
to its cluster mean to form the residue space representa- For both methods, the solution can be represented as
tion, X (t+1) . A = [φ1 φ2 · · · φq ], which contains the q most important
Step3: Set t = t + 1 and repeat steps 1 and 2 until the eigenvectors (corresponding to the q largest eigenvalues) of
−1
desired number of views or until the sum-squared-error, Sw Sb for LDA and M M T for PCA respectively.
k  (t) (t) 2 Note that the trace(Sb ) = trace(M  M T ), where M  =
(t) ||x
j=1 (t)
x ∈C i − µj || , is very small. √ √ √
i j
[ n1 µ1 n2 µ2 · · · nk µk ]. The difference between the
two approaches is the normalization by the within-class-
−1
scatter Sw and a weighting of each µj by nj , the number
Note that one can apply other clustering methods within our of data points in cluster Cj . But, both M and M  span the
framework. We chose PCA followed by k-means because same space. Thus, in practice, both approaches result in
they are popular techniques. In step 2, we project the data similar results. For computational purposes, we choose the
to the space orthogonal to the current cluster representation PCA approach on the µj ’s and set q = k − 1, the rank of
(using cluster centers) to obtain our residue X (t+1) . The MMT.
next clustering view is then obtained by clustering in this Once we have obtained a feature subspace A =
residue space. We repeat steps 1 (clustering) and 2 (orthog- [φ1 φ2 · · · φk−1 ] that captures the clustering structure well,
onalization) until the desired number of views are obtained we project X (t) to the subspace orthogonal to A to obtain
or when the SSE is very small. Small SSE signifies that the the residue X (t+1) = P (t) X (t) . The orthogonal projection
existing views have already covered most of the data. T T
operator, P , is: P (t) = I − A(t) (A(t) A(t) )−1 A(t) .
Algorithm 2 presents the pseudo-code for clustering in
3.2 Clustering in Orthogonal Subspaces orthogonal subspaces. In step 1, we apply a clustering algo-
rithm (PCA followed by k-means in our experiments). We
In this approach, given a clustering solution with means then represent this clustering solution using the subspace
µj , j = 1 . . . k, we would like to find a feature subspace that best separates these clusters. In step 2, we project the
that best captures the clustering structure, or, in other words, data to the space orthogonal to the computed subspace rep-
discriminates these clusters well. One well-known method resentation. We repeat steps 1 and 2 until the desired num-
for finding a reduced dimensional space that discriminates ber of views are obtained or the SSE is very small.
classes (clusters here) is linear discriminant analysis (LDA)
[12, 9]. Another approach is by applying principal compo-
nent analysis (PCA) on the k mean vectors µj ’s [8]. 4 Experiments
Below we explain the mathematical differences between
these two approaches. LDA finds a linear projection Y = In this section, we investigate whether our multi-view or-
AT X that maximizes the thogonal clustering framework can provide us with reason-
able and orthogonal clustering views of the data. We start
−1
trace(Sw Sb ) by performing experiments on synthetic data in Section 4.1
to get a better understanding of the methods, then we test
where Sw is the within-class-scatter matrix and Sb is the the methods on benchmark data in Section 4.2. In these ex-
between-class-scatter matrix defined as follows. periments, we chose as our base clustering – PCA followed

k  by k-means clustering. This means, we first reduce the di-
Sw = (yi − µj )(yi − µj )T mensionality with PCA, keeping a dimensionality that re-
j=1 yi ∈Cj tains at least 90% of the original variance, then follow PCA

4
Algorithm 2 Clustering in Orthogonal Subspaces. 200, 200 and 100 data points and means µ1 = (2, 17),
Inputs: The data matrix X ∈ Rd×N , and the number of µ2 = (17.5, 9), and µ3 = (1.2, 5), and identity covari-
clusters k (t) for each iteration. ances. This data tests whether the methods can find
Output: The multiple partitioning views of the data into different clustering solutions in different subspaces.
k (t) clusters, and a reduced dimensional subspace for
each iteration A(t) .
Pre-processing: Center the data to have zero mean.
Initialization: Set the iteration number t = 1 and Table 1. Confusion Matrix for Synthetic Data1
X (1) = X. S YNTHETIC DATA 1 METHOD 1 METHOD 2

Step1: Cluster X (t) . In our experiments, we performed ITERATION 1 C1 C2 C1 C2


PCA followed by k-means. Then, find the PCA solution L1 125 0 125 0
(t) (t) (t) L2 0 125 0 125
of M (t) , M (t) = [µ1 µ2 · · · µk ] and keep k (t) − 1 L3 125 0 125 0
dimensions to obtain the subspace, A(t) , that captures the L4 0 125 0 125
current clustering. ITERATION 2 C1 C2 C1 C2
Step2: Project X (t) to the space orthogonal to A(t) to L1 125 0 125 0
produce X (t+1) = P (t) X (t) , where the projection oper- L2 125 0 125 0
ator P (t) is: L3 0 125 0 125
T T
L4 0 125 0 125
P (t) = I − A(t) (A(t) A(t) )−1 A(t)

Step3: Set t = t + 1 and repeat steps 1 and 2 until the


desired number of views or until the sum-squared-error,
N (t) (t) (t) 2 4.1.1 Results for Synthetic Data 1
i=1 ||xi − A yi || , is very small.
The confusion matrix in Table 1 shows the experimental re-
sults for synthetic data 1 for methods 1 and 2, in two iter-
with k-means clustering. Because we apply k-means clus- ations. We can see that for the first iteration, both meth-
tering, we implement orthogonal clustering with the “hard” ods grouped classes L1 and L3 into a single cluster C1 , and
assumption. In this section, we refer to the orthogonal clus- classes L2 and L4 into another cluster C2 . For the second
tering approach as method 1, and the clustering in orthogo- iteration, the data was partitioned in a different way, which
nal subspaces as method 2. grouped classes L1 and L2 into one cluster, and classes L3
and L4 into another cluster. Figure 3 shows the scatter plot
4.1 Experiments on Synthetic Data of the clustering results of both methods in the original 2D
data space for the two iterations. Different colors are used
We would like to see whether our two methods can find to signify the true classes, and the ellipses show the clusters
diverse groupings of the data. We generate two synthetic found by k-means. The figure confirms the result summa-
data. rized in the confusion matrix. Both methods 1 and 2 have
similar results as shown. In subfigure a3 and b3 of Fig-
Data 1: We generate a four-cluster data in two dimensions
ure 3, we plot the sum-squared-error (SSE) as a function
with N = 500 instances as shown in Figure 3, where
of iteration. Note that, as expected, SSE for both methods
each cluster contains 125 data points. We test our
decreases monotonically until convergence. Moreover, the
methods by setting k = 2 for our k-means cluster-
SSE at iteration 2 and after is zero meaning that the first two
ing. We would like to see that if the methods group the
clustering views have covered the data space completely.
clusters into two in the first iteration, then they should
group the clusters the other way in the next iteration.
This data tests whether the methods can find orthogo- 4.1.2 Results for Synthetic Data 2
nal clusters.
Table 2 shows the confusion matrix for our clustering with
Data 2: We generate a second synthetic data in four dimen- the two different labelings: labeling 1 is for features 1 and
sions, with N = 500 instances as shown in Figure 4. 2, and labeling 2 is for features 3 and 4. High number of
We generate three Gaussian clusters in features F1 and common occurrences means that the cluster correspond to
F2 with 100, 100 and 300 data points and means µ1 = those labels. Observe that for both methods 1 and 2, they
(12.5, 12.5), µ2 = (19, 10.5), and µ3 = (6, 17.5), found the clusters in labeling 2 (features 3 and 4) perfectly
and identity covariances. We generate another mixture with zero confusion in the off-diagonal elements in the first
of three Gaussian clusters in features F3 and F4 with iteration/view. In the second iteration/view, methods 1 and

5
15 15
On the other hand, iteration 2 groups the clusters based on
features 1 and 2 correctly, but the partition for the clusters in
10 10

5 5

features 3 and 4 is wrong. The results confirm that indeed


F2

F2
0 0

−5 −5 our multi-view approach can discover multiple clustering


−10 −10
solutions in different subspaces.
−15 −15
−20 −15 −10 −5 0 5 10 15 20 −20 −15 −10 −5 0 5 10 15 20
F1 F1

(a1.iteration1 for method1) (b1.iteration1 for method2)


20 20 Table 2. Confusion Matrix for Synthetic Data2
15

10
15

10
S YNTHETIC DATA 2 M ETHOD 1 M ETHOD 2
5 5 ITERATION1
F2

F2
−5
0

−5
0
L ABELLING 1 C1 C2 C3 C1 C2 C3
−10 −10 L1 41 40 19 41 40 19
−15 −15

−20 −20
L2 44 34 22 44 34 22
−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15
F1 F1
L3 115 126 59 115 126 59
(a2.iteration2 for method1) (b2.iteration2 for method2) LABELLING2 C1 C2 C3 C1 C2 C3
4 4

6
x 10
6
x 10
L1 200 0 0 200 0 0
5 5
L2 0 200 0 0 200 0
4 4
Sum−Square−Error

Sum−Square−Error

L3 0 0 100 0 0 100
3 3

2 2 ITERATION 2
1 1 LABELLING1 C1 C2 C3 C1 C2 C3
0
1 2
Iteration
3
0
1 2
Iteration
3 L1 100 0 0 100 0 0
L2 0 100 0 0 100 0
([Link] for method1) ([Link] for method2)
L3 0 0 300 0 0 300
L ABELLING 2 C1 C2 C3 C1 C2 C3
Figure 3. Scatter plots of synthetic data 1.
L1 126 34 40 126 34 40
The two columns show the results of meth- L2 115 44 41 115 44 41
ods 1 and 2 respectively. The colors repre- L3 59 22 19 59 22 19
sent different class labels and the ellipses
represent the clusters found. Row 1 and 2
show the results for iteration 1 and 2 respec-
tively; Row 3 shows SSE as a function of iter- 4.2 Experiments on Benchmark Data
ation.
We have shown that our two methods work on synthetic
data. Here, we investigate whether they reveal interesting
2 found the clusters in labeling 1 (features 1 and 2) per- and diverse clustering solutions on real benchmark data. We
fectly also with zero confusion. This result confirms that select data sets that have high-dimensionality and that have
indeed our multi-view approach can discover multiple clus- multiple possible partitioning.
tering solutions in different subspaces. Figure 4 shows scat- In this section, we investigate the performance of our
ter plots of the data. The left column ((a1), (a2), (a3)) is multi-view orthogonal clustering algorithms on four real
the plot for method 1. (a1) shows the clustering in ellipses world data sets, including the digits data set from the UCI
found by method 1 in iteration 1. The left sub-figure shows machine learning repository [4], the face data set from the
the groupings in the original features 1 and 2, and the data UCI KDD repository [2], and two text data sets: the mini-
points are colored based on true labeling 1. The right sub- newsgroups data [2] and the WebKB data set [6].
figure shows the clusterings in the original features 3 and The digits data is a data set for an optical recogni-
4, and the color of the data points are based on true label- tion problem of handwritten digits with ten classes, 5620
ing 2. (a2) is the same scatter plot of the original data X cases, and 64 attributes (all input attributes are integers
with the clusters found by method 1 as shown by the ellipses from 0 . . . 16). The face data consists of 640 face images
in iteration 2. Similarly, (b1) and (b2) show the results of of 20 people taken with varying pose (straight, left, right,
method 2. (a3) and (b3) are the SSE for the two methods up), expression (neutral, happy, sad, angry), eyes (wear-
in each iteration. Method 2 converges to zero much faster ing sunglasses or not). Each person has 32 images captur-
than method 1 here. Note that SSE monotonically decreases ing every combination of features. The image resolution
with iteration and that the algorithm captures most of the is 32 × 30. We removed the missing data and formed a
information in two clustering views. From these results, in 960 × 624 data matrix. Each of the 960 features represents
iteration 1 we found the right partition based on features 3 a pixel value. The mini-newsgroups data comes from the
and 4, but group the clusters in features 1 and 2 incorrectly. UCI KDD repository which contains 2000 articles from 20

6
Label 1 Label 2 Label 1 Label 2
22 20 22 20

20

18
18

16
20

18
18

16
Table 3. Confusion Matrix for the Digits Data
16
14
16
14
DIGIT METHOD 1 METHOD 2
12 12

F2

F4

F2

F4
14

12
10
14

12
10 IT1 C1 C2 C3 C1 C2 C3
8 8

10
6
10
6
0 550 0 4 550 0 4
8

6 2
4 8

6 2
4
4 371 197 0 369 199 0
0 10 20 30 −10 0 10 20 30 0 10 20 30 −10 0 10 20 30
F1 F3 F1 F3
6 555 2 1 555 2 1
(a1.iteration1 for method1) (b1.iteration1 for method2) 1 12 477 82 12 477 82
20
Label 1
22
Label 2
20
Label 1
22
Label 2
7 0 566 0 0 566 0
20 20
18
18
18
18 8 15 321 218 15 321 218
16 16 16 16

14
14
14
14 2 5 27 525 5 27 525
F2

F4

F2

F4
12 12
12
10
12
10 3 0 41 531 0 41 531
10 8

6
10 8

6
5 35 236 287 35 236 287
8 8

6
0 10 20 30
2
4

−10 0 10 20 30
6
0 10 20 30
2
4

−10 0 10 20 30
9 2 152 408 2 152 408
F1 F3 F1 F3
IT2 C1 C2 C3 C1 C2 C3
(a2.iteration2 for method1) (b2.iteration2 for method2) 0 547 2 5 548 2 4
4 4
x 10 x 10
2.5 2.5
3 380 92 100 348 66 158
2 2
5 326 216 16 322 212 24
Sum−Square−Error

Sum−Square−Error

1.5 1.5
7 492 68 6 492 67 7
1 1 9 470 8 84 458 9 95
0.5 0.5 2 58 490 9 58 488 11
0
2 4 6 8 10 12
0
1 1.5 2 2.5 3
6 12 539 7 12 532 14
Iteration Iteration
8 182 354 18 173 350 31
([Link] for method1) ([Link] for method2) 1 2 308 261 0 285 286
4 65 41 462 138 49 381
Figure 4. These are scatter plots of synthetic IT3 C1 C2 C3 C1 C2 C3
data 2 and the clusters found by methods 1 0 529 4 21 545 1 8
(a1, a2) and 2 (b1, b2). The color of the data 1 478 45 48 394 77 100
points reflect different class labels and the el- 8 484 46 24 317 213 24
lipses represent the clusters found. a1, b1 9 381 54 127 265 143 154
are the results for iteration 1; a2, b2 are the 2 120 397 40 522 21 14
results for iteration 2; a3 and b3 are SSE as a 3 106 454 12 151 393 28
function of iteration for methods 1 and 2 re- 6 107 407 44 47 486 25
7 16 536 14 54 500 12
spectively.
4 67 275 226 187 159 222
5 27 1 530 18 5 535
newsgroups. The second text data is the CMU four univer-
sity WebKB data set as described in [6]. Both text data sets
were processed following the standard procedure, including
stemming and removing stopwords. digit to be considered as contained in a cluster, we require
that at least 70% of its data points fall into the cluster. It is
4.2.1 Results for the Digit Data interesting to note that digits 4 and 5 were not well captured
by any of the clusters in iteration 1. In contrast, in iteration
Table 3 shows the confusion matrix for the digit data for 2, we see digit 4 well separated and captured by cluster 2.
both methods 1 and 2. Because the results for methods 1 In iteration 3, we were able to capture digit 5 nicely in a
and 2 are similar, we will concentrate on the results from single cluster. This further demonstrated that our method is
method 1. For both iterations, we partition the data into capable of discovering multiple reasonable structures from
three clusters. In iteration 1, the resulting partition clustered data.
digits {0, 4, 6}, {1,7,8} and {2,3,5,9} into different groups.
In iteration 2, our method clustered {0,3,5,7,9}, {2,6,8,1}
4.2.2 Results for the Face Data
and {4} into another set of clusters. And, in iteration 3, the
clusters we found are {0,1,8,9}, {2,3,6,7,4} and {5}. These Face data is a very interesting data set because it can be
results show that in each iteration we can find a different grouped in several different ways (e.g., by person, pose,
way of partitioning the ten classes (digits). etc.). We design the experiment to see if we can get dif-
In Figure 5, we present the mean image of each cluster ferent clustering information in different iterations.
obtained by method 1 in three iterations. Below each image First, we begin with our number of clusters K = 20 in
we show the dominant digits contained in the cluster. For a the first iteration, hopefully to find the 20 persons in the

7
Iteration 1
("2","3","9") ("1","7") (’’0","6’’) 0.53 0.33 1 0.52 0.86
Iteration 2

1 0.76 0.6 1 0.44


("2","6") ("4") ("0","7","9")
Iteration3

0.28 0.52 1 1 0.85


("0","1","8") ("5") ("3","6","7")

Figure 5. The average digit for images within


1 0.25 1 0.65 0.57
each cluster found by method 1 in itera-
tions/views 1, 2 and 3. These clustering
views correspond to different digits.
Figure 6. The average face image for each
database. Then, from the second iteration to the rest of the cluster found by method 1 in iteration 1. This
iterations, we set K = 4 to see if the partitions found in the clustering view corresponds to different per-
remaining iterations can tell us any useful information. Fig- sons.
ure 6 shows the average image for each cluster we find in
iteration 1. We observed from this figure and from the con-
fusion matrix (not shown due to space limitations and infor- for text data. Here, we apply the spherical k-means method
mation clutter) that iteration 1 leads to a clustering corre- [7] instead, which considers the correlation between docu-
sponding to the different persons in the [Link] num- ments rather than the Euclidean distance. Our experiments
ber below the image is the percentage this person appears showed that this method provided a reasonable clustering of
in the cluster. It is a confidence measure of the identifica- the text data sets.
tion. And the images clearly show different persons. In the
second iteration, the four clusters we found are shown in Table 4. Confusion Matrix for WebKB Data
Figure 7. Each image is an average image of the images ITERATION 1 C1 C2 C3 C4
within each cluster. It is clear that the clustering in iteration COURSE 134 12 81 17
2 groups the data based on different poses. This again sug- FACULTY 2 78 61 12
gested that our method can find different clustering views PROJECT 1 47 28 10
from the data. The more independent the important parti- STUDENT 2 68 402 86
ITERATION 2 C1 C2 C3 C4
tions lie in the data, the better are the results of our method.
CORNELL 103 86 27 10
Since method 2 gave us similar images, we only provided
TEXAS 50 87 83 32
the results by method 1 here due to space limitation.
WASHINGTON 35 77 138 5
WISCONSIN 60 86 30 132
4.2.3 Results for the Mini-Newsgroups Data
Table 5 shows the confusion matrices by method 1 for
The mini-newsgroups data set originally contains 20 three iterations. For the first iteration, we set K = 3. The
classes. We removed the classes that are under the “misc” results show that cluster C1 groups together recreation and
category because it does not correspond to a clear concept computer categories. The ten most frequent words tell us
class. We also pre-processed the data to remove stop words, that the documents here share information related to enter-
words that appeared in less than 40 documents, and the tainment. Cluster C2 groups science and talks together, and
words that had low variance of occurrence in all documents. the frequent words confirm that it groups science and the
After the pre-processing, the data contains 1700 documents religion part of the talk. Cluster C3 is a mixture of different
from 17 classes. Each document is represented by a 500- topics.
dimensional term frequency vector. In iteration 2, we set K = 4 to see if it we can partition
Note that PCA followed by k-means does not work well the data to capture the four categories “computer”, “recre-

8
method found a different clustering that corresponds to the
universities, as shown in Table 4.

5 Conclusion

The goal of explorative data analysis is to find the un-


0.90 0.42 derlying structure from a given set of data, which may be
multi-faceted by nature. Existing work on non-redundant
clustering attempts to address this problem by searching for
a single alternative clustering solution that is different from
an existing one. Our main contribution in this paper is that
we introduced a new paradigm for explorative data cluster-
ing that seeks to extract all non-redundant clustering views
0.51 0.67 from a given set of data (until there is only noise left in the
data).
Figure 7. The average face image for each We presented a general framework for extracting mul-
cluster found by method 1 in iteration 2. tiple clustering views from high dimensional data. In
This clustering view corresponds to different essence, this framework works by incorporating orthog-
poses. onality constraints into a clustering algorithm. In other
words, the clustering algorithm will search for a new clus-
tering in a space that is orthogonal to what has been covered
ation”, “talk” and “science”. From the confusion matrix, by existing clustering solutions. We described two different
we see that we were able to find these high level categories. methods for introducing orthogonality and conducted a va-
C1 is about computers; C2 contains news about recreation; riety of experiments on both synthetic data and real world
and C3 groups those files related to science. The last one benchmark data sets to evaluate these methods. The results
C4 contains documents from the “talk” category that are can be summarized as follows.
related to politics.
1. Using two different synthetic data sets, our proposed
In iteration 3, two of the “computer” classes (graph-
framework was able to find different substructures of
ics, [Link]) were grouped together with the “talk” category,
the data or different structures embedded in different
the remaining three “computer” classes were grouped to-
subspaces in different iterations.
gether with the “recreation” category (auto, motorcycles
and sports). This suggests that our method continued find- 2. On benchmark data sets, our methods not only found
ing clustering structure that is different from the existing different clustering structures in different iterations,
results. but also discovered clustering structures that are sen-
sible, judging from the various evaluation criteria re-
ported (such as, confusion matrices and scatter plots).
4.2.4 Results for the WebKB Text Data
The face data set for example, PCA+K-means identi-
This data contains 1041 html documents, from four web- fied individuals in the first iteration. In the second iter-
page topics: course, faculty, project and student. Alterna- ation, our methods were able to identify a set of differ-
tively, the webpage can also be grouped based on their re- ent clusters that correspond nicely to different poses.
gions/universities, which include four universities: Cornell For other data sets, we observed similar results in the
University, University of Texas Austin, University of Wash- sense that different concept classes were identified in
ington and Wisconsin Madison. Following the same pre- different iterations.
processing procedure used for the mini-newsgroups data, Note that this paper uses k-means as the basic cluster-
we removed the rare words, stop words, and words with ing algorithm. However, the framework is designed with
low variances. Finally, we obtained 350 words in the vo- no specific algorithm in mind and can work with any clus-
cabulary. The final data matrix is of size 350 × 1041. tering algorithm. Future directions will be to explore the
The experimental results are quite interesting. For the framework with other clustering methods.
first iteration, we see our method found the partition that
mostly corresponds to the different topics, which can be
seen in Table 4. Cluster 1 contains course webpages, cluster
Acknowledgments
2 is a mix of faculty and project pages, cluster 3 consists of This research is supported by NSF CAREER IIS-
a majority of student webpages. In the second iteration, our 0347532.

9
References
Table 5. Confusion Matrix for the Mini-
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Newsgroups Data
Automatic subspace clustering of high dimensional data for
data mining applications. In Proc. of the ACM SIGMOD ITERATION 1 (K=3) C1 C2 C3
Int’l Conf. on Management of Data, pages 94–105, 1998. COMP. GRAPHICS 88 0 12
[2] S. D. Bay. The UCI KDD archive, 1999. COMP. OS . MS 95 0 5
[3] S. Bickel and T. Scheffer. Multi-view clustering. In Proc. of COMP. SYS . IBM . PC . HARDWARE 94 0 6
the IEEE Int’l Conf. on Data Mining, pages 19–26, 2004. COMP. SYS . MAC . HARDWARE 88 0 12
[4] C. Blake and C. Merz. UCI repository of ma- COMP. WINDOWS . X 87 0 13
chine learning databases. In [Link] REC . AUTOS 81 0 19
∼mlearn/[Link], 1998. REC . MOTORCYCLES 82 0 18
[5] G. Chechik and N. Tishby. Extracting relevant structures REC . SPORT. BASEBALL 81 0 19
with side information. In Advances in Neural Information REC . SPORT. HOCKEY 71 2 27
Processing Systems 15 (NIPS-2002), 2003. SCI . CRYPT 0 68 32
[6] CMU. CMU 4 universities WebKB data, 1997. SCI . ELECTRONICS 0 76 24
[7] I. S. Dhillon and D. M. Modha. Concept decompositions for SCI . MED 0 78 22
large sparse text data using clustering. Machine Learning, SCI . SPACE 0 74 26
42(1):143–175, 2001. TALK . POLITICS . GUNS 0 70 30
[8] C. Ding, X. He, H. Zha, and H. Simon. Adaptive dimension TALK . POLITICS . MIDEAST 0 61 39
reduction for clustering high dimensional data. In Proc. of TALK . POLITICS . MISC 0 72 28
the IEEE Int’l Conf. on Data Mining, pages 147–154, 2002. TALK . RELIGION . MISC 0 77 23
[9] R. O. Duda and P. E. Hart. Pattern Classification and Scene
ITERATION 2 (K=4) C1 C2 C3 C4
Analysis. Wiley & Sons, NY, 1973.
COMP. GRAPHICS 98 0 2 0
[10] X. Z. Fern and C. E. Brodley. Random projection for high
COMP. OS . MS 94 0 0 6
dimensional data clustering: A cluster ensemble approach.
COMP. SYS . IBM . PC . HARDWARE 78 15 3 4
In Proc. of the Int’l Conf. on Machine Learning, 2003.
COMP. SYS . MAC . HARDWARE 66 20 3 11
[11] E. Forgy. Cluster analysis of multivariate data: Efficiency vs.
COMP. WINDOWS . X 43 39 5 13
interpretability of classifications. Biometrics, 21:768, 1965.
[12] K. Fukunaga. Statistical Pattern Recognition (second edi- REC . AUTOS 28 51 6 15
REC . MOTORCYCLES 17 59 6 18
tion). Academic Press, San Diego, CA, 1990.
[13] D. Gondek. Non-redundant clustering. PhD thesis, Brown REC . SPORT. BASEBALL 10 67 4 19
University, 2005. REC . SPORT. HOCKEY 5 62 4 29
[14] D. Gondek and T. Hofmann. Conditional information bottle- SCI . CRYPT 5 8 57 30
neck clustering. In The 3rd IEEE Intl. Conf. on Data Mining, SCI . ELECTRONICS 1 9 65 25
Workshop on Clustering Large Data Sets, 2003. SCI . MED 1 22 61 16
[15] D. Gondek and T. Hofmann. Non-redundant data clustering. SCI . SPACE 0 16 58 26
In Proc. of the 4th Intl. Conf. on Data Mining, 2004. TALK . POLITICS . GUNS 0 37 20 43
[16] D. Gondek and T. Hofmann. Non-redundant clustering TALK . POLITICS . MIDEAST 5 39 11 45
with conditional ensembles. In Proc. of the 11th ACM TALK . POLITICS . MISC 3 45 6 46
SIGKDD Intl. Conf. on Knowledge Discovery and Data TALK . RELIGION . MISC 1 58 3 38
Mining (KDD’05), pages 70–77, 2005. ITERATION 3 (K=4) C1 C2 C3 C4
[17] D. Gondek, S. Vaithyanathan, and A. Garg. Clustering with COMP. GRAPHICS 33 32 6 29
model-level constraints. In Proc. of SIAM International COMP. OS . MS 42 23 10 25
Conference on Data Mining, 2005. COMP. SYS . IBM . PC . HARDWARE 17 45 11 27
[18] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A COMP. SYS . MAC . HARDWARE 15 41 20 24
review. ACM Computing Surveys, 31(3):264–323, 1999. COMP. WINDOWS . X 19 40 18 23
[19] I. T. Jolliffe. Principal Component Analysis. Springer- REC . AUTOS 15 47 27 11
Verlag, New-York, 1986. REC . MOTORCYCLES 10 54 22 14
[20] J. Macqueen. Some methods for classifications and analy- REC . SPORT. BASEBALL 7 51 33 9
sis of multivariate observations. Proc. Symp. Mathematical REC . SPORT. HOCKEY 5 66 21 8
Statistics and Probability, 5th, Berkeley, 1:281–297, 1967.
SCI . CRYPT 5 15 68 12
[21] L. Parsons, E. Haque, and H. Liu. Subspace clustering for
SCI . ELECTRONICS 10 9 65 16
high dimensional data: a review. SIGKDD Explor. Newsl.,
SCI . MED 31 8 46 15
6(1):90–105, 2004.
SCI . SPACE 15 24 48 13
[22] A. Strehl and J. Ghosh. Cluster ensembles – a knowledge
TALK . POLITICS . GUNS 49 19 18 14
reuse framework for combining multiple partitions. Journal
TALK . POLITICS . MIDEAST 45 24 16 15
of Machine Learning Research, pages 583–617, 2002.
TALK . POLITICS . MISC 55 12 12 21
TALK . RELIGION . MISC 56 8 20 16

10

You might also like