Non-Redundant Multi-View Clustering Via Orthogonalization: Cui - Yi@neu - Edu Xfern@eecs - Oregonstate.edu Jdy@ece - Neu.edu
Non-Redundant Multi-View Clustering Via Orthogonalization: Cui - Yi@neu - Edu Xfern@eecs - Oregonstate.edu Jdy@ece - Neu.edu
18
18
14
ing, orthogonalization 14 12
F2
F4
12 10
8
10
6
8
4
1 Introduction 6
2 4 6 8 10 12
F1
14 16 18 20 22
2
−5 0 5 10
F3
15 20 25
(a) (b)
Many applications are characterized by data in high di-
mensions. Examples include text, image, and gene data. Figure 1. This is a scatter plot of the data
Automatically extracting interesting structure in such data in (a) subspace {F1 , F2 } and (b) subspace
has been heavily studied in a number of different areas in- {F3 , F4 }. Note that the two subspaces lead
cluding data mining, machine learning and statistical data to different clustering structures.
analysis. One approach for extracting information from un-
labeled data is through clustering. Given a data set, typical
Hierarchical clustering [18] presents a hierarchical unifying framework and seeks lower dimensional rep-
grouping of the objects. However, the different cluster- resentation of the data that reveals non-redundant in-
ing solutions obtained at different hierarchical levels differ formation about the data.
only in their granularity – objects belonging to the same
cluster in fine resolutions remain in the same cluster at the (2) Existing non-redundant clustering techniques are lim-
coarser levels. On the other hand, cluster ensemble meth- ited to finding one alternative structure given a known
ods [22, 10] do create a set of cluster solutions but the final structure. Our framework works successively to reveal
goal is still to find a single clustering solution that is most a sequence of different clustering of the data.
consistent throughout the set. Our framework produces a set of different clustering so-
The goal of exploratory data analysis is to find structures lutions, which is similar to cluster ensembles[22, 10]. A
in data, which maybe multi-faceted by nature. Traditional clear distinction of our work from cluster ensembles is that
clustering methods seek to find a unified clustering solution we intentionally search for orthogonal clusterings and do
and are thus inherently limited in achieving this goal. In this not seek to find a consensus clustering as our end product.
paper, we suggest a new exploratory clustering paradigm: Similar to cluster ensembles, multi-view clustering [3]
the goal is to find a set of non-redundant clustering views seek a single final clustering solution by learning from mul-
from data, where data points belonging to the same cluster tiple views, whereas our non-redundant multi-view cluster-
in one view can belong to different clusters in another view. ing paradigm looks for multiple clustering solutions.
Toward this goal, we present a framework that extracts An integral part of our framework is to search a cluster-
multiple clustering views of high-dimensional data that are ing structure in a high-dimensional space and find the corre-
orthogonal to each other. Note that there are k N possible sponding subspace that best reveals the clustering structure.
k disjoint partitioning of N data points. Not all of them This topic has been widely studied, and one closely related
are meaningful. We wish to find good clustering solutions work was conducted by [8]. In this work, after a clustering
based on a clustering objective function. Meanwhile, we solution is obtained, a subspace is computed to best cap-
would like to minimize the redundancy among the obtained ture the clustering, and the clustering is then refined using
solutions. Thus, we include an orthogonality constraint in the data projected onto the new subspace. Interestingly, our
our search for new clustering views to avoid providing the framework works in an opposite direction. We look at a sub-
user with redundant clustering results. The proposed frame- space that is orthogonal to the space in which the original
work works iteratively, at each step adding one clustering clustering is embedded to search for non-redundant cluster-
view by searching for solutions in a space that is orthogonal ing solutions.
to the space of the existing solutions. Within this frame- Finally, while we search for different subspaces in our
work, we develop two general approaches. The first ap- framework, it is different from the concept of subspace clus-
proach seeks orthogonality in the cluster space, while the tering in [1, 21]. In subspace clustering, the goal is still to
second one seeks orthogonality in the feature subspace. We learn a single clustering, where each cluster can be embed-
present all the multiple view clustering solutions to the user. ded in its own subspace. In contrast, our method searches
for multiple clustering solutions, each is revealed in a dif-
2 Literature Review ferent subspace.
In this section, we briefly review the literature related to 3 Multi-View Orthogonal Clustering
our research in different aspects.
First, our research can be considered as performing non- Given data X ∈ Rd×N , with N instances and d features.
redundant clustering [13, 5]. In non-redundant clustering, Our goal is to learn a set of orthogonal clustering views
we are typically given a set of data objects together with from the data.
an existing clustering solution and the goal is to learn an There are several ways to find different clustering views
alternative clustering that captures new information about [22, 10]. One can apply different objective functions, uti-
the structure of the data. Existing non-redundant clustering lize different similarity measures, or use different density
techniques include the conditional information bottleneck models. In general, one can apply different clustering algo-
approach [5, 14, 15], the conditional ensemble based ap- rithms, or utilize one algorithm on randomly sampled (ei-
proach [16] and the constrained model based approach [17]. ther in instance space or feature space) data from X. Note
Compared to existing non-redundant clustering techniques, that such methods produce each individual clustering inde-
the critical differences of our proposed research are: pendently from all the other clustering views. While the
(1) Focusing on searching for orthogonal clustering of differences in the objective functions, similarity measures,
high-dimensional data, our research combines dimen- density models, or different data samples may lead to clus-
sionality reduction and orthogonal clustering into a tering results that differ from one another, it is common to
2
see redundancy in the obtained multiple clustering views. means and the cluster membership of each data point xi .
We present a framework for successively generating multi- One can consider k-means clustering as a compression of
ple clustering views that are orthogonal from one another data X to the k cluster means µj .
and thus contain limited redundancy. Below, we describe Following the compression viewpoint, each data point xi
our framework for generating different clustering views by is represented by its cluster mean µj . Given k µj ’s for rep-
orthogonalization. resenting X, what is not captured by these µj ’s? Let us con-
sider the space that is spanned by xi , i = 1 . . . N , we refer
X
to this as the original data space. In contrast, the subspace
that is spanned by the mean vectors µj , j = 1 . . . k, is con-
Residue Space Residue Space
sidered the compressed data space. Assigning data points
to their corresponding cluster means can be essentially con-
Orthogonalization Orthogonalization Orthogonalization
sidered as projecting the data points from the original data
space onto the compressed data space. What is not covered
Clustering/ Clustering/ Clustering/ Clustering/
by the current clustering solution (i.e., its compressed data
Dim. Red. Dim. Red. Dim. Red. Dim. Red. space) is simply its residue space. In this paper, we define
residue space as the data projected onto the space orthogo-
View 1 View 2 View 3 View n
nal to our current representation. Thus, to find alternative
Figure 2. The general framework for generat- clustering solutions not covered in the current solution, we
ing multiple orthogonal clustering views. can simply perform clustering in the space that is orthogo-
nal to the compressed data space.
Figure 2 shows the general framework of our approach. Given current data X (t) , and the clustering solution we
(t) (t) (t)
We first cluster the data (this can include dimensionality re- found on X (t) (i.e., M (t) = [µ1 µ2 · · · µk ], and the clus-
duction followed by clustering), then we orthogonalize the ter assignments), we describe two variations for represent-
data to a space that is not covered by the existing cluster- ing data in the residue space, X (t+1) that are based on the
ing solutions. Repeat the process until we cover most of “hard” and “soft” clustering views respectively.
the data space or no structure can be found in the remaining
space. Hard clustering. In hard clustering, each data point be-
We developed two general approaches within this frame- longs to a single cluster, thus is represented using a
(t)
work: (1) orthogonal clustering, and (2) clustering in or- single mean vector. For data point xi belonging to
(t)
thogonal subspaces. cluster j, we project it onto its center µj as its repre-
These two approaches differ primarily in how they rep- sentation in the current clustering view. The residue,
(t+1) (t)
resent the existing clustering solutions and consequently xi , is then the projection of xi onto the subspace
how to orthogonalize the data based on existing solutions. (t)
orthogonal to µj . This can be formalized by the fol-
Specifically, the first approach represents a clustering so- lowing formula:
lution using its k cluster centroids. The second approach
(t+1) (t) (t)T (t)T (t) (t)
represents a clustering solution using the feature subspace xi = (I − µj µj /(µj µj ))xi
that best captures the clustering result. In the next two sub-
sections, we describe these two different representations in Soft clustering. In soft clustering, each data point can par-
detail and explain how to obtain orthogonal clustering solu- tially belong to all clusters. Therefore, we can repre-
tions based on these two representations. sent the data using all cluster centers. We achieve this
by projecting the data onto the subspace spanned by
3.1 Orthogonal Clustering all cluster means and compute the residue, X (t+1) , as
the projection of X (t) onto the subspace orthogonal to
Clustering can be viewed as a way for compressing data all the cluster centroids. This can be formalized by the
X. For example in k-means [11, 20], the objective function following formula:
is to minimize the sum-squared-error criterion (SSE):
X (t+1) = (I − M (t) (M (t)T M (t) )−1 M (t)T )X (t)
k
SSE = ||xi − µj ||2 The algorithm for orthogonal clustering is summarized
j=1 xi ∈Cj in Algorithm 1. The data is first centered to have zero
mean. We then create the first view by clustering the orig-
where xi ∈ Rd is a data point assigned to cluster Cj , and inal data X. Since most of the data in our experiments are
µj is the mean of Cj . We represent xi and µj as column high-dimensional, we apply principal components analy-
vectors. The outputs for k-means clustering are the cluster sis [19] to reduce the dimensionality, followed by k-means.
3
Algorithm 1 Orthogonal Clustering.
k
Sb = nj (µj − µ)(µj − µ)T
Inputs: The data matrix X ∈ R d×N
, and the number of
j=1
clusters k (t) for each iteration.
Output: The multiple partitioning views of the data into where yi ’s are projected data points; µj ’s are projected clus-
k (t) clusters at each iteration. ter centers; nj is the total number of points in cluster j
Pre-processing: Center the data to have zero mean. and µ is the center of the entire set of projected data. In
Initialization: Set the iteration number t = 1 and essence, LDA finds the subspace that maximizes the scatter
X (1) = X. between the cluster means normalized by the scatter within
Step1: Cluster X (t) . In our experiments, we performed each cluster.
PCA followed by k-means. The compressed solution are Similarly, the PCA approach in [8] seeks a linear projec-
(t) (t)
the k means, µj . Each µj is a column vector in Rd tion Y = AT X, but maximizes a different objective func-
(the original feature space). tion trace(M M T ), where M = [µ1 µ2 · · · µk ] and the µj ’s
(t)
Step2: Project each xi in X (t) to the space orthogonal are the projected cluster centers.
to its cluster mean to form the residue space representa- For both methods, the solution can be represented as
tion, X (t+1) . A = [φ1 φ2 · · · φq ], which contains the q most important
Step3: Set t = t + 1 and repeat steps 1 and 2 until the eigenvectors (corresponding to the q largest eigenvalues) of
−1
desired number of views or until the sum-squared-error, Sw Sb for LDA and M M T for PCA respectively.
k (t) (t) 2 Note that the trace(Sb ) = trace(M M T ), where M =
(t) ||x
j=1 (t)
x ∈C i − µj || , is very small. √ √ √
i j
[ n1 µ1 n2 µ2 · · · nk µk ]. The difference between the
two approaches is the normalization by the within-class-
−1
scatter Sw and a weighting of each µj by nj , the number
Note that one can apply other clustering methods within our of data points in cluster Cj . But, both M and M span the
framework. We chose PCA followed by k-means because same space. Thus, in practice, both approaches result in
they are popular techniques. In step 2, we project the data similar results. For computational purposes, we choose the
to the space orthogonal to the current cluster representation PCA approach on the µj ’s and set q = k − 1, the rank of
(using cluster centers) to obtain our residue X (t+1) . The MMT.
next clustering view is then obtained by clustering in this Once we have obtained a feature subspace A =
residue space. We repeat steps 1 (clustering) and 2 (orthog- [φ1 φ2 · · · φk−1 ] that captures the clustering structure well,
onalization) until the desired number of views are obtained we project X (t) to the subspace orthogonal to A to obtain
or when the SSE is very small. Small SSE signifies that the the residue X (t+1) = P (t) X (t) . The orthogonal projection
existing views have already covered most of the data. T T
operator, P , is: P (t) = I − A(t) (A(t) A(t) )−1 A(t) .
Algorithm 2 presents the pseudo-code for clustering in
3.2 Clustering in Orthogonal Subspaces orthogonal subspaces. In step 1, we apply a clustering algo-
rithm (PCA followed by k-means in our experiments). We
In this approach, given a clustering solution with means then represent this clustering solution using the subspace
µj , j = 1 . . . k, we would like to find a feature subspace that best separates these clusters. In step 2, we project the
that best captures the clustering structure, or, in other words, data to the space orthogonal to the computed subspace rep-
discriminates these clusters well. One well-known method resentation. We repeat steps 1 and 2 until the desired num-
for finding a reduced dimensional space that discriminates ber of views are obtained or the SSE is very small.
classes (clusters here) is linear discriminant analysis (LDA)
[12, 9]. Another approach is by applying principal compo-
nent analysis (PCA) on the k mean vectors µj ’s [8]. 4 Experiments
Below we explain the mathematical differences between
these two approaches. LDA finds a linear projection Y = In this section, we investigate whether our multi-view or-
AT X that maximizes the thogonal clustering framework can provide us with reason-
able and orthogonal clustering views of the data. We start
−1
trace(Sw Sb ) by performing experiments on synthetic data in Section 4.1
to get a better understanding of the methods, then we test
where Sw is the within-class-scatter matrix and Sb is the the methods on benchmark data in Section 4.2. In these ex-
between-class-scatter matrix defined as follows. periments, we chose as our base clustering – PCA followed
k by k-means clustering. This means, we first reduce the di-
Sw = (yi − µj )(yi − µj )T mensionality with PCA, keeping a dimensionality that re-
j=1 yi ∈Cj tains at least 90% of the original variance, then follow PCA
4
Algorithm 2 Clustering in Orthogonal Subspaces. 200, 200 and 100 data points and means µ1 = (2, 17),
Inputs: The data matrix X ∈ Rd×N , and the number of µ2 = (17.5, 9), and µ3 = (1.2, 5), and identity covari-
clusters k (t) for each iteration. ances. This data tests whether the methods can find
Output: The multiple partitioning views of the data into different clustering solutions in different subspaces.
k (t) clusters, and a reduced dimensional subspace for
each iteration A(t) .
Pre-processing: Center the data to have zero mean.
Initialization: Set the iteration number t = 1 and Table 1. Confusion Matrix for Synthetic Data1
X (1) = X. S YNTHETIC DATA 1 METHOD 1 METHOD 2
5
15 15
On the other hand, iteration 2 groups the clusters based on
features 1 and 2 correctly, but the partition for the clusters in
10 10
5 5
F2
0 0
10
15
10
S YNTHETIC DATA 2 M ETHOD 1 M ETHOD 2
5 5 ITERATION1
F2
F2
−5
0
−5
0
L ABELLING 1 C1 C2 C3 C1 C2 C3
−10 −10 L1 41 40 19 41 40 19
−15 −15
−20 −20
L2 44 34 22 44 34 22
−15 −10 −5 0 5 10 15 −15 −10 −5 0 5 10 15
F1 F1
L3 115 126 59 115 126 59
(a2.iteration2 for method1) (b2.iteration2 for method2) LABELLING2 C1 C2 C3 C1 C2 C3
4 4
6
x 10
6
x 10
L1 200 0 0 200 0 0
5 5
L2 0 200 0 0 200 0
4 4
Sum−Square−Error
Sum−Square−Error
L3 0 0 100 0 0 100
3 3
2 2 ITERATION 2
1 1 LABELLING1 C1 C2 C3 C1 C2 C3
0
1 2
Iteration
3
0
1 2
Iteration
3 L1 100 0 0 100 0 0
L2 0 100 0 0 100 0
([Link] for method1) ([Link] for method2)
L3 0 0 300 0 0 300
L ABELLING 2 C1 C2 C3 C1 C2 C3
Figure 3. Scatter plots of synthetic data 1.
L1 126 34 40 126 34 40
The two columns show the results of meth- L2 115 44 41 115 44 41
ods 1 and 2 respectively. The colors repre- L3 59 22 19 59 22 19
sent different class labels and the ellipses
represent the clusters found. Row 1 and 2
show the results for iteration 1 and 2 respec-
tively; Row 3 shows SSE as a function of iter- 4.2 Experiments on Benchmark Data
ation.
We have shown that our two methods work on synthetic
data. Here, we investigate whether they reveal interesting
2 found the clusters in labeling 1 (features 1 and 2) per- and diverse clustering solutions on real benchmark data. We
fectly also with zero confusion. This result confirms that select data sets that have high-dimensionality and that have
indeed our multi-view approach can discover multiple clus- multiple possible partitioning.
tering solutions in different subspaces. Figure 4 shows scat- In this section, we investigate the performance of our
ter plots of the data. The left column ((a1), (a2), (a3)) is multi-view orthogonal clustering algorithms on four real
the plot for method 1. (a1) shows the clustering in ellipses world data sets, including the digits data set from the UCI
found by method 1 in iteration 1. The left sub-figure shows machine learning repository [4], the face data set from the
the groupings in the original features 1 and 2, and the data UCI KDD repository [2], and two text data sets: the mini-
points are colored based on true labeling 1. The right sub- newsgroups data [2] and the WebKB data set [6].
figure shows the clusterings in the original features 3 and The digits data is a data set for an optical recogni-
4, and the color of the data points are based on true label- tion problem of handwritten digits with ten classes, 5620
ing 2. (a2) is the same scatter plot of the original data X cases, and 64 attributes (all input attributes are integers
with the clusters found by method 1 as shown by the ellipses from 0 . . . 16). The face data consists of 640 face images
in iteration 2. Similarly, (b1) and (b2) show the results of of 20 people taken with varying pose (straight, left, right,
method 2. (a3) and (b3) are the SSE for the two methods up), expression (neutral, happy, sad, angry), eyes (wear-
in each iteration. Method 2 converges to zero much faster ing sunglasses or not). Each person has 32 images captur-
than method 1 here. Note that SSE monotonically decreases ing every combination of features. The image resolution
with iteration and that the algorithm captures most of the is 32 × 30. We removed the missing data and formed a
information in two clustering views. From these results, in 960 × 624 data matrix. Each of the 960 features represents
iteration 1 we found the right partition based on features 3 a pixel value. The mini-newsgroups data comes from the
and 4, but group the clusters in features 1 and 2 incorrectly. UCI KDD repository which contains 2000 articles from 20
6
Label 1 Label 2 Label 1 Label 2
22 20 22 20
20
18
18
16
20
18
18
16
Table 3. Confusion Matrix for the Digits Data
16
14
16
14
DIGIT METHOD 1 METHOD 2
12 12
F2
F4
F2
F4
14
12
10
14
12
10 IT1 C1 C2 C3 C1 C2 C3
8 8
10
6
10
6
0 550 0 4 550 0 4
8
6 2
4 8
6 2
4
4 371 197 0 369 199 0
0 10 20 30 −10 0 10 20 30 0 10 20 30 −10 0 10 20 30
F1 F3 F1 F3
6 555 2 1 555 2 1
(a1.iteration1 for method1) (b1.iteration1 for method2) 1 12 477 82 12 477 82
20
Label 1
22
Label 2
20
Label 1
22
Label 2
7 0 566 0 0 566 0
20 20
18
18
18
18 8 15 321 218 15 321 218
16 16 16 16
14
14
14
14 2 5 27 525 5 27 525
F2
F4
F2
F4
12 12
12
10
12
10 3 0 41 531 0 41 531
10 8
6
10 8
6
5 35 236 287 35 236 287
8 8
6
0 10 20 30
2
4
−10 0 10 20 30
6
0 10 20 30
2
4
−10 0 10 20 30
9 2 152 408 2 152 408
F1 F3 F1 F3
IT2 C1 C2 C3 C1 C2 C3
(a2.iteration2 for method1) (b2.iteration2 for method2) 0 547 2 5 548 2 4
4 4
x 10 x 10
2.5 2.5
3 380 92 100 348 66 158
2 2
5 326 216 16 322 212 24
Sum−Square−Error
Sum−Square−Error
1.5 1.5
7 492 68 6 492 67 7
1 1 9 470 8 84 458 9 95
0.5 0.5 2 58 490 9 58 488 11
0
2 4 6 8 10 12
0
1 1.5 2 2.5 3
6 12 539 7 12 532 14
Iteration Iteration
8 182 354 18 173 350 31
([Link] for method1) ([Link] for method2) 1 2 308 261 0 285 286
4 65 41 462 138 49 381
Figure 4. These are scatter plots of synthetic IT3 C1 C2 C3 C1 C2 C3
data 2 and the clusters found by methods 1 0 529 4 21 545 1 8
(a1, a2) and 2 (b1, b2). The color of the data 1 478 45 48 394 77 100
points reflect different class labels and the el- 8 484 46 24 317 213 24
lipses represent the clusters found. a1, b1 9 381 54 127 265 143 154
are the results for iteration 1; a2, b2 are the 2 120 397 40 522 21 14
results for iteration 2; a3 and b3 are SSE as a 3 106 454 12 151 393 28
function of iteration for methods 1 and 2 re- 6 107 407 44 47 486 25
7 16 536 14 54 500 12
spectively.
4 67 275 226 187 159 222
5 27 1 530 18 5 535
newsgroups. The second text data is the CMU four univer-
sity WebKB data set as described in [6]. Both text data sets
were processed following the standard procedure, including
stemming and removing stopwords. digit to be considered as contained in a cluster, we require
that at least 70% of its data points fall into the cluster. It is
4.2.1 Results for the Digit Data interesting to note that digits 4 and 5 were not well captured
by any of the clusters in iteration 1. In contrast, in iteration
Table 3 shows the confusion matrix for the digit data for 2, we see digit 4 well separated and captured by cluster 2.
both methods 1 and 2. Because the results for methods 1 In iteration 3, we were able to capture digit 5 nicely in a
and 2 are similar, we will concentrate on the results from single cluster. This further demonstrated that our method is
method 1. For both iterations, we partition the data into capable of discovering multiple reasonable structures from
three clusters. In iteration 1, the resulting partition clustered data.
digits {0, 4, 6}, {1,7,8} and {2,3,5,9} into different groups.
In iteration 2, our method clustered {0,3,5,7,9}, {2,6,8,1}
4.2.2 Results for the Face Data
and {4} into another set of clusters. And, in iteration 3, the
clusters we found are {0,1,8,9}, {2,3,6,7,4} and {5}. These Face data is a very interesting data set because it can be
results show that in each iteration we can find a different grouped in several different ways (e.g., by person, pose,
way of partitioning the ten classes (digits). etc.). We design the experiment to see if we can get dif-
In Figure 5, we present the mean image of each cluster ferent clustering information in different iterations.
obtained by method 1 in three iterations. Below each image First, we begin with our number of clusters K = 20 in
we show the dominant digits contained in the cluster. For a the first iteration, hopefully to find the 20 persons in the
7
Iteration 1
("2","3","9") ("1","7") (’’0","6’’) 0.53 0.33 1 0.52 0.86
Iteration 2
8
method found a different clustering that corresponds to the
universities, as shown in Table 4.
5 Conclusion
9
References
Table 5. Confusion Matrix for the Mini-
[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Newsgroups Data
Automatic subspace clustering of high dimensional data for
data mining applications. In Proc. of the ACM SIGMOD ITERATION 1 (K=3) C1 C2 C3
Int’l Conf. on Management of Data, pages 94–105, 1998. COMP. GRAPHICS 88 0 12
[2] S. D. Bay. The UCI KDD archive, 1999. COMP. OS . MS 95 0 5
[3] S. Bickel and T. Scheffer. Multi-view clustering. In Proc. of COMP. SYS . IBM . PC . HARDWARE 94 0 6
the IEEE Int’l Conf. on Data Mining, pages 19–26, 2004. COMP. SYS . MAC . HARDWARE 88 0 12
[4] C. Blake and C. Merz. UCI repository of ma- COMP. WINDOWS . X 87 0 13
chine learning databases. In [Link] REC . AUTOS 81 0 19
∼mlearn/[Link], 1998. REC . MOTORCYCLES 82 0 18
[5] G. Chechik and N. Tishby. Extracting relevant structures REC . SPORT. BASEBALL 81 0 19
with side information. In Advances in Neural Information REC . SPORT. HOCKEY 71 2 27
Processing Systems 15 (NIPS-2002), 2003. SCI . CRYPT 0 68 32
[6] CMU. CMU 4 universities WebKB data, 1997. SCI . ELECTRONICS 0 76 24
[7] I. S. Dhillon and D. M. Modha. Concept decompositions for SCI . MED 0 78 22
large sparse text data using clustering. Machine Learning, SCI . SPACE 0 74 26
42(1):143–175, 2001. TALK . POLITICS . GUNS 0 70 30
[8] C. Ding, X. He, H. Zha, and H. Simon. Adaptive dimension TALK . POLITICS . MIDEAST 0 61 39
reduction for clustering high dimensional data. In Proc. of TALK . POLITICS . MISC 0 72 28
the IEEE Int’l Conf. on Data Mining, pages 147–154, 2002. TALK . RELIGION . MISC 0 77 23
[9] R. O. Duda and P. E. Hart. Pattern Classification and Scene
ITERATION 2 (K=4) C1 C2 C3 C4
Analysis. Wiley & Sons, NY, 1973.
COMP. GRAPHICS 98 0 2 0
[10] X. Z. Fern and C. E. Brodley. Random projection for high
COMP. OS . MS 94 0 0 6
dimensional data clustering: A cluster ensemble approach.
COMP. SYS . IBM . PC . HARDWARE 78 15 3 4
In Proc. of the Int’l Conf. on Machine Learning, 2003.
COMP. SYS . MAC . HARDWARE 66 20 3 11
[11] E. Forgy. Cluster analysis of multivariate data: Efficiency vs.
COMP. WINDOWS . X 43 39 5 13
interpretability of classifications. Biometrics, 21:768, 1965.
[12] K. Fukunaga. Statistical Pattern Recognition (second edi- REC . AUTOS 28 51 6 15
REC . MOTORCYCLES 17 59 6 18
tion). Academic Press, San Diego, CA, 1990.
[13] D. Gondek. Non-redundant clustering. PhD thesis, Brown REC . SPORT. BASEBALL 10 67 4 19
University, 2005. REC . SPORT. HOCKEY 5 62 4 29
[14] D. Gondek and T. Hofmann. Conditional information bottle- SCI . CRYPT 5 8 57 30
neck clustering. In The 3rd IEEE Intl. Conf. on Data Mining, SCI . ELECTRONICS 1 9 65 25
Workshop on Clustering Large Data Sets, 2003. SCI . MED 1 22 61 16
[15] D. Gondek and T. Hofmann. Non-redundant data clustering. SCI . SPACE 0 16 58 26
In Proc. of the 4th Intl. Conf. on Data Mining, 2004. TALK . POLITICS . GUNS 0 37 20 43
[16] D. Gondek and T. Hofmann. Non-redundant clustering TALK . POLITICS . MIDEAST 5 39 11 45
with conditional ensembles. In Proc. of the 11th ACM TALK . POLITICS . MISC 3 45 6 46
SIGKDD Intl. Conf. on Knowledge Discovery and Data TALK . RELIGION . MISC 1 58 3 38
Mining (KDD’05), pages 70–77, 2005. ITERATION 3 (K=4) C1 C2 C3 C4
[17] D. Gondek, S. Vaithyanathan, and A. Garg. Clustering with COMP. GRAPHICS 33 32 6 29
model-level constraints. In Proc. of SIAM International COMP. OS . MS 42 23 10 25
Conference on Data Mining, 2005. COMP. SYS . IBM . PC . HARDWARE 17 45 11 27
[18] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A COMP. SYS . MAC . HARDWARE 15 41 20 24
review. ACM Computing Surveys, 31(3):264–323, 1999. COMP. WINDOWS . X 19 40 18 23
[19] I. T. Jolliffe. Principal Component Analysis. Springer- REC . AUTOS 15 47 27 11
Verlag, New-York, 1986. REC . MOTORCYCLES 10 54 22 14
[20] J. Macqueen. Some methods for classifications and analy- REC . SPORT. BASEBALL 7 51 33 9
sis of multivariate observations. Proc. Symp. Mathematical REC . SPORT. HOCKEY 5 66 21 8
Statistics and Probability, 5th, Berkeley, 1:281–297, 1967.
SCI . CRYPT 5 15 68 12
[21] L. Parsons, E. Haque, and H. Liu. Subspace clustering for
SCI . ELECTRONICS 10 9 65 16
high dimensional data: a review. SIGKDD Explor. Newsl.,
SCI . MED 31 8 46 15
6(1):90–105, 2004.
SCI . SPACE 15 24 48 13
[22] A. Strehl and J. Ghosh. Cluster ensembles – a knowledge
TALK . POLITICS . GUNS 49 19 18 14
reuse framework for combining multiple partitions. Journal
TALK . POLITICS . MIDEAST 45 24 16 15
of Machine Learning Research, pages 583–617, 2002.
TALK . POLITICS . MISC 55 12 12 21
TALK . RELIGION . MISC 56 8 20 16
10