0% found this document useful (0 votes)
23 views15 pages

Approximation Algorithms For Tensor Clustering

This document presents the first approximation algorithm for tensor clustering, addressing the NP-hard nature of optimizing common tensor clustering formulations. The proposed algorithm, named Combination Tensor Clustering (CoTeC), achieves an approximation ratio dependent on the order of the tensor and the approximation factor of corresponding 1D clustering algorithms. The analysis demonstrates that clustering along individual dimensions can yield strong approximation guarantees, making the method practical for various applications involving tensor data.

Uploaded by

POTISA Bryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

Approximation Algorithms For Tensor Clustering

This document presents the first approximation algorithm for tensor clustering, addressing the NP-hard nature of optimizing common tensor clustering formulations. The proposed algorithm, named Combination Tensor Clustering (CoTeC), achieves an approximation ratio dependent on the order of the tensor and the approximation factor of corresponding 1D clustering algorithms. The analysis demonstrates that clustering along individual dimensions can yield strong approximation guarantees, making the method practical for various applications involving tensor data.

Uploaded by

POTISA Bryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Approximation Algorithms for Tensor Clustering

Stefanie Jegelka1 , Suvrit Sra1 , and Arindam Banerjee2


1
Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany
{jegelka,suvrit}@tuebingen.mpg.de
2
Univ. of Minnesota, Twin Cities, Minneapolis, MN, USA
[email protected]

Abstract. We present the first (to our knowledge) approximation algo-


rithm for tensor clustering—a powerful generalization to basic 1D clus-
tering. Tensors are increasingly common in modern applications dealing
with complex heterogeneous data and clustering them is a fundamental
tool for data analysis and pattern discovery. Akin to their 1D cousins,
common tensor clustering formulations are NP-hard to optimize. But,
unlike the 1D case no approximation algorithms seem to be known. We
address this imbalance and build on recent co-clustering work to derive
a tensor clustering algorithm with approximation guarantees, allowing
metrics and divergences (e.g., Bregman) as objective functions. There-
with, we answer two open questions by Anagnostopoulos et al. (2008).
Our analysis yields a constant approximation factor independent of data
size; a worst-case example shows this factor to be tight for Euclidean
co-clustering. However, empirically the approximation factor is observed
to be conservative, so our method can also be used in practice.

1 Introduction

Tensor clustering is a recent generalization to the basic one-dimensional clus-


tering problem, and it seeks to partition an order-m input tensor into coherent
sub-tensors while minimizing some cluster quality measure [1, 2]. For example,
in co-clustering, which is a special case of tensor clustering with m = 2, one si-
multaneously partitions rows and columns of an input matrix to obtain coherent
submatrices, often while minimizing a Bregman divergence [3, 4].
Being generalizations of the 1D case, common tensor clustering formulations
are also NP-hard to optimize. But despite the existence of a vast body of research
on approximation algorithms for 1D clustering problems (e.g., [5–10]), there seem
to be no published approximation algorithms for tensor clustering. Even for (2D)
co-clustering, there are only two recent attempts [11] and [12] (from 2008). Both
prove an approximation factor of 2α1 for Euclidean co-clustering given an α1 -
approximation for k-means, and show constant approximation factors for `1 ([12]
only for binary matrices) and `p -norm [11] based variants.
Tensor clustering is a basic data analysis task with growing importance;
several domains now deal frequently with tensor data, e.g., data mining [13],
computer graphics [14], and computer vision [2]. We refer the reader to [15]
for a recent survey about tensors and their applications. The simplest tensor
clustering scenario, namely, co-clustering (also known as bi-clustering) is more
established [12, 4, 16–18]. Tensor clustering is less well known, though several
researchers have considered it before [1, 2, 19–21].
1.1 Contributions
The main contribution of this paper is the analysis of an approximation algo-
rithm for tensor clustering that achieves an approximation ratio of O(p(m)α),
1
where m is the order of the tensor, p(m) = m or p(m) = m log3 2 , and α is the
approximation factor of a corresponding 1D clustering algorithm. Our results
apply to a fairly broad class of objective functions, including metrics such as `p
norms, Hilbertian metrics [22, 23], and divergence functions such as Bregman di-
vergences [24] (with some assumptions). As corollaries, our results solve two open
problems posed by [12], viz., whether their methods for Euclidean co-clustering
could be extended to Bregman co-clustering, and if one could extend the ap-
proximation guarantees to tensor clustering. The bound also gives insight into
properties of the tensor clustering problem. We give an example for the tight-
ness of our bound for squared Euclidean distance, and provide an experimental
validation of the theoretical claims, which forms an additional contribution.

2 Background & Problem


Traditionally, “center” based clustering algorithms seek partitions of columns
of an input matrix X = [x1 , . . . , xn ] into clusters C = {C1 , . . . , CK }, and find
“centers” µk that minimize the objective
XK X
J(C) = d(x, µk ), (2.1)
k=1 x∈Ck

where the function d(x, y) measures cluster quality. The “center” µk of cluster
Ck is given by the mean of the points in Ck when d(x, y) is a Bregman di-
vergence [25]. Co-clustering extends (2.1) to seek simultaneous partitions (and
centers µIJ ) of rows and columns of X, so that the objective function
X X
J(C) = d(xij , µIJ ), (2.2)
I,J i∈I,j∈J

is minimized; µIJ denotes the (scalar) “center” of the cluster described by the
row and column index sets, viz., I and J. We generalize formulation (2.2) to
tensors in Section 2.2 after introducing some background on tensors.

2.1 Tensors
An order-m tensor A may be viewed as an element of the vector space Rn1 ×...×nm .
An individual entry of A is given by the multiply-indexed value ai1 i2 ...im , where
ij ∈ {1, . . . , nj } for 1 ≤ j ≤ m. For us, the most important tensor operation
is multilinear matrix multiplication, which generalizes matrix multiplica-
tion [26]. Matrices act on other matrices by either left or right multiplication.
Similarly, for an order-m tensor, there are m dimensions on which a matrix may
act. For A ∈ Rn1 ×n2 ×···×nm , and matrices P1 ∈ Rp1 ×n1 , . . ., Pm ∈ Rpm ×nm ,
multilinear multiplication is defined by the action of the Pi on the different di-
mensions of A, and is denoted by A0 = (P1 , . . . , Pm )·A ∈ Rp1 ×···×pm . The individ-
Pn ,...,n (1) (m)
ual components of A0 are given by a0i1 i2 ...im = j11,...,jmm=1 pi1 j1 · · · pim jm aj1 ...jm ,
(k)
where pij denotes the ij-th entry of matrix Pk . The inner product between two
tensors A and B is defined as
X
hA, Bi = ai1 ...im bi1 ...im , (2.3)
i1 ,...,im

and this inner product satisfies the following natural property (which generalizes
the familiar hAx, Byi = x, A> By ):
h(P1 , . . . , Pm ) · A, (Q1 , . . . , Qm ) · Bi = A, P> >
˙ ` ´ ¸
1 Q1 , . . . , P m Qm · B . (2.4)
2
Moreover, the Frobenius norm is kAk = hA, Ai. Finally, we define an arbitrary
divergence function d(X, Y) as an elementwise sum of individual divergences, i.e.,
X
d(X, Y) = d(xi1 ,...,im , yi1 ,...,im ), (2.5)
i1 ,...,im
and we will define the scalar divergence d(x, y) as the need arises.
2.2 Problem Formulation
Let A ∈ Rn1 ×···×nm be an order-m tensor that we wish to partition into co-
herent sub-tensors (or clusters). In 3D, we divide a cube into smaller cubes by
cutting orthogonal to (i.e., along) each dimension (Fig. 1). A basic approach is
to minimize the sum of the divergences between individual (scalar) elements in
each cluster to their corresponding (scalar) cluster “centers”. Readers familiar
with [4] will recognize this to be a “block-average” variant of tensor clustering.
Assume that each dimension j (1 ≤ j ≤ m) is partitioned into kj clusters. Let
Cj ∈ {0, 1}nj ×kj be the cluster indicator matrix for dimension j, where the ik-th
entry of such a matrix is one if and only if index i belongs to the k-th cluster
(1 ≤ k ≤ kj ) for dimension j. Then, the tensor clustering problem is (cf. 2.2):
nj ×kj
minimize d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} , (2.6)
C1 ,...,Cm ,M

where the tensor M collects all the cluster “centers.”

3 Algorithm and Analysis


Given formulation (2.6), our algorithm, which we name Combination Tensor
Clustering (CoTeC), follows the simple outline:
1. Cluster along each dimension j, using an approximation algorithm
to obtain clustering Cj ; Let C = (C1 , . . . , Cm )
2. Compute M = argminX∈Rk1 ×···×km d(A, C · X).
3. Return the tensor clustering (C1 , . . . , Cm ) (with representatives M).
Remark 1: Instead of clustering one dimension at a time in Step 1, we can
also cluster along t dimensions simultaneously. In such a t-dimensional clustering
of an order-m tensor, we form groups of order-(m − t) tensors.
Fig. 1. CoTeC: Cluster C3
along dimensions one (C1),
two (C2), three (C3) sep-
arately and combine the C1 C2
results; µ3,1,3 is the mean of
sub-tensor (cluster) (3,1,3);
The various clusters in the µ3,1,3
final tensor clustering are
color coded to indicate com- C3
bination of contributions
from clusters along each
dimension. C1 C2

Our algorithm might be counterintuitive to some readers as merely clustering


along individual dimensions and then combining the results is against the idea
of “co”-clustering, where one simultaneously clusters along different dimensions.
However, our analysis shows that dimension-wise clustering suffices to obtain
strong approximation guarantees for tensor clustering—a fact often observed
empirically too. It is also easy to see that CoTeC runs in time O((m/t)T (t)), if
the subroutine for dimension-wise clustering takes T (t) time.
The main contribution of this paper is the following approximation guarantee
for CoTeC, which we prove in the remainder of this section.
Theorem 1 (Approximation). Let A be an order-m tensor and let Cj denote
its clustering along the jth subset of t dimensions (1 ≤ j ≤ m/t), as obtained
from a multiway clustering algorithm with guarantee αt 3 . Let C = (C1 , . . . , Cm/t )
denote the induced tensor clustering, and JOPT (m) the best m-dimensional clus-
tering. Then,
J(C) ≤ p(m/t)ρd αt JOPT (m), with (3.1)
1. ρd = 1 and p(m/t) = 2log2 m/t if d(x, y) = (x − y)2 ,
2. ρd = 1 and p(m/t) = 3log2 m/t if d(x, y) is a metric4 .
Thm. 1 is quite general, and it can be combined with some natural assump-
tions (see §3.3) to yield results for tensor clustering with general divergence
functions (though ρd might be greater than 1). For particular choices of d one
can perhaps derive tighter bounds, though for squared Euclidean distances, we
provide an explicit example (Fig. 2) that shows the bound to be tight in 2D.
3.1 Analysis: Theorem 1, Euclidean Case
We begin our proof with the Euclidean case, i.e., d(x, y) = (x − y)2 . Our proof is
inspired by the techniques of [12]. We establish that given a clustering algorithm
3 We say an approximation algorithm has guarantee α if it yields a solution that achieves an
objective value within a factor O(α) of the optimum.
4 The results can be trivially extended to λ-relaxed metrics that satisfy d(x, y) ≤ λ(d(x, z) +
d(z, y)); the corresponding approximation factor just gets scaled by λ.
which clusters along t of the m dimensions at a time5 with an approximation
factor of αt , CoTeC achieves an objective within a factor O(dm/teαt ) of the
optimal. For example, for t = 1 we can use the seeding methods of [8, 9] or the
stronger approximation algorithms of [5]. We assume without loss of generality
(wlog) that m = 2h t for an integer h (otherwise, pad in empty dimensions).
Since for the squared Frobenius norm, each cluster “center” is given by the
mean, we can recast Problem (2.6) into a more convenient form. To that end,
note that the individual entries of the means tensor M are given by (cf. (2.2))
1 X
MI1 ...Im = ai1 ...im , (3.2)
|I1 | · · · |Im | i1 ∈I1 ,...,im ∈Im

with index sets Ij for 1 ≤ j ≤ m. Let Cj be the normalized cluster indicator


>
matrix obtained by normalizing the columns of Cj , so that Cj Cj = Ikj . Then,
we can rewrite (2.6) in terms of projection matrices Pj as:
>
minimize J(C) = kA − (P1 , . . . , Pm ) · Ak2 , s.t. Pj = Cj Cj . (3.3)
C=(C1 ,...,Cm )

Lemma 1 (Pythagorean). Let P = (P1 , . . . , Pt ), P ⊥ = (I − P1 , . . . , I − Pt )


be collections of projection matrices Pj , and S and R be arbitrary collections of
m − t projection matrices. Then,
k(P , S) · A + (P ⊥ , R) · Bk2 = k(P , S) · Ak2 + k(P ⊥ , R) · Bk2 .
Proof. Using kAk2 = hA, Ai we can rewrite the l.h.s. as
k(P , S) · A + (P ⊥ , R) · Bk2 = k(P , S) · Ak2 + k(P ⊥ , R) · Bk2 + 2 (P , S) · A, (P ⊥ , R) · B ,
˙ ¸

from which the last term is immediately seen to be zero using Property (2.4)
and the fact that P> ⊥
j Pj = Pj (I − Pj ) = 0. t
u
Some more notation: Since we cluster along t dimensions at a time, we
recursively partition the initial set of all m dimensions until (after log(m/t) + 1
steps), the sets of dimensions have length t. Let l denote the level of recursion,
starting at l = log(m/t) = h and going down to l = 0. At level l, the sets of
dimensions will have length 2l t (so that for l = 0 we have t dimensions). We
represent each clustering along a subset of 2l t dimensions by its corresponding
2l t projection matrices. We gather these projection matrices into the collection
Pil (note boldface), where the index i ranges from 1 to 2h−l .
We also need some notation to represent a complete tensor clustering along
all m dimensions, where only a subset of 2l t dimensions are clustered. We pad
the collection Pil with m−2l t identity matrices for the non-clustered dimensions,
and call this padded collection Qli . With recursive partitioning of the dimensions,
Qli subsumes Q0j for 2l (i − 1) < j ≤ 2l i, i.e.,
Y 2l i
Qli = Q0j .
j=2l (i−1)+1

5 One could also consider clustering differently sized subsets of the dimensions, say
{t1 , . . . , tr }, where t1 + · · · + tr = m. However, this requires unilluminating notational
jugglery, which we can skip for simplicity of exposition.
At level 0, the algorithm yields the collections Q0i and Pi0 . The remaining clus-
terings are simply combinations, i.e., products of these level-0 clusterings. We
denote the collection of m − 2l t identity matrices (of appropriate size) by I l ,
so that Ql1 = (P1l , I l ). Accoutered with our notation, we now prove the main
lemma that relates the combined clustering to its sub-clusterings.
Lemma 2. Let A be an order-m tensor and m ≥ 2l t. The objective function for
any 2l t-dimensional clustering Pil = (P20l (i−1)+1 , . . . , P20l i ) can be bound via the
sub-clusterings along only one set of dimensions of size t as
kA − Qli · Ak2 ≤ max 2l kA − Q0j · Ak2 . (3.4)
2l (i−1)<j≤2l i

We can always (wlog) permute dimensions so that any set of 2l clustered dimen-
sions maps to the first 2l ones. Hence, it suffices to prove the lemma for i = 1,
i.e., the first 2l dimensions.
Proof. We prove the lemma for i = 1 by induction on l.
Base: Let l = 0. Then Ql1 = Q01 , and (3.4) holds trivially.
Induction: Assume the claim holds for l ≥ 0. Consider a clustering P1l+1 =
(P1 , P2l ), or equivalently Ql+1
l
1 = Ql1 Ql2 . Using P + P ⊥ = I, we decompose A as
⊥ ⊥ ⊥
A = (P1l+1 + P1l+1 , I l+1 ) · A = (P1l + P1l , P2l + P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A,
⊥ ⊥
where Ql1 = (P1l , I l ). Since Ql+1
1 = Ql1 Ql2 , the Pythagorean Property 1 yields
⊥ ⊥ ⊥ ⊥
kA − Ql+1
1 · Ak2 = kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2 .

Combining the above equalities with the assumption (wlog) kQl1 Ql2 · Ak2 ≥

kQl1 Ql2 · Ak2 , we obtain the inequalities
⊥ ⊥ ⊥
kA − Ql1 Ql2 · Ak2 ≤ 2 kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2
` ´

⊥ ⊥ ⊥ ⊥ ⊥
= 2kQl1 Ql2 · A + Ql1 Ql2 · Ak2 = 2kQl1 (Ql2 + Ql2 ) · Ak2

= 2kQl1 · Ak2 = 2kA − Ql1 · Ak2
≤ 2 max kA − Qlj · Ak2 ≤ 2 · 2l max kA − Q0j · Ak2 ,
1≤j≤2l 1≤j≤2l+1

where the last step follows from the induction hypothesis (3.4), and the two
norm terms in the first line are combined using the Pythagorean Property. ut
Proof. (Thm. 1, Case 1 ). Let m = 2h t. Using an algorithm with guarantee αt ,
we cluster each subset (indexed by i) of t dimensions to obtain Q0i . Let Si be the
optimal sub-clustering of subset i, i.e., the result that Q0i would be if αt were 1.
We bound the objective for the collection of all m sub-clusterings P1h = Qh1 as

kA − Qh1 · Ak2 ≤ 2h max kA − Q0j · Ak2 ≤ 2h αt max kA − Sj · Ak2 . (3.5)


j j
The first inequality follows from Lemma 2, while the last inequality follows from
the αt approximation factor that we used to get sub-clustering Q0j .
So far we have related our approximation to an optimal sub-clustering along
a set of dimensions. Let us hence look at the relation between such an optimal
sub-clustering S of the first t dimensions (via permutation, these dimensions
correspond to an arbitrary subset of size t), and the optimal tensor clustering F
along all the m = 2h t dimensions. Recall that a clustering can be expressed by
either the projection matrices collected in Ql1 , or by cluster indicator matrices
Ci together with the mean tensor M, so that

(C1 , . . . , C2l t , I l ) · M = Ql1 · A.

Let CSj and CF j be the dimension-wise cluster indicator matrices for S and
F , respectively. By definition, S solves
nj ×kj
min kA − (C1 , . . . , Ct , I 0 ) · Mk2 , s.t. Cj ∈ {0, 1} ,
C1 ,...,Ct ,M

which makes S even better than the sub-clustering (CF F


1 , . . . , Ct ) induced by the
optimal m-dimensional clustering F . Thus,

kA − S · Ak2 ≤ min kA − (CF F 0


1 , . . . , Ct , I ) · Mk
2
M
≤ kA − (CF F 0 F F F 2
1 , . . . , Ct , I )(I, . . . , I, Ct+1 , . . . , Cm ) · M k
= kA − F · Ak2 , (3.6)

where MF is the tensor of means for the optimal m-dimensional clustering. Com-
bining (3.5) with (3.6) yields the final bound for the combined clustering C = Qh1 ,

Jm (C) = kA − Qh1 · Ak2 ≤ 2h αt kA − F · Ak2 = 2h αt JOPT (m),

which completes the proof of the theorem. t


u

Tightness of Bound: How tight is the bound for CoTeC implied by Thm. 1?
The following example shows that for Euclidean co-clustering, i.e., m = 2, the
bound is tight. Specifically, for every 0.25 > γ > 0, there exists a matrix for
which the approximation is as bad as J(C) = (m − γ)JOPT (m).
Let  be such that γ = 2(1 + a b c d
)−2 . The optimal 1D row clus- 1 − −1  1
tering C1 for the matrix in Fig- 2 1  −1 −
ure 2 groups rows {1, 2} and {3, 4} 3 10 −  9 10 +  11
together, and the optimal column 4 11 10 +  9 10 − 
clustering is C2 = ({a, b}, {c, d}).
The co-clustering loss for the com- Fig. 2. Matrix with co-clustering approxi-
bination is J2 (C1 , C2 ) = 8 + 82 . mation factor 2 − 2(1 + )−2 .
The optimal co-clustering, group-
ing columns {a, d} and {b, c} (and rows as C2 ) achieves an objective of JOPT (2) =
4(1 + )2 . Relating these results, we get J2 (C1 , C2 ) = (2 − γ)JOPT (m). However,
this example is a worst-case scenario; the average factor is much better in prac-
tice, as revealed by our experiments (§4). The latter combined with the structure
of this negative example suggest that with some assumptions on the data, one
can probably obtain tighter bounds. Also note that the bound holds for a CoTeC
-like scheme treating dimensions separately, but not necessarily for all approxi-
mation algorithms.

3.2 Analysis: Theorem 1, Metric Case


Now we present our proof of Thm. 1 for the case where d(x, y) is a metric. For
this case, recall that the tensor clustering problem is
nj ×kj
minimize J(C) = d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} . (3.7)
(C1 ,...,Cm ),M

Since in general the best representative M is not the mean tensor, we cannot use
the shorthand P · A for M, so the proof is different from the Euclidean case.
The following lemma is the basis of the induction for this case of Thm. 1.
Lemma 3. Let A be of order, m = 2h t, and Ril the clustering of the i-th subset
of 2l t dimensions (for l < h) with an approximation guarantee of α2l t —Ril
combines the Cj in a manner analogous to how Qli combines projection matrices.
Then the combination Rl+1 = Ril Rjl , i 6= j, satisfies

min d(A, Rl+1 · M) ≤ 3α2l t min d(A, F l+1 · M),


M M

l+1
where F is the optimal joint clustering of the dimensions covered by Rl+1 (as
before, we always assume that Ril and Rjl cover disjoint subsets of dimensions).

Proof. Without loss of generality, we prove the lemma for R1l+1 = R1l R2l . Let
Mli = argminX d(A, Ril · X) be the associated representatives for i = 1, 2, and Sil
the optimal 2l -dimensional clusterings. Further let F1l+1 = F1l F2l be the optimal
2l+1 -dimensional clustering. The following step is vital in relating objective val-
ues of R1l+1 and Sil . The optimal sub-clusterings will eventually be bounded by
the objective of the optimal F1l+1 . Let L = 2l+1 , and
b = argmin d(Rl Ml , Rl Rl · X),
M 1 1 1 2 X ∈ Rk1 ×...×kL ×nL+1 ...×nm .
X

Let i,j be multi-indices running over dimensions 1 to 2l , and 2l + 1 to 2l+1 ,


respectively; let r be the multi-index covering the remaining m − L dimensions.
The multi-indices of the clusters defined by R1l and R2l , respectively, are I and
J. Since M
b is the element-wise minimum, we have
XX X
d(R1l · Ml1 , R1l R2l · M)
b = min d((µl1 )Ijr , µIJr )
µIJr ∈R
I,J i∈I,r j∈J
XXX
≤ d((µl1 )Ijr , (µl2 )iJr ) = d(R1l · Ml1 , R2l · Ml2 ).
I,J i∈I,r j∈J
Using this relation and the triangle inequality, we can now relate the objectives
for the combined clustering and for the optimal sub-clusterings:
min d(A, Rl Rl · Ml+1 ) ≤ d(A, Rl Rl · M)
1 2
b
1 2
Ml+1

≤ d(A, R1l · Ml1 ) + d(R1l · Ml1 , R1l R2l · M)


b
≤ d(A, R1l · Ml1 ) + d(R1l · Ml1 , R2l · M2 l )
≤ 2d(A, R1l · Ml1 ) + d(A, R2l · Ml2 )
≤ 2α2l t min d(A, S1l · X1 ) + α2l t min d(A, S2l · X2 ). (3.8)
X1 X2

However, owing to the optimality of S1l , we have


min d(A, S1l · Xl1 ) ≤ min d(A, F1l · Yl ) ≤ min d(A, F1l F2l · Yl+1 ),
Xl1 Yl Yl+1

and analogously for S2l . Plugging this inequality into (3.8) we get
min d(A, R1l R2l ·Ml+1 ) ≤ 3α2l t min d(A, F1l F2l ·Yl+1 ) = 3α2l t min d(A, F1l+1 ·Yl+1 ).u
t
Ml+1 Yl+1 Yl+1

Proof. (Thm. 1, Case 2 ). Given Lemma 3, the proof of Thm. 1 for the metric
case follows easily by induction if we hierarchically combine the sub-clusterings
and use α2l+1 t = 3α2l t , for l ≥ 0, as stated by the lemma. t
u

3.3 Implications
We now mention several important implications of Theorem 1.
Clustering with Bregman divergences. Bregman divergence based cluster-
ing and co-clustering are well-studied problems [25, 4]. Here, the function d(x, y)
is parametrized by a strictly convex function f [24], so that d(x, y) = Bf (x, y) =
f (x) − f (y) − f 0 (y)(x − y). Under the assumption (also see [5, 6])
σL kx − yk2 ≤ Bf (x, y) ≤ σU kx − yk2 , (3.9)
on the curvature of the divergence Bf (x, y), we can invoke Thm. 1 with ρd =
σU /σL The proofs are omitted for brevity, and may be found in [27]. We would
like to stress that such curvature bounds seem to be necessary to guarantee
constant approximation factors for the underlying 1D clustering—this intuition
is reinforced by the results of [28], who avoided such curvature assumptions
and had to be content with a non-constant O(log n) approximation factor for
information theoretic clustering.
Clustering with `p -norms. Thm. 1 (metric case) immediately yields approx-
imation factors for clustering with `p -norms. We note that for binary matrices,
using t = 2 and the results of [11] we can obtain the slightly stronger guarantee

J(C) ≤ 3log2 (m)−1 (1 + 2)α1 JOPT (m).
Exploiting 1D clustering results. Substituting the approximation factors α1
of existing 1D clustering algorithms in Thm. 1 (with t = 1) instantly yields spe-
cific bounds for corresponding tensor clustering algorithms. Table 1 summarizes
these results, however we omit proofs for lack of space—see [27] for details.
Table 1. Approximation guarantees for Tensor Clustering Algorithms. K ∗ denotes the
maximum number of clusters, i.e., K ∗ = argmaxj kj ; c is some constant.

Problem Name Approx. Bound Proof


Metric tensor clustering J(C) ≤ m(1 + )JOPT (m) Thm. 1 + [6]
Bregman tensor clustering E[J(C)] ≤ 8mc(log K ∗ + 2)JOPT (m) (3.9), Thm. 1 + [7]
−1
Bregman tensor clustering J(C) ≤ mσU σL (1 + )JOPT (m) (3.9), Thm. 1 + [5]
Bregman co-clustering Above two results with m = 2 as above
Hilbertian metrics E[J(C)] ≤ 8m(log K ∗ + 2)JOPT (m) See [27]

4 Experimental Results

Our bounds depend strongly on the approximation factor αt of an underlying


t-dimensional clustering method. In our experiments, we study this close depen-
dence for t = 1, wherein we compare the tensor clusterings arising from different
1D methods of varying sophistication. Keep in mind that the comparison of the
1D methods is to see their impact on the tensor clustering built on top of them.
Our experiments reveal that the empirical approximation factors are usu-
ally smaller than the theoretical bounds, and these factors depend on statistical
properties of the data. We also observe the linear dependence of the CoTeC ob-
jectives on the associated 1D objectives, as suggested by Thm. 1 (for Euclidean)
and Table 1 (2nd row, for KL Divergence).
Further comparisons show that in practice, CoTeC is competitive with a
greedy heuristic SiTeC (Simultaneous Tensor Clustering), which simultaneously
takes all dimensions into account, but lacks theoretical guarantees. As expected,
initializing SiTeC with CoTeC yields lower final objective values using fewer
“simultaneous” iterations.
We focus on Euclidean dis-
uniform
seeding
lrl +1D k-means rk } CoTeC tance and KL Divergence to test
+SiTeC +SiTeC
(1D) CoTeC. To study the effect of
data lrcl rkc } SiTeC the 1D method, we use two seed-
specific ing methods, uniform, and dis-
seeding
(1D) lsl +1D k-means
sk }CoTeC tance based (weighted farthest
+SiTeC +SiTeC first) drawing. The latter en-
lscl skc }SiTeC sures 1D approximation factors
for E[J(C)] by [7] for Euclidean
Fig. 3. Tensor clustering variants. clustering and by [8, 9] for KL
Divergence.
We use each seeding by itself and as an initialization for k-means to get four
1D methods for each divergence (see Fig. 3). We refer to the CoTeC combina-
tion of the corresponding independent 1D clusterings by abbreviations: (1) ‘r:’
uniformly sample centers from the data points and assign each point to its clos-
est center; (2) ‘s:’ sample centers with distance-specific seeding [7–9] and assign
each point to its closest center; (3) ‘rk:’ initialize Euclidean or Bregman k-means
with ‘r’; (4) ‘sk:’ initialize Euclidean or Bregman k-means with ‘s’.
5 r r
rk rk
rc rc
4.5 rkc 3.5 rkc
s s
sk sk
4 sc sc
skc 3 skc
3.5
factor

factor
2.5
3

2.5 2

2
1.5
1.5

1 1

0.5 1 1.5 2 3 0.5 1 1.5 2 3


σ σ

r 1.35 r
rk rk
1.5 rc 1.3 rc
rkc rkc
s 1.25 s
1.4 sk sk
sc sc
1.2
skc skc
1.3 1.15
factor

factor
1.1
1.2

1.05
1.1
1

1 0.95

0.9
0.9
0.85
1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2
σ σ

Fig. 4. Approximation factors for 3D clustering (left) and co-clustering (right) with
increasing noise. Top row: Euclidean distances, bottom row: KL Divergence. The x
axis shows σ, the y axis the empirical approximation factor.

The SiTeC method we compare to is the minimum sum-squared residue co-


clustering of [29] for Euclidean distances in 2D, and a generalization of Algo-
rithm 1 of [4] for 3D and Bregman 2D clustering. Additionally, we initialize
SiTeC with the outcome of each of the four CoTeC variants, which yields four
versions (of SiTeC), namely, rc, sc, rkc, and skc, initialized with the results
of ‘r’, ‘s’, ‘rk’, and ‘sk’, respectively. These variants inherit the guarantees of
CoTeC, as they monotonically decrease the objective value.

4.1 Experiments on Synthetic Data

For a controlled setting with synthetic data, we generate tensors A of size 75 ×


75 × 50 and 75 × 75, for which we randomly choose a 5 × 5 × 5 tensor of means
M and cluster indicator matrices Ci ∈ {0, 1}ni ×5 . For clustering with Euclidean
distances we add Gaussian noise (from N (0, σ 2 ) with varying σ) to A, while for
KL Divergences we use the sampling method of [4] with varying noise.
For each noise-level to test, we repeat the 1D seeding 20 times on each of five
generated tensors and average the resulting 100 objective values. To estimate the
approximation factor αm on a tensor, we divide the achieved objective J(C) by
the objective value of the “true” underlying tensor clustering. Figure 4 shows the
empirical approximation factor α̂m for Euclidean distance and KL Divergence.
Qualitatively, the plots for tensors of order 2 and 3 do not differ.
Table 2. (i) Improvement of CoTeC and SiTeC variants upon ‘r’ in %; the respective
reference value (J2 for ‘r’) is shaded in gray. (ii) Average number of SiTeC iterations.

Bcell, Euc. Bcell, KL


(i) CoTeC SiTeC (i) CoTeC SiTeC
k1 k2 x xk xc xkc k1 k2 x xk xc xkc
20 3 x=r 5.75 · 105 31.66 20.05 33.05 20 3 x=r 3.37 · 10−1 17.59 22.23 23.26
x=s 18.83 32.24 24.61 33.36 x=s 10.54 18.44 22.99 22.98
20 6 x=r 5.56 · 105 49.13 35.26 50.37 20 6 x=r 3.15 · 10−1 18.62 24.51 25.43
x=s 34.97 50.55 43.93 51.66 x=s 11.76 20.52 25.69 26.23
50 3 x=r 5.63 · 105 31.10 14.77 31.76 50 3 x=r 3.20 · 10−1 15.70 20.12 21.07
x=s 15.25 32.58 19.14 33.17 x=s 9.61 17.24 20.85 21.33
50 6 x=r 5.18 · 105 47.55 34.63 48.41 50 6 x=r 2.85 · 10−1 16.38 21.61 22.57
x=s 36.22 49.83 43.77 50.55 x=s 11.86 18.63 23.24 23.13

(ii) k1 k2 rc rkc sc skc (ii) k1 k2 rc rkc sc skc


20 3 7.0 ± 1.4 2.0 ± 0.2 3.9 ± 1.0 2.2 ± 0.5 20 3 10.6 ± 2.8 7.5 ± 2.0 7.4 ± 1.8 7.0 ± 2.2
20 6 11.3 ± 2.3 2.6 ± 0.8 5.1 ± 2.0 2.7 ± 0.7 20 6 12.6 ± 3.4 8.8 ± 2.9 8.4 ± 2.1 8.1 ± 2.0
50 3 6.2 ± 1.9 2.0 ± 0.0 3.5 ± 2.0 2.0 ± 0.0 50 3 9.1 ± 2.3 6.2 ± 1.3 6.9 ± 1.8 6.0 ± 1.3
50 6 8.1 ± 2.1 2.1 ± 0.3 4.1 ± 1.6 2.0 ± 0.0 50 6 10.5 ± 1.8 7.7 ± 2.1 8.1 ± 2.3 6.9 ± 1.0

In all settings, the empirical factor remains below the theoretical factor. The
reason for decreasing approximation factors with higher noise could be lower
accuracy of the estimates of JOPT on the one hand, and more similar objective
values for all clusterings on the other hand. With low noise, distance-specific
seeding s yields better results than uniform seeding r, and adding k-means on top
(rk,sk) improves the results of both. With Euclidean distances, CoTeC with well-
initialized 1D k-means (sk) competes with SiTeC. For KL Divergence, though,
SiTeC still improves on sk, and with high noise levels, 1D k-means does not
help: both rk and sk are as good as their seeding only counterparts r, s.

4.2 Experiments on Biological Data


We further assess the behavior of our method with gene expression data6 from
multiple sources [30–32]. For brevity, we only introduce two of the data sets here
for which we present more detailed results; more datasets and experiments are
described in [27].
The matrix Bcell [30] is a (1332×62) lymphoma microarray dataset of chronic
lymphotic leukemia, diffuse large Bcell leukemia and follicular lymphoma. The
order-3 tensor Interferon consists of gene expression levels from MS patients
treated with recombinant human interferon beta [32]. After removal of missing
values, a complete 6 × 21 × 66 tensor remained. For experiments with KL Di-
vergence, we normalized all tensors to have their entries sum up to one. Since
our analysis concerns the objective function J(C) alone, we disregard the “true”
labels, which are available for only one of the dimensions.
For each data set, we repeat the sampling of centers 30 times and average the
resulting objective values. Panel (i) in Table 2 (order-2), and in Table 3 (order-3)
show the objective value for the simplest CoTeC variant ‘r’ as a baseline, and
6 We thank Hyuk Cho for kindly providing us his preprocessed 2D data sets.
the relative improvements achieved by other methods. The methods are encoded
as x, xk, xc, xkc, where x stands for r or s, depending on the row in the table.

Interferon, KL Table 3. (i) Improvement of CoTeC


(i) k1 k2 k3 x xk xc xkc and SiTeC variants upon ‘r’ in %; the
2 2 2 x=r 9.71 · 10−1 38.58 42.46 43.53
respective reference value (J3 for ‘r’) is
x=s 25.07 36.67 43.53 43.74
2 2 3 x=r 8.17 · 10−1 41.31 46.06 46.31 shaded in gray.
x=s 33.63 43.90 46.82 47.16
2 2 4 x=r 7.11 · 10−1 39.79 44.05 45.62
x=s 38.01 46.09 51.30 51.35

Figure 5 summarizes the average improvements for all five order-2 data sets
studied in [27]. Groups indicate methods, and colors indicate seeding techniques.
On average, a better seeding improves the results for all methods: the gray bars
are higher than their black counterparts in all groups. Just as for synthetic data,
1D k-means improves the CoTeC results here too. SiTeC (groups 3 and 4) is
better than CoTeC with mere seeding (r,s, group 1). Notably, for Euclidean
distances, combining good 1D clusters obtained by k-means (rk,sk, group 2) is
on average better than SiTeC initialized with simple seeding (rc,sc, group 3). For
KL Divergences, on the other hand, SiTeC still outperforms all CoTeC variations.
Given the limitation to single dimensions, CoTeC performs surprisingly well
in comparison to SiTeC. Additionally, SiTeC initialized with CoTeC converges
faster to better solutions, further underscoring the utility of CoTeC.

(i) Eucl.Dist. (i) KL (ii) Eucl.Dist. (ii) KL


100
25 unif: x=r
dist.: x=s
% improvement wrt. ’r’

80
20
% iterations

15 60

10 40

5 20
unif: x=r
dist.: x=s
0 0
x xk xc xkc xc xkc

| {z } | {z } | {z } | {z }
CoTeC SiTeC CoTeC SiTeC

Fig. 5. (i) % improvement of the objective J2 (C) with respect to uniform 1D seeding
(r), averaged over all order-2 data sets and parameter settings (details in [27]). (ii)
average number of SiTeC iterations, in % with respect to initialization by r.

Relation to 1D Clusterings Our experiments support the theoretical results


and the intuitive expectation that better 1D clusterings yield better CoTeC
solutions. Can we quantify this relation?
Theorem 1 suggests a linear dependence of the order-m factor αm on α1 .
However, these factors are difficult to check empirically when optimal clusterings
are unknown. However, on one matrix JOPT (2)/JOPT (1) is constant, so if the
approximation factors are tight (up to a constant factor), the ratio
J2 (C1 , C2 )/J1 (Ci ) ≈ (α2 /α1 ) JOPT (2)/JOPT (1), i = 1, 2
only depends on α2 /α1 . Stating α2 = 2α1 ρd , Thm. 1 predicts J2 /J1 to be inde-
pendent of the 1D method, i.e., of α1 , and constant on one matrix.
The empirical ratios J2 /J1
Fig. 6. Left: average improvement of 1D cluster-
in Figure 6 support this predic- ings (components) with respect to ‘r’. Right: aver-
tion, which suggests that for age ratio J2 /J1 , both for the same clusterings as
CoTeC the quality of the multi- in Figure 5.
dimensional clustering directly 1.2

% improvement wrt. ’r’ − 1D


depends on the quality of its 10
1

0.8
1D components, both in the-

J2 / J1
0.6

ory and in practice. 5 0.4


uniform unif: x=r
0.2
dist−spec dist.: x=s
0 0
x xk xc xkc x xk xc xkc

5 Conclusion
In this paper we presented CoTeC, a simple, and to our knowledge the first ap-
proximation algorithm for tensor clustering, which yielded approximation results
for Bregman co-clustering and tensor clustering as special cases. We proved an
approximation factor that grows linearly with the order of the tensor, and showed
tightness of the factor for the 2D Euclidean case (Fig. 2), though empirically the
observed factors are usually smaller than suggested by the theory.
Our worst-case example also illustrates the limitation of CoTeC, i.e., to ig-
nore the interaction between clusterings along multiple dimensions. Thm. 1 thus
gives hints how much information maximally lies in this interaction. Analyzing
this interplay could potentially lead to better approximation factors, e.g., by
developing a co-clustering specific seeding technique. Using such an algorithm
as a subroutine in CoTeC will yield a hybrid that combines CoTeC’s simplicity
with better approximation guarantees.

Acknowledgment AB was supported in part by NSF grant IIS-0812183.


References
1. Banerjee, A., Basu, S., Merugu, S.: Multi-way Clustering on Relation Graphs. In:
SIAM Conf. Data Mining (SDM). (2007)
2. Shashua, A., Zass, R., Hazan, T.: Multi-way Clustering Using Super-Symmetric
Non-negative Tensor Factorization. LNCS 3954 (2006) 595–608
3. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In:
KDD. (2003) 89–98
4. Banerjee, A., Dhillon, I.S., Ghosh, J., Merugu, S., Modha, D.S.: A Generalized
Maximum Entropy Approach to Bregman Co-clustering and Matrix Approxima-
tion. JMLR 8 (2007) 1919–1986
5. Ackermann, M.R., Blömer, J.: Coresets and Approximate Clustering for Bregman
Divergences. In: ACM-SIAM Symp. on Disc. Alg. (SODA). (2009)
6. Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and non-metric
distance measures. In: ACM-SIAM Symp. on Disc. Alg. (SODA). (April 2008)
7. Arthur, D., Vassilvitskii, S.: k-means++: The Advantages of Careful Seeding. In:
ACM-SIAM Symp. on Discete Algorithms (SODA). (2007) 1027–1035
8. Nock, R., Luosto, P., Kivinen, J.: Mixed Bregman clustering with approximation
guarantees. In: Eur. Conf. on Mach. Learning (ECML). LNAI 5212 (2008)
9. Sra, S., Jegelka, S., Banerjee, A.: Approximation algorithms for Bregman cluster-
ing, co-clustering and tensor clustering. Technical Report 177, MPI for Biological
Cybernetics (2008)
10. Ben-David, S.: A framework for statistical clustering with constant time approx-
imation algorithms for K-median and K-means clustering. Mach. Learn. 66(2-3)
(2007) 243–257
11. Puolamäki, K., Hanhijärvi, S., Garriga, G.C.: An approximation ratio for biclus-
tering. Inf. Process. Letters 108(2) (2008) 45–49
12. Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation algorithms for
co-clustering. In: Symp. on Principles of Database Systems (PODS). (2008)
13. Zha, H., Ding, C., Li, T., Zhu, S.: Workshop on Data Mining using Matrices and
Tensors. KDD (2008)
14. Hasan, M., Velazquez-Armendariz, E., Pellacini, F., Bala, K.: Tensor Clustering for
Rendering Many-Light Animations. Eurographics Symp. on Rendering 27 (2008)
15. Kolda, T.G., Bader, B.W.: Tensor Decompositions and Applications. SIAM Review
51(3) (2009) to appear.
16. Hartigan, J.A.: Direct clustering of a data matrix. J. of the Am. Stat. Assoc.
67(337) (March 1972) 123–129
17. Cheng, Y., Church, G.: Biclustering of expression data. In: Proc. ISMB, AAAI
Press (2000) 93–103
18. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph
partitioning. In: KDD. (2001) 269–274
19. Bekkerman, R., El-Yaniv, R., McCallum, A.: Multi-way distributional clustering
via pairwise interactions. In: ICML. (2005)
20. Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S.:
Beyond pairwise clustering. In: IEEE CVPR. (2005)
21. Govindu, V.M.: A tensor decomposition for geometric grouping and segmentation.
In: IEEE CVPR. (2005)
22. Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press (2001)
23. Hein, M., Bosquet, O.: Hilbertian metrics and positive definite kernels on proba-
bility measures. In: AISTATS. (2005)
24. Censor, Y., Zenios, S.A.: Parallel Optimization: Theory, Algorithms, and Appli-
cations. Oxford University Press (1997)
25. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman Di-
vergences. JMLR 6(6) (October 2005) 1705–1749
26. de Silva, V., Lim, L.H.: Tensor Rank and the Ill-Posedness of the Best Low-Rank
Approximation Problem. SIAM J. Matrix Anal. & Appl. 30(3) (2008) 1084–1127
27. Jegelka, S., Sra, S., Banerjee, A.: Approximation algorithms for Bregman co-
clustering and tensor clustering. In: arXiv: cs.DS/0812.0389. (v3,2009)
28. Chaudhuri, K., McGregor, A.: Finding metric structure in information theoretic
clustering. In: Conf. on Learning Theory, COLT. (July 2008)
29. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum Sum Squared Residue based
Co-clustering of Gene Expression data. In: SDM. (2004) 114–125
30. Kluger, Y., Basri, R., Chang, J.T.: Spectral biclustering of microarray data: Co-
clustering genes and conditions. Genome Research 13 (2003) 703–716
31. Cho, H., Dhillon, I.: Coclustering of human cancer microarrays using minimum
sum-squared residue coclustering. IEEE/ACM Tran. Comput. Biol. Bioinf. 5(3)
(2008) 385–400
32. Baranzini, S.E., et al.: Transcription-based prediction of response to IFNβ using
supervised computational methods. PLoS Biology 3(1) (2004)

You might also like