Approximation Algorithms For Tensor Clustering
Approximation Algorithms For Tensor Clustering
1 Introduction
where the function d(x, y) measures cluster quality. The “center” µk of cluster
Ck is given by the mean of the points in Ck when d(x, y) is a Bregman di-
vergence [25]. Co-clustering extends (2.1) to seek simultaneous partitions (and
centers µIJ ) of rows and columns of X, so that the objective function
X X
J(C) = d(xij , µIJ ), (2.2)
I,J i∈I,j∈J
is minimized; µIJ denotes the (scalar) “center” of the cluster described by the
row and column index sets, viz., I and J. We generalize formulation (2.2) to
tensors in Section 2.2 after introducing some background on tensors.
2.1 Tensors
An order-m tensor A may be viewed as an element of the vector space Rn1 ×...×nm .
An individual entry of A is given by the multiply-indexed value ai1 i2 ...im , where
ij ∈ {1, . . . , nj } for 1 ≤ j ≤ m. For us, the most important tensor operation
is multilinear matrix multiplication, which generalizes matrix multiplica-
tion [26]. Matrices act on other matrices by either left or right multiplication.
Similarly, for an order-m tensor, there are m dimensions on which a matrix may
act. For A ∈ Rn1 ×n2 ×···×nm , and matrices P1 ∈ Rp1 ×n1 , . . ., Pm ∈ Rpm ×nm ,
multilinear multiplication is defined by the action of the Pi on the different di-
mensions of A, and is denoted by A0 = (P1 , . . . , Pm )·A ∈ Rp1 ×···×pm . The individ-
Pn ,...,n (1) (m)
ual components of A0 are given by a0i1 i2 ...im = j11,...,jmm=1 pi1 j1 · · · pim jm aj1 ...jm ,
(k)
where pij denotes the ij-th entry of matrix Pk . The inner product between two
tensors A and B is defined as
X
hA, Bi = ai1 ...im bi1 ...im , (2.3)
i1 ,...,im
and this inner product satisfies the following natural property (which generalizes
the familiar hAx, Byi = x, A> By ):
h(P1 , . . . , Pm ) · A, (Q1 , . . . , Qm ) · Bi = A, P> >
˙ ` ´ ¸
1 Q1 , . . . , P m Qm · B . (2.4)
2
Moreover, the Frobenius norm is kAk = hA, Ai. Finally, we define an arbitrary
divergence function d(X, Y) as an elementwise sum of individual divergences, i.e.,
X
d(X, Y) = d(xi1 ,...,im , yi1 ,...,im ), (2.5)
i1 ,...,im
and we will define the scalar divergence d(x, y) as the need arises.
2.2 Problem Formulation
Let A ∈ Rn1 ×···×nm be an order-m tensor that we wish to partition into co-
herent sub-tensors (or clusters). In 3D, we divide a cube into smaller cubes by
cutting orthogonal to (i.e., along) each dimension (Fig. 1). A basic approach is
to minimize the sum of the divergences between individual (scalar) elements in
each cluster to their corresponding (scalar) cluster “centers”. Readers familiar
with [4] will recognize this to be a “block-average” variant of tensor clustering.
Assume that each dimension j (1 ≤ j ≤ m) is partitioned into kj clusters. Let
Cj ∈ {0, 1}nj ×kj be the cluster indicator matrix for dimension j, where the ik-th
entry of such a matrix is one if and only if index i belongs to the k-th cluster
(1 ≤ k ≤ kj ) for dimension j. Then, the tensor clustering problem is (cf. 2.2):
nj ×kj
minimize d(A, (C1 , . . . , Cm ) · M), s.t. Cj ∈ {0, 1} , (2.6)
C1 ,...,Cm ,M
from which the last term is immediately seen to be zero using Property (2.4)
and the fact that P> ⊥
j Pj = Pj (I − Pj ) = 0. t
u
Some more notation: Since we cluster along t dimensions at a time, we
recursively partition the initial set of all m dimensions until (after log(m/t) + 1
steps), the sets of dimensions have length t. Let l denote the level of recursion,
starting at l = log(m/t) = h and going down to l = 0. At level l, the sets of
dimensions will have length 2l t (so that for l = 0 we have t dimensions). We
represent each clustering along a subset of 2l t dimensions by its corresponding
2l t projection matrices. We gather these projection matrices into the collection
Pil (note boldface), where the index i ranges from 1 to 2h−l .
We also need some notation to represent a complete tensor clustering along
all m dimensions, where only a subset of 2l t dimensions are clustered. We pad
the collection Pil with m−2l t identity matrices for the non-clustered dimensions,
and call this padded collection Qli . With recursive partitioning of the dimensions,
Qli subsumes Q0j for 2l (i − 1) < j ≤ 2l i, i.e.,
Y 2l i
Qli = Q0j .
j=2l (i−1)+1
5 One could also consider clustering differently sized subsets of the dimensions, say
{t1 , . . . , tr }, where t1 + · · · + tr = m. However, this requires unilluminating notational
jugglery, which we can skip for simplicity of exposition.
At level 0, the algorithm yields the collections Q0i and Pi0 . The remaining clus-
terings are simply combinations, i.e., products of these level-0 clusterings. We
denote the collection of m − 2l t identity matrices (of appropriate size) by I l ,
so that Ql1 = (P1l , I l ). Accoutered with our notation, we now prove the main
lemma that relates the combined clustering to its sub-clusterings.
Lemma 2. Let A be an order-m tensor and m ≥ 2l t. The objective function for
any 2l t-dimensional clustering Pil = (P20l (i−1)+1 , . . . , P20l i ) can be bound via the
sub-clusterings along only one set of dimensions of size t as
kA − Qli · Ak2 ≤ max 2l kA − Q0j · Ak2 . (3.4)
2l (i−1)<j≤2l i
We can always (wlog) permute dimensions so that any set of 2l clustered dimen-
sions maps to the first 2l ones. Hence, it suffices to prove the lemma for i = 1,
i.e., the first 2l dimensions.
Proof. We prove the lemma for i = 1 by induction on l.
Base: Let l = 0. Then Ql1 = Q01 , and (3.4) holds trivially.
Induction: Assume the claim holds for l ≥ 0. Consider a clustering P1l+1 =
(P1 , P2l ), or equivalently Ql+1
l
1 = Ql1 Ql2 . Using P + P ⊥ = I, we decompose A as
⊥ ⊥ ⊥
A = (P1l+1 + P1l+1 , I l+1 ) · A = (P1l + P1l , P2l + P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A + (P1l , P2l , I l+1 ) · A
⊥ ⊥ ⊥ ⊥
= Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A + Ql1 Ql2 · A,
⊥ ⊥
where Ql1 = (P1l , I l ). Since Ql+1
1 = Ql1 Ql2 , the Pythagorean Property 1 yields
⊥ ⊥ ⊥ ⊥
kA − Ql+1
1 · Ak2 = kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2 .
⊥
Combining the above equalities with the assumption (wlog) kQl1 Ql2 · Ak2 ≥
⊥
kQl1 Ql2 · Ak2 , we obtain the inequalities
⊥ ⊥ ⊥
kA − Ql1 Ql2 · Ak2 ≤ 2 kQl1 Ql2 · Ak2 + kQl1 Ql2 · Ak2
` ´
⊥ ⊥ ⊥ ⊥ ⊥
= 2kQl1 Ql2 · A + Ql1 Ql2 · Ak2 = 2kQl1 (Ql2 + Ql2 ) · Ak2
⊥
= 2kQl1 · Ak2 = 2kA − Ql1 · Ak2
≤ 2 max kA − Qlj · Ak2 ≤ 2 · 2l max kA − Q0j · Ak2 ,
1≤j≤2l 1≤j≤2l+1
where the last step follows from the induction hypothesis (3.4), and the two
norm terms in the first line are combined using the Pythagorean Property. ut
Proof. (Thm. 1, Case 1 ). Let m = 2h t. Using an algorithm with guarantee αt ,
we cluster each subset (indexed by i) of t dimensions to obtain Q0i . Let Si be the
optimal sub-clustering of subset i, i.e., the result that Q0i would be if αt were 1.
We bound the objective for the collection of all m sub-clusterings P1h = Qh1 as
Let CSj and CF j be the dimension-wise cluster indicator matrices for S and
F , respectively. By definition, S solves
nj ×kj
min kA − (C1 , . . . , Ct , I 0 ) · Mk2 , s.t. Cj ∈ {0, 1} ,
C1 ,...,Ct ,M
where MF is the tensor of means for the optimal m-dimensional clustering. Com-
bining (3.5) with (3.6) yields the final bound for the combined clustering C = Qh1 ,
Tightness of Bound: How tight is the bound for CoTeC implied by Thm. 1?
The following example shows that for Euclidean co-clustering, i.e., m = 2, the
bound is tight. Specifically, for every 0.25 > γ > 0, there exists a matrix for
which the approximation is as bad as J(C) = (m − γ)JOPT (m).
Let be such that γ = 2(1 + a b c d
)−2 . The optimal 1D row clus- 1 − −1 1
tering C1 for the matrix in Fig- 2 1 −1 −
ure 2 groups rows {1, 2} and {3, 4} 3 10 − 9 10 + 11
together, and the optimal column 4 11 10 + 9 10 −
clustering is C2 = ({a, b}, {c, d}).
The co-clustering loss for the com- Fig. 2. Matrix with co-clustering approxi-
bination is J2 (C1 , C2 ) = 8 + 82 . mation factor 2 − 2(1 + )−2 .
The optimal co-clustering, group-
ing columns {a, d} and {b, c} (and rows as C2 ) achieves an objective of JOPT (2) =
4(1 + )2 . Relating these results, we get J2 (C1 , C2 ) = (2 − γ)JOPT (m). However,
this example is a worst-case scenario; the average factor is much better in prac-
tice, as revealed by our experiments (§4). The latter combined with the structure
of this negative example suggest that with some assumptions on the data, one
can probably obtain tighter bounds. Also note that the bound holds for a CoTeC
-like scheme treating dimensions separately, but not necessarily for all approxi-
mation algorithms.
Since in general the best representative M is not the mean tensor, we cannot use
the shorthand P · A for M, so the proof is different from the Euclidean case.
The following lemma is the basis of the induction for this case of Thm. 1.
Lemma 3. Let A be of order, m = 2h t, and Ril the clustering of the i-th subset
of 2l t dimensions (for l < h) with an approximation guarantee of α2l t —Ril
combines the Cj in a manner analogous to how Qli combines projection matrices.
Then the combination Rl+1 = Ril Rjl , i 6= j, satisfies
l+1
where F is the optimal joint clustering of the dimensions covered by Rl+1 (as
before, we always assume that Ril and Rjl cover disjoint subsets of dimensions).
Proof. Without loss of generality, we prove the lemma for R1l+1 = R1l R2l . Let
Mli = argminX d(A, Ril · X) be the associated representatives for i = 1, 2, and Sil
the optimal 2l -dimensional clusterings. Further let F1l+1 = F1l F2l be the optimal
2l+1 -dimensional clustering. The following step is vital in relating objective val-
ues of R1l+1 and Sil . The optimal sub-clusterings will eventually be bounded by
the objective of the optimal F1l+1 . Let L = 2l+1 , and
b = argmin d(Rl Ml , Rl Rl · X),
M 1 1 1 2 X ∈ Rk1 ×...×kL ×nL+1 ...×nm .
X
and analogously for S2l . Plugging this inequality into (3.8) we get
min d(A, R1l R2l ·Ml+1 ) ≤ 3α2l t min d(A, F1l F2l ·Yl+1 ) = 3α2l t min d(A, F1l+1 ·Yl+1 ).u
t
Ml+1 Yl+1 Yl+1
Proof. (Thm. 1, Case 2 ). Given Lemma 3, the proof of Thm. 1 for the metric
case follows easily by induction if we hierarchically combine the sub-clusterings
and use α2l+1 t = 3α2l t , for l ≥ 0, as stated by the lemma. t
u
3.3 Implications
We now mention several important implications of Theorem 1.
Clustering with Bregman divergences. Bregman divergence based cluster-
ing and co-clustering are well-studied problems [25, 4]. Here, the function d(x, y)
is parametrized by a strictly convex function f [24], so that d(x, y) = Bf (x, y) =
f (x) − f (y) − f 0 (y)(x − y). Under the assumption (also see [5, 6])
σL kx − yk2 ≤ Bf (x, y) ≤ σU kx − yk2 , (3.9)
on the curvature of the divergence Bf (x, y), we can invoke Thm. 1 with ρd =
σU /σL The proofs are omitted for brevity, and may be found in [27]. We would
like to stress that such curvature bounds seem to be necessary to guarantee
constant approximation factors for the underlying 1D clustering—this intuition
is reinforced by the results of [28], who avoided such curvature assumptions
and had to be content with a non-constant O(log n) approximation factor for
information theoretic clustering.
Clustering with `p -norms. Thm. 1 (metric case) immediately yields approx-
imation factors for clustering with `p -norms. We note that for binary matrices,
using t = 2 and the results of [11] we can obtain the slightly stronger guarantee
√
J(C) ≤ 3log2 (m)−1 (1 + 2)α1 JOPT (m).
Exploiting 1D clustering results. Substituting the approximation factors α1
of existing 1D clustering algorithms in Thm. 1 (with t = 1) instantly yields spe-
cific bounds for corresponding tensor clustering algorithms. Table 1 summarizes
these results, however we omit proofs for lack of space—see [27] for details.
Table 1. Approximation guarantees for Tensor Clustering Algorithms. K ∗ denotes the
maximum number of clusters, i.e., K ∗ = argmaxj kj ; c is some constant.
4 Experimental Results
factor
2.5
3
2.5 2
2
1.5
1.5
1 1
r 1.35 r
rk rk
1.5 rc 1.3 rc
rkc rkc
s 1.25 s
1.4 sk sk
sc sc
1.2
skc skc
1.3 1.15
factor
factor
1.1
1.2
1.05
1.1
1
1 0.95
0.9
0.9
0.85
1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2
σ σ
Fig. 4. Approximation factors for 3D clustering (left) and co-clustering (right) with
increasing noise. Top row: Euclidean distances, bottom row: KL Divergence. The x
axis shows σ, the y axis the empirical approximation factor.
In all settings, the empirical factor remains below the theoretical factor. The
reason for decreasing approximation factors with higher noise could be lower
accuracy of the estimates of JOPT on the one hand, and more similar objective
values for all clusterings on the other hand. With low noise, distance-specific
seeding s yields better results than uniform seeding r, and adding k-means on top
(rk,sk) improves the results of both. With Euclidean distances, CoTeC with well-
initialized 1D k-means (sk) competes with SiTeC. For KL Divergence, though,
SiTeC still improves on sk, and with high noise levels, 1D k-means does not
help: both rk and sk are as good as their seeding only counterparts r, s.
Figure 5 summarizes the average improvements for all five order-2 data sets
studied in [27]. Groups indicate methods, and colors indicate seeding techniques.
On average, a better seeding improves the results for all methods: the gray bars
are higher than their black counterparts in all groups. Just as for synthetic data,
1D k-means improves the CoTeC results here too. SiTeC (groups 3 and 4) is
better than CoTeC with mere seeding (r,s, group 1). Notably, for Euclidean
distances, combining good 1D clusters obtained by k-means (rk,sk, group 2) is
on average better than SiTeC initialized with simple seeding (rc,sc, group 3). For
KL Divergences, on the other hand, SiTeC still outperforms all CoTeC variations.
Given the limitation to single dimensions, CoTeC performs surprisingly well
in comparison to SiTeC. Additionally, SiTeC initialized with CoTeC converges
faster to better solutions, further underscoring the utility of CoTeC.
80
20
% iterations
15 60
10 40
5 20
unif: x=r
dist.: x=s
0 0
x xk xc xkc xc xkc
| {z } | {z } | {z } | {z }
CoTeC SiTeC CoTeC SiTeC
Fig. 5. (i) % improvement of the objective J2 (C) with respect to uniform 1D seeding
(r), averaged over all order-2 data sets and parameter settings (details in [27]). (ii)
average number of SiTeC iterations, in % with respect to initialization by r.
0.8
1D components, both in the-
J2 / J1
0.6
5 Conclusion
In this paper we presented CoTeC, a simple, and to our knowledge the first ap-
proximation algorithm for tensor clustering, which yielded approximation results
for Bregman co-clustering and tensor clustering as special cases. We proved an
approximation factor that grows linearly with the order of the tensor, and showed
tightness of the factor for the 2D Euclidean case (Fig. 2), though empirically the
observed factors are usually smaller than suggested by the theory.
Our worst-case example also illustrates the limitation of CoTeC, i.e., to ig-
nore the interaction between clusterings along multiple dimensions. Thm. 1 thus
gives hints how much information maximally lies in this interaction. Analyzing
this interplay could potentially lead to better approximation factors, e.g., by
developing a co-clustering specific seeding technique. Using such an algorithm
as a subroutine in CoTeC will yield a hybrid that combines CoTeC’s simplicity
with better approximation guarantees.