0% found this document useful (0 votes)
11 views46 pages

Dynamic Tensor Clustering

The document presents a new dynamic tensor clustering method that effectively handles general-order dynamic tensors while ensuring strong statistical guarantees and computational efficiency. The proposed method utilizes a structured tensor factorization approach that promotes sparsity and smoothness, allowing for accurate clustering of tensor data, including applications in brain connectivity analysis. The authors establish theoretical foundations for their method, demonstrating its effectiveness through simulations and practical examples.

Uploaded by

juntaoduan1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views46 pages

Dynamic Tensor Clustering

The document presents a new dynamic tensor clustering method that effectively handles general-order dynamic tensors while ensuring strong statistical guarantees and computational efficiency. The proposed method utilizes a structured tensor factorization approach that promotes sparsity and smoothness, allowing for accurate clustering of tensor data, including applications in brain connectivity analysis. The authors establish theoretical foundations for their method, demonstrating its effectiveness through simulations and practical examples.

Uploaded by

juntaoduan1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Dynamic Tensor Clustering

Will Wei Sun and Lexin Li

University of Miami and University of California at Berkeley


arXiv:1708.07259v2 [[Link]] 14 Sep 2018

Abstract
Dynamic tensor data are becoming prevalent in numerous applications. Existing
tensor clustering methods either fail to account for the dynamic nature of the data,
or are inapplicable to a general-order tensor. There is also a gap between statistical
guarantee and computational efficiency for existing tensor clustering solutions. In
this article, we propose a new dynamic tensor clustering method that works for a
general-order dynamic tensor, and enjoys both strong statistical guarantee and high
computational efficiency. Our proposal is based on a new structured tensor factorization
that encourages both sparsity and smoothness in parameters along the specified tensor
modes. Computationally, we develop a highly efficient optimization algorithm that
benefits from substantial dimension reduction. Theoretically, we first establish a non-
asymptotic error bound for the estimator from the structured tensor factorization.
Built upon this error bound, we then derive the rate of convergence of the estimated
cluster centers, and show that the estimated clusters recover the true cluster structures
with high probability. Moreover, our proposed method can be naturally extended
to co-clustering of multiple modes of the tensor data. The efficacy of our method is
illustrated through simulations and a brain dynamic functional connectivity analysis
from an Autism spectrum disorder study.

Key Words: Cluster analysis; Multidimensional array; Non-convex optimization; Tensor


decomposition; Variable selection.

1
Will Wei Sun is Assistant Professor, Department of Management Science, University of Miami Business
School, Miami, FL 33146. Email: wsun@[Link]. Sun’s work was partially supported by Provost’s
research award from University of Miami. Lexin Li is Professor, Division of Biostatistics, University of
California, Berkeley, Berkeley, CA 94720-3370. Email: lexinli@[Link]. Li’s work was partially supported
by NSF grant DMS-1613137 and NIH grant AG034570. Both authors thank the editor, one associate editor
and three reviewers for their helpful comments and suggestions which led to a much improved presentation.

1
1 Introduction
Data in the form of multidimensional array, or tensor, are now frequently arising in a diverse
range of scientific and business applications (Zhu et al., 2007; Liu et al., 2013; Zhou et al.,
2013; Ding and Cook, 2015). Particularly, for a large class of tensor data, time is one of the
tensor modes, and this class is often termed as dynamic tensor. Examples of dynamic tensor
data are becoming ubiquitous and its analysis are receiving increasing attention. For instance,
in online advertising, Bruce et al. (2016) studied consumer engagement on advertisements over
time to capture the temporal dynamics of users behavior. The resulting data is a three-mode
tensor of user by advertisement by time. In molecular biology, Seigal et al. (2016) studied
time-course measurements of the activation levels of multiple pathways from genetically
diverse breast cancer cell lines after exposure to numerous growth factors with different dose.
The data is a five-mode tensor of cell line by growth factor by pathway by dose by time. In
neurogenomics, Liu et al. (2017) modeled spatial temporal patterns of gene expression during
brain development, and the data is a three-mode tensor of gene by brain region by time. In
Section 6, we illustrate our method on a brain dynamic connectivity analysis, where the goal
is to understand interaction of distinct brain regions and their dynamic pattern over time.
One form of the data in this context is a three-mode tensor of region by region by time.
Clustering has proven to be a useful tool to reveal important underlying data structures
(Yuan and Kendziorski, 2006; Ma and Zhong, 2008; Shen et al., 2012a; Wang et al., 2013).
Directly applying a clustering algorithm to the vectorized tensor data is a simple solution, but
it often suffers from poor clustering accuracy, and induces heavy and sometimes intractable
computation. There have been a number of proposals for clustering of tensor data, or the
two-mode special case, matrix data. One class of such methods focused on biclustering
that simultaneously group rows (observations) and columns (features) of the data matrix
(Huang et al., 2009; Lee et al., 2010; Chi and Lange, 2015; Chi and Allen, 2017). The
proposed solutions were based on sparse singular value decomposition, or a reformulation
of biclustering as a penalized regression with some convex penalties. However, the dynamic
property remained largely untapped in those solutions. The second class of approaches
directly targeted biclustering of time-varying matrix data (Hocking et al., 2011; Ji et al., 2012;
Li et al., 2015). Nevertheless, those solutions were designed specifically for matrix-valued
data. Extension to a general-order tensor is far from trivial. The third class tackled clustering
of tensor data through some forms of `1 penalization (Cao et al., 2013; Wu et al., 2016).

2
But none of those approaches incorporated the dynamic information in the data and would
inevitably lead to loss in clustering accuracy. Moreover, there is no statistical guarantee
provided in the performance of these tensor clustering algorithms. Tensor decomposition is a
crucial component in handling tensor-valued data, and is to play a central role in our proposed
tensor clustering solution as well. There have been a number of recent proposals of convex
relaxation of tensor decomposition through various norms (Romera-Paredes and Pontil, 2013;
Yuan and Zhang, 2017, 2016; Zhang, 2018). However, all those methods focused on low-rank
tensor recovery, and none incorporated any sparsity or fusion structure in the decomposition.
Recently, Sun et al. (2017) proposed a low-rank decomposition with a truncation operator for
hard thresholding. Although sparsity was considered, they did not consider fusion structure
for a dynamic tensor. Ignoring this structure, as we show later, would induce large estimation
errors in both tensor decomposition and subsequent clustering. In summary, there exists no
clustering solution with statistical guarantee that handles a general-order dynamic tensor
and incorporates both sparsity and fusion structures.
In this article, we aim to bridge this gap by proposing a dynamic tensor clustering method,
which takes into account both sparsity and fusion structures, and enjoys strong statistical
guarantee as well as high computational efficiency. Our proposal makes multiple contributions
to the clustering literature. First, our clustering method is built upon a newly proposed
structured tensor factorization approach, which encourages both sparsity and smoothness
in the decomposed components, and in turn captures the dynamic nature of the tensor
data. We show how structured tensor factorization can be used to infer the clustering
structure. Interestingly, we find that tensor Gaussian mixture model can be viewed as a
special case of our clustering method. Second, our proposal is computationally efficient. This
is partly achieved by substantial dimension reduction resulting from the imposed structure;
in the illustrative example in Section 6, the number of free parameters was reduced from
about two millions to one thousand. Our optimization algorithm can be decomposed as an
unconstrained tensor decomposition step followed by a constrained optimization step. We
show that, the overall computational complexity of our constrained solution is comparable to
that of the unconstrained one. Third, and probably most importantly, we establish rigorous
theoretical guarantee for our proposed dynamic tensor clustering solution. Specifically, we
first establish a non-asymptotic error bound for the estimator from the proposed structured
tensor factorization. Based on this error bound, we then obtain the rate of convergence
of the estimated cluster centers from our dynamic tensor clustering solution, and prove

3
that the estimated clusters recover the true cluster structures with high probability. It is
also noteworthy that we allow the number of clusters to grow with the sample size. Such
consistency results are new in the tensor clustering literature. From a technical perspective,
the fusion structure we consider introduces some additional challenges in the theoretical
analysis, since the resulting truncated fusion operator is non-convex and the observed tensor
is usually noisy with an unknown error distribution. To address such challenges, we develop
a set of non-asymptotic techniques to carefully evaluate the estimation error in each iteration
of our alternating updating algorithm. We also utilize a series of large deviation bounds to
show that our estimation error replies on the error term only through its sparse spectral
norm, which largely relieves the uncertainty from the unknown error distribution. Last but
not least, although our algorithm mostly focuses on clustering along a single mode of tensor
data, the same approach can be easily applied to co-clustering along multiple tensor modes.
This is different from classical clustering methods, where an extension from clustering to
bi-clustering generally requires different optimization formulations (see, e.g., Chi and Allen,
2017). In contrast, our clustering method naturally incorporates single and multi-mode
clustering without requiring any additional modification.
The rest of the article is organized as follows. Section 2 introduces the proposed dynamic
tensor clustering method, and Section 3 presents its solution through structured tensor
factorization. Section 4 establishes the estimation error bound of the structured tensor
factorization and the consistency properties of dynamic tensor clustering. Section 5 presents
the simulations, and Section 6 illustrates with a brain dynamic functional connectivity
analysis. The Supplementary Materials collect all technical proofs and additional simulations.

2 Model
2.1 Clustering via tensor factorization
Given N copies of m-way tensors, X1 , . . . , XN ∈ Rd1 ×···×dm , our goal is to uncover the
underlying cluster structures of these N samples. That is, we seek the true cluster assignment,

(1, . . . , 1, 2, . . . , 2, . . . , K, . . . , K ),
| {z } | {z } | {z }
l samples l samples l samples

where K is the number of clusters and l = N/K. Here, for ease of presentation, we assume
an equal number of l samples per cluster. In Section S.9 of the Supplementary Materials, we

4
report some numerical results with unequal cluster sizes.
To cluster those tensor samples, we first stack them into a (m + 1)-way tensor, T ∈
Rd1 ×···×dm ×N . We comment that, in principle, one can cluster along any single or multiple
modes of T . Without loss of generality, we focus our discussion on clustering along the last
mode of T , and only briefly comment on the scenario that clusters along multiple modes.
This formulation covers a variety of scenarios encountered in our illustrative example of brain
dynamic connectivity analysis. For instance, in one scenario, T ∈ Rp×p×t×n , N = n, and
the goal is to cluster n individuals, each with a p × p × t tensor that represents the brain
connectivity pattern among p brain regions over t sliding time windows. In another scenario,
T ∈ Rp×p×t , N = t, and the goal is to cluster t sliding windows for a single subject. In the
third scenario, T ∈ Rp×p×ng ×t , N = t, where ng is the number of subjects in group g, and
the goal becomes clustering t moving windows for all subjects in that group.
Our key idea is to consider a structured decomposition of T , then apply a usual clustering
algorithm, e.g., K-means, to the matrix from the decomposition that corresponds to the last
mode to obtain the cluster assignment. Assume that the tensor T is observed with noise,

T = T ∗ + E, (1)

where E is an error tensor, and T ∗ is the true tensor with a rank-R CANDECOMP/PARAFAC
(CP) decomposition structure (Kolda and Bader, 2009),
R
X

T = wr∗ β1,r
∗ ∗
◦ · · · ◦ βm+1,r , (2)
r=1
∗ ∗
where βj,r ∈ Rdj , kβj,r k2 = 1, wr∗ > 0, j = 1, . . . , m + 1, r = 1, . . . , R, k · k2 denotes the vector
`2 norm and ◦ is the vector outer product. For ease of notation, we define dm+1 := N . We
have chosen the CP decomposition due to its relatively simple formulation and its competitive
empirical performance. It has been widely used for link prediction (Dunlavy et al., 2011),
community detection (Anandkumar et al., 2014a), recommendation systems (Bi et al., 2017),
and convolutional neural network speeding-up (Lebedev et al., 2015).
Given the structure in (2), it is straightforward to see that, the cluster structure of
samples along the last mode of the tensor T is fully determined by the matrix that stacks
∗ ∗
the decomposition components, βm+1,1 , . . . , βm+1,R . We denote this matrix as B∗m+1 , which
can be written as
  ∗> >
B∗m+1 := βm+1,1
∗ ∗
, . . . , βm+1,R = µ1 , . . . , µ∗> , . . . , µ ∗>
, . . . , µ∗>
∈ RN ×R , (3)
| {z 1 } | K {z K}
l samples l samples

5
Figure 1: A schematic illustration of the proposed tensor clustering method. It stacks
multiple tensor samples to a higher-order tensor, carries out structured tensor factorization,
then applies a classical clustering algorithm, e.g, K-means, to the data matrix from the
decomposition that corresponds to the last mode of the stacked tensor.

where µ∗k := (µ∗k,1 , . . . , µ∗k,R ) ∈ RR , k = 1, . . . , K, indicates the cluster assignment. Figure 1


shows a schematic illustration of our tensor clustering proposal. We comment that, if the
goal is to cluster along more than one tensor mode, one only needs to apply a clustering
algorithm to the matrix formed by each of those modes separately.
Accordingly, the true cluster means of the tensor samples X1 , . . . , XN can be written as,
R
X R
X
M1 := wr∗ β1,r
∗ ∗
◦ · · · ◦ βm,r µ∗1,r , ..., MK := wr∗ β1,r
∗ ∗
◦ · · · ◦ βm,r µ∗K,r . (4)
r=1 r=1
| {z } | {z }
cluster center 1 cluster center K

The structure in (4) reveals the key underlying assumption of our tensor clustering solution.
That is, we assume each cluster mean is a linear combination of the outer product of R
rank-1 basis tensors, and all the cluster means share the same R basis tensors. We recognize
that this assumption introduces an additional constraint. However, it leads to substantial
dimension reduction, which in turn enables efficient estimation and inference in subsequent
analysis. As we show in Section 2.3, the tensor Gaussian mixture model can be viewed as a
special case of our clustering structure. Moreover, as our numerical study has found, this
imposed structure provides a reasonable approximation in real data applications.
For comparison, we consider the alternative solution that applies clustering directly on the
vectorized version of tensor data. It does not require (4), and the corresponding number of

6
Figure 2: Heatmap of the vectorized data, with dimension 100 × 8000, is shown in the left
panel, and heatmap of the reduced data from our tensor factorization, with dimension 100 × 2,
is shown on the right. The true cluster structure can be fully recovered by any reasonable
clustering method based on the reduced data, but not based on the vectorized data.

Q
free parameters is in the order of K j dj . In the example in Section 6, d1 = d2 = 116, d3 =
80, K = 2, and that amounts to 2, 152, 960 parameters. Imposing (4), however, would reduce
P
the number of free parameters to R( j dj + K + 1); again, for the aforementioned example,
R = 5 and that amounts to 1, 175 parameters. Such a substantial reduction in dimensionality
is crucial for both computation and inference in tensor data analysis. Moreover, we use
a simple simulation example to demonstrate that, our clustering method, which assumes
(4) and thus exploits the underlying structure of the tensor data, can not only reduce
the dimensionality and the computational cost, but also improve the clustering accuracy.
Specifically, we follow the example in Section 5.2 to generate N = 100 tensor samples of
dimension d1 = d2 = d3 = 20 from 4 clusters, with samples 1 to 25, 26 to 50, 51 to 75, 76 to
100 belonging to clusters 1 to 4, respectively. Figure 2 shows the heatmap of the vectorized
data (left panel), which is of dimension 100 × 8000, and the heatmap of the data with reduced
rank R = 2 (right panel), which is of dimension 100 × 2. It is clearly seen from this plot that,
our clustering method based on the reduced data under (4) is able to fully recover the four
underlying clusters, while the clustering method based on the vectorized data cannot.

2.2 Sparsity and fusion structures


Motivated from the brain dynamic functional connectivity analysis, in addition to the CP
low-rank structure (2), we also impose the sparsity and smoothness fusion structures in
tensor decomposition to capture the sparsity and dynamic properties of the tensor samples.

7
Specifically, we impose the following structures in the parameter space,
( d
)
X
S(d, s0 ) := β ∈ Rd 1{βj 6=0} ≤ s0 ,
j=1
( d
)
X n o
d d
F(d, f0 ) := β∈R |βj − βj−1 | ≤ f0 = β ∈ R kDβk1 ≤ f0 ,
j=2

where β = (β1 , . . . , βd )> , k · k1 denotes the vector `1 norm, and D ∈ R(d−1)×d , whose jth row
has −1 and 1 on its jth and (j + 1)th positions and zero elsewhere. Combining these two
structures with the CP decomposition of T ∗ in (2), we consider

βj,r ∈ S(dj , s0,j ) ∩ F(dj , f0,j ), for any j = 1, . . . , m + 1, r = 1, . . . , R.

Here for simplicity, we assume the maximum of the sparsity parameter s0,j and the fusion
parameter f0,j are the same across different rank r = 1, . . . , R, and we denote s0 := maxj s0,j
and f0 := maxj f0,j . This can be easily extended to a more general case where these
parameters vary with r. To encourage such sparse and fused components, we propose to solve
the following penalized optimization problem,
R m+1 R
X 2 XX
min T − wr β1,r ◦ . . . ◦ βm+1,r + λ kDβj,r k1 , (5)
wr ,β1,r ,...,βm+1,r F
r=1 j=1 r=1
s.t. kβj,r k2 = 1 and kβj,r k0 ≤ sj , j = 1, . . . , m + 1, r = 1, . . . , R,

for some cardinality parameters s1 , . . . , sm+1 . Here k · k0 denotes the vector `0 norm, i.e., the
number of nonzero entries, and k · kF denotes the tensor Frobenius norm, which is defined as
qP
2 d1 ×...×dm
kAkF := i1 ,...,im Ai1 ,...,im for a tensor A ∈ R . The optimization formulation (5)
encourages sparsity in the individual components via a direct `0 constraint, and encourages
smoothness via a fused lasso penalty. We make a few remarks. First, one can easily
choose which mode to impose which constraints, by modifying the penalty functions in (5)
accordingly. See Section 3.2 for some specific examples in brain dynamic connectivity analysis
where different constraints along different tensor modes are imposed. Second, our problem
(5) is related to a recently proposed tensor decomposition method with generalized lasso
penalties in Madrid-Padilla and Scott (2017). However, the two proposals differ in several
ways. We use the `0 truncation to achieve sparsity, whereas they used the lasso penalty. It is
known that former yields an unbiased estimator, while the latter leads to a biased one in
high-dimensional estimation (Shen et al., 2012b; Zhu et al., 2014). Moreover, our rate of

8
convergence of the parameter estimators are established under a general error tensor, whereas
theirs required the error tensor to be Gaussian. Third, our method also differs from the
structured sparse principal components analysis of Jenatton et al. (2010). The latter seeks
the factors that maximumly explain the variance of the data while respecting some structural
constraints, and it assumes the group structure is known a priori. By contrast, our method
aims to identify a low-rank representation of the tensor data, and does not require any prior
structure information but learns the group structure adaptively given the data.

2.3 A special case: tensor Gaussian mixture model


In model (1), no distributional assumption is imposed on the error tensor E. In this section,
we show that, if one further assumes that E is a standard Gaussian tensor, then our method
reduces to a tensor version of Gaussian mixture model.
An m-way tensor X ∈ Rd1 ×d2 ×···×dm is said to follow a tensor normal distribution (Kolda
and Bader, 2009) with mean M and covariance matrices, Σ1 , . . . , Σm , denoted as X ∼
Q
j dj ,

TN(M; Σ1 , . . . , Σm ), if and only if vec(X ) ∼ N vec(M), ⊗m
j=1 Σj , where vec(M) ∈ R

vec denotes the tensor vectorization, and ⊗ denotes the matrix Kronecker product. Following
the usual definition of Gaussian mixture model, we say X is drawn from a tensor Gaussian
mixture model, if its density function is of the form, f (X ) = K
P
k=1 πk φk (Mk ; Σk,1 , . . . , Σk,m ),
where πk is the mixture weight, and φk (Mk ; Σk,1 , . . . , Σk,m ) is the probability density function
of a tensor Gaussian distribution TN(Mk ; Σk,1 , . . . , Σk,m ). We next consider two special
cases of our general tensor clustering model (1).
First, when the error tensor E in (1) is a standard Gaussian tensor, our clustering model
is equivalent to assuming the tensor-valued samples Xi , i = 1, . . . , N , follow the above tensor
Gaussian mixture model. That is,

X1 , . . . , Xl ∼ T N (M∗1 ; Id1 , . . . , Idm ), . . . , XN −l , . . . , XN ∼ T N (M∗K ; Id1 , . . . , Idm ),

where Id is a d × d identity matrix. Thus the tensor samples X1 , . . . , XN are drawn from a
tensor Gaussian mixture model with the kth prior probability πk = 1/K, k = 1, . . . , K.
Second, when the error tensor E in (1) is a general Gaussian tensor, our model is equivalent
to assuming the samples follow

X1 , . . . , Xl ∼ T N (M∗1 ; Σ1,1 , . . . , Σ1,m ), . . . , XN −l , . . . , XN ∼ T N (M∗K ; ΣK,1 , . . . , ΣK,m ), (6)

where Σk,j is a general covariance matrix, k = 1, . . . , K, j = 1, . . . , m.

9
In our tensor clustering solution, we have chosen not to estimate those covariance matrices.
This may lose some estimation efficiency, but it greatly simplifies both the computation
and the theoretical analysis. Meanwhile, the methodology we develop can be extended to
incorporate general covariances Σk,j in a straightforward fashion, where the tensor cluster
means and the covariance matrices can be estimated using a high-dimensional EM algorithm
(Hao et al., 2018), in which the M-step solves a penalized weighted least squares. We do not
pursue this line of research in this article.

3 Estimation
3.1 Optimization algorithm
We first introduce some operators for achieving the sparsity and fusion structures of a given
dense vector. We then present our optimization algorithm.
The first operator is a truncation operator to obtain the sparse structure. For a vector
v ∈ Rd and a scaler τ ≤ d, we define Truncate(v, τ ) as
(
vj if j ∈ supp(v, τ )
[Truncate(v, τ )]j = ,
0, otherwise

where supp(v, τ ) refers to the set of indices of v corresponding to its largest τ absolute values.
The second is a fusion operator. For a vector v ∈ Rd and a fusion parameter λ > 0, the fused
vector Fuse(v, λ) is obtained via the fused lasso (Tibshirani et al., 2005); i.e.,
( d )
X
Fuse(v, λ) := arg min (ui − vi )2 + λkDuk1 .
u∈Rd
i=1

An efficient ADMM-based algorithm for this fused lasso has been developed in Zhu (2017).
The third operator is a combination of the truncation and fusion operators. For v ∈ Rd and
parameters τ , λ, we define Truncatefuse(v, τ, λ) as
 
Truncatefuse(v, τ, λ) = Truncate Fuse(v, λ), τ .

Lastly, we denote Norm(v) = v/kvk as the normalization operator on a vector v.


We propose a structured tensor factorization procedure in Algorithm 1. It consists of
three major steps. Specifically, the step in (7) essentially obtains an unconstrained tensor
decomposition, and is done through the classical tensor power method (Anandkumar et al.,

10
Algorithm 1 Structured tensor factorization for optimization of (5).
1: Input: tensor T , rank R, cardinalities (s1 , . . . , sm+1 ), fusion parameters (λ1 , . . . , λm+1 ).
2: For r = 1 to R
(0)
3: Initialize unit-norm vectors βbj,r randomly for j = 1, . . . , m + 1.
4: Repeat
5: For j = 1 to m + 1
(κ+1)
6: Update βbj,r as
 
(κ+1) (κ+1) (κ+1) (κ) (κ)
βej,r = Norm T ×1 βb1,r . . . ×j−1 βbj−1,r ×j+1 βbj+1,r . . . ×m+1 βbm+1,r , (7)
 
(κ+1) (κ+1)
β̌j,r = Truncatefuse βj,r , sj , λj ,
e (8)
 
(κ+1) (κ+1)
βbj,r = Norm β̌j,r . (9)

7: End For
8: Until the termination condition is met.
(κ ) (κmax )
9: Compute w br = T ×1 βb1,rmax ×2 . . . ×m βbm+1,r , where κmax is the terminated iteration.
(κ max ) (κmax )
10: Update the tensor T = T − w br βb1,r ◦ · · · ◦ βbm+1,r .
11: End For
(κ )
12: Output: w br and βbj,rmax for j = 1, . . . , m + 1 and r = 1, . . . , R.

2014b). Here, for a tensor A ∈ Rd1 ×...×dm and a set of vectors aj ∈ Rdj , j = 1, . . . , m,
the multilinear combination of the tensor entries is defined as A ×1 a1 ×2 . . . ×m am :=
P P
i1 ∈[d1 ] . . . im ∈[dm ] ai1 . . . aim Ai1 ,...,im ∈ R. It is then followed by our new Truncatefuse step
in (8) to generate a sparse and fused component. Finally, the step in (9) normalizes the
component to ensure the unit-norm. These three steps form a full update cycle of one
decomposition component. The algorithm then updates all components in an alternating
fashion, until some termination condition is satisfied. In our implementation, the algorithm
terminates if the total number of iterations exceeds 20, or m+1
P b(κ+1) − βb(κ) k2 ≤ 10−4 .
j=1 kβj,r j,r 2
In terms of computational complexity, we note that, the complexity of operation in
(7) is O( m+1
Q
j=1 dj ), while the complexity in (8) consists of O(dj log dj ) for the truncation
operation and O(d3j ) for the fusion operation
 (Tibshirani and Taylor, 2011). Therefore, the
Qm+1 Pm+1 3 
total complexity of Algorithm 1 is O Rκmax max{m j=1 dj , j=1 dj } , by noting that
Pm+1 Qm+1  Qm+1 Pm+1 3 
3
j=1 max{ j=1 dj , dj } = O max{m j=1 dj , j=1 dj } . It is interesting to note that,
when the tensor order m > 2 and the dimension dj along each tensor mode is of a similar
value, the complexity of the sparsity and fusion operation in (8) is to be dominated by that
in (7). Consequently, our addition of the sparsity and fusion structures does not substantially

11
Algorithm 2 Dynamic tensor clustering procedure
1: Input: Tensor samples X1 , . . . , XN ∈ Rd1 ×···×dm , number of clusters K, cardinalities
(s1 , . . . , sm+1 ), fusion parameters (λ1 , . . . , λm+1 ), rank R.
2: Step 1: Stack tensor samples into a higher-order tensor T ∈ Rd1 ×···×dm ×N , where the ith
slice in the last mode is Xi .
3: Step 2: Apply Algorithm 1 to T with rank R, cardinalities (s1 , . . . , sm+1 ), and fusion
parameters (λ1 , . . . , λm+1 ) to obtain B b m+1 = (βb(κmax ) , . . . , βb(κmax ) ) ∈ RN ×R .
m+1,1 m+1,R
4: Step 3: Apply the K-means clustering to B b m+1 by treating each row as a sample.
5: Output: Cluster assignments A b1 , . . . , AbK from Step 3.

increase the overall complexity of the tensor decomposition.


Based upon the structured tensor factorization, we next summarize in Algorithm 2 our
proposed dynamic tensor clustering procedure illustrated in Figure 1. It is noted that Step 2
of Algorithm 2 utilizes the structured tensor factorization to obtain a reduced data B b m+1 ,
whose columns consist of all information of the original samples that are relevant to clustering.
This avoids the curse of dimensionality through substantial dimension reduction. We also
note that Step 2 provides a reduced data along each mode of the original tensor, and hence
it is straightforward to achieve co-clustering of any single or multiple tensor modes. This is
different from the classical clustering methods, where extension from clustering to bi-clustering
generally requires different optimization formulations (see, e.g., Chi and Allen, 2017).

3.2 Application example: brain dynamic connectivity analysis


Our proposed dynamic tensor clustering approach applies to many different applications
involving dynamic tensor. Here we consider one specific application, brain dynamic connec-
tivity analysis. Brain functional connectivity describes interaction and synchronization of
distinct brain regions, and is characterized by a network, with nodes representing regions,
and links measuring pairwise dependency between regions. This dependency is frequently
quantified by Pearson correlation coefficient, and the resulting connectivity network is a
region by region correlation matrix (Fornito et al., 2013). Traditionally, it is assumed that
connectivity networks do not change over time when a subject is resting during the scan.
However, there is growing evidence suggesting the contrary that networks are not static but
dynamic over the time course of scan (Hutchison et al., 2013). To capture such dynamic
changes, a common practice is to introduce sliding and overlapping time windows, compute
the correlation network within each window, then apply a clustering method to identify

12
distinct states of connectivity patterns over time (Allen et al., 2014).
There are several scenarios within this context, which share some similar characteristics.
The first scenario is when we aim to cluster n individual subjects, each represented by a
p × p × t tensor that describes the brain connectivity pattern among p brain regions over
t sliding and overlapping time windows. Here n denotes the sample size, p the number
of brain regions, and t the total number of moving windows. In this case, T ∈ Rp×p×t×n
and N = n. It is natural to encourage sparsity in the first two modes of T , and thus to
improve interpretability and identification of connectivities among important brain regions.
Meanwhile, it is equally intuitive to encourage smoothness along the time mode of T , since
the connectivity patterns among the adjacent and overlapping time windows are indeed
highly correlated and similar. Toward that end, we propose the following structured tensor
factorization based clustering solution, by considering the minimization problem,
R R
X 2 X
min T − wr β1,r ◦ β2,r ◦ β3,r ◦ β4,r +λ kDβ3,r k1 .
wr ,β1,r ,β2,r ,β3,r β4,r F
r=1 r=1
s.t. kβj,r k2 = 1, j = 1, . . . , 4, β1,r = β2,r , and kβ1,r k0 ≤ s, r = 1, . . . , R,

where s and λ are the cardinality and fusion parameters, respectively. This optimization
problem can be solved via Algorithm 1. Given that the connectivity matrix is symmetric, i.e.,
T is symmetric in its first two modes, we can easily incorporate such a symmetry constraint
(t+1) (t+1)
by setting βb 2r = βb in Algorithm 1.
1r

Another scenario is to cluster t sliding windows for a single subject, so to examine if the
connectivity pattern is dynamic or not over time. Here T ∈ Rp×p×t and N = t. In this case,
we consider the following minimization problem,
R R
X 2 X
min T − wr β1,r ◦ β2,r ◦ β3,r +λ kDβ3,r k1 .
wr ,β1,r ,β2,r ,β3,r F
r=1 r=1
s.t. kβj,r k2 = 1, j = 1, . . . , 3, β1,r = β2,r , and kβ1,r k0 ≤ s, r = 1, . . . , R.

If one is to examine the dynamic connectivity pattern for ng subjects simultaneously, then
T ∈ Rp×p×ng ×t . Both problems can be solved by dynamic tensor clustering in Algorithm 2.

3.3 Tuning
Our clustering algorithm involves a number of tuning parameters. To facilitate the compu-
tation, we propose a two-step tuning procedure. We first choose the rank R, the sparsity
parameters s1 , . . . , sm+1 , and the fusion parameters λ1 , . . . , λm+1 , by minimizing

13
! Pm+1
br βb1,r ◦ . . . ◦ βbm+1,r k2F
P
kT − r∈[R] w j=1 log dj
log Qm+1 Qm+1 × pe , + (10)
j=1 dj j=1 d j

where pe refers to the total degrees of freedom, and is estimated by m+1


P P
r∈[R] pe (βj,r ),
b
j=1
in which the degrees of freedom of an individual component pe (βbj,r ) is defined as the
number of unique non-zero elements in βbj,r . For simplicity, we set s = s1 = . . . = sm+1
and λ = λ1 = . . . = λm+1 . The selection criterion (10) balances model fitting and model
complexity, and the criterion of a similar form has been widely used in model selection (Wang,
2015). Next, we tune the number of clusters K by employing a well established gap statistic
(Tibshirani et al., 2001). The gap statistic selects the best K as the minimal one such that
gap(k) ≥ gap(k + 1) − se(k + 1), where se(k + 1) is the standard error corresponding to the
(k + 1)th gap statistic. In the literature, stability-based methods (Wang, 2010; Wang and
Fang, 2013) have also been proposed that can consistently select the number of clusters. We
choose the gap statistic due to its computational simplicity.

4 Theory
4.1 Theory with R = 1
We first develop the theory for the rank R = 1 case in this section, then for the general
rank case in the next section. We begin with the derivation of the rate of convergence of
the general structured tensor factorization estimator from Algorithm 1. We then establish
the clustering consistency of our proposed dynamic tensor clustering in Algorithm 2. The
difficulties in the technical analysis lie in the non-convexity of the tensor decomposition and
the incorporation of the sparsity and fusion structures.
Recall that we observe the tensor T with noise as specified in (1). To quantify the noise
level of the error tensor, we define the sparse spectral norm of E ∈ Rd1 ×...×dm+1 as,

η(E; s∗1 , . . . , s∗m+1 ) := sup E ×1 u1 ×2 . . . ×m+1 um+1 ,


ku1 k=...=kum+1 k=1
ku1 k0 ≤s∗1 ,...,kum+1 k0 ≤s∗m+1

where s∗j ≤ dj , j = 1, . . . , m + 1. It quantifies the perturbation error in a sparse scenario.

Assumption 1. Consider the structure in (2). Assume the true decomposition components
∗ ∗
are sparse and smooth in that kDβj,r k1 ≤ f0,j and kβj,r k0 ≤ s0j , for j = 1, . . . , m + 1.
(0)
Furthermore, assume the initialization satisfies kβb − β ∗ k2 ≤ 0 , j = 1, . . . , m, with 0 < 1.
j,r j,r

14
We recognize that the initialization error bound 0 ≤ 0 ≤ 1, since the components are
normalized to have a unit norm. This condition on the initial error is very weak, by noting
that we only require the initialization to be slightly better than a naive estimator whose
entries are all zeros so to avoid 0 = 1. As such, the proposed random initialization in our
algorithm satisfies this condition with high probability. The next theorem establishes the
convergence rate of the structured tensor factorization under the rank R = 1.

Theorem 1. Assume R = 1 and Assumption 1 holds. Define M := maxj k[D† ]j k2 , where


D† is the pseudoinverse of D and [D† ]j refers to its jth column. If sj ≥ s0j , and the error
(1)
tensor satisfies η(E; s1 , . . . , sm+1 ) < w∗ (1 − 20 ), then βbm+1 of Algorithm 1 with λm+1 ≥
2M η(E; s1 , . . . , sm+1 )/[w∗ (1 − 20 ) − η(E; s1 , . . . , sm+1 )] satisfies, with high probability,
 2
(1) ∗ 2 2η(E; s1 , . . . , sm+1 ) 8M f0,m+1 η(E; s1 , . . . , sm+1 )
kβm+1 − βm+1 k2 ≤
b
2
+ ∗ .

w (1 − 0 ) − η(E; s1 , . . . , sm+1 ) w (1 − 20 ) − η(E; s1 , . . . , sm+1 )
(11)

With 0 being a constant strictly less than 1, we see a clear tradeoff between the signal
level w∗ and the error tensor E according to the condition η(E; s1 , . . . , sm+1 ) < w∗ (1 − 20 ).
Besides, the derived upper bound in (11) reveals an interesting interaction of the initial error
0 , the signal level w∗ , and the error tensor E. Apparently, the error bound can be reduced
by lowering 0 or E, or increasing w∗ . For a fixed signal level w∗ , more noisy samples, i.e., a
larger E, would require a more accurate initialization in order to obtain the same error bound.
Moreover, this error bound is a monotonic function of the smoothness parameter f0,m+1 . In
contrast to the non-smoothed tensor factorization with f0,m+1 = dm+1 , our structured tensor
factorization is able to greatly reduce the error bound, since f0,m+1 is usually much smaller
than dm+1 in the dynamic setup.
Based on the general rate derived in Theorem 1, the next result shows that Algorithm 1
generates a contracted estimator. It is thus guaranteed that the estimator converges to the
truth as the number of iterations increases.

Corollary 1. Assume the conditions in Theorem 1, and the error tensor satisfies that
 ∗
w 0 (1 − 0 ) w∗ 20 (1 − 0 )

η(E; s1 , . . . , sm+1 ) ≤ min , .
9 64M f0,m+1 + 1
(0) (1) (1)
If kβbj − βj∗ k2 ≤ 0 , j = 1, . . . , m, then the update βbm+1 in our algorithm satisfies kβbm+1 −

βm+1 k2 ≤ 0 /2 with high probability.

15
Corollary 1 implies that the `2 distance of the estimator to the truth from each iteration is at
most half of the one from the previous iteration. Simple algebra implies that the estimator
(κ)
in the κth iteration of the algorithm satisfies kβb − β ∗ k2 ≤ 2−κ 0 , which converges to
m+1 m+1

zero as κ increases. Here the assumption on η(E; s1 , . . . , sm+1 ) is imposed to ensure that the
contraction rate of the estimator is 1/2, and it can be relaxed for a slower contraction rate.
It is also noteworthy that the above results do not require specification of the error tensor
distribution. Next we derive the explicit form of the estimation error when E is a Gaussian
tensor. In the following, an  bn means bn /an → 0, and an = O ep (bn ) means an , bn are of the
same order up to a logarithm term, i.e., an = Op (bn (log n)c ) for some constant c > 0.

Corollary 2. Assume the conditions in Theorem 1, and qQassumeP E ∈ Rd1 ×...×dm+1 is a


m+1 m+1
Gaussian tensor. Then we have η(E; s1 , . . . , sm+1 ) = C j=1 sj log(dj ) for some
qQ j=1 P
m+1 m+1
constant C. In addition, if the signal strength satisfies w∗  j=1 sj j=1 log(dj ), we
(1)
have the update βb in one iteration of our algorithm satisfies
m+1
Q  qQ 
m+1
 m+1 sj f0,m+1 j=1 s j 
(1) ∗ 2 j=1
kβm+1 − βm+1 k2 = Op max
b e  , .
 w∗2 w∗ 

Next we establish the consistency of our dynamic tensor clustering Algorithm 2 under the
tensor Gaussian mixture model. Consider a collection of m-way tensor samples, X1 , . . . , XN ∈
Rd1 ×d2 ×···×dm , from a tensor Gaussian mixture model, with K rank-1 centers, M1 := µ∗1 w∗ β1∗ ◦

. . .◦βm , . . . , MK := µ∗K w∗ β1∗ ◦. . .◦βm

, and equal prior probability πk = 1/K, for k = 1, . . . , K.
As before, we assume an equal number of l = N/K samples in each cluster. We denote the

component from the last mode βm+1 as,

βm+1 = (µ∗1 , . . . , µ∗1 , . . . , µ∗K , . . . , µ∗K ) ∈ RN ×1 . (12)
| {z } | {z }
l samples l samples

Define the true cluster assignments as A∗1 := {1, . . . , l}, . . . , A∗K := {N −l, . . . , N }. Recall that
the estimated cluster assignments Ab1 , . . . , AbK are obtained from Algorithm 2. Then we show
that our proposed dynamic tensor clustering estimator is consistent, in that the estimated
cluster centers from our dynamic tensor clustering converge to the truth consistently, and that
the estimated cluster assignments recover the true cluster structures with high probability.

Theorem 2. Assume R = 1 and Assumption 1 holds. If sj ≥ s0j , and the signal strength
qQ
m Qm
satisfies that w∗  2
j=1 sj N log( j=1 dj N )/K, then the estimator βm+1 satisfies that
b

16
 
∗ K
kβbm+1 − βm+1 k2 = Op √ .
N

Moreover, if mini,j |µ∗i − µ∗j | > C1 K/ N for some constant C1 , we have Abk = A∗k for any
k = 1, . . . , K with high probability.
Compared to Corollary 2 for the general structured tensor factorization, Theorem 2 requires a
stronger condition on the signal strength w∗ in order to ensure the estimation error of cluster
centers converges to zero at a desirable rate. Given this rate and an additional condition on
the minimal gap between clusters, we are able to ensure that the estimated clusters recover
the true clusters with high probability. It is also worth mentioning that our theory allows
the number of clusters K to diverge polynomially with the sample size N . Besides, we allow
dj and sj (< dj ) to diverge to infinity, as long as the signal strength condition is satisfied.

4.2 Theory with a general rank


We next extend the theory of structured tensor factorization and dynamic clustering to the
general rank R case. For this case, we need some additional assumptions on the structure of
tensor factorization, the initialization, and the noise level.
We first introduce the concept of incoherence, which is to quantify the correlation between
the decomposed components.
∗ ∗
ξ := max max |hβj,r , βj,r 0 i|. (13)
j=1,...,m+1 r6=r0

(0) ∗
Denote the initialization error maxj kβbj,r − βj,r k2 ≤ 0 for some r ∈ {1, . . . , R}. Denote
wmax = maxr wr∗ and wmin = minr wr∗ . Define g(0 , ξ, R) := 20 C1 + 20 ξ(R − 1) + ξ 2 (R − 1).
The next theorem establishes the convergence rate of the structured tensor factorization
under a general rank R.
Theorem 3. Assume a general rank R ≥ 1 and Assumption 1 holds. Assume kT ∗ k ≤ C1 wmax ,
and the error tensor satisfies η(E; s1 , . . . , sm+1 ) < wmin (1 − 20 ) − wmax g(0 , ξ, R). If sj ≥ s0j ,
(1)
then βb of Algorithm 1 with λm+1 ≥ 2M [wmax g(0 , ξ, R) + η(E; s1 , . . . , sm+1 )]/[wmin (1 −
m+1,r

20 ) − wmax g(0 , ξ, R) − η(E; s1 , . . . , sm+1 )] satisfies, with high probability,


 2
(1) ∗ 2 2w max g( 0 , ξ, R) + 2η(E; s 1 , . . . , s m+1 )
kβbm+1,r − βm+1,r k2 ≤
wmin (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , . . . , sm+1 )
8M f0,m+1 wmax g(0 , ξ, R) + 8M f0,m+1 η(E; s1 , . . . , sm+1 )
+ . (14)
wmin (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , . . . , sm+1 )

17
Compared to the error bound of the R = 1 case in Theorem 1, the error for the general rank
case is slightly larger. To see this, setting R = 1 in (14), we have g(0 , ξ, 1) = C1 20 , and
the bound in (14) is strictly larger than that in (11). This is due to an inevitable triangle
inequality used in the general rank case. The derivation for the general R case is more
complicated, and involves a different set of techniques than the R = 1 case.
Based on the general rate in Theorem 3, we then derive the local convergence which
ensures that the final estimator is to converge to the true parameter at a geometric rate. For
that purpose, we introduce some additional assumptions on the initialization and noise level.
(0) ∗
Assumption 2. For the initialization error maxj,r kβbj,r − βj,r k2 ≤ 0 , we assume that
 
wmin 2ξ(R − 1) wmin 2
0 ≤ min − , − ξ (R − 1) .
8C1 wmax C1 6wmax

Assumption 3. Assume the error tensor satisfies η(E; s1 , . . . , sm+1 ) < wmin /6.

Assumption 4. Denote f0 = maxj f0,j . Assume the fuse parameter satisfies

wmax g(0 , ξ, R) + η(E; s1 , . . . , sm+1 )


f0 ≤
2M wmin (1 − 20 )

Note that Assumptions 2-3 ensure that the condition η(E; s1 , . . . , sm+1 ) < wmin (1 − 20 ) −
wmax g(0 , ξ, R) stated in Theorem 3 is satisfied. We also define a statistical error S as
8wmax 2 8
S := ξ (R − 1) + η(E; s1 , . . . , sm+1 ).
wmin wmin
(1)
Corollary 3. Assume Assumptions 1-4 hold. If sj ≥ s0j , then βbm+1,r of Algorithm
1 with λm+1 ≥ 2M [wmax g(0 , ξ, R) + η(E; s1 , . . . , sm+1 )]/[wmin (1 − 20 ) − wmax g(0 , ξ, R) −
η(E; s1 , . . . , sm+1 )] satisfies, with high probability,

(1) ∗
kβbm+1,r − βm+1,r k2 ≤ q0 + S , (15)

where q := 8wmax [C1 0 +2ξ(R−1)]/wmin ∈ (0, 1). Therefore, after running T = Ω{log(0 /S )}
iterations in Algorithm 1 , we have

(T ) ∗
max kβbj,r − βj,r k2 ≤ Op (S ). (16)
j,r

The error bound (15) ensures that the estimator in one iteration contracts at a geometric rate.
It reveals an interesting interaction between the statistical error rate S and the contraction

18
term q0 . As the iteration step t increases, the computational error q t 0 decreases while the
statistical error S is fixed. After a sufficient number of iterations, the final error is to be
dominated by the statistical error as shown in (16).
Next we establish the consistency of our dynamic tensor clustering Algorithm 2 under
the tensor Gaussian mixture model with a general rank. Recall that our clustering analysis
is conducted along the (m + 1)th mode of the tensor, and the true parameters are defined
in (3). Analogously, we denote our dynamic tensor clustering estimator after T iterations,
as Bb m+1 = (βbm+1,1 , . . . , βbm+1,R ) ∈ RN ×R , and denote the ith row of B b i ∈ RR ,
b m+1 as µ
b i and µ∗i .
i = 1, . . . , N . We quantify the clustering error via the distance between µ
Theorem 4. Assume the conditions in Corollary 3 hold. Assume wmax /wmin ≤ C2 for some

constant C2 > 0, the incoherence parameter satisfies ξ 2 (R − 1) = O(K/ N ), and the minimal
qQ
m 2
Qm
weight wmin satisfies wmin  j=1 sj N log( j=1 dj N )/K, then we have
r !
R
max kµ b i − µ∗i k2 = Op K .
i N
p
Moreover, if mini,j kµ∗i − µ∗j k2 > C3 K R/N for some constant C3 , we have Abk = A∗k for
any k = 1, . . . , K, with high probability.
The clustering error rate allows the true rank R to increase with the sample size N . The
p
clustering consistency holds as long as K R/N → 0. Compared to Theorem 2 in the
R = 1 case, Theorem 4 requires an upper bound on the incoherence parameter. A similar
incoherence condition has also been imposed in Anandkumar et al. (2014b) and Sun et al.
(2017) to guarantee the performance of the general-rank tensor factorization.
We also remark that, Theorem 4 assumes that the true rank R is known. However,
when the estimated rank R b − R features can be
b exceeds the true rank R, those additional R
viewed as noise. As long as their magnitudes are well controlled, it is still possible to obtain
correct cluster assignments. Sun et al. (2012) showed that a regularized K-means clustering
algorithm can asymptotically achieve a consistent clustering assignment in a high-dimensional
setting, where the number of features is large and many of them may contain no information
about the true cluster structure. Nevertheless, their result required the features to follow a
Gaussian or sub-Gaussian distribution. In our setup, the features are constructed from the
low-rank tensor decomposition, and thus may not satisfy those distributional assumptions.
We leave a full theoretical investigation of the rank selection consistency and the clustering
consistency under an estimated rank as our future research.

19
5 Simulations
5.1 Setup and evaluation
We consider two simulated experiments: clustering of two-dimensional matrix samples, and
clustering of three-dimensional tensor samples. We assess the performance in two ways: the
tensor recovery error and the clustering error. The former is defined as
PR
br βb1,r ◦ . . . ◦ βbm+1,r − R ∗ ∗ ∗
P
r=1 w r=1 wr β1,r ◦ . . . ◦ βm+1,r
F
tensor recovery error = PR .
∗ ∗ ∗
r=1 wr β1,r ◦ . . . ◦ βm+1,r
F

The latter is defined as the estimated distance between an estimated cluster assignment ψb
and the true assignment ψ of the sample data X1 , . . . , XN , i.e.,
 −1
N
clustering error = {(i, j) : 1(ψ(X b j )) 6= 1(ψ(Xi ) = ψ(Xj )); i < j} ,
b i ) = ψ(X
2
where |A| is the cardinality of the set A. This clustering criterion has been commonly used
in the clustering literature (Wang, 2010).
We compare our proposed method with some alternative solutions. For tensor de-
composition, we compare our structured tensor factorization method with two alternative
decomposition solutions, the tensor truncated power method of Sun et al. (2017), and the
generalized lasso penalized tensor decomposition method of Madrid-Padilla and Scott (2017).
For clustering, our solution is to apply K-means to the last component from our structured
tensor factorization. Therefore, we compare with the solutions of applying K-means to the
component from the decomposition method of Sun et al. (2017) and Madrid-Padilla and Scott
(2017), respectively, and to the vectorized tensor data without any tensor decomposition.
The solution of clustering the vectorized data is often used in brain dynamic functional
connectivity analysis (Allen et al., 2014).

5.2 Clustering of 2D matrix data


We simulate the matrix observations, Xi ∈ Rd1 ×d2 , i = 1, . . . , N , based on the decomposition

model in (2) with a true rank R = 2. The unnormalized components βer,j are generated as,
∗ ∗ ∗
βe1,1 = βe2,1 = (µ, −µ, 0.5µ, −0.5µ, 0, . . . , 0), βe3,1 = (µ, . . . , µ, −µ, . . . , −µ);
| {z } | {z }
bN/2c the rest
∗ ∗ ∗
βe1,2 = βe2,2 = (0, 0, 0, 0, µ, −µ, 0.5µ, −0.5µ, 0, . . . , 0), βe3,2 = (−µ, . . . , −µ, µ, . . . , µ, −µ, . . . , −µ).
| {z } | {z } | {z }
bN/4c bN/2c the rest

20
∗ ∗
We then obtain the true components βr,j = Norm(βer,j ), and the corresponding norm is
absorbed into the tensor weight w∗ . We set d1 = d2 and vary d1 = {20, 40}, the sample size
N = {50, 100}, and the signal level µ = {1, 1.2}. The components βe∗ and βe∗ determine the 3,1 3,2

cluster structures of the matrix samples, resulting in four clusters, with X1 , . . . , XbN/4c ∈ A1 ,
XbN/4c+1 , . . . , XbN/2c ∈ A2 , XbN/2c+1 , . . . , Xb3N/4c ∈ A3 , and the rest samples belonging to A4 .
Table 1 reports the average and standard error (in parentheses) of tensor recovery
error, clustering error, and computational time based on 50 data replications. For tensor
recovery accuracy, it is clearly seen that our method outperforms the competitors in all
scenarios. When the dimension d1 increases from 20 to 40, the recovery error increases, as
the decomposition becomes more challenging. When the signal µ increases from 1 to 1.2, i.e.,
the tensor decomposition weight w∗ increases, the tensor recovery error reduces dramatically.
For clustering accuracy, our approach again performs the best, which demonstrates the
usefulness of incorporation of both sparsity and fusion structures in tensor decomposition.
For computational time, the three clustering methods based on different decompositions are
comparable to each other, and are much faster than the one based on the vectorized data. It
reflects the substantial computational gain of tensor factorization in this context.

5.3 Clustering of 3D tensor data


We simulate the three-dimensional tensor samples, Xi ∈ Rd1 ×d2 ×d3 , i = 1, . . . , N . We vary the
sample size N = {50, 100}, the signal level µ = {0.6, 0.8}, and set d1 = d2 = d3 = 20. We did
not consider d1 = 40, since the computation is too expensive for the alternative solution that
applies K-means to the vectorized data. The unnormalized components are generated as,
∗ ∗ ∗ ∗
βe1,1 = βe2,1 = βe3,1 = (µ, . . . , µ, −µ, . . . , −µ, 0, . . . , 0), βe4,1 = (µ, . . . , µ, −µ, . . . , −µ);
| {z } | {z } | {z } | {z } | {z }
5 5 10 bN/2c the rest
∗ ∗ ∗ ∗
βe1,2 = βe2,2 = βe3,2 = (0, . . . , 0, µ, . . . , µ, −µ, . . . , −µ), βe4,2 = (−µ, . . . , −µ, µ, . . . , µ, −µ, . . . , −µ).
| {z } | {z } | {z } | {z } | {z } | {z }
10 5 5 bN/4c bN/2c the rest

Table 2 reports the average and standard error (in parentheses) of tensor recovery error,
clustering error, and computational time based on 50 data replications. Again we observe
similar qualitative patterns in 3D tensor clustering as in 2D matrix clustering, except that
the advantage of of our method is even more prominent. It is also noted that the clustering
error of applying K-means to the vectorized data without any decomposition is at least 0.253
even when the signal strength increases. This is due to the fact that the method always

21
Table 1: Clustering of 2D matrices. Reported are the average and standard error (in
parentheses) of tensor recovery error, clustering error, and computational time based on 50
data replications. DTC: the proposed dynamic tensor clustering method based on structured
tensor factorization (STF); TTP: the tensor truncated power method of Sun et al. (2017);
GLTD: the generalized lasso penalized tensor decomposition method of Madrid-Padilla and
Scott (2017); vectorized: K-means clustering based on the vectorized data.

Tensor recovery error


d1 = d2 N µ STF TTP GLTD
20 50 1 0.833 (0.012) 0.879 (0.017) 0.938 (0.018)
1.2 0.305 (0.016) 0.427 (0.031) 0.474 (0.032)
100 1 0.786 (0.008) 0.826 (0.015) 0.886 (0.019)
1.2 0.284 (0.016) 0.416 (0.032) 0.505 (0.036)
40 50 1 0.883 (0.016) 0.999 (0.016) 1.040 (0.011)
1.2 0.498 (0.032) 0.696 (0.039) 0.802 (0.032)
100 1 0.867 (0.017) 0.963 (0.018) 1.009 (0.014)
1.2 0.371 (0.027) 0.552 (0.037) 0.694 (0.036)
Clustering error
d1 = d2 N µ DTC TTP GLTD vectorized
20 50 1 0.291 (0.010) 0.332 (0.015) 0.382 (0.017) 0.306 (0.006)
1.2 0.015 (0.009) 0.076 (0.017) 0.093 (0.018) 0.255 (0.000)
100 1 0.270 (0.007) 0.304 (0.013) 0.351 (0.017) 0.282 (0.002)
1.2 0.015 (0.009) 0.076 (0.017) 0.121 (0.019) 0.253 (0.000)
40 50 1 0.337 (0.015) 0.440 (0.015) 0.466 (0.013) 0.429 (0.007)
1.2 0.118 (0.019) 0.242 (0.027) 0.292 (0.025) 0.262 (0.002)
100 1 0.336 (0.016) 0.415 (0.016) 0.446 (0.014) 0.376 (0.006)
1.2 0.061 (0.015) 0.151 (0.022) 0.227 (0.024) 0.253 (0.000)
Computational time (seconds)
d1 = d2 N µ DTC TTP GLTD vectorized
20 50 1 1.612 (0.325) 1.097 (0.064) 1.662 (0.067) 43.85 (0.030)
1.2 2.581 (0.456) 1.598 (0.089) 2.669 (0.095) 43.86 (0.022)
100 1 1.971 (0.370) 1.407 (0.075) 1.995 (0.077) 100.2 (0.059)
1.2 3.811 (0.882) 2.717 (0.173) 3.962 (0.184) 123.4 (0.113)
40 50 1 2.956 (0.762) 1.446 (0.150) 3.251 (0.155) 290.7 (0.502)
1.2 4.432 (0.999) 1.890 (0.203) 4.745 (0.276) 290.2 (0.518)
100 1 3.410 (0.956) 1.954 (0.188) 3.696 (0.195) 569.4 (2.979)
1.2 5.920 (1.509) 3.155 (0.307) 6.456 (0.368) 572.5 (2.047)

mis-estimates the number of clusters K. By contrast, our method could estimate K correctly.
Figure 3 illustrates the gap statistic of both methods in one data replication with µ = 0.6,
where the true value K = 4 in this example. Moreover, the tuning time for the method
with the vectorized data is very long due to expensive computation of gap statistic in this

22
Table 2: Clustering of 3D tensors. The methods under comparison are the same as described
in Table 1.
Tensor recovery error
d1 = d2 = d3 N µ STF TTP GLTD
20 50 0.6 0.557 (0.054) 1.073 (0.004) 0.688 (0.060)
0.8 0.083 (0.001) 0.253 (0.059) 0.214 (0.058)
100 0.6 0.415 (0.055) 0.821 (0.044) 0.493 (0.056)
0.8 0.081 (0.001) 0.465 (0.070) 0.116 (0.031)
Clustering error
d1 = d2 = d3 N µ DTC TTP GLTD vectorized
20 50 0.6 0.154 (0.029) 0.443 (0.024) 0.302 (0.034) 0.397 (0.014)
0.8 0.000 (0.000) 0.049 (0.023) 0.051 (0.023) 0.255 (0.000)
100 0.6 0.076 (0.027) 0.332 (0.032) 0.176 (0.037) 0.314 (0.006)
0.8 0.000 (0.000) 0.148 (0.028) 0.013 (0.013) 0.253 (0.000)
Computational time (seconds)
d1 = d2 = d3 N µ DTC TTP GLTD vectorized
20 50 0.6 5.847 (0.436) 4.642 (0.323) 6.048 (0.457) 1746 (2.897)
0.8 5.974 (0.129) 5.347 (0.145) 6.623 (0.199) 1716 (1.172)
100 0.6 11.04 (0.688) 9.923 (0.592) 11.41 (0.706) 3849 (36.93)
0.8 9.774 (0.367) 9.516 (0.304) 10.23 (0.349) 3696 (35.31)

ultrahigh dimensional setting. This example suggests that our method of clustering after
structured tensor factorization has clear advantages in not only computation but also the
tuning and the subsequent clustering accuracy compared to the simple alternative solution
that directly applies clustering to the vectorized data.
In the interest of space, additional simulations with different correlation structures, large
ranks, and unequal cluster sizes are reported in Section S.9 of the Supplementary Materials.

6 Real data analysis


We illustrate our dynamic tensor clustering method through a brain dynamic connectivity
analysis based on resting-state functional magnetic resonance imaging (fMRI). Meanwhile we
emphasize that our proposed method can be equally applied to many other dynamic tensor
applications as well. The data is from the Autism Brain Imaging Data Exchange (ABIDE), a
study of autism spectrum disorder (ASD) (Di Martino et al., 2014). ASD is an increasingly
prevalent neurodevelopmental disorder, characterized by symptoms such as social difficulties,
communication deficits, stereotyped behaviors and cognitive delays (Rudie et al., 2013).

23
Vectorized Kmeans Ours

0.8
0.37
● ●

0.7

0.36
● ●

0.6
0.35 ●

0.5
Gapk

Gapk

0.4

0.34

0.3


0.33

0.2
● ●

0.1
● ●

2 4 6 8 10 2 4 6 8 10

k k

Figure 3: Gap statistics of clustering 3D tensor samples based on a single data replication.
Left panel shows the gap statistic from applying K-means to the vectorized tensor data, and
the right panel shows that after structured tensor factorization. The true number of clusters
is 4 in this example.

The data were obtained from multiple imaging sites. We have chosen to focus on the fMRI
data from the University of Utah School of Medicine (USM) site only, since the sample size
at USM is relatively large; meanwhile it is not too large so to ensure the computation of
applying K-means clustering to the vectorized data is feasible. See more discussion on the
computational aspect in our first task and Table 3. The data consists of resting-state fMRI
of 57 subjects, of whom 22 have ASD, and 35 are normal controls. The fMRI data has been
preprocessed, and is summarized as a 116 × 236 spatial-temporal matrix for each subject. It
corresponds to 116 brain regions-of-interest from the Anatomical Automatic Labeling (AAL)
atlas, and each time series is of length 236. We consider two specific tasks that investigate
brain dynamic connectivity patterns using the sliding window approach.
Specifically, in the first task, we aim to cluster the subjects based on their dynamic
brain connectivity patterns, then compare our estimated clusters with the subject’s diagnosis
status, which is treated as the true cluster membership in this analysis. The targeting tensor
is T ∈ R116×116×t×n , which stacks a sequence of correlation matrices of dimension 116 × 116
corresponding to 116 brain regions over t sliding time windows, and n is the total number
of subjects and equals 57 in this example. There are two parameters affecting the moving
windows, the width of the window and the step size of each movement. We combined these
two into a single parameter, the number of moving windows t, by fixing the width of moving
window at 20. We varied the value of t among {1, 30, 50, 80} to examine its effect on the

24
Table 3: Clustering of the ABIDE data along the subject mode. Reported are clustering errors
with different sliding windows. The methods under comparison are the same as described in
Table 1.
Windows DTC TTP GLTD vectorized
1 26/57 = 0.456 26/57 = 0.456 27/57 = 0.474 27/57 = 0.474
30 22/57 = 0.386 25/57 = 0.439 25/57 = 0.439 27/57 = 0.474
50 18/57 = 0.316 21/57 = 0.368 21/57 = 0.368 28/57 = 0.491
80 15/57 = 0.263 15/57 = 0.263 15/57 = 0.263 21/57 = 0.368

subsequent clustering performance. When t = 1, it reduces to the usual static connectivity


analysis where the correlation matrix is computed based on the entire spatial-temporal matrix.
We clustered along the last mode of T , i.e., the mode of individual subjects. Meanwhile,
we imposed fusion smoothness along the time mode, since by construction, the connectivity
patterns of the adjacent and overlapping sliding windows are very similar. We did not impose
any sparsity constraint, since our focus in this task is on the overall dynamic connectivity
behavior rather than individual connections. To compare with the diagnosis status, we fixed
the number of clusters at K = 2. We report the clustering error of our method in Table
3, along with the alternative methods in Section 5. We make a few observations. First,
the clustering error based on dynamic connectivity (t > 1) is consistently better than the
one based on static connectivity (t = 1). This suggests that the underlying connectivity
pattern is likely dynamic rather than static for this data. Second, the clustering accuracy of
our method is considerably better than that of applying K-means to the vectorized data.
Moreover, our method is computationally much more efficient. Actually, the method that
directly applied K-means to the vectorized data ran out of memory on a personal laptop
computer in the case of t = 80 since the space complexity of Lloyd’s algorithm for K-means
is O((K + n)d) (Hartigan and Wong, 1979), where the number of clusters K = 2, the sample
size n = 57, and the dimension of the vectorized tensor d = 116 × 116 × 80 ∼ 106 . We later
conducted analysis on a computer cluster. Third, the number of moving windows does indeed
affect the clustering accuracy, as one would naturally expect. In practice, we recommend
experimenting with multiple values of t. Furthermore, we can treat t as another tuning
parameter. To mitigate temporal inconsistency, we have also conducted the analysis in the
frequency domain, following Ahn et al. (2015); Calhoun et al. (2003). We report the results
in Section S.9 of the Supplementary Materials.
In the second task, we aim to uncover the dynamic behavior of the brain connectivity of

25
Disease group Control group

1.2
● ● ●
● ● ● ● ●

1.2
● ●
●●●● ●●● ●
● ● ●●
1.0
●● ● ● ●
●●● ●●●
● ● ● ● ●
● ●

1.0
●●●● ●●●
●● ●●
0.8

● ●●●●
● ●●
●●
● ● ●●

0.8
Gapk

Gapk
● ● ●
●●
0.6


●●
●● ●
● ●

0.6
● ●

0.4

●●
● ●

●● ●

0.4
● ●

0.2


● ●


●● ●●

0.2
0 10 20 30 40 50 0 10 20 30 40 50

k k

Figure 4: Clustering of the ABIDE data along the time mode. The gap statistic as a function
of the number of clusters for the ASD group and the control group. It chose K = 7 for the
ASD group, and K = 12 for the control group.

each diagnosis group and compare between the ASD group with the normal control. The
targeting tensor is T ∈ R116×116×ng ×t , where ng denotes the number of subjects in each
diagnosis group; in our example, n1 = 22 and n0 = 35. In light of the results from the
first task, we fixed the number of moving time windows at t = 80. We clustered along
the last mode of T , i.e., the time mode, to examine the dynamic behavior, if any, of the
functional connectivity patterns. We imposed the sparsity constraint on the first two modes
to improve interpretation and to identify important connections among brain regions. We
imposed the fusion constraint on the time mode to capture smoothness of the connectivities
along the adjacent sliding windows. We carried out two clustering analysis, one for each
diagnostic group. We employed the selection criterion (10) to select the rank, the sparsity
parameter, and the fusion parameter, and then tuned the number of clusters K using the
gap statistic. For both the ASD and the normal control group, the selected rank is 5. The
selected sparsity parameter is 0.9, and the fusion parameter is 0.5, suggesting that in the
estimated decomposition components, there are 10% zero entries, and about 40 unique values
along the time mode. Figure 4 further shows the gap statistic as a function of the number of
clusters for both diagnostic groups. Accordingly, it selects K = 7 for the ASD group, and
K = 12 for the normal control. Moreover, we have observed that, for the ASD group, the
clustering membership changed 10 times along the sliding windows, whereas it changed 14
times for the control. This on one hand suggests that both groups of subjects exhibit dynamic

26
connectivity changes over time. More importantly, the change of the state of connectivity is
less frequent for the ASD group than the control. This finding agrees with the literature in
that the ASD subjects are usually found less active in brain connectivity patterns (Solomon
et al., 2009; Gotts et al., 2012; Rudie et al., 2013).

References
Ahn, M., Shen, H., Lin, W. and Zhu, H. (2015). A sparse reduced rank framework for
group analysis of functional neuroimaging data. Statistica Sinica 25 295–312.
Allen, E. A., Damaraju, E., Plis, S. M., Erhardt, E. B., Eichele, T. and Calhoun,
V. D. (2014). Tracking whole-brain connectivity dynamics in the resting state. Cerebral
Cortex 24 663.
Anandkumar, A., Ge, R., Hsu, D. and Kakade, S. M. (2014a). A tensor approach to
learning mixed membership community models. Journal of Machine Learning Research 15
2239–2312.
Anandkumar, A., Ge, R. and Janzamin, M. (2014b). Guaranteed non-orthogonal tensor
decomposition via alternating rank-1 updates. arXiv preprint arXiv:1402.5180 .
Bi, X., Qu, A. and Shen, X. (2017). Multilayer tensor factorization with applications to
recommender systems. Annals of Statistics To Appear.
Bruce, N., Murthi, B. and Rao, R. C. (2016). A dynamic model for digital advertising:
The effects of creative format, message content, and targeting on engagement. Journal of
Marketing Research To Appear.
Calhoun, V., Adali, T., Pekar, J. and Pearlson, G. (2003). Latency (in)sensitive ica
group independent component analysis of fmri data in the temporal frequency domain.
NeuroImage 20 1661–1669.
Cao, X., Wei, X., Han, Y., Yang, Y. and Lin, D. (2013). Robust tensor clustering
with non-greedy maximization. In Proceedings of the Twenty-Third International Joint
Conference on Artificial Intelligence.
Chi, E. C. and Allen, G. I. (2017). Convex biclustering. Biometrics 73 10–19.
Chi, E. C. and Lange, K. (2015). Splitting methods for convex clustering. Journal of
Computational and Graphical Statistics 994–1013.
Di Martino, A., Yan, C.-G., Li, Q., Denio, E., Castellanos, F. X., Alaerts,
K., Anderson, J. S., Assaf, M., Bookheimer, S. Y., Dapretto, M., Deen, B.,
Delmonte, S., Dinstein, I., Ertl-Wagner, B., Fair, D. A., Gallagher, L.,
Kennedy, D. P., Keown, C. L., Keysers, C., Lainhart, J. E., Lord, C., Luna,
B., Menon, V., Minshew, N. J., Monk, C. S., Mueller, S., Muller, R.-A.,
Nebel, M. B., Nigg, J. T., O’Hearn, K., Pelphrey, K. A., Peltier, S. J., Rudie,
J. D., Sunaert, S., Thioux, M., Tyszka, J. M., Uddin, L. Q., Verhoeven, J. S.,

27
Wenderoth, N., Wiggins, J. L., Mostofsky, S. H. and Milham, M. P. (2014).
The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic
brain architecture in autism. Molecular Psychiatry 19 659–667.
Ding, S. and Cook, R. D. (2015). Tensor sliced inverse regression. Journal of Multivariate
Analysis 133 216–231.
Dunlavy, D., Kolda, T. and Acar, E. (2011). Temporal link prediction using matrix
and tensor factorizations. ACM Transactions on Knowledge Discovery from Data 5 1–27.
Fornito, A., Zalesky, A. and Breakspear, M. (2013). Graph analysis of the human
connectome: Promise, progress, and pitfalls. NeuroImage 80 426–444.
Gotts, S., Simmons, W., Milbury, L., Wallace, G., Cox, R. and Martin, A. (2012).
Fractionation of social brain circuits in autism spectrum disorders. Brain 2711–2725.
Hao, B., Sun, W. W., Liu, Y. and Cheng, G. (2018). Simultaneous clustering and
estimation of heterogeneous graphical models. The Journal of Machine Learning Research
To Appear.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering
algorithm. Journal of the Royal Statistical Society, Series C 28 100–108.
Hocking, T., Vert, J.-P., Bach, F. and Joulin, A. (2011). Clusterpath: An algorithm
for clustering using convex fusion penalties. Proceedings of the Twenty Eighth International
Conference on Machine Learning .
Huang, J., Shen, H. and Buja, A. (2009). The analysis of two-way functional data using
two-way regularized singular value decompositions. Journal of the American Statistical
Association 104 1609–1620.
Hutchison, R. M., Womelsdorf, T., Allen, E. A., Bandettini, P. A., Cal-
houn, V. D., Corbetta, M., Penna, S. D., Duyn, J. H., Glover, G. H.,
Gonzalez-Castillo, J., Handwerker, D. A., Keilholz, S., Kiviniemi, V.,
Leopold, D. A., de Pasquale, F., Sporns, O., Walter, M. and Chang, C.
(2013). Dynamic functional connectivity: Promise, issues, and interpretations. NeuroImage
80 10.1016/[Link].2013.05.079.
Jenatton, R., Obozinski, G. and Bach, F. (2010). Structured sparse principal component
analysis. In International Conference on Artificial Intelligence and Statistics.
Ji, S., Zhang, W. and Liu, J. (2012). A sparsity-inducing formulation for evolutionary
co-clustering. International conference on knowledge discovery and data mining .
Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM Review
51 455–500.
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I. and Lempitsky, V. (2015).
Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In Interna-
tional Conference on Learning Representations.

28
Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via sparse
singular value decomposition. Biometrics 66 1087–1095.
Li, R., Zhang, W., Zhao, Y., Zhu, Z. and Ji, S. (2015). Sparsity learning formulations
for mining time-varying data. IEEE Transactions on Knowledge and Data Engineering 27
1411–1423.
Liu, J., Musialski, P., Wonka, P. and Ye, J. (2013). Tensor completion for estimating
missing values in visual data. IEEE Transactions on Pattern Analysis and Machine
Intelligence 35 208–220.
Liu, T., Yuan, M. and Zhao, H. (2017). Characterizing spatiotemporal transcriptome of
human brain via low rank tensor decomposition. arXiv preprint arXiv:1702.07449 .
Ma, P. and Zhong, W. (2008). Penalized clustering of large scale functional data with
multiple covariates. Journal of the American Statistical Association 103 625–636.
Madrid-Padilla, O.-H. and Scott, J. G. (2017). Tensor decomposition with generalized
lasso penalties. Journal of Computational and Graphical Statistics 26 537–546.
Massart, P. (2003). Concentration inequalties and model selection. Ecole dEte de Proba-
bilites de Saint-Flour 23 .
Raskutti, G., Yuan, M. and Chen, H. (2017). Convex regularization for high-dimensional
multi-response tensor regression. arXiv .
Romera-Paredes, B. and Pontil, M. (2013). A new convex relaxation for tensor
completion. In Advances in Neural Information Processing Systems.
Rudie, J., Brown, J., Beck-Pancer, D., Hernandez, L., Dennis, E., Thompson,
P., Bookheimer, S. and Dapretto, M. (2013). Altered functional and structural brain
network organization in autism. NeuroImage : Clinical 2 79–94.
Seigal, A., Beguerisse-Diaz, M., Schoeberl, B., Niepel, M. and Harrington,
H. A. (2016). Tensors and algebra give interpretable groups for crosstalk mechanisms in
breast cancer. arXiv preprint arXiv:1612.08116 .
Shen, X., Huang, H. and Pan, W. (2012a). Simultaneous supervised clustering and
feature selection over a graph. Biometrika 99 899–914.
Shen, X., Pan, W. and Zhu, Y. (2012b). Likelihood-based selection and sharp parameter
estimation. Journal of the American Statistical Association 107 223–232.
Solomon, M., Ozonoff, S., Ursu, S., Ravizza, S., Cummings, N., Ly, S. and Carter,
C. (2009). The neural substrates of cognitive control deficits in autism spectrum disorders.
Neuropsychologia 47 2515–2526.
Sun, W., Lu, J., Liu, H. and Cheng, G. (2017). Provable sparse tensor decomposition.
Journal of the Royal Statistical Society, Series B 79 899–916.
Sun, W., Wang, J. and Fang, Y. (2012). Regularized k-means clustering of high-
dimensional data and its asymptotic consistency. Electronic Journal of Statistics 6 148–167.

29
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity
and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67
91–108.
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of clusters
in a dataset via the gap statistic. Journal of the Royal Statistical Society: Series B 63
411–423.
Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso.
Annals of Statistics 39 1335–1371.
Wang, J. (2010). Consistent selection of the number of clusters via cross validation.
Biometrika 97 893–904.
Wang, J. (2015). Joint estimation of sparse multivariate regression and conditional graphical
models. Statistica Sinica 25 831–851.
Wang, J. and Fang, Y. (2013). Analysis of presence-only data via semisupervised learning
approaches. Computational Statistics and Data Analysis 59 134–143.
Wang, Y., Sharpnack, J., Smola, A. and Tibshirani, R. (2016). Trend filtering on
graphs. Journal of Machine Learning Research 17 1–41.
Wang, Y., Xu, H. and Leng, C. (2013). Provable subspace clustering: When lrr meets
ssc. Advances in Neural Information Processing Systems .
Wu, T., Benson, A. R. and Gleich, D. F. (2016). General tensor spectral co-clustering
for higher-order data. Advances in Neural Information Processing Systems .
Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering
and differential expression identification. Biometrics 62 1089–1098.
Yuan, M. and Zhang, C. (2016). On tensor completion via nuclear norm minimization.
Foundations of Computational Mathematics 16 1031–1068.
Yuan, M. and Zhang, C. (2017). Incoherent tensor norms and their applications in higher
order tensor completion. IEEE Transactions on Information Theory 63 6753–6766.
Zhang, A. (2018). Cross: Efficient low-rank tensor completion. Annals of Statistics To
Appear.
Zhou, H., Li, L. and Zhu, H. (2013). Tensor regression with applications in neuroimaging
data analysis. Journal of the American Statistical Association 108 540–552.
Zhu, H., Zhang, H., Ibrahim, J. and Peterson, B. (2007). Statistical analysis of
diffusion tensors in diffusion-weighted magnetic resonance image data. Journal of the
American Statistical Association 102 1081–1110.
Zhu, Y. (2017). An augmented admm algorithm with application to the generalized lasso
problem. Journal of Computational and Graphical Statistics 26 195–204.
Zhu, Y., Shen, X. and Pan, W. (2014). Structural pursuit over multiple undirected graphs.
Journal of the American Statistical Association 109 1683–1696.

30
Supplementary Materials for
“Dynamic Tensor Clustering”
This supplementary note collects auxiliary lemmas, detailed proofs for the theorems and
corollaries in Section 4, and additional numerical analyses.

S.1 Auxiliary lemmas


Lemma 1 provides the error bound of a fused lasso estimator in trend filtering.

Lemma 1. Consider the model y = β ∗ +  with true parameter β ∗ ∈ Rd and any noise
. Denote the fused lasso estimator as βb := arg minβ 1 ky − βk2 + λkDβk1 . Denote M :=
2 2
maxj k[D† ]j k2 . If λ ≥ M kk∞ , then we have
1 b kk2∞ 4λkDβ ∗ k1
kβ − β ∗ k22 ≤ + (S1)
d d d
Its proof follows from similar arguments to the proof of Theorem 3 in Wang et al. (2016),
and hence is omitted. The difference is that here we consider a general error , while they
considered a Gaussian error. When  is Gaussian, Lemma 1 reduces to their Theorem 3, by
applying the standard Gaussian tail inequality to bound kk∞ .
Lemma 2 provides a concentration of Lipschitz functions of Gaussian random variables
(Massart, 2003).

Lemma 2. (Massart, 2003, Theorem 3.4) Let v ∈ Rd be a Gaussian random variable such
that v ∼ N (0, Id ). Assuming g(v) ∈ R to be a Lipschitz function such that |g(v1 ) − g(v2 )| ≤
Lkv1 − v2 k2 for any v1 , v2 ∈ Rd , then we have, for each t > 0,
t2
 
P [|g(v) − E[g(v)]| ≥ t] ≤ 2 exp − 2 .
2L
Lemma 3 provides an upper bound of the Gaussian width of the unit ball for the sparsity
regularizer (Raskutti et al., 2017).

Lemma 3. For a tensor T ∈ Rd1 ×d2 ×d3 , denote its regularizer R(T ) =
P P P
j1 j2 j3 |Tj1 ,j2 ,j3 |.
Define the unit ball of this regularizer as BR (1) := {T ∈ Rd1 ×d2 ×d3 |R(T ) ≤ 1}. For a tensor
G ∈ Rd1 ×d2 ×d3 whose entries are independent standard normal random variables, we have
" #
p
E sup hT , Gi ≤ c log(d1 d2 d3 ),
T ∈BR (1)

for some bounded constant c > 0.

1
Lemma 4 links the hard thresholding sparsity and the L1 -penalized sparsity.

Lemma 4. For any vectors u ∈ Rd1 , v ∈ Rd2 , w ∈ Rd3 satisfying kuk2 = kvk2 = kwk2 =
1, kuk0 ≤ s1 , kvk0 ≤ s2 , and kwk0 ≤ s3 , denoting A := u ◦ v ◦ w, we have
XXX √
kAk1 := |Aj1 j2 j3 | ≤ s1 s2 s3 .
j1 j2 j3
√ √
Proof : According to the Cauchy-Schwarz inequality, we have kuk1 ≤ s1 kuk2 = s1 , and
√ √ √
kvk1 ≤ s2 , kwk1 ≤ s3 . Therefore, kAk1 = ku◦v ◦wk1 ≤ kuk1 ·kvk1 ·kwk1 ≤ s1 s2 s3 . 

Lemma 5 connects the non-sparse entries in a tensor to the non-sparse entries in the
low-rank factorization components.

Lemma 5. (Sun et al., 2017, Lemma S.6.2) For any tensor T ∈ Rd1 ×d2 ×d3 and an index set
P
F = F1 ◦ F2 ◦ F3 with Fi ⊆ {1, . . . , d}, if T = i∈[R] wi ai ◦ bi ◦ ci , we have
X
TF = wi Truncate(ai , F1 ) ◦ Truncate(bi , F2 ) ◦ Truncate(ci , F3 ).
i∈[R]

S.2 Proof of Theorem 1


Throughout the supplementary materials, we prove the case with m = 2. The extension
to a general m follows immediately. Also, for notational simplicity, we denote the one-step
(1)
estimator βb3 as βb3 . Note that in the update of βb3 in our algorithm, we first obtain an
unconstrained estimator βe3 in (7), then apply the truncation and fusion operator to βe3 .
Therefore, our estimator βb3 is also a solution to the following problem,
1 e
βb3 := arg min kβ3 − βk22 + λ3 kDβk1 .
β∈S(d3 ,s3 ) 2

The underlying true model corresponding to the above problem is assumed to be

βe3 = β3∗ + ,

where  is some error term. Note that the distribution of  is unknown. Therefore, in order to
derive the rate of kβb3 − β ∗ k2 , we utilize the result of the fused lasso problem with a general
3
error as shown in Lemma 1.
According to the fusion assumption, we have kDβ3∗ k1 ≤ f03 . To derive the explicit form
of the estimation error, Lemma 1 suggests to calculate kk∞ . By the definition of βe3 , we have

 = Norm T ×1 βb1 ×2 βb2 − β3∗ .




2
Denote F1 := supp(β1∗ )∪supp(βb1 ), F2 := supp(β2∗ )∪supp(βb2 ), and F3 := supp(β3∗ )∪supp(βb3 ),
where supp(v) refers to the set of indices in v that are nonzero. Let F := F1 ◦ F2 ◦ F3 .
Consider the following update,
0 
βe3 = Norm TF ×1 βb1 ×2 βb2 ,

where TF denotes the restriction of tensor T on the three modes indexed by F1 , F2 and F3 .
0
Note that replacing βe3 with βe3 in our algorithm does not affect the iteration of βb3 due to
the sparsity restriction of TF and the scaling-invariant truncation operation. Therefore, in
0
the sequel, we assume that βe3 is replaced by βe . 3
Plugging the underlying model T = T ∗ +E with the rank-1 true tensor T ∗ = w∗ β1∗ ◦β2∗ ◦β3∗
into the above expression, we have,

TF∗ ×1 βb1 ×2 βb2 EF ×1 βb1 ×2 βb2


 = − β3∗ + .
kTF ×1 βb1 ×2 βb2 k2 kTF ×1 βb1 ×2 βb2 k2

To bound kk∞ , it is sufficient to bound the following two terms, (I) and (II), where

TF∗ ×1 βb1 ×2 βb2 kEF ×1 βb1 ×2 βb2 k∞


kk∞ ≤ − β3∗ + .
kTF ×1 βb1 ×2 βb2 k2 ∞
kT F × 1
b1 ×2 βb2 k2
β
| {z } | {z }
(I) (II)

Before we compute the upper bound of (I) and (II), respectively, we first derive the lower
bound of kTF ×1 βb1 ×2 βb2 k2 . By the model assumption T = T ∗ + E, we have

kTF ×1 βb1 ×2 βb2 k2 ≥ kTF∗ ×1 βb1 ×2 βb2 k2 − kEF ×1 βb1 ×2 βb2 k2 (S2)
≥ w∗ (1 − 20 ) − η(E; s1 , s2 , s3 ), (S3)

where the second inequality is due to the following two facts. First, kTF∗ ×1 βb1 ×2 βb2 k2 =
kT ∗ ×1 βb1 ×2 βb2 k2 = kw∗ hβb1 , β ∗ ihβb2 , β ∗ iβ ∗ k2 = w∗ |hβb1 , β ∗ i||hβb2 , β ∗ i| ≥ w∗ (1 − 2 ), where
1 2 3 1 2 0
the last inequality holds because of the initial conditions on βb1 , βb2 , as well as the fact that
hu, vi2 ≥ 1 − ku − vk22 for unit vectors u, v. Second, by definition of tensor norm and the
property kβb1 k = kβb2 k = 1, we have

η(E; s1 , s2 , s3 ) = sup E ×1 u ×2 v ×3 w
kuk=kvk=kwk=1
kuk0 ≤s1 ,kvk0 ≤s2 ,kwk0 ≤s3

≥ sup EF ×1 u ×2 v ×3 w ≥ kE ×1 βb1 ×2 βb2 k2 , (S4)


kuk=kvk=kwk=1

3
where the last inequality is due to the Cauchy-Schwarz inequality. The lower bound in (S3)
is positive according to the assumption on the initialization and error tensor.
Next we bound the two terms (I) and (II). To bound (I), we have that

TF∗ ×1 βb1 ×2 βb2


− β3∗
kTF ×1 β1 ×2 β2 k2
b b
TF∗ ×1 βb1 ×2 βb2 − kTF∗ ×1 βb1 ×2 βb2 k2 β3∗ + kEF ×1 βb1 ×2 βb2 k2 β3∗
≤ (S5)
kTF ×1 βb1 ×2 βb2 k2
kEF ×1 βb1 ×2 βb2 k2 β3∗
= (S6)
kTF ×1 βb1 ×2 βb2 k2
η(E; s1 , s2 , s3 )β3∗
≤ . (S7)
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 )
where the inequality in (S5) is due to (S2) and the condition that (S3) is positive, the equality
in (S6) is due to the above argument that kT ∗ ×1 βb1 ×2 βb2 k2 = w∗ |hβb1 , β ∗ i||hβb2 , β ∗ i| =
F 1 2
w hβb1 , β1∗ ihβb2 , β2∗ i. Here hβbi , βi∗ i > 0 is due to the initialization condition on kβbi − βi∗ k

for i = 1, 2. Finally, the last inequality (S7) is due to (S3) and (S4). Therefore, this fact,
together with kβ3∗ k∞ ≤ 1, implies that
η(E; s1 , s2 , s3 )
(I) ≤ .
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 )

In addition, according to (S4), we have kE ×1 βb1 ×2 βb2 k∞ ≤ η(E; s1 , s2 , s3 ), and hence


η(E; s1 , s2 , s3 )
(II) ≤ .
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 )
Therefore, we have the final bound for kk∞ in that,
2η(E; s1 , s2 , s3 )
kk∞ ≤ .
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 )
The rest of the theorem follows from (S1) in Lemma 1 and the assumption kDβ3∗ k1 ≤ f03 .
This completes the proof of Theorem 1. 

S.3 Proof of Corollary 1


According to Theorem 1, it remains to show the upper bound in (11) is bounded by 20 /4.
Therefore, it suffices to show
2
20 20

2η(E; s1 , s2 , s3 ) 8M f0,3 η(E; s1 , s2 , s3 )
≤ , and ≤ . (S8)
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 ) 8 w∗ (1 − 20 ) − η(E; s1 , s2 , s3 ) 8

4
Note that when η(E; s1 , s2 , s3 ) ≤ w∗ 0 (1−0 )/9, we have, η(E; s1 , s2 , s3 ) ≤ {w∗ 0 (1−20 )}/(8+
0 ), due to the fact that 0 ≤ 1. Therefore,
2η(E; s1 , s2 , s3 ) 2w∗ 0 (1 − 20 )/(8 + 0 )
≤ = 0 /4,
w∗ (1 − 20 ) − η(E; s1 , s2 , s3 ) w∗ (1 − 20 ) − w∗ 0 (1 − 20 )/(8 + 0 )
which implies directly that the first argument in (S8) holds.
Moreover, when η(E; s1 , s2 , s3 ) ≤ w∗ 20 (1 − 0 )/(64M f0,3 + 1), we have
w∗ 20 (1 − 0 )(1 + 0 ) w∗ 20 (1 − 20 )
η(E; s1 , s2 , s3 ) ≤ ≤ .
64M f0,3 + 1 64M f0,3 + 20
Therefore, we have
8M f0,3 η(E; s1 , s2 , s3 ) 8M f0,3 w∗ 20 (1 − 20 )/(64M f0,3 + 20 )
∗ 2
≤ ∗ 2 ∗ 2 2 2
= 20 /8,
w (1 − 0 ) − η(E; s1 , s2 , s3 ) w (1 − 0 ) − w 0 (1 − 0 )/(64M f0,3 + 0 )
which validates the second argument in (S8). This completes the proof of Corollary 1. 

S.4 Proof of Corollary 2


The derivation in the Gaussian error tensor scenario consists of three steps. In Step 1, we
show via the large deviation bound inequality that, for some Gaussian tensor G,
t2
 
P (|η (G; s1 , s2 , s3 ) − E[η (G; s1 , s2 , s3 )]| ≥ t) ≤ 2 exp − 2
2L
with some Lipschitz constant L. In Step 2, by incorporating Lemma 3, and exploring the
sparsity constraint, we establish that
p
E[η (G; s1 , s2 , s3 )]| ≤ C s1 s2 s3 log(d1 d2 d3 ),

for some constant C > 0. In Step 3, we derive the final rate by incorporating the above
results into Theorem 1.
Step 1: We show that the function η (·; s1 , s2 , s3 ) is a Lipschitz function in its first
argument. For any two tensors G1 , G2 ∈ Rd1 ×d2 ×d3 , denote A∗ = supA hG1 , Ai. We have

suphG1 , Ai − suphG2 , Ai ≤ hG1 , A∗ i − suphG2 , Ai ≤ hG1 , A∗ i − hG2 , A∗ i ≤ hG1 − G2 , A∗ i.


A A A

Therefore, by the definition of η (·; s1 , s2 , s3 ), we have


D E
|η (G1 ; s1 , s2 , s3 ) − η (G2 ; s1 , s2 , s3 ) | ≤ sup G1 − G2 , u ◦ v ◦ w
kuk=kvk=kwk=1
kuk0 ≤s1 ,kvk0 ≤s2 ,kwk0 ≤s3

≤ sup ku ◦ v ◦ wkF · kG1 − G2 kF ≤ kG1 − G2 kF ,


kuk=kvk=kwk=1

5
where the second inequality is due to the fact that hA, Bi ≤ kAkF kBkF , and the third
inequality is due to ku ◦ v ◦ wkF = ku ◦ v ◦ wk2 ≤ kuk2 kvk2 kwk2 = 1 for any unit-norm
vectors u, v, w.
Applying the concentration result of Lipschitz functions of Gaussian random variables in
Lemma 2 with L = 1, for the Gaussian tensor G, we have
 2
t
P (|η (G; s1 , s2 , s3 ) − E[η (G; s1 , s2 , s3 )]| ≥ t) ≤ 2 exp − . (S9)
2
Step 2: We aim to bound E[η (G; s1 , s2 , s3 )]. For a tensor T ∈ Rd1 ×d2 ×d3 , denote its
P P P
L1 -norm regularizer as R(T ) := j1 j2 j3 |Tj1 ,j2 ,j3 |. Define the ball of this regularizer as
BR (δ) := {T ∈ Rd1 ×d2 ×d3 |R(T ) ≤ δ}. For any vectors u ∈ Rd1 , v ∈ Rd2 , w ∈ Rd3 satisfying
kuk2 = kvk2 = kwk2 = 1, kuk0 ≤ s1 , kvk0 ≤ s2 , and kwk0 ≤ s3 , denote A := u ◦ v ◦ w.

Lemma 4 implies that R(A) ≤ s1 s2 s3 . Therefore, we have,
 

E[η (G; s1 , s2 , s3 )] = E  sup hG, u ◦ v ◦ wi


 
kuk=kvk=kwk=1
kuk0 ≤s1 ,kvk0 ≤s2 ,kwk0 ≤s3
" # " #

≤ E sup

hG, Ai = s1 s2 s3 E sup hG, Ai .
A∈BR ( s1 s2 s3 ) A∈BR (1)

This result, together with Lemma 3, implies that


p
E[η (G; s1 , s2 , s3 )] ≤ C s1 s2 s3 log(d1 d2 d3 ). (S10)
p
Finally, combing (S9) and (S10), and setting t = s1 s2 s3 log(d1 d2 d3 ), we have, with
probability 1 − 2 exp(−s1 s2 s3 log(d1 d2 d3 )/2),
p
|η (G; s1 , s2 , s3 ) − E[η (G; s1 , s2 , s3 )]| ≤ s1 s2 s3 log(d1 d2 d3 ).

Henceforth,
p
η (G; s1 , s2 , s3 ) ≤ (C + 1) s1 s2 s3 log(d1 d2 d3 ).
Step 3: Finally, we derive the rate of βb3 by incorporating the inequality of η(G; s1 , s2 , s3 )
in Stage 2 into the general rate in Theorem 1. Here for notational simplicity, we denote the
(1)
p
one-step estimator βb3 in our algorithm as βb3 . In particular, when w∗  s1 s2 s3 log(d1 d2 d3 ),
p
there exists a constant 0 < c < 1, such that w∗ (1 − 20 ) − (C + 1) s1 s2 s3 log(d1 d2 d3 ) ≥
cw∗ (1 − 20 ). Therefore,
" p #2 p
2(C + 1) s 1 s 2 s 3 log(d 1 d 2 d 3 ) 8M f 0,3 (C + 1) s1 s2 s3 log(d1 d2 d3 )
kβb3 − β3∗ k22 ≤ 2
+ ,

cw (1 − 0 ) cw (1 − 20 )

6
and henceforth, up to a logarithm term, we have,
  √ 
∗ 2 s 1 s 2 s 3 f 0,3 s 1 s 2 s 3
kβb3 − β3 k2 = O
ep max , .
w∗2 w∗
This completes the proof of Corollary 2. 

S.5 Proof of Theorem 2


(1)
As in Theorem 1, for notational simplicity, we denote the one-step estimator βb3 in our
algorithm as βb3 . Our proof consists of three steps. In Step 1, we derive the upper bound
of kDβ3∗ k1 given the clustering assumption in (12). In Step 2, we use the general theory
developed in Corollary 2 to obtain the final rate. In Step 3, we derive the clustering consistency
via the rate of convergence in Stage 2.
Step 1: Recall that in (12), we assume the component β3∗ ∈ RN has the following clustering
structure,
β3∗ = (µ∗1 , . . . , µ∗1 , µ∗2 , . . . , µ∗2 , . . . , µ∗K , . . . , µ∗K ),
| {z } | {z } | {z }
l elements l elements l elements
To facilitate the derivation, we denote β := β3∗ and
denote its jth entry as βj . Then we have,
v
N K K u K
X X
∗ ∗
X

√ uX
kDβk1 = |βj − βj−1 | = |µj − µj−1 | ≤ 2 |µj | ≤ 2 K t µ∗2
j
j=2 j=2 j=1 j=1

where the second equality holds because of the above structural assumption of β3∗ , and
the last inequality holds due to the Cauchy-Schwarz inequality. Moreover, due to the
unit-norm condition, we have kβk2 = 1; that is, K ∗2
P
j=1 µj N/K = 1. Therefore, we have
PK ∗2
j=1 µj = K/N , and
2K
kDβ3∗ k1 = kDβk1 ≤ √ . (S11)
N
Step 2: Note that β3∗ is not necessarily sparse and hence kβ3∗ k0 ≤ s3 = N . According to
the general theory developed in Corollary 2, we have
" p #2 p
C 1 s 1 s 2 N log(d1 d2 N ) C 2 f 0,3 s1 s2 N log(d1 d2 N )
kβb3 − β3∗ k22 ≤ ∗
+ ,
w w∗
for some constants C1 and C2 . Here f0,3 is the upper bound of kDβ3∗ k1 , and is bounded by
√ p
2K/ N according to (S11). Therefore, when w∗  s1 s2 N 2 log(d1 d2 N )/K, we have
" p #2   p  2
C1 s1 s2 N log(d1 d2 N ) K C2 f0,3 s1 s2 N log(d1 d2 N ) K

=O , and ∗
=O ,
w N w N

7
and henceforth,
K2
 
kβb3 − β3∗ k22 =O .
N

Step 3: Based on the rate in Stage 2, we have kβb3 − β3∗ k∞ ≤ kβb3 − β3∗ k2 = O(K/ N ),
where kβk∞ = maxj βj . Therefore, when the minimal gap between two clusters is lower

bounded such that mini,j |µ∗i − µ∗j | > C1 K/ N , it is guaranteed to recover all cluster struc-
tures. That is, with high probability, we have Abk = A∗ , for any k. This completes the proof
k
of Theorem 2. 

S.6 Proof of Theorem 3


(1)
For notational simplicity, we denote the one-step estimator βb3,r as βb3,r , and the initial
(0) (0) ∗
estimators βb1,r , βb2,r as βb1,r , βb2,r , respectively. Denote F1 := supp(β1,r ) ∪ supp(βb1,r ), F2 :=
supp(β ∗ )∪supp(βb2,r ), and F3 := supp(β ∗ )∪supp(βb3,r ), and let F := F1 ◦F2 ◦F3 . Following
2,r 3,r
similar arguments as that in the proof of Theorem 1, the key step is to compute kk∞ , where



 = Norm TF ×1 βb1,r ×2 βb2,r − β3,r
TF∗ ×1 βb1,r ×2 βb2,r ∗ EF ×1 βb1,r ×2 βb2,r
= − β3,r + .
kTF ×1 βb1,r ×2 βb2,r k2 kTF ×1 βb1,r ×2 βb2,r k2
| {z } | {z }
(I) (II)

Next, to bound kk∞ , we divide our procedure in three steps. In Step 1, we decompose and
bound the nominator TF∗ ×1 βb1,r ×2 βb2,r in the term (I). In Step 2, we seek the lower bound
of the denominator kTF ×1 βb1,r ×2 βb2,r k2 in (I). In Step 3, we bound k(I)k∞ and k(II)k∞ ,
respectively, then eventually bound kk∞ .
Step 1: For the nominator TF∗ ×1 βb1,r ×2 βb2,r in (I), consider βb1,r = βb1,r − β1,r
∗ ∗
+ β1,r and
∗ ∗
βb2,r = βb2,r − β + β . We have
2,r 2,r

TF∗ ×1 βb1,r ×2 βb2,r = TF∗ ×1 (βb1,r − β1,r


∗ ∗
) ×2 (βb2,r − β2,r ) + TF∗ ×1 (βb1,r − β1,r
∗ ∗
) ×2 β2,r
| {z } | {z }
I1 I2

+ TF∗ ∗
×1 β1,r ∗
×2 (βb2,r − β2,r ) + T ∗ ×1 β1,r
∗ ∗
×2 β2,r (S12)
| {z } |F {z }
I3 I4

According to Lemma 5 and the CP decomposition in (2), we have


X
TF∗ = wj∗ Truncate(β1,j
∗ ∗
, F1 ) ◦ Truncate(β2,j ∗
, F2 ) ◦ Truncate(β3,j , F3 ).
j∈[R]

8
∗ ∗
Denote β̄i,j = Truncate(βi,j , Fi ) for i = 1, 2, 3 and j = 1, . . . , R. We have
X
I1 = wj∗ hβ̄1,j
∗ ∗
, βb1,r − β1,r ∗
ihβ̄2,j ∗
, βb2,r − β2,r ∗
iβ̄3,j .
j∈[R]

This, together with the Cauchy-Schwarz inequality |hβ1 , β2 i| ≤ kβ1 k2 kβ2 k2 , implies that
X
kI1 k∞ ≤ wj∗ |hβ̄1,j
∗ ∗
, βb1,r − β1,r ∗
i||hβ̄2,j ∗
, βb2,r − β2,r ∗
i|kβ̄3,j k∞
j∈[R]
X
∗ ∗
≤ kβb1,r − β1,r k2 kβb2,r − β2,r k2 ∗
wj∗ kβ̄1,j ∗
k2 kβ̄2,j ∗
k2 kβ̄3,j k∞
j∈[R]
X
≤ 20 wj∗ ≤ 20 kT ∗ k ≤ C1 wmax 20 , (S13)
j∈[R]

∗ ∗
where the third inequality is due to the fact that kβ̄i,j k2 ≤ kβi,j k2 = 1 for i = 1, 2,
∗ ∗
and kβ̄3,j k∞ = kβ3,j k∞ ≤ 1. The forth inequality is due to the property of the spectral
norm of the true tensor. In particular, kT ∗ k = supkuk=kvk=kwk=1 |T ∗ ×1 u ×2 v ×3 w| =
supkuk=kvk=kwk=1 | j∈[R] wj∗ hu, β1,j
∗ ∗ ∗
i| ≥ j∈[R] wj∗ by letting u = β1,j

P P
ihv, β2,j ihw, β3,j ,v =
∗ ∗
β2,j , w = β3,j . Finally, the last inequality is due to Assumption 1.
Next, we have
X
I2 = wj∗ hβ̄1,j
∗ ∗
, βb1,r − β1,r ∗
ihβ̄2,j ∗
, β2,r ∗
iβ̄3,j
j∈[R]
X
= wr∗ hβ1,r
∗ ∗
, βb1,r − β1,r ∗
iβ3,r + wj∗ hβ̄1,j
∗ ∗
, βb1,r − β1,r ∗
ihβ̄2,j ∗
, β2,r ∗
iβ̄3,j ,
| {z } j6=r
I21 | {z }
I22

∗ ∗ ∗ ∗
where the second equality is due to the fact that hβ̄1,j , βb1,r − β1,r i = hβ1,r , βb1,r − β1,r i since
∗ ∗ ∗ ∗ ∗
supp(β1,r ) ⊆ F1 and supp(β1,r − β ) ⊆ F1 , as well as the fact hβ̄ , β i = hβ , β i = 1.
b b
1,r 2,r 2,r 2,r 2,r
By the Cauchy-Schwarz inequality, we have kI21 k∞ ≤ wr∗ kβ1,r ∗
k2 kβb1,r − β1,r∗ ∗
k2 kβ3,r k∞ ≤ wr∗ 0 .
Moreover,
X
kI22 k∞ ≤ wj∗ kβ̄1,j
∗ ∗
k2 kβb1,r − β1,r ∗
k2 hβ̄2,j ∗
, β2,r ∗
ikβ̄3,j k∞
j6=r
X
≤ wj∗ 0 ξ ≤ 0 ξ(R − 1)wmax , (S14)
j6=r

∗ ∗ ∗ ∗
where the second inequality is due to the facts that kβ̄1,j k2 ≤ 1, kβ̄3,j k∞ ≤ 1, and hβ̄2,j , β2,r i≤
∗ ∗
hβ2,j , β2,r i ≤ ξ with the incoherence parameter ξ defined in (13). Therefore, we have

kI2 k∞ ≤ kI21 k∞ + kI22 k∞ ≤ wr∗ 0 + 0 ξ(R − 1)wmax .

9
By a similar argument, we also have

kI3 k∞ ≤ wr∗ 0 + 0 ξ(R − 1)wmax .

Finally, for I4 , we have


X
I4 = ∗
wj∗ hβ̄1,j ∗
, β1,r ∗
ihβ̄2,j ∗
, β2,r ∗
iβ̄3,j
j∈[R]
X
= wr∗ β3,r

+ wj∗ hβ̄1,j
∗ ∗
, β1,r ∗
ihβ̄2,j ∗
, β2,r ∗
iβ̄3,j . (S15)
j6=r
| {z }
I42

According to the Cauchy-Schwarz inequality, and similar arguments used in the bound of
kI22 k∞ , we have
X
kI42 k∞ ≤ wj∗ |hβ̄1,j
∗ ∗
, β1,r ∗
i||hβ̄2,j ∗
, β2,r ∗
i|kβ̄3,j k∞ ≤ ξ 2 (R − 1)wmax . (S16)
j6=r

Step 2: Denote the denominator in the term (I) as α := kTF ×1 βb1,r ×2 βb2,r k2 . We next
derive the lower bound of α. According to (2), and the facts supp(βb1,r ) ⊆ F1 , supp(βb2,r ) ⊆ F2 ,

and supp(β3,r ) ⊆ F3 , we have

TF ×1 βb1,r ×2 βb2,r = TF∗ ×1 βb1,r ×2 βb2,r + EF ×1 βb1,r ×2 βb2,r


X
= wr∗ hβ1,r
∗ ∗
, βb1,r ihβ2,r ∗
, βb2,r iβ3,r + w∗ hβ ∗ , βb1,r ihβ2,j
∗ ∗
, βb2,r iβ̄3,j + EF ×1 βb1,r ×2 βb2,r .
| {z } j6=r j 1,j | {z }
I5 | {z } I7
I6

This implies that

α = kTF ×1 βb1,r ×2 βb2,r k2 ≥ kI5 k2 − kI6 k2 − kI7 k2 .

Therefore, in order to find the lower bound of α, it is sufficient to find the lower bound of kI5 k2
and upper bounds of kI6 k2 and kI7 k2 . The initial conditions imply that kI5 k2 ≥ wr∗ (1 − 20 ).
Moreover, according to the argument in (S4), we have kI7 k2 = kEF ×1 βb1,r ×2 βb2,r k2 ≤ kEF k ≤
∗ ∗ ∗ ∗
η(E; s1 , s2 , s3 ). Furthermore, considering βb1,r = βb1,r − β1,r + β1,r and βb2,r = βb2,r − β2,r + β2,r ,
we have

kI6 k2
X X
≤ k wj∗ hβ1,j
∗ ∗
, βb1,r − β1,r ∗
ihβ2,j ∗
, β2,r ∗
iβ̄3,j k2 + k wj∗ hβ1,j
∗ ∗
, β1,r ∗
ihβ2,j ∗
, βb2,r − β2,r ∗
iβ̄3,j k2
j6=r j6=r
| {z } | {z }
I61 I62
X X
+ k wj∗ hβ1,j
∗ ∗
, βb1,r − β1,r ∗
ihβ2,j ∗
, βb2,r − β2,r ∗
iβ̄3,j k2 + k wj∗ hβ1,j
∗ ∗
, β1,r ∗
ihβ2,j ∗
, β2,r ∗
iβ̄3,j k2
j6=r j6=r
| {z } | {z }
I63 I64

10
Using similar arguments as those in (S13), (S14), and (S16), we have

I61 ≤ 0 ξ(R − 1)wmax , I62 ≤ 0 ξ(R − 1)wmax , I63 ≤ C1 20 wmax , I64 ≤ ξ 2 (R − 1)wmax ,

and hence the denominator α := kTF ×1 βb1,r ×2 βb2,r k2 satisfies

α ≥ wr∗ (1 − 20 ) − wmax 20 ξ(R − 1) + C1 20 + ξ 2 (R − 1) −η(E; s1 , s2 , s3 )


 
(S17)
| {z }
g(0 ,ξ,R)

Step 3: Combining the results in (S12), (S15) and (S17), we have

TF∗ ×1 βb1,r ×2 βb2,r − αβ3,r



(I) =
α
I1 + I2 + I3 + wr∗ β3,r

+ I42 − αβ3,r ∗
=
α
I1 + I2 + I3 + I42 + wr∗ β3,r

− [wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )] β3,r


wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )

This, together with the bounds on I1 , I2 , I3 , and I42 , implies that

2wmax g(0 , ξ, R) + η(E; s1 , s2 , s3 )


k(I)k∞ ≤
wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )

Following similar arguments as those in (S4) and (S17), we have

η(E; s1 , s2 , s3 )
k(II)k∞ ≤ .
wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )

Therefore, we have

2wmax g(0 , ξ, R) + 2η(E; s1 , s2 , s3 )


kk∞ ≤ .
wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )

The rest of the theorem follows from (S1) in Lemma 1 and the conditions on λ3 and kDβ3,r k1 .
This completes the proof of Theorem 3. 

S.7 Proof of Corollary 3


Based on the results in Theorem 3, we prove Corollary 3 in two steps. In step 1, we find the
lower bound of the denominator wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 ) in (14). In Step
2, we simplify its numerator and find its upper bound.

11
Step 1: According to the first condition in Assumption 2, we have qe := C1 0 + 2ξ(R − 1) <
wmin /(8wmax ) < 1, which, together with the second condition in Assumption 2, implies that
wmin
g(0 , ξ, R) = 20 C1 + 20 ξ(R − 1) + ξ 2 (R − 1) = ξ 2 (R − 1) + qe0 ≤ .
6wmax
Moreover, the second condition in Assumption 2 implies that 0 ≤ wmin /6wmax , and henceforth
20 wmax /wmin ≤ 1/6. These two facts, together with Assumption 3, lead to the lower bound

wr∗ (1 − 20 ) − wmax g(0 , ξ, R) − η(E; s1 , s2 , s3 )


 
wmax 2 wmax η(E; s1 , s2 , s3 )
≥ wmin 1 −  − g(0 , ξ, R) −
wmin 0 wmin wmin
1 1 1 wmin
> wmin (1 − − − ) = . (S18)
6 6 6 2
Step 2: According to Theorem 3, (S18) and the Assumption 4, we have

(1) ∗ 8wmax 8
kβb3,r − β3,r k2 ≤ g(0 , ξ, R) + η(E; s1 , s2 , s3 )
wmin wmin
8wmax 2 8
≤ [ξ (R − 1) + qe0 ] + η(E; s1 , s2 , s3 )
wmin wmin
8wmax qe 8wmax 2 8
≤ 0 + ξ (R − 1) + η(E; s1 , s2 , s3 )
wmin wmin wmin
= q0 + S ,

where the contraction coefficient q := 8wmax qe/wmin < 1 by the first condition in Assumption
2, and the statistical error S is independent of the initial error 0 . The rest of Corollary 3
follows by iteratively applying the above inequality. This completes the proof of Corollary 3. 

S.8 Proof of Theorem 4


Based on the results in Corollary 3, the estimator βbm+1,r after T iterations satisfies that


max kβbm+1,r − βm+1,r k2 ≤ Op (S ).
r=1,...,R

According to Corollary qQ2, whenPthe error tensor is a Gaussian tensor, we have that
m+1 m+1
η(E; s1 , . . . , sm+1 ) = C j=1 sj j=1 log(dj ) for some constant C. Note that here sm+1 =
dm+1 = N . Therefore,
q
N log(N ) m
Q Pm
wmax 2 j=1 sj j=1 log(dj )
S ∼ ξ (R − 1) +
wmin wmin

12
where a ∼ b means a, b are in the same order. Therefore, when wmax /wmin ≤ C2 for some
√ qQ
m Qm
constant C2 > 0, ξ 2 (R − 1) = O(K/ N ), and wmin  2
j=1 sj N log( j=1 dj N )/K, we

have S = O(K/ N ). Based on this result, we obtain the estimation error in clustering; i.e.,
r !
√ R p
max kµb i − µ∗i k2 ≤ R max kβbm+1,r − βm+1,r

k2 ≤ Op K < CK
e R/N ,
i r=1,...,R N

for some constant C.


e
p
Finally, if mini,j kµ∗i − µ∗j k2 > C3 K R/N for some constant C3 > 4C, e we have, for any
two samples µ b j from different clusters A∗i and A∗j , respectively,
bi, µ


bi − µ b i − µ∗i + µ∗i − µ∗j + µ∗j − µ
b j k2 = kµ b j k2
≥ kµ∗i − µ∗j k2 − kµ
b i − µ∗i k2 − kµ∗j − µ
b j k2
p
> 2CK
e R/N ,

Similarly, for any two samples µ b i0 from the same cluster A∗i ,
bi, µ


bi − µ b i − µ∗i + µ∗i − µ
b i0 k2 = kµ b i0 k2
b i − µ∗i k2 − kµ∗i − µ
≤ kµ b i0 k 2
p
≤ 2CK
e R/N .

This implies that the within-cluster distance is always smaller than the between-cluster dis-
tance, and henceforth, any Euclidean-distance based clustering algorithm is able to correctly
b i , i = 1, . . . , N . That is, we have Abk = A∗k for any k = 1, . . . , K,
assign the clusters of all µ
with high probability. This completes the proof of Theorem 4. 

S.9 Additional numerical analyses


In this section, we present additional simulations with different correlation structures, with
large ranks, and with unequal cluster sizes. We also report the clustering analysis in the first
task of our real data analysis but in the frequency domain instead of the time domain.
First, we investigate the impact of different correlation structures on our method. We
adopt the same simulation setting as in Section 5.2, but replace the original identity covariance
matrix with a covariance matrix under an autoregressive (AR) structure or an exchangeable
structure. That is, we generate the data from (6), with Σk,j , k = 1, . . . , K, j = 1, . . . , m, of
the form,

13
Table S4: Clustering of 2D matrices with different correlation structures. The methods under
comparison are the same as described in Table 1.

Tensor recovery error


Correlation N ρ STF TTP GLTD
AR 50 0.1 0.317 (0.024) 0.446 (0.034) 0.550 (0.040)
0.2 0.448 (0.035) 0.516 (0.035) 0.645 (0.036)
0.5 0.699 (0.038) 0.710 (0.039) 0.972 (0.037)
100 0.1 0.277 (0.022) 0.405 (0.035) 0.479 (0.038)
0.2 0.334 (0.033) 0.413 (0.036) 0.520 (0.040)
0.5 0.655 (0.028) 0.678 (0.026) 0.923 (0.032)
Exchangeable 50 0.1 0.464 (0.035) 0.512 (0.036) 0.814 (0.034)
0.2 0.909 (0.041) 0.883 (0.04) 1.010 (0.032)
0.5 1.946 (0.034) 1.913 (0.037) 1.878 (0.025)
100 0.1 0.356 (0.029) 0.407 (0.035) 0.721 (0.043)
0.2 0.880 (0.046) 0.898 (0.044) 1.061 (0.023)
0.5 1.998 (0.031) 1.966 (0.032) 1.890 (0.023)
Clustering error
Correlation N ρ DTC TTP GLTD vectorized
AR 50 0.1 0.036 (0.013) 0.092 (0.017) 0.148 (0.023) 0.255 (0.000)
0.2 0.117 (0.018) 0.148 (0.018) 0.201 (0.021) 0.255 (0.000)
0.5 0.237 (0.02) 0.250 (0.023) 0.381 (0.023) 0.256 (0.001)
100 0.1 0.025 (0.011) 0.086 (0.017) 0.113 (0.019) 0.253 (0.000)
0.2 0.066 (0.016) 0.092 (0.017) 0.146 (0.021) 0.253 (0.000)
0.5 0.214 (0.015) 0.227 (0.013) 0.363 (0.019) 0.253 (0.000)
Exchangeable 50 0.1 0.124 (0.020) 0.142 (0.020) 0.270 (0.021) 0.255 (0.000)
0.2 0.248 (0.024) 0.236 (0.023) 0.265 (0.023) 0.309 (0.012)
0.5 0.272 (0.013) 0.281 (0.017) 0.290 (0.018) 0.491 (0.005)
100 0.1 0.063 (0.015) 0.086 (0.017) 0.217 (0.022) 0.253 (0.000)
0.2 0.226 (0.024) 0.238 (0.023) 0.283 (0.018) 0.288 (0.012)
0.5 0.275 (0.014) 0.267 (0.014) 0.264 (0.016) 0.486 (0.005)

   
1 ρ · · · ρdj −1 ρdj 1 ρ ρ··· ρ
 ρ 1 · · · ρdj −2 ρdj −1  d ×d
ρ 1 ρ··· ρ d ×d
Σar = ..  ∈ R j j , Σexch =  .. ..  ∈ R j j .
   
k,j .. .. .. k,j .. ..
 . . . .  . . . .
ρdj ρdj −1 ···ρ 1 ρ ρ ρ··· 1

We consider the correlation coefficient ρ ∈ {0.1, 0.2, 0.5}, where a larger value of ρ indicates a
more severe misspecification for our method. We vary the sample size N ∈ {50, 100}, and fix
the other parameters µ = 1.2 and d1 = d2 = 20. Table S4 reports the tensor recovery error
and clustering error out of 50 data replications for all methods under the two correlation

14
Table S5: Clustering of 2D matrices with different ranks. The methods under comparison
are the same as described in Table 1.
Tensor recovery error
d1 = d2 N µ Rank STF TTP GLTD
20 50 1 5 0.764 (0.032) 0.848 (0.036) 0.865 (0.036)
8 0.808 (0.021) 0.862 (0.021) 0.903 (0.023)
1.2 5 0.509 (0.007) 0.589 (0.030) 0.609 (0.031)
8 0.668 (0.006) 0.689 (0.013) 0.728 (0.022)
100 1 5 0.712 (0.032) 0.748 (0.034) 0.831 (0.034)
8 0.804 (0.018) 0.849 (0.021) 0.930 (0.022)
1.2 5 0.496 (0.004) 0.572 (0.028) 0.659 (0.034)
8 0.640 (0.011) 0.665 (0.020) 0.678 (0.017)
Clustering error
d1 = d2 N µ Rank DTC TTP GLTD vectorized
20 50 1 5 0.183 (0.029) 0.259 (0.035) 0.245 (0.036) 0.291 (0.007)
8 0.147 (0.031) 0.222 (0.031) 0.247 (0.033) 0.280 (0.005)
1.2 5 0.013 (0.013) 0.047 (0.019) 0.064 (0.024) 0.256 (0.001)
8 0.000 (0.000) 0.033 (0.019) 0.066 (0.026) 0.255 (0.000)
100 1 5 0.139 (0.032) 0.156 (0.034) 0.224 (0.032) 0.278 (0.004)
8 0.165 (0.027) 0.200 (0.028) 0.314 (0.032) 0.266 (0.003)
1.2 5 0.000 (0.000) 0.040 (0.017) 0.115 (0.027) 0.234 (0.014)
8 0.013 (0.013) 0.025 (0.023) 0.050 (0.021) 0.038 (0.021)

structures. It is seen that, when the correlation is moderate, i.e., ρ = 0.1, 0.2, our method
clearly outperforms the alternative solutions in terms of both tensor recovery and clustering
accuracy. When the correlation increases to 0.5, for which our model is severely misspecified,
our method still works reasonable well. In addition, the computational time under the AR or
the exchangeable correlation structure is very similar to that reported in Table 1 under the
independent correlation structure, and we omit the results.
Next we investigate the impact of a large rank R on our method. We continue to adopt
the simulation setting as in Section 5.2, but increase the true rank value R ∈ {5, 8}. We
vary the sample size N ∈ {50, 100}, µ ∈ {1, 1.2}, and fix d1 = d2 = 20. Table S5 reports the
tensor recovery error and clustering error out of 50 data replications for all methods under
the new rank values. We continue to observe that our method performs competitively with a
larger value of R.
Next we consider a simulation example where the cluster sizes are unequal. We adopt
the simulation setting in Section 5.3, but set the four cluster sizes l1 , . . . , l4 satisfying that
l1 : l2 : l3 : l4 = 1 : 2 : 3 : 4. Table S6 reports the tensor recovery error and clustering error

15
Table S6: Clustering of 3D tensors with unequal cluster sizes. The methods under comparison
are the same as described in Table 1.
Tensor recovery error
d1 = d2 = d3 N µ STF TTP GLTD
20 50 0.6 0.052 (0.003) 0.699 (0.049) 0.346 (0.081)
0.8 0.012 (0.001) 0.335 (0.070) 0.119 (0.048)
100 0.6 0.038 (0.003) 0.797 (0.047) 0.316 (0.053)
0.8 0.009 (0.001) 0.237 (0.059) 0.066 (0.037)
Clustering error
d1 = d2 = d3 N µ DTC TTP GLTD vectorized
20 50 0.6 0.177 (0.028) 0.400 (0.028) 0.286 (0.035) 0.406 (0.017)
0.8 0.000 (0.000) 0.146 (0.034) 0.058 (0.026) 0.204 (0.000)
100 0.6 0.069 (0.025) 0.408 (0.031) 0.191 (0.032) 0.287 (0.008)
0.8 0.000 (0.000) 0.107 (0.031) 0.029 (0.020) 0.202 (0.000)

out of 50 data replications. We observe a similar pattern of performance when the cluster
sizes are different. In this article, the assumption of equal cluster sizes is imposed only to
simplify the presentation. It can be easily relaxed to the general case of unequal cluster sizes.
Finally, we add an analysis of the ABIDE fMRI data in the frequency domain to mitigate
the potential issue of temporal inconsistency. Specifically, we repeat the first task in Section
6, but transform the data to the frequency domain via fast Fourier transform (FFT). That
is, following Calhoun et al. (2003); Ahn et al. (2015), we calculate the temporal amplitude
spectrum for each voxel, Xik = |F F T (xik )|, where xik represents the original measure in the
time domain at voxel i and time point k. The rest of analysis is the same as the one in the
time domain. Table S7 reports the clustering error of all methods in the frequency domain.
It is seen that our proposed method and the method of Madrid-Padilla and Scott (2017)
perform similarly and are the best across all sliding windows. The smallest clustering error
in the frequency domain is 21/57, which is larger than the smallest clustering error 15/57
obtained in the time domain in Table 3.

Table S7: Clustering of the ABIDE data along the subject mode in the frequency domain.
The setup is the same as Table 3.

Windows DTC TTP GLTD vectorized


1 21/57 26/57 21/57 21/57
30 21/57 25/57 21/57 21/57
50 22/57 24/57 22/57 22/57
80 21/57 22/57 21/57 24/57

16

You might also like