PARAFAC Algorithms for Large Tensors
PARAFAC Algorithms for Large Tensors
Neurocomputing
journal homepage: [Link]/locate/neucom
a r t i c l e i n f o a b s t r a c t
Available online 19 February 2011 Parallel factor analysis (PARAFAC) is a tensor (multiway array) factorization method which allows to
Keywords: find hidden factors (component matrices) from a multidimensional data. Most of the existing
Canonical polyadic decomposition (CP) algorithms for the PARAFAC, especially the alternating least squares (ALS) algorithm need to compute
Tensor factorization Khatri–Rao products of tall factors and multiplication of large matrices, and due to this require high
PARAFAC computational cost and large memory and are not suitable for very large-scale-problems. Hence,
Large-scale dataset PARAFAC for large-scale data tensors is still a challenging problem. In this paper, we propose a new
Multiway classification approach based on a modified ALS algorithm which computes Hadamard products, instead Khatri–Rao
Parallel computing products, and employs relatively small matrices. The new algorithms are able to process extremely
Alternating least squares
large-scale tensors with billions of entries. Extensive experiments confirm the validity and high
Hierarchical ALS
performance of the developed algorithm in comparison with other well-known algorithms.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction and problem statement dynamic tensor factorization [3,4]. This problem also often arises
in factorization of a training dataset.
Tensor factorizations are important techniques for data Tensor factorizations can be used as dimensionality reduction
mining, dimensionality reduction, pattern recognition, object methods in multiway classification, and factors are bases to
detection, classification, gene clustering, sparse nonnegative project a tensor data onto the feature subspace. In practice the
representation and coding, and blind source separation (BSS) training data is often augmented with some new samples, hence
[1,2]. For example, the PARAFAC also known as the canonical the basis factors for the expanded training tensor need to be
polyadic decomposition (CP) has already found a wide spectrum updated. A simple way is to factorize the new whole training
of applications in positron emission tomography (PET), spectro- tensor again. However, this approach demands high computa-
scopy, chemometrics and environmental science [1]. tional cost. A convenient way is that we update the old bases with
In this paper, we present PARAFAC factorization suitable for factors of the new coming data.
large-scale problems and fast parallel implementation. The pro- The proposed algorithm calculates Hadamard products, multi-
posed model and algorithms solve till now intractable problem plication of small matrices, and avoids Khatri–Rao products.
for arbitrary high dimension and large-scale tensors. We divide a Especially, this new algorithm opens new perspectives to find
given data tensor into a grid of multiple sub-tensors (see Fig. 2). nonnegative factors for the very large-scale nonnegative tensors
We factorize all the sub-tensors independently by using efficient that is potentially useful in many applications, such as those in
PARAFAC algorithms in parallel mode and next integrate partial neuroscience and data mining. For specific data, such as spectral
results for the whole tensor to estimate desired factors. data, images, chemical concentrations, in order to provide mean-
The developed model is referred here to as the grid-tensor ingful representation and clear physical interpretation, sparsity
factorization (gTF). The simplest case of the gTF is factorization for and nonnegative constraints are often imposed on the hidden
a large-scale tensor partitioned into only two sub-tensors. If one factors. The extracted basis components are part-based represen-
sub-tensor is considered as a new data coming and expanding tations of the nonnegative data. However, the constrained non-
the current (actual) data along a single mode (usually time), the negative PARAFAC, also called Nonnegative Tensor Factorization
problem is referred to as the dynamic tensor analysis (DTA) or (NTF) is not guaranteed to converge to a stationary point. Hence,
the NTF often needs to be analyzed with many trials, and different
regularized values to choose the desired solutions. This demands
high computational cost due to often accessing to the input data.
Corresponding author. In practice, a large data tensor can be well explained by PARAFAC
E-mail addresses: phan@[Link] (A.H. Phan), factors, then it is more efficient to store the hidden factors, and
cia@[Link] (A. Cichocki). process them instead of the whole data tensor. The ALS PARAFAC
0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/[Link].2010.06.030
A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984 1971
algorithm is a ‘‘work-horse’’ algorithm which is able to factorize algorithm for grid PARAFAC, and its pseudo-code which can be
data tensor with a high fitness [2,5,6]. However, it is not suitable implemented in a parallel processing system. In Section 4, a
for very large-scale problems. By using the estimated factors by conversion of PARAFAC model to its restricted model with
the modified ALS, we develop the very fast algorithm to retrieve nonnegative constraints is introduced. The new stopping criteria
the nonnegative factor hidden from a data tensor. is introduced in Section 5. Communication cost and practical
An important key factor for large-scale factorization is evalua- implementation of grid algorithms are discussed in Section 6.
tion of stopping criteria to control convergence. In this paper, we The grid PARAFAC is extended for complex-valued tensors
use the cost function value or the fit error. However, for a large- in Section 7. The experiments with large synthetic and real-world
scale tensor, an explicit computation of the cost function value is datasets will confirm the validity and efficiency of the proposed
impossible due to so much memory requirement to build up the algorithms in Section 8. Finally, Section 9 concludes the paper.
approximate tensor Y^ . We derive a simple low complexity
formula for stopping criteria applied to the grid-tensor 1.1. Tensor notations and basic models
factorizations.
The paper is organized as follows. After a brief overview of the We shall denote a tensor by underlined bold capital letters, e.g.
notation and PARAFAC tensor factorizations, dynamic tensor Y, matrices by bold capital letters, e.g. A, and vectors by bold italic
factorization is presented for three dimensional tensor lowercase letters, e.g. d. For a three dimensional tensor Y A RIJK ,
in Section 2. Section 3 is devoted to generalize the dynamic its frontal slice, lateral slice, and horizontal slice are denoted,
tensor factorization to grid PARAFAC. We present the ALS respectively, by Yk ¼ YHk ,Y:j: , and YiH . A tube (vector) at a position
(i,j) along mode-3 is denoted by yij: , and the corresponding tubes
along mode-2 and mode-1 are yi:k and y:jk . Mode-n matricized
version of a tensor Y is denoted by YðnÞ , and intuitively defined for
PARAFAC a 3-D tensor as follows:
Yð1Þ ¼ ½Y1 Y2 YK A RIðJKÞ , ð1Þ
Factor reconstruction
PARAFAC for sub-tensors
Fast learning rules
Parallel computing
Parallel computing
Fig. 2. Illustration for the standard PARAFAC (dash arrow), and grid PARAFAC for large-scale tensors (solid arrows) in two stages.
1972 A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984
Multiplication of a tensor with all but one mode is denoted as 2. Dynamic tensor factorization
Gn fAg ¼ G1 Að1Þ 2 Að2Þ n1 Aðn1Þ n þ 1 Aðn þ 1Þ N AðNÞ : For simplicity of explanation of our basic concept, we shall first
ð7Þ consider the grid PARAFAC for a three dimensional tensor parti-
tioned into only two sub-tensors. This model can also be con-
The above notation is adopted from [2]. sidered as a Dynamic Tensor Analysis (Factorization) in which
Mode-n multiplication of a tensor Y A RI1 I2 IN by a vector data is augmented along a single mode.
a A RIn is denoted by
Problem 1 (Dynamic Tensor Factorization). Consider two data
Z ¼ Yn a A RI1 In1 In þ 1 IN , ð8Þ tensors Y ð1Þ A RIJK1 , Y ð2Þ A RIJK2 with their approximated
PARAFAC models (see Fig. 1):
and the tensor-vector product of a tensor Y with set of N column
vectors fag ¼ ½að1Þ ,að2Þ , . . . ,aðNÞ is given by Y ðkÞ I1 AðkÞ 2 BðkÞ 3 CðkÞ , k ¼ 1,2, ð15Þ
ðkÞ IRk ðkÞ JRk ðkÞ Kk Rk
where A A R , B AR and C A R .
Yfag ¼ Y1 að1Þ2 að2Þ N aðNÞ : ð9Þ
The problem is to find PARAFAC factors for the concatenated
PARAFAC [7] can be formulated as follows, ‘‘Factorize a given N-th tensor Y A RIJK (with K ¼K1 +K2):
order data tensor Y A RI1 I2 IN into a set of N component Y I1 A2 B3 C, ð16Þ
In J
matrices (factors): AðnÞ ¼ ½a1ðnÞ ,aðnÞ ðnÞ
2 , . . . ,aJ A R , ðn ¼ 1,2, . . . ,NÞ
representing the common (loading) factors’’, that is, without repeating all computation from the beginning, but using
already known partial results (factors).
J
¼ ½½Að1Þ ,Að2Þ , . . . ,AðNÞ ¼ ½½fAg ¼ Y^ , The easiest way is that we factorize the concatenation tensor
X
Y að1Þ ð2Þ ðNÞ
j 3aj 3 3aj ð10Þ
j¼1 again. This method provides accurate results. However, for large-
scale data, this is not a suitable solution due to high computa-
where symbol ‘‘3’’ means outer product, and we assume unit- tional cost. In this section, we present three methods to deal with
length components JajðnÞ Jp ¼ 1 for n ¼ 1,2, . . . ,N1, j ¼ 1,2, . . . ,J, this problem. The final learning rules of all the methods are
and p ¼1,2 (see Fig. 2). Tensor Y^ is an approximation of the data identical.
tensor Y. Alternatively we can describe the PARAFAC model using
tensor notation is given by 1. By minimizing the cost function for the concatenated tensor.
ð1Þ ð2Þ ðNÞ
2. By adopting or modifying the ALS algorithm for the concate-
Y ¼ I1 A 2 A . . . N A , ð11Þ nated tensor represented by block matrices.
3. By concatenating factors of sub-tensors and then reducing the
where I is an identity tensor. The mode-n matricization of the
number of components.
tensors (11) can be represented by set of matrix factorizations:
where YðnÞ is the mode-n matricized version of tensor Y, and the Generally, a tensor factorization can be solved by minimizing a
Hadamard product is given by cost function of the concatenation tensor and its approximation,
that is
fAg,n ¼ AðNÞ , ,Aðn þ 1Þ ,Aðn1Þ ,Að1Þ :
2
1 1X
For large-scale tensor, the mode-n matricized version YðnÞ of D¼ JYI1 A2 B3 CJ2F ¼ JY ðkÞ I1 A2 B3 CðkÞ J2F
size In ð k a n Ik Þ is often a large matrix, and the Khatri–Rao
Q 2 2k¼1
product fAgn returns a tall matrix of size ð k a n Ik Þ J in each 2
Q
1X
iteration step. Hence, this learning rule demands high computa- ¼ JYðkÞ AðCðkÞ BÞT J2F : ð19Þ
2 k ¼ 1 ð1Þ
tional cost, and of course large memory for large-scale data. The
ALS algorithm cannot decompose a dense data tensor having In order to find the learning rule for factor A, we formulate the
billions of entries. To deal with such large dataset, instead of cost function for the mode-1 matricization (19), and compute its
computing the full tensor in (14), we can reduce number of gradient with respect to A
columns in YðnÞ by some sampled vectors (tubes) along each 2
X
mode-n which satisfy specific criteria [1,8]. Recently, the CUR rA D ¼ ðYðkÞ T
ð1Þ ðCðkÞ BÞ þ AðCðkÞ BÞ ðCðkÞ BÞÞ
decomposition [9] gives a fast approximation for a raw matrix k¼1
2
based on some sampled rows, columns, and their intersection X T T
¼ ðYðkÞ
ð1Þ ðCðkÞ BÞ þ AðCðkÞ CðkÞ Þ,ðB BÞÞ
entries. This method was also extended to multiway array to
k¼1
select tubes along each modes. However, both block-wise and 2
T T
X
CUR approaches have not completely resolved the problem for the ¼ YðkÞ
ð1Þ ðCðkÞ BÞ þ AðC CÞ,ðB BÞ: ð20Þ
very large data, since they still to compute Khatri–Rao product. k¼1
A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984 1973
PðkÞ ¼ ðAðkÞT AÞ,ðBðkÞT BÞ,ðCðkÞT CÞ, ð22Þ ~ B~ and C~ for the tensor Y
This leads to a tensor factorization of A,
given by
Q ¼ ðAT AÞ,ðBT BÞ,ðCT CÞ: ð23Þ ~ 2 B ~
~ 3 C:
Y ¼ I1 A ð33Þ
Similarly, update rule for B is given by
! This completes the proof. &
2
B ðP {ðB BÞÞ ðQ {ðBT BÞÞ1 :
X ðkÞ ðkÞ ðkÞT
B’ ð24Þ Lemma 1 ensures that any large-scale tensor can be factorized by
k¼1 concatenating factors of sub-tensors. However, the combined factors
Factor C is estimated via partially updating Cð1Þ , and Cð2Þ . The ALS often have higher rank than that we need. The next update rule will
help us to find R componential factors from factors A,~ B, ~
~ C.
update rule to update CðkÞ from the tensor Y ðkÞ is given by
Factor A can be updated by using the ALS learning rule (14) for
T T
CðkÞ ’YðkÞ
ð3Þ ðB AÞððB BÞ,ðA AÞÞ
1
tensor Y as
¼ CðkÞ ðBðkÞ AðkÞ ÞT ðB AÞððBT BÞ,ðAT AÞÞ1 A’Yð1Þ ðC BÞððCT CÞ,ðBT BÞÞ1
ðkÞ ðkÞT ðkÞT T T 1
¼ C ððB BÞ,ðA AÞÞððB BÞ,ðA AÞÞ ~ C~ BÞ
¼ Að ~ T ðC BÞððCT CÞ,ðBT BÞÞ1
¼ CðkÞ ðPðkÞ {ðCðkÞT CÞÞðQ {ðCT CÞÞ1 : ð25Þ ~ C~ T CÞ,ðB~ T BÞÞððCT CÞ,ðBT BÞÞ1
¼ Aðð
T
We note that matrices PðkÞ A RRk R and Q A RRR are relatively ~
¼ AðP{ð A~ AÞÞðQ {ðAT AÞÞ1 , ð34Þ
small size. Hence, the learning rules (21), (24) and (25) are low
where
computational cost.
T T T
P ¼ ðA~ AÞ,ðB~ BÞ,ðC~ CÞ: ð35Þ
2.2. Approach 2: By modifying the ALS learning rule
We note that the learning rules (21), (34) are identical due to the
Factors A, B and C can be estimated by using the ALS (14) for following relations
the concatenation tensor Y " ð1Þ #T " # " ð1ÞT #
T C~ Cð1Þ C~ Cð1Þ
~
C C¼ ¼
A’Yð1Þ ðC BÞððCT CÞ,ðBT BÞÞ1 : ð26Þ ð2Þ Cð2Þ
,
C~ C~ ð2ÞT Cð2Þ
By using the following relation, we can convert the ALS learning " ð1ÞT #
rule to the learning rule given in (21) ~ T
~ ð1Þ
~ ð2Þ T B~ B
B B ¼ ½B B B¼ ð2ÞT ,
B~ B
" # ! " #
ð1Þ ð2Þ
Cð1Þ ð1Þ ð2Þ
Cð1Þ B
Yð1Þ ðC BÞ ¼ ½Yð1Þ Yð1Þ B ¼ ½Yð1Þ Yð1Þ
Cð2Þ Cð2Þ B and
2
X 2
X 2
¼ YðkÞ
ð1Þ ðCðkÞ BÞ ¼ AðkÞ ðCðkÞ BðkÞ ÞT ðCðkÞ BÞ ~ C~ T CÞ,ðB~ T BÞÞ ¼ AðkÞ ðC~
ðkÞT
CðkÞ Þ,ðB~
ðkÞT
X
Aðð BÞ:
k¼1 k¼1 k¼1
X2
ðkÞ ðkÞT ðkÞT 3. Grid PARAFAC
¼ A ððC CðkÞ Þ,ðB BÞÞ: ð27Þ
k¼1
We consider a tensor Y is divided into a grid of multiple sub-
tensors Y ðkÞ of size Ik1 Ik2 IkN , Kknn ¼ 1 Ikn ¼ In , where vector
P
2.3. Approach 3: By concatenating sub-factors
k ¼ ½k1 ,k2 , . . . ,kN indicates sub-tensor index, 1 rkn r Kn , and Kn is
the number of sub-tensors along mode-n (see Fig. 2). We factorize
Another method to find factors of the concatenation tensor
all the sub-tensors by PARAFAC sub-factors UðnÞ in parallel mode.
Y is that we build up a factorization from concatenation factors. ðkÞ
Finally, the full factors AðnÞ for the whole tensor will be estimated
The following lemma will show this factorization.
from these sub-factors with fast learning rules and parallel
Lemma 1 (Concatenation of factors). Tensor Y is factorized by computing.
factors of R~ ¼ R1 þ R2 components
3.1. ALS algorithm for grid PARAFAC
A~ ¼ ½Að1Þ Að2Þ , ð28Þ
Assuming that the tensor Y can be approximated by N factors
B~ ¼ ½Bð1Þ Bð2Þ , ð29Þ AðnÞ , we split each factor AðnÞ into Kn parts
ðnÞT ðnÞT ðnÞT T
AðnÞ ¼ ½Að1Þ
" #
Cð1Þ Að2Þ . . . AðK nÞ
, ð36Þ
C~ ¼ ð2Þ
, ð30Þ
C where sub-factor AðnÞ A RIkn J ,kn ¼ 1,2, . . . ,Kn .
ðkn Þ
1974 A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984
Lemma 2. A sub-tensor Y ðkÞ , k ¼ ½k1 ,k2 , . . . ,kN can be factorized by sub-factors AðnÞ
ðkn Þ
a set of N sub-factors fAðkÞ g ¼ fAð1Þ
ðk1 Þ
,Að2Þ
ðk2 Þ
ðNÞ
, . . . ,Aðk NÞ
g: 0 10 11
ðnÞ
X ðnÞ
ðnÞT ðnÞ
X
ðnÞT ðnÞ
Aðkn Þ ’@ U PðkÞ { U Aðkn Þ A@ Q ðkÞ { Aðkn Þ Aðkn Þ A
Y ðkÞ ¼ I1 Að1Þ Að2Þ N Aðk
ðk1 Þ 2 ðk2 Þ
ðNÞ
NÞ
¼ ½½fAðkÞ g: ð37Þ ðkÞ ðkÞ
k n ¼ kn k n ¼ kn
ð45Þ
Proof. Proof of this lemma is directly derived from definition of
Pkn 1 ¼ TS1 : ð46Þ
PARAFAC model for each entry yi , i ¼ ½i1 ,i2 , . . . ,iN , l ¼ 1 Ikl o
Pkn
in o l ¼ 1 Ikl where matrices T and S are computed for specific sub-factors AðnÞ
ðkn Þ
X
yi ¼ að1Þ að2Þ . . . aiðNÞ : & ð38Þ T¼ UðnÞ PðkÞ { UðnÞT Aðk
ðnÞ
nÞ
, ð47Þ
i1 i2 N ðkÞ ðkÞ
k1 ,...,kn1
kn þ 1 ,...,kN
kn þ 1 ,...,kN
1
D¼ JYI1 Að1Þ 2 Að2Þ . . . N AðNÞ J2F ð39Þ Each term of the two summations in (47) and (48) calculates
2
Hadamard divisions, and performs all operations on small-sized
K1 KN matrices, instead of Khatri–Rao products for tall matrices.
1 X 1 X ðkÞ
JY ðkÞ ½½fAðkÞ gJ2F ¼ fAðkÞ gn T J2F Matrices PðkÞ A RJk J and Q ðkÞ A RJJ can be calculated only once
X
¼ JYðnÞ AðnÞ
ðkn Þ
2k ¼1 2
k ¼1
1 N k time, and can be quickly updated after estimating the sub-factors
ð40Þ AðnÞ
kn
. Moreover, the matrices T and S can be calculated through a
parallel loop.
whose gradient components with respect to sub-factor AðnÞ
ðkn Þ
are The pseudo-code of the new ALS algorithm is given
given by in Algorithm 1. The normalization of components to unit-length
X ðkÞ
rAðnÞ D ¼ ðnÞ
YðnÞ A n þ Aðk An T An vectors are not explicitly displayed in Algorithm 1. Parallel FOR-
nÞ ðkÞ ðkÞ ðkÞ
ðkn Þ
k n ¼ kn
loop denoted by ‘‘parfor’’ loop is available with the Matlab
X ðkÞ Parallel Computing Toolbox.
¼ YðnÞ A n þ AðnÞ
ðkn Þ
fATðkÞ AðkÞ g,n : ð41Þ
ðkÞ
k n ¼ kn
ðkÞ n
YðnÞ A UðnÞ Un T An ¼ UðnÞ ðUTðkÞ AðkÞ Þ,n ¼ UðnÞ ðPðkÞ {ðUðnÞT AðnÞ
ðkn Þ
ÞÞ,
ðkÞ ðkÞ ðkÞ ðkÞ ðkÞ ðkÞ ðkÞ
ð44Þ
4. Grid PARAFAC with nonnegative constraints The multiplicative algorithms have a relative low complexity
Factorization of nonnegative tensors such as spectral data, but they are characterized by rather slow convergence and they
images, data in social networks, email surveillance requires sometimes converge to spurious local minima.
nonnegative factors to provide meaningful components, and
physical interpretation. Many efficient algorithms have been
proposed or generalized from algorithms for nonnegative matrix
factorization (NMF). The ALS [1], HALS [8,10] or multiplicative LS 4.3. Hierarchical ALS algorithm for grid NTF
(least squares) algorithms [1,11], can be directly derived from the
Frobenius cost function (40). However, most efficient algorithms By taking into account that if the tensor Y in (42) is approxi-
for nonnegative tensor factorization cannot deal with very large- mated by a rank-one tensor, that means factors AðnÞ have only one
scale data. In this section, we extend grid PARAFAC for nonnega- component, matrix inversion simplifies into a division by a scalar,
tive factors, and proposed algorithms based on minimizing the and the ALS (42) to update a part k of a factor n is given by
same cost function in (40). Sub-tensors are factorized by PARAFAC
YðkÞ an
P
models, then nonnegative factors for the full tensor will be k n ¼ kn ðnÞ ðkn Þ
aðnÞ ’P : ð53Þ
estimated from factors of sub-tensors. ðkn Þ
ðaTðkn Þ aðkn Þ Þ,n
k n ¼ kn
4.1. ALS algorithm for grid NTF Therefore, if we formulate an ALS update rule for a component of
ðnÞ
the sub-factor Aðk nÞ
, this learning rule does not require matrix
inversion. Hence, it reduces computational cost, and is more
A simple extension is that we enforce nonnegativity or strictly
stable.
speaking positive constraints for factors updated by the ALS (45)
20 1 Assume that, we estimate the r-th component of the sub-factor
6B X ðnÞ
ðnÞ
AðknÞ
denoted as ½AðnÞ ¼ aðrkn Þ ,
ðkn Þ r
n ¼ 1,2, . . . ,N,kn ¼ 1,2, . . . ,Kn ,
ðnÞ ðnÞT ðnÞ
Aðkn Þ ’4@ U ðPðkÞ {ðU Aðkn Þ ÞÞA
C
ðkÞ ðkÞ
k1 ,...,kn1
r ¼ 1,2, . . . ,R. This component only exists in factorizations of
kn þ 1 ,...,kN
0 3 sub-tensors Y k with kn be the n-th entry of the sub-tensor index
B X ðnÞT ðnÞ 1 7 k ¼ ½j1 , . . . ,jn1 ,kn ,jn þ 1 , . . . ,jN , jm ¼ 1, . . . ,Km . We split the approx-
@ ðQ ðkÞ {ðAðknÞ
Aðkn Þ ÞÞÞ 5 , ð49Þ h i
k1 ,...,kn1 imation of sub-tensors Y k , k ¼ kn into two parts:
kn þ 1 ,...,kN
þ n
where ½x þ ¼ maxfe,xg is a half-wave rectifying nonlinear projec- one consists of all rank-one tensors with which the specific
tion and e is a small positive constant. The ALS (42), (49) exploits component aðk nÞ
is not involved
r
information not only about gradient but also Hessian information
and gives some insight why ALS algorithm have good conver- X ðk Þ ðk Þ
Y ðkÞ ¼ aj 1 3aj 2 3 3aðkNÞ
: ð54Þ
gence rate. Unfortunately, the standard ALS algorithm suffers ðrÞ j
jar
from unstable convergence properties, demands high computa-
tional cost due to matrix inversion, often returns suboptimal And a rank-one tensor built up from aðk
r
nÞ
However, for real-world data, a proper selection of parameter l To exploit the learning rule (53), we define a new residual tensor
decides the performance of the final result. which is approximated by the rank-one tensor Y ðkÞ
ð þ rÞ
ðkÞ
4.2. Multiplicative algorithm for grid NTF Y~ ð þ rÞ ¼ Y ðkÞ Y ðkÞ
ðrÞ
¼ arðk1 Þ 3aðk2Þ ðkN Þ
r 3 3ar þ E: ð57Þ
From the gradients (41), we apply a similar method to derive Learning rule to update the component aðk
r
nÞ
is given by
the multiplicative LS algorithm for NMF [12–16] to obtain the
ðkÞ
update rule X Y~ ð þ rÞn farðkÞ g
aðk
r ’
nÞ
k ¼ kn P
, ð58Þ
ðaðkÞT arðkÞ Þ,n
2 3 n
k n ¼ kn r
ðnÞ ðnÞ 6 X ðnÞ ðnÞT ðnÞ
Aðkn Þ ’Aðkn Þ ,4 U ðPðkÞ {ðU Aðkn Þ ÞÞ5
7
ðkÞ ðkÞ
k1 ,...,kn1
kn þ 1 ,...,kN
where farðkÞ g ¼ farðk1 Þ ,aðk2Þ ðkN Þ
r , . . . ,ar g.
þ
0 1 Each term in the numerator in (58) is rewritten as
X
{@AðnÞ ðnÞT ðnÞ
ðQ ðkÞ {ðAðk Aðkn Þ ÞÞA, ð51Þ ðkÞ ðkÞ
B C
ðkn Þ nÞ Y~ ð þ rÞn faðkÞ
r g ¼ ðY
ðkÞ
Y^ þ Y ððkÞ Þ faðkÞ
þ rÞ n r g
k1 ,...,kn1
kn þ 1 ,...,kN
where matrices T and S are defined in (47) and (48), and ¼ UðnÞ ½PðkÞ {ðUðnÞT AðnÞ Þr Aðk
ðnÞ
½Q ðkÞ {ðAðnÞT AðnÞ Þ þ arðkn Þ farðkÞT arðkÞ g,n
ðkÞ ðkÞ ðkÞ nÞ ðkn Þ ðkn Þ r
computed through a parallel loop in Steps 13 and 15
ð59Þ
of Algorithm 1. Replacing Step 16 by the expression in (52) gives
us pseudo-code of the multiplicative algorithm for grid NTF. where ½Ar ¼ ar denotes the r-th column vector of matrix A.
1976 A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984
Learning rule (58) can be expressed in the compact form labs and the server in a parallel system during the factorization.
1 X ðnÞ For illustration, we will compute the communication cost for the
arðkn Þ ’aðk
r
nÞ
þ U ½PðkÞ {ðUðnÞT AðnÞ Þr ALS algorithm. The value can be roughly measured as the total
wkn ðkÞ ðkÞ ðkÞ
k n ¼ kn data received and sent in all steps in Algorithm 1.
1 X ðnÞ ðnÞT ðnÞ For all the grid algorithms, the communication cost mostly
Aðkn Þ ½Q ðkÞ {ðAðknÞ
Aðkn Þ Þr
wkn concentrates on computing matrices T and S to update the sub-
k n ¼ kn
1 1 ðnÞ factors AðnÞ
kn
. At Step 12, to update a matrix PðkÞ , a lab requires an
¼ aðk
r
nÞ
þ tr A sr , ð60Þ
wkn wkn Jk J matrix PðkÞ , the Ikn Jk sub-factor UðnÞ and the Ikn J sub-
ðkÞ
where the scalar weight wkn ¼ and T and SðaðkÞT aðkÞ ,n
,
P
r Þ
ðnÞ
k n ¼ kn r factor AðknÞ
, then sends an Jk J data to the server. Through Steps
are defined in (47) and (48), respectively.
12–15, the total transferred data is
Finally, the equation given in (60) is the update rule for a
component of a sub-factor AðnÞ
ðkn Þ
. By replacing Step 16 in Algorithm Ikn ðJk þ 3JÞ þ2Jkn J þ4J 2
1 by (60), we obtain the pseudo-code for the Hierarchical ALS
or
algorithm for grid NTF.
4Ikn J þ6J2
5. Stopping criterion under the assumption that sub-tensors are approximated with
the same number of components of their full tensors, that means
Stopping criterion takes an important role in identification of J ¼ Jk . For all the sub-tensors, the total transferred data in the
convergence of a factorization. For simplicity, the cost function ‘‘parfor’’ loop in Step 11 is
value (13) is usually used as stopping criterion. The FIT rate
!
X In Y
2
(FITð%Þ ¼ 1ðJYY^ J2F Þ=JYJ2F ) can also be used. However, due to 4J þ6J Kn ,
n
K n n
similar computation, we only mention the Frobenius norm. For a
large tensor, an explicit computation of the cost function or
value (13) is impossible due to so much memory requirement
4JK N1
X
In þ6J2 K N ,
to build up the approximate tensor Y^ . In this section, we derive a n
fast computation for stopping criterion applied to the grid
PARAFAC. The Frobenius norm of a raw sub-tensor and its with an assumption that K1 ¼ K2 ¼ ¼ KN ¼ K. Hence, the total
approximation is given by data transferred inside the ‘‘repeat’’ loop (in Step 7) is
6JK N1
X
ðkÞ 2 ðkÞ 2 ðkÞ In þ10J 2 K N :
DðkÞ ¼ JY ðkÞ Y^ JF ¼ JY ðkÞ J2F þ JY^ JF 2/Y ðkÞ , Y^ S ð61Þ n
where /Y, Y^ S is the inner product of two same-sized tensors For a specific data tensor, the communication cost of the ALS (45)
I1 IN
increases as the number of sub-tensors KN increases. The smaller
ðkÞ ðkÞ
/Y ðkÞ , Y^ S ¼
ðkÞ
Þ vecðY^ ðNÞ Þ: the number of sub-tensors, the lower the communication cost.
X X
yiðkÞ y^
1 iN i1 iN
ðkÞ T
¼ vecðYðNÞ ð62Þ
i1 ¼ 1 iN ¼ 1 However, the sub-tensor’s size is limited by the configuration of
its working lab, such as memory, operating system. For 32-bit
Each terms in the expression (61) can be computed as follows:
system, tensor size should not exceed 3 Gb. Relatively small sub-
ðkÞ 2 ðkÞ tensors can be easily factorized with high fitness. When the
JY^ JF ¼ JvecðY^ ÞJ22 ¼ JfAðkÞ g 1J22 ¼ 1T fATðkÞ AðkÞ g, 1 ¼ 1T Q ðkÞ 1,
number of sub-tensors Kn along mode-n are exactly the tensor
ð63Þ size, that means Kn ¼ In, for n ¼ 3, . . . ,N, the sub-tensors simplify
into matrices. Factorizations of these sub-tensors will become
ðkÞ
/Y ðkÞ , Y^ S ¼ 1T fUðkÞ gT fAðkÞ g 1 ¼ 1T fUTðkÞ AðkÞ g, 1 ¼ 1T PðkÞ 1: ð64Þ Singular Value Decompositions which can always explain 100% of
the data. However, reconstruction of the full factors demands very
From (61), (63) and (64), we obtain a convenient and fast high communication cost due to the large number of sub-tensors.
computing for the cost function As a consequence, a large-scale tensor should be divided to
1 X ðkÞ 1 X have a minimum number of sub-tensors.
D¼ D ¼ ðJY ðkÞ J2F þ 1T Q ðkÞ 121T PðkÞ 1Þ Dividing the large-scale tensor into small sub-tensors could
2 2
k k
increase the risk of nearly collinear factors which cannot be
1 1X T
¼ JYJ2F þ ð1 Q ðkÞ 121T PðkÞ 1Þ: ð65Þ approximated by the standard ALS algorithm in (14). In order to
2 2 cope with the problem, we can use some modifications such as
k
Compression [17] and Line Search [18,19]. The compression
The first term JYJ2F is constant, hence can be neglected. The rest
technique compresses the data into a smaller tensor using the
terms are additions of all the entries of matrices Q ðkÞ , and PðkÞ .
TUCKER model. Next, an PARAFAC model is fitted to the core
tensor of the TUCKER model. The PARAFAC for the original data
6. Communication cost and practical considerations are represented as products of estimated factors.
This section analyzes the communication cost of grid algo- 6.2. Multistage reconstruction for grid decomposition
rithms, and discusses some practical issues on implementation of
the grid factorization. Techniques are introduced to deal with Reconstruction of the full factors from all sub-factors would
small-sub-tensors, or with the large number of sub-tensors. demand high communication cost of the order YðK N Þ. This section
is devoted to speeding up the reconstruction. A multistage
6.1. Communication cost reconstruction is recommended for this problem. At each stage,
we construct factors for the concatenated tensors which are built
The communication cost of algorithms for grid PARAFAC and up from neighbor sub-tensors; hence this reduces the number of
grid NTF can be evaluated as the total data transferred between sub-tensors for the next step.
A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984 1977
X
ðnÞT ðnÞ
S¼ Q ðkÞ { AðknÞ
Aðkn Þ , ð69Þ
k1 ,...,kn1
kn þ 1 ,...,kN
based on matrices
PðkÞ ¼ ðUTðkÞ AðkÞ Þ, A RJk J , ð70Þ
8. Experiments
in Table 1. The results were averaged over 100 runs. The 1,131 s, and achieved SIR ¼ 25.08 dB, whereas the reconstruction
performance almost did not change when varying the grid size. with three stages took place in 172 s and achieved 23.95 dB.
This tensor was next degraded by an additive Gaussian noise As analyzed in Section 6.1, communication cost of the recon-
with 10 dB. The grid PARAFAC achieved good performance with struction strongly depends on the total number of sub-tensors or
SIR 35 dB for different grid size K ¼2, 4, 5, 8, 10. While the grid the grid size. The larger the number of sub-tensors, the slower the
NTF achieved SIR 46 dB. We cannot see the effect of the grid processing speed. The grid PARAFAC completed the reconstruc-
size on the accuracy. tion from 4 sub-tensors in 18.65 s, and in 249 s for 125 sub-
The noise intensities were increased so that SNR ¼ 0 dB. Both tensors K ¼5, and in 29,310 s for 8000 sub-tensors K ¼20. A
the grid PARAFAC and grid NTF also obtained good performance similar effect of the grid size on the processing speed for the grid
with a small grid size K ¼2, 4. However, the performance NTF, and the multistage grid factorizations.
decreased as the sub-tensor’s size decreased. The algorithms We replicated this experiment but for a 4-D synthetic tensor
achieved only 20 dB for the gridsize K ¼20, or the sub-tensor’s X A R1000100010001000 composed from the similar components.
size 50 50 50. Small sub-tensors could not capture the hidden The tensor was also degraded by a Gaussian noise with SNR ¼
components due to their structures distorted by heavy noise. The 10 dB. The noisy dense tensor consisted of 1012 entries, and could
comparison of SIR indices for such kind of data are illustrated consumed 4 TB RAM with single precision format. The grid
in Fig. 4(b). We also analyzed the multistage reconstructions for PARAFAC and grid NTF factorized the noisy tensor with a
the grid PARAFAC and grid NTF, and the conversion model of 7 7 7 7 dimensional grid of sub-tensors. That means there
PARAFAC to NTF factors. Multistage reconstruction achieved were in total 2401 sub-tensors. Factorizations of all the sub-
slightly lower performance. For example, with a grid size of tensors took place in 19,156 s in a parallel system consisting of
K ¼5, the grid PARAFAC achieved 29.39 dB, and its multistage 4 nodes and 16 cores. The approximation time will decrease when
factorization achieved 26.85 dB. The grid NTF and its multistage using more nodes. The reconstructions of factors took 842 s using
obtained 35.11 and 32.97 dB, respectively. The performance of the the grid PARAFAC, and 3,911 s using the grid NTF. By using the
multistage paradigm could be degraded up to 2 dB, but this (N–2)-stage paradigm presented in Section 6.2, and illustrated
technique speeded up the processing time. The comparison of in Fig. 3(b), the grid PARAFAC and grid NTF completed the
processing speed between models is given in Table 2. With the reconstructions in 167 and 165 s, respectively. The performance
same grid size, the multistage reconstructions were always much of the experiment is shown in Table 3. All the algorithms achieved
faster than the grid models. For example, the grid PARAFAC very good performance 440 dB.
approximated the full factors from 512 ¼83 sub-tensors in
8.2. Synthetic benchmark
0.1
0.05
0
15
10
Co 1000
m 800
po 5
ne 600
nt
s 400
ples
0
200 Sam
0
Fig. 5. Illustration for Example 2 with a noisy dense tensor of size 5000 5000 5000 (only 200 samples for each dimensions are shown). (a) 15 components. (b) Noisy
tensor. (c) Reconstructed tensor.
0 0
20 20
40 40
60 60
80 80
100 100
120 120
140 140
160 160
180 180
200 200
0 50 100 150 200 0 50 100 150 200
Fig. 6. Noisy slice and its reconstruction for Example 2 (only 200 samples for each dimensions are shown). (a) Noisy slice. (b) Reconstructed slice.
estimated PARAFAC factors, we can quickly retrieve the nonne- SIR indices in a range of [43.64 , 54.97] dB, and are depicted
gative factors under the data by applying the NTF algorithms. in Fig. 5. Fig. 6(b) shows the reconstruction of the noisy slice
This step only took 2.78 s. The 15 estimated factors achieved high in Fig. 6(a).
1980 A.H. Phan, A. Cichocki / Neurocomputing 74 (2011) 1970–1984
1 1
2 2
Components
Components
3 3
4
4
5
5
6
6
10 15 20 25 30 0 1 2 3 4
Frequency [Hz] Time [sec]
Fig. 7. Illustration of six nonnegative components estimated from 10 PARAFAC components for the Graz benchmark. (a) Spectral components. (b) Temporal components.
Table 5
Comparison of grid NTF and NTF for the digit dataset.
0
2.1 1.86 1.86 1.8 1.79 20 1
2
3
10 4
1.43 1.37 1.26 1.2 0.59
5
0 6
7
0.53 0.41 0.41 0.16 0.13 8
Feature 2
–10 9
Fig. 10. Feature extraction for a digit 6 using PARAFAC features: (a) training digit; –20
(b)–(d) samples were reconstructed by using 10, 15 and 20 features with
corresponding bases in (e).
–30
Fig. 12. Illustration of frequency responses of the 20 20 MIMO system for Example 6. (a)–(c) magnitude responses of the original and estimated impulse functions
obtained the 5-D PARAFAC, the grid PARAFAC and the method in [23]. (d)–(f) phase responses of the original and estimated impulse functions for different methods. For
each plot, the x-axis denotes the frequency points, and the y-axis represents the intensity. We displayed only some responses for the first three inputs and outputs.
of sub-tensors from the original tensors increases the commu- MATLAB function kron):
nication cost of the algorithms. To deal with this, we can estimate 2
a11 B a12 B . . . a1J B
3
the full factors in multistage as illustrated in Examples 1, 2, and 5. 6a B a B ...
6 21 22 a2J B 7
7
AB¼6 6 ^
7 ðB:6Þ
4 ^ & ^ 7 5
Appendix A. Basic statistics for a synthetic tensor aI1 B aI2 B . . . aIJ B
This section will present some basic statistics for a synthetic ¼ ½a1 b1 a1 b2 ... aJ bR1 aJ bR : ðB:7Þ
tensor Y. In example 1, we built up a dense and very large-scale
It should be mentioned that, in general, the outer product of
tensor. This tensor was then degraded by an additive Gaussian
vectors yields a tensor whereas the Kronecker product gives a
noise. To generate random noise, we needed the standard devia-
vector. For example, for the three vectors a A RJ ,b A RT ,c A RQ
tion of vector vecðYÞ. The following lemma helps us to quickly
their three-way outer product Y ¼ a3b3c A RITQ is a third-order
compute this statistic instead of manipulating on samples yi1 ...iN .
tensor with the entries yitq ¼ aj bt cq , while the three-way
Lemma 3. The following properties hold for an N-dimensional Kronecker product of the same vectors is a vector vecðYÞ ¼
tensor Y built up from N factors AðnÞ A RIn R c b a A RITQ .
i1 ,i2 ,...,iN For two matrices A ¼ ½a1 ,a2 , . . . ,aJ A RIJ and B ¼ ½b1 ,b2 , . . . ,
T
¼ 1 fA Ag 1: T ,
& ðA:5Þ bJ A RTJ
with the same number of columns J, their Khatri–Rao product,
Appendix B. Outer, Kronecker, Khatri–Rao and Hadamard denoted by , performs the following operation:
products
A B ¼ ½a1 b1 a2 b2 . . . aJ bJ ðB:9Þ
Several special matrix products are important for representa-
¼ ½vecðb1 aT1 Þvecðb2 aT2 Þ . . . vecðbJ aTJ Þ A RITJ : ðB:10Þ
tion of tensor factorizations and decompositions.
The following properties of the Khatri–Rao product are often
B.1. Outer product employed in the paper:
[8] A. Cichocki, A.-H. Phan, Fast local algorithms for large scale nonnegative [26] B. Savas, L. Eldén, Handwritten digit classification using higher order singular
matrix and tensor factorizations, IEICE 92-A(3) 708–721 (2009) (invited value decomposition, Pattern Recognition 40 (3) (2007) 993–1003 doi:
paper). [Link]
[9] S. Goreinov, E. Tyrtyshnikov, N. Zamarashkin, A theory of pseudo-skeleton [27] P. Paatero, A weighted non-negative least squares algorithm for threeway
approximations, Linear Algebra and Applications 261 (1997) 1–21. PARAFAC factor analysis, Chemometrics Intelligent Laboratory Systems,
[10] A.-H. Phan, A. Cichocki, Multi-way nonnegative tensor factorization using fast vol. 38, no. 2, pp. 223–242, 1997.
hierarchical alternating least squares algorithm (HALS), in: Proceedings of [28] P. Tichavský, Z. Koldovský, Simultaneous search for all modes in multilinear
The 2008 International Symposium on Nonlinear Theory and its Applications, models, in: Proceedings of the IEEE International Conference on Acoustics,
Budapest, Hungary, 2008. Speech, and Signal Processing (ICASSP10), 2010.
[11] M. Mørup, L. Hansen, J. Parnas, S. Arnfred, Decomposing the time-frequency [29] A.-H. Phan, A. Cichocki, Tensor decompositions for feature extraction and
representation of EEG using non-negative matrix and multi-way factoriza- classification of high dimensional datasets, Nonlinear Theory and Its Applica-
tion, Technical Report, 2006. URL /[Link] tions, IEICE (invited paper) 1 (1) (2010) 37–68.
php?4144S.
[12] M. Daube-Witherspoon, G. Muehllehner, An iterative image space recon-
struction algorithm suitable for volume ECT, IEEE Transactions on Medical
Imaging 5 (1986) 61–66.
Anh Huy Phan received the B.E. and [Link]. degrees in
[13] A. De Pierro, On the relation between the ISRA and the EM algorithm for
area of Electronic Engineering from the Ho-Chi-Minh
positron emission tomography, IEEE Transactions on Medial Imaging 12 (2)
City University of Technologies. He worked as Deputy
(1993) 328–333.
Head of Research and Development Department,
[14] H. Lantéri, R. Soummmer, C. Aime, Comparison between ISRA and RLA
Broadcast Research and Application Center, Vietnam
algorithms: use of a Wiener filter based stopping criterion, Astronomy and
Television; And he is doing research towards his Ph.D.
Astrophysics Supplemantary Series 140 (1999) 235–246.
degree under supervision of Professor Cichocki. He is
[15] P. Eggermont, V. LaRiccia, Maximum smoothed likelihood density estimation
the co-author of the book: Nonnegative Matrix and
for inverse problems, Annals of Statistics 23 (1) (1995) 199–220.
Tensor Factorizations, J. Wiley 2009. His interest is in
[16] D. Lee, H. Seung, Learning of the parts of objects by non-negative matrix
tensor decompositions and their applications.
factorization, Nature 401 (1999) 788–791.
[17] C. Andersson, R. Bro, Improving the speed of multi-way algorithms: part I.
Tucker3, Chemometrics Intelligent Laboratory Systems 42 (1998) 93–103.
[18] R. Bro, Multi-way analysis in the food industry—models, algorithms, and
applications, Ph.D. Thesis, University of Amsterdam, Holland, 1998.
[19] M. Rajih, P. Comon, Enhanced line search: a novel method to accelerate
PARAFAC, SIAM Journal on Matrix Analysis and Applications 30 (3) (2008) Andrzej Cichocki was born in Poland. He received his
1128–1147. [Link]. (with honors), Ph.D. and Habilitate Doctorate
[20] N. Sidiropoulos, G. Giannakis, R. Bro, Blind parafac receivers for ds-cdma ([Link].) degrees, all in electrical engineering, from the
systems, IEEE Transactions on Signal Processing 48 (3) (2000) 810–823. Warsaw University of Technology (Poland). He is the
[21] D. Nion, L. De Lathauwer, A tensor-based blind ds-cdma receiver using co-author of four international books and monographs
simultaneous matrix diagonalization, in: Proceedings of SPAWC 07, IEEE (two of them translated to Chinese): Nonnegative
Workshop on Signal Processing Advances in Wireless Communications, Matrix and Tensor Factorizations, J. Wiley 2009, Adap-
Helsinki, Finland, 2007, URL /[Link] tive Blind Signal and Image Processing, J. Wiley 2002,
[22] E. Acar, C. Bingol, H. Bingol, R. Bro, B. Yener, Computational analysis of MOS Switched-Capacitor and Continuous-Time Inte-
epileptic focus localization, in: Proceedings of The Fourth IASTED Interna- grated Circuits and Systems (Springer- Verlag, 1989)
tional Conference on Biomedical Engineering BioMED2006, Innsbruck, Aus- and Neural Networks for Optimization and Signal
tria, 2006, pp. 317–322. Processing (J. Wiley and Teubner Verlag, 1993/94)
[23] Y. Yu, A.P. Petropulu, PARAFAC-based blind estimation of possibly under- and author or co-author of more than two hundreds
determined convolutive MIMO systems, IEEE Transactions on Signal Proces- papers. He is Editor-in-Chief of Journal Computational Intelligence and Neu-
sing 56 (2007) 111–124. roscience. Currently, he is the head of the laboratory for Advanced Brain Signal
[24] C. Brunner, R. Leeb, G. R. Muller-Putz, A. Schlogl, BCI competition 2008—Graz Processing in Brain Science Institute, RIKEN, Japan.
data set A, 2008.
[25] L. van der Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine
Learning Research 9 (2008) 2579–2605.