An Efficient Approach to Informative Feature Extraction
from Multimodal Data
Lichen Wang1⇤ , Jiaxiang Wu2 , Shao-Lun Huang1 , Lizhong Zheng3 ,
Xiangxiang Xu4 , Lin Zhang1 , Junzhou Huang5
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, 2 Tencent AI Lab
1
3
Department of EECS, Massachusetts Institute of Technology
4
Department of Electronic Engineering, Tsinghua University
5
Department of CSE, The University of Texas at Arlington
Email: wlc16@[Link], jonathanwu@[Link], [Link]@[Link],
lizhong@[Link], xuxx14@[Link], linzhang@[Link], jzhuang@[Link]
arXiv:1811.08979v1 [[Link]] 22 Nov 2018
Abstract Correlation Analysis (CCA) (Hotelling 1936; Akaho 2006;
Andrew et al. 2013), Euclidean distance minimization (Frome
One primary focus in multimodal feature extraction is to find et al. 2013), enforcing partial order (Vendrov et al. 2015), etc.
the representations of individual modalities that are maxi-
mally correlated. As a well-known measure of dependence, In statistic, the Hirschfeld-Gebelein-Rényi (HGR) max-
the Hirschfeld-Gebelein-Rényi (HGR) maximal correlation be- imal correlation (Hirschfeld 1935; Gebelein 1941; Rényi
comes an appealing objective because of its operational mean- 1959), as a generalization from the Pearson’s correlation
ing and desirable properties. However, the strict whitening (Pearson 1895), is well-known for its legitimacy as a mea-
constraints formalized in the HGR maximal correlation limit sure of dependence. Such notion is appealing to multimodal
its application. To address this problem, this paper proposes feature extraction for many reasons. For example, maximiz-
Soft-HGR, a novel framework to extract informative features ing the HGR maximal correlation enables us to determine
from multiple data modalities. Specifically, our framework
the nonlinear transformations of two variables that are max-
prevents the “hard” whitening constraints, while simultane-
ously preserving the same feature geometry as in the HGR imally correlated (Feizi et al. 2017). In the perspective of
maximal correlation. The objective of Soft-HGR is straightfor- the information theory, the HGR transformation carries the
ward, only involving two inner products, which guarantees the maximum amount of information of X about Y , and vice
efficiency and stability in optimization. We further generalize versa (Huang et al. 2017). As for generality, CCA (Hotelling
the framework to handle more than two modalities and missing 1936) and its variants (Bach and Jordan 2002; Akaho 2006;
modalities. When labels are partially available, we enhance the Andrew et al. 2013) can be regarded as the realizations of the
discriminative power of the feature representations by making HGR maximal correlation with different designs of transfor-
a semi-supervised adaptation. Empirical evaluation implies mation functions.
that our approach learns more informative feature mappings
and is more efficient to optimize. However, the HGR maximal correlation suffers from
two limitations. Firstly, HGR maximal correlation involves
whitening constraints which require each feature to be strictly
uncorrelated. Most commonly, the orthogonal geometry is
Introduction preserved by a whitening process (Andrew et al. 2013;
Human perception is typically more accurate when objects Wang et al. 2015b), which relies on the computation of matrix
are presented in multiple modalities, as information from inversion or decomposition. These operations are of high-
one sense often augments information from another. The complexity and may have numerical stability issues for large
idea has risen recent interests to develop learning machines feature dimensions. Secondly, discriminativeness is not ex-
which can extract correlation across modalities, through the plicitly formulated in the objective of the HGR maximal
perception of equivalence, dependence or association. How- correlation. In fact, it can lead to desirable performance in
ever, compared to the ease of human perception, identify- downstream supervised tasks only if all the discriminative in-
ing the relationship among multiple sources is much harder formation “accidentally” lies in the common subspace of dif-
for machines. The reason lies in the facts that the varying ferent modalities. Such assumption may not hold true when
statistic properties carried by data from each source ob- input modalities are weakly correlated and do not possess
scure the correlation among modalities, which could be vital much common information. In this case, the underlying dis-
for learning effective feature representations (Baltrušaitis, criminative information is more likely to be omitted after
Ahuja, and Morency 2018; Sohn, Shang, and Lee 2014). feature mapping, which leads to performance degradation.
Existing methods approaches this problem by Canonical To address these problems, we propose Soft-HGR, a novel
⇤
This work was done when Lichen Wang was an intern at Ten- framework to learn correlated representation across modal-
cent AI Lab. ities without hard whitening constraints. The objective of
Copyright c 2019, Association for the Advancement of Artificial Soft-HGR consists of two inner products, one between the
Intelligence ([Link]). All rights reserved. feature mappings and the other between feature covariances.
While the formulation rules out the whitening constraints, our
model is still able to preserve the same feature geometry as in
the original HGR formulation. Therefore, no additional decor-
relation process is required in optimization, which promises
scalability and stability to the algorithm. Besides, the simple
formulation of the Soft-HGR provides additional generaliz-
ability to the framework. Soft-HGR can be readily extended
to manage more than two modalities and missing modalities.
In the semi-supervised settings, we adapt the model to extract
the information not only about the dependence between dif-
ferent modalities, but has good predictive power to the labels.
Empirically, our method reveals superior efficiency, stability
and discriminative performance on real data.
In summary, our main contributions are as follows:
• We proposed Soft-HGR, based on the HGR maximal cor-
relation, to extract informative features from multimodal
data. The objective is simple and easy to implement;
• We proposed an alternative strategy to learn the HGR trans-
formations without explicit whitening constraints. The op-
timization is more efficient and reliable;
• We generalize our framework to handle more than two Figure 1: Architecture of Soft-HGR
modalities and missing modalities, and to incorporate dis-
criminative information for semi-supervised tasks.
Akaho 2006) and Deep CCA (Andrew et al. 2013). In fact,
CCA based models share a very similar objective to the HGR
Background: The HGR Maximal Correlation maximal correlation, except their transformation functions
The HGR maximal correlation (Hirschfeld 1935; Gebelein are restricted to certain forms. More specifically, CCA and
1941; Rényi 1959) generalizes the well-known Pearson’s cor- Kernel CCA find optimal feature mappings in linear and re-
relation (Pearson 1895) as a general measure of dependence. producing kernel Hilbert space, respectively. Deep CCA takes
While it was originally defined on one feature, the multi- a different approach, in which the f and g are implemented
feature extension is straightforward. For joint distributed as deep neural networks. Assuming the infinite expressive
random variables X and Y with ranges X and Y, the HGR power of the neural structure, the f and g have the capability
maximal correlation with k features is defined by: to approximate the HGR transformations.
⇥ ⇤
⇢(k) (X, Y ) = sup E f T (X)g(Y )
f :X !Rk ,E[f ]=0,Cov(f )=I
Limitations
g:Y!Rk ,E[g]=0,Cov(g)=I An impediment to HGR maximal correlation is that the
(1) whitening constraints bring high computational complexity
where f = [f1 , f2 , ..., fk ]T , g = [g1 , g2 , ..., gk ]T , and the to the optimization. Existing models introduce a decorrela-
supremum is taken over all sets of Borel measurable func- tion step which forces the covariance to be an identity matrix.
tions with zero-mean and identity covariance. As a legitimate The decorrelation process is not scalable since it relies on
measure of dependence, the HGR maximal correlation satis- the computation of the matrices inversion and decomposition,
fies many fundamental properties which are rarely provided. whose time complexity is O(k 3 ). Besides, the optimization
For example, the correlation coefficient is bounded by 0 and in practice often encounters gradients explosion as we choose
1, corresponding to the case when two random variables are large k, because the covariance matrices become ill-posed.
independent, or there exists a deterministic relationship be- Some works are proposed to address the problem. Soft-CCA
tween X and Y (Rényi 1959). (Chang, Xiang, and Hospedales 2018) introduces a decor-
There are many reasons why HGR maximal correlation relation regularizer based on the l1 penalty to replace the
is appealing to multimodal feature extraction. For example, hard whitening constraints. Correlational Neural Network
finding the HGR maximal correlation also leads us to the non- (Chandar et al. 2016), inspired by autoencoder, introduces
linear transformation f and g. These transformations are the an addition reconstruction loss to replace the whitening con-
most “informative” ones, in the view of information theory, straints. However, both methods break the original feature
as f (X) carries the maximum amount of the information geometry of the HGR maximal correlation.
towards Y and vice versa (Huang et al. 2017). Besides, the features extracted from the HGR maximal
correlation are not necessarily suitable for downstream dis-
Connections to CCA Based Models criminative tasks. Without loss of generality, let us assume
One strand of research on correlation extraction is based X 2 RDX , Y 2 RDY are multivariate random variables
on the work of Hotelling on CCA (Hotelling 1936), which and DX > DY . Then the dimension of the feature trans-
is later extended to Kernal CCA (Bach and Jordan 2002; formation k should be smaller than DY . Otherwise, the
feature covariance cov(f (X)) becomes singular. Therefore, Theorem 1. (Huang et al. 2017) Given the SVD of B =
PK
some information about X is discarded during transforma- U⌃VT = i=0 i ui viT , with 1 = 0 1 ... K,
tion f : RDX ! Rk . This is acceptable if the primary goal is then optimal feature transformations for the HGR maximal
to model the correlation between modalities. However, if f is correlation are given by:
utilized for future discriminative tasks, we may expect some p
fi⇤ (x) = Ux,i / PX (x), i = 1, ..., k, x 2 X
performance loss. (5)
p
gi⇤ (y) = Vy,i / PY (y), i = 1, ..., k, y 2 Y
Soft-HGR
Proof.
In this section, we detail our framework for Soft-HGR. We h i
commence by deriving the optimal solution for the HGR E f T (X)g(Y )
maximal correlation with the whitening constraints. Then XX
we propose an alternative strategy, the low-rank approxima- = PXY (x, y)f T (x)g(y)
tion, to approach the HGR problem. we show our proposed x2X y2Y
X Xp p
objective escapes whitening constraints but still arrives at = PX (x)f T (x) p
PX,Y (x, y)
p PY (y)g(y)
an equivalent optimum. Finally, we generalize Soft-HGR to x2X y2Y
PX (x) PY (y)
handle more than two data modalities and missing modalities, T
and to incorporate supervised information. = tr( B )
(6)
The Optimal Feature Transformations In (6) we introduce new variables 2 R|X |⇥k and 2
R |Y|⇥k
, which are connected to f and g by:
To simplify the discussions, we assume that X and Y are hp p iT
discrete random variables with range X = {1, 2, ..., |X |} and = PX (|X |)f (|X |)
PX (1)f (1), . . . ,
Y = {1, 2, ..., |Y|}, respectively. However, the discussion is hp p iT (7)
still valid when X and Y are multivariate and continuous in = PY (1)g(1), . . . , PY (|Y|)g(|Y|)
nature.
We first introduce matrix B 2 R|X |⇥|Y| as a function of Following the variables substitution, the objective of the
joint distribution PXY (Huang et al. 2017). The (x, y)-th HGR maximal correlation can be reformulated as follows:
⇥ ⇤
entry is defined as: ⇢k (X, Y ) = max E f T (X)g(Y ) (8)
f :X !Rk ,E[f ]=0,Cov(f )=I
PXY (x, y) g:Y!Rk ,E[g]=0,Cov(g)=I
Bx,y = p p (2)
PX (x) PY (y) T
= max tr( B ) (9)
As a summarization of the data, B has the following property: : T
u0 =0, T
=I
T T
: v0 =0, =I
Lemma 1. The largest singular value of B is 1, with the
T
corresponding left and right singular vectors given by: = max tr( B̃ ) (10)
T
hp iT : =I
p p : T
=I
u0 = PX (1), PX (2), . . . , PX (|X |)
hp p p iT (3) As for the optimization problem in (10), the optimal ⇤
v0 = PY (1), PY (2), . . . , PY (|Y|) and ⇤ should align the left and right singular vectors of B̃
p respectively. Substituting { ⇤ , ⇤ } back to {f , g} leads us
Proof. For any = [ PY (y) g(y), y = 1, 2, . . . , |Y|]T to the solution in (5).
that satisfies || ||2 = 1, we have
!2 For the maximization problem in (10), the whitening con-
X X PXY (x, y) p straints over T and T are inevitable as they assure
||B ||22 = p p PY (y) g(y)
x y PX (x) PY (y) the selected features to be mutually orthogonal in the func-
!2 tional space. In the next subsection, we show an alternative
X X PXY (x, y) formulation for this problem.
= PX (x) g(y)
PX (x)
x
X
y
(4) Alternative: The Low-rank Approximation
= PX (x)E2 [g(Y )|X = x] Instead of solving the SVD, we approach this problem by
x
X ⇥ ⇤ discovering the low-rank approximation of B̃, where all the
PX (x)E g 2 (Y )|X = x cross-modal interactions lies in. Recall the variable equiva-
⇥
x
⇤ lence in (7), we approximate the B̃ by:
=E g 2 (Y ) = || ||22 = 1
1 T 2
Therefore the largest singular value 0 = sup ||B ||2 1. min kB̃ kF
f ,g 2 (11)
The equality only holds when g(Y ) is the constant 1 and
s.t. E [f (X)] = E [g(Y )] = 0.
= v0 . The derivation is similar for u0 .
Note that we do not impose constraints on the cov(f (X)) or
Below, we show that finding the most correlated feature cov(g(Y )). We will soon argue that this formulation leads
transformations for the maximal HGR correlation is equiva- to the same feature geometry as the one in (10). In order to
lent to solving the SVD for B̃ = B u0 v0T . solve this problem, we introduce the following theorem:
Theorem 2. (Eckart-Young-Mirsky Theorem) (Eckart and Algorithm 1 Evaluate Soft-HGR on a mini-batch
Young
Pr1936) Suppose A = U⌃VT , then Ar = Ur ⌃r VrT Input:
i=1 i ui vi is the optimal solution to the following
T
= Paired data samples of two modalities in a mini-batch of
low-rank approximation problem: size m: (x(1) , y(1) ), · · · , (x(m) , y(m) )
min kA Ar k2F Two branches of parameterized neural networks with k
Ar (12) output units: f and g
s.t. rank(Ar ) r. Output:
⇤ ⇤ The objective value of Soft-HGR
Therefore, the optimal and should follow: 1: Subtract the mean of features:
1
Pm
k
X f (x(i) ) f (x(i) ) m (j)
j=1 f (x ), i = 1, · · · , m
⇤ ⇤T
= T
i ui vi
T
= U1:k ⌃1:k V1:k (13) 1
P m
i=1
g(y(i) ) g(y(i) ) m j=1 g(y(j) ), i = 1, · · · , m
The and is not unique. Given any constant decom- 2: Compute the empirical covariance:
position of ⌃1:k = H1 HT Pm
2 , there is an associated solution cov(f ) 1 (i) (i) T
⇤
= U1:k H1 , ⇤ = V1:k H2 . Equivalent expression for f m 1 Pi=1 f (x )f (x )
1 m (i) (i) T
and g is: cov(g) m 1 i=1 g(y )g(y )
p 3: Compute the empirical Soft-HGR objective:
fi⇤ (x) = [U1:k H1 ]x,i / PX (x), i = 1, ..., k, x 2 X 1
Pm (i) T (i) 1
(14) m 1 i=1 f (x ) g(y ) 2 tr(cov(f ) cov(g))
p
gi⇤ (y) = [V1:k H2 ]x,i / PY (y), i = 1, ..., k, y 2 Y
Since H1 are H2 are invertable, one can conclude that the
optimal feature transformation for Soft-HGR (14) and for the Optimization
HGR maximal correlation (5) are linearly transformable from
the one to the other. Namely, they span the same feature space, In practise, we do not usually have access to the joint proba-
i.e. span{f1 , f2 , ..., fk } = span{f1⇤ , f2⇤ , ..., fk⇤ } (resp. for bility distribution PXY , but rather paired multimodal samples
g) and therefore describe same amount of information. One (x(1) , y(1) ), · · · , (x(m) , y(m) ) retrived from this distribution.
way to understand this equivalence is to imagine that the As common practices, we embrace SGD techniques that op-
HGR features are feed into a linear dense layer, and output erate on mini-batch of data to optimize the Soft-HGR. The
Soft-HGR features with same dimensions. prominent concern here is how to the estimate of the sample
covariance with only partially seen mini-batches. In fact, we
The Soft-HGR Objective find that simply using the batch covariance as a replacement
awards the best performance. This implies the Soft-HGR
Thus far, we prove that the low-rank approximation of B̃ also
leads to the optimal feature transformation. Based on this actually decomposes the empirical B̃ over every mini-batch.
idea, now we develop the operational objective for Soft-HGR. Only in this way the empirical PXY is always consistent with
By expanding (11), we have: the marginal distribution PX and PY , where the covariance
is evaluated on. The detailed procedure to calculate the Soft-
1 T 2 HGR objective is summarized in Algorithm 1. The overall
kB̃ kF (15)
2 complexity of Soft-HGR is O(mk 2 ), which is significantly
1 1 less compared to O(mk 2 + k 3 ) for normal HGR implemen-
= kB̃k2F tr( T B̃ ) + tr( T T
) (16)
2 2 tation, i.e. Deep CCA. It is also worth noting that our method
where the norm of B̃ given the data is a constant. Minimizing does not impose an upper bound on the feature dimension k.
the last two terms with respect to f and G leads us to the The optimization is consistently stable for very large k.
Soft-HGR objective:
h i 1 Extension to More or Missing Modalities
max E f T (X)g(Y ) tr (cov(f (X)) cov(g(Y )))
f ,g 2 (17) The HGR maximal correlation is originally defined on two
s.t. E [f (X)] = E [g(Y )] = 0. random variables. In contrast to reconstruction models (Sri-
The proposed Soft-HGR consists of two inner products, one vastava and Salakhutdinov 2012; Zhao, Hu, and Wang 2015),
between feature mappings and the other between feature the multi-modal extension for correlation based models is not
covariance. The first term in (17) is consistent to the objec- straightforward. New modalities will bring additional whiten-
tive of the HGR maximal correlation, and the second term ing constraints, and the computational complexities scales
is considered as a soft regularizer to replace the whitening up. However, in Soft-HGR, the “soft” formulation provides
constraints. more flexibility. Recall that the core idea behind the Soft-
Follow the practice of Deep CCA, we design transforma- HGR is to find an approximation of the B̃ matrix defined on
tion functions f and g as parametric neural networks. As two modalities. In order to handle more than two modalities,
long as the reachable functional space of the neural structures the multimodal Soft-HGR should be able to learn feature
covers the optimal feature transformation, the Soft-HGR and transformations which recover all pairwise B̃ simultaneously.
the HGR maximal correlation will always lead us to the Landing on this idea, let X1 , . . . , Xd be d different modali-
equivalent solution. ties, and f (1) , . . . , f (d) be their corresponding transformation
functions, the multimodal Soft-HGR is defined as:
Table 1: The linear correlation between features extracted
2 3
d
X from the Soft-HGR and the HGR maximal correlation.
T
max E4 f (i) (Xi )f (j) (Xj )5
f (1) ,...,f (d) Feature dimensions
i6=j
Linear correlation 10 20 40
1X ⇣ ⌘
d
tr cov(f (i) (Xi )) cov(f (j) (Xj )) (18) Upper Bound 10 20 40
2
i6=j
h i fHGR (X) and gHGR (Y) 1.36 2.37 3.40
s.t. f (i)
: Xi ! R ; E f k (i)
(Xi ) = 0; fSHGR (X) and gSHGR (Y) 1.36 2.37 3.40
fSHGR (X) and fHGR (X) 9.99 20.00 39.99
i, j = 1, 2, . . . , d. gSHGR (Y) and gHGR (Y) 10.00 20.00 39.99
When d = 3, Figure 1 provides an illustration for (18)
with neural network implementations. The DNN structure for
each neural branches may vary, depending on the statistical Experiments
property of the inputs. The overall model extracts the features In this section, we evaluate Soft-HGR in the following as-
from every neural branch, and maximize their pairwise Soft- pects:
HGR in an additive manner. From an information theoretical
perspective, maximizing (18) is equivalent to extracting the • To verify the relationship between the HGR features and
common information from multiple random variables. Soft-HGR feature is linear;
Note that this generalization also provides solutions to • To compare the efficiency and numerical stability of CCA
deal with data with partially missing modalities. To see this, based models and Soft-HGR;
the first term in (18) can be applied only on the presented • To demonstrate the power of semi-supervised Soft-HGR
modalities for each training sample, and the second term on discriminative tasks with limited labels;
is always measurable as it only depends on the marginal
• To show the performance of Soft-HGR on more than two
distribution of individual modalities.
modalities and missing modalities.
Incorporating Supervised Information
Comparing Soft-HGR with HGR
The primary goal of the above framework is to extract the The formulation of the Soft-HGR and the original HGR maxi-
correlation between modalities. Therefore, any information mal correlation are equivalent except for the way they control
that is private to the individual modality is eliminated, re- whitening. In this section, we compare two methods in terms
gardless of its discriminative power. The intuition behind the of linearity, efficiency and stability.
supervised/semi-supervised adaptation is that feature extrac-
tion should be conducted under the guidance of supervised Linearity Check Based on the theory, the HGR and the
labels, even if they are insufficient. Soft-HGR transformations should span the same feature
Assumed that a subset of bi-modal data is associated with space. To verify this, we randomly generated 100K data sam-
discrete labels Z with range Z = {1, 2, . . . , |Z|}. In order to ples (xi , yi ) from a randomly chosen joint distribution PXY ,
receive the supervised information from labels, we feed the where X, Y 2 {1, ..., 50} are both discrete random vari-
joint representation, the concatenation of individual feature ables. The HGR feature {fHGR , gHGR } is obtained by directly
mappings, into a softmax classifier. The cross entropy loss is solving the SVD for B̃, which is calculated from empirical
added to the overall objective, with a hyper-parameter 2 joint distribution P̃XY . In order to retrieve the Soft-HGR fea-
[0, 1] to trade off the strength of the unsupervised component: tures {fSHGR , gSHGR }, we first turn the data into one-hot form
⇥ ⇤ h i X, Y 2 R100K⇥50 , then feed them into a two-branch one-
L =( 1) · E log QZ|XY E f T (X)g(Y ) layer neural network optimized by Soft-HGR objective. Note
(19) that when data are one-hot encoded, all possible functions
+ tr (cov(f (X)) cov(g(Y ))) can be captured by linear operations. Finally, we apply all
2 learned functions to data and run linear CCA between every
where two feature transformations. Recall that the HGR and Soft-
⇥ ⇤ HGR features are linearly transformable from the one to the
exp f T (X), gT (Y ) ✓ j other. Therefore, the linear correlation between {fSHGR (X),
QZ=j|XY = P|Z| (20)
i=1 exp ([f T (X), gT (Y )] ✓ i ) fHGR (X)} and between {gSHGR (Y), gHGR (Y)}, in the ideal
case, should reach the upper bound.
In semi-supervised settings, the supervised softmax loss, Table 1 summarizes the simulation result. The HGR and
the first term in (19), is only effective when labels are pre- the Soft-HGR extract exactly the same linear correlation
sented. The last two terms of (19) corresponds to the Soft- between X and Y on different choice of k. Besides, the
HGR loss, which is evaluated independently from labels. The correlation between corresponding features from two models
gradients from the label Z are first backpropagated to the is almost identical to the upper bound, which provides an
individual feature mappings, then affect the feature selection. empirical evidence for our theory.
Table 2: Phonetic prediction accuracy obtained by different Linear Soft-HGR
20 CCA
methods on certain percentages of the labeled data in XRMB.
Execution time / Seconds
Percentages of labels 15
Method 10% 50% 100% k = 350
Baseline DNN 72.2% 81.2% 86.4% 10
PCA + DNN 71.5% 80.5% 85.2%
CCA + DNN 70.7% 79.9% 84.4% 5
Deep CCA + DNN 73.2% 80.1% 84.0%
Soft-HGR + DNN 73.0% 79.9% 83.7% 0
Soft CCA + DNN 69.4% 76.0% 78.8% 50 100 200 300 400 500
CorrNet 71.2% 79.7% 83.2% Feature dimensions k
Semi Soft-HGR 76.3% 85.0% 88.0%
Semi Soft CCA 73.6% 82.8% 85.5% Figure 2: Execution time of SGD on CCA and linear Soft-
HGR for one training epoch on MNIST data. When k is larger
than 350, CCA experiences numerical issues.
Efficiency and Stability In this subsection, we focus on
the efficiency and stability provided by two methods in opti- 0.7
mization. In particular, we compare the execution time and
maximally reachable feature dimension by applying both 0.69
models to the MNIST handwritten image dataset (LeCun et
al. 1998), which consists of 60K/10K gray-scale digit images
AUC
0.68
of size 28 ⇥ 28 as training/testing sets. We follow the experi-
ment setting in (Andrew et al. 2013), and treat left and right
halves of digit images as two modalities X and Y . In order 0.67
to highlight efficiency difference brought by the objectives,
we restrict the all the feature transformation to take the linear 0.66
form. Therefore, the HGR maximal correlation degrades to
linear CCA. Both optimizations are executed on a Nvidia 0 0.1 0.2 0.3 0.4 0.5
Tesla K80 GPU with mini-batch SGD of 5K batchsize. Soft-HGR strength
Figure 2 compares the execution time on one training
epoch with different feature dimensions k. As we expected, Figure 3: The effect of hyper-parameter on AUC
Soft-HGR is faster than CCA methods by orders of magni-
tude. In addition, the execution time of CCA method grows
quickly with the feature dimensions. This is undesirable in accuracy with only X observed. We expect using Y during
real-world settings where k could be very large. It is also training to improve the classification performance, even if
worth noting that CCA experiences numerical issue when they are absent in the test phase. In addition, we partially
feature dimension exceeds 350. The instability arises in that mask out some portions of labels Z associated with the train-
the empirical covariance matrices over some mini-batches ing data. These two restrictions are consistent with the real
become ill-posed, or even non-invertible. world multimodal settings where facial movement data is
usually not obtainable, and labels are limited.
Soft-HGR for Semi-supervised Learning
In this section we demonstrate how Soft-HGR are applied to Comparing Models (1) Supervised DNN, which has four
improve the performance of discriminative tasks. we evaluate hidden layers [1K, 1K, 1K, 1K]. It only takes raw feature X as
our model on the University of Wisconsin X-ray Microbeam inputs and makes predictions on Z; Supervised DNN can only
Database (XRMB) (Westbury 1994) for phonetic classifica- deal with labeled data. (2) DNN on CCA features: the DNN
tion. XRMB is a bi-modal Dataset consisting of articulatory structure is the same as in (1) but it accepts pre-trained f (X)
and acoustic data. Followed the same preprocessing and re- as input. f (X) is obtained from principal components analy-
construction procedures as described in (Arora and Livescu sis (PCA), CCA (Hotelling 1936), Deep CCA (Andrew et al.
2013; Wang et al. 2015a), we obtain the total number of 160K 2013), Soft CCA (Chang, Xiang, and Hospedales 2018), Cor-
entries of acoustic and vectors articulatory vectors X 2 R273 relational Neural Network (CorrNet) (Chandar et al. 2016)
and Y 2 R112 , corresponding to 41 classes of labels Z. and our model. Except for PCA which extracts feature only
from X, all other methods are trying to find the most cor-
Experiement Settings While both modalities X and Y are related f (X) to g(Y). For Deep CCA, Soft CCA and our
available in the training phase, Y is not provided at the test model, the selected f is a DNN with two hidden layers: [1K,
time. Namely, the model is evaluated by the classification 1K] and g is linear. For a fair comparison, the output fea-
Table 3: AUC obtained by different methods using part of the labels or part of the modalities
Missing labels Missing modalities
Method No Missing 20% 50% 90% 20% 50%
LR 0.6625 0.6623 0.6618 0.6534 0.6588 0.6286
FM 0.6780 0.6728 0.6723 0.6543 0.6696 0.6449
Deep FM 0.6803 0.6765 0.6756 0.6613 0.6714 0.6450
Neural FM 0.6760 0.6768 0.6746 0.6570 0.6661 0.6574
Semi Soft-HGR 0.6972 0.6935 0.6906 0.6728 0.6823 0.6682
ture dimensions k is chosen to be 80 for all methods, as song again, and Y = 0 means the opposite. The user features
higher value leads to unstable gradients in Deep CCA. (3) Xu and item (song) features Xi are explicitly given, and
Semi-supervised Model: we construct semi-supervised Soft- we treat source system tab, source screen name
HGR as described in subsection Incorporating Supervised and source type as context features Xc . For preprocess-
Information, except that the top layer softmax function only ing, the categorical features are one-hot/multi-hot encoded,
takes f (X) as input. In particular, we use DNN with four and the continuous features are normalized. The features cor-
hidden layers [1K, 1K, 1K, 1K] for f and linear function for responding to one modality are concatenated into a single
g. To see the equivalence, when = 0, the network becomes vector, resulting in Xu , Xi , and Xc as 34656, 623691, and
the supervised DNN. For a fair comparison, we also adapt 45 dimensional feature vectors, respectively. The test labels
semi-supervised Soft CCA in the same manner. However, we are not disclosed, therefore we use the last 20% of 7M train-
found Deep CCA fails the adaptation because the training ing data as test set1 . We test the model under two settings,
is very unstable which prevents us to get a reliable result. where labels are insufficient or some modalities are missing.
In all DNNs, batch normalization (Ioffe and Szegedy 2015) In the first setting, we conceal 20%/50%/90% of the labels
is applied before ReLU activation function to ensure better in training data. In the second scenarios, we randomly mask
convergence. We evaluate the classification based on 5-fold one of the three modalities as missing in 20%/50% of both
cross validation. All the models are optimized by Adam opti- training and testing data, the status of whether data is miss-
mizer (Kingma and Ba 2014) with batch size of 1024 until ing is constructed as a binary flag in the feature vector. The
convergence. The hyper-parameters are determined by their performances are evaluated by the Area under the ROC curve
best performance on validation set. Table 2 reports the aver- (AUC).
age phonetic prediction accuracy of each method with certain
percentages of labels. Comparing Models We compare our model against to the
state-of-art predictive model for sparse data. These include
Observations (1) Semi-supervised Soft-HGR achieves the Shallow models: Logistic regression (LR) and Factoriza-
highest accuracy among all models, and the difference be- tion machines (FM) (Rendle 2010), and Deep models: Deep
comes more apparent when labels are insufficient. (2) The FM (Guo et al. 2017), Neural FM (He and Chua 2017). For
discriminative performance of Deep CCA and Soft-HGR are models besides LR, the dimension of feature embedding
similar as they learns equivalent features. (3) f (X) trained is set to 16. The DNN component for Deep FM, Neural
by various unsupervised models is not necessarily more dis- FM and Semi Soft-HGR has consistent structure [100, 100].
criminative than raw feature X1 . In fact, they only improve Semi-supervised Soft-HGR: The architecture for the un-
classification when labels are extremely limited. In other supervised part is designed mainly according to Figure 1.
cases, their performances are inferior to the end-to-end DNN However, since fully connected layers is not effective for
because valuable information may be lost as f projects the sparse features, a Bi-Interaction layer, proposed in (He and
data into lower dimensions. Chua 2017), is inserted between the input and DNN structure.
The output features from three neural network branches are
Soft-HGR for More or Missing Modalities forwarded to an average pooling layer. The output joint rep-
In this section we apply our method to recommender sys- resentation is feed to a softmax function for prediction. The
tem. In such problems, users Xu , items Xi , and context Xc hyper-parameter controls the participation of the Soft-HGR
are three natural modalities. Extensive success achieved by loss. The comparing result is reported in Table 3. In order to
collaborative filtering techniques (Breese, Heckerman, and highlight the role of Soft-HGR loss, we plot the AUC versus
Kadie 1998) demonstrates that the correlations between these when Semi Soft-HGR is trained with all labels in Figure 3.
modalities are useful to infer user behaviors. Observations (1) Semi-supervised Soft-HGR achieves sig-
Specifically, we experiment with KKBox’s Music Recom- nificantly better performance than all the baselines. (2) From
mendation Dataset (Chen et al. 2018). The goal is to predict
the chances of a user listening to a song repetitively after 1
The split is suggested by the 1st place solution. The last part
the first listening event within a period of time. The labels of the data is used for test set because the data are speculated to be
are binary, where Y = 1 represents the user listens to the chronologically ordered.
Figure 3 we can see the performance decreases as it is elim- [Frome et al. 2013] Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.;
inated from the objective (i.e. = 0). Arguably, the per- Dean, J.; Mikolov, T.; et al. 2013. Devise: A deep visual-semantic
formance gain comes from the introduction of unsupervised embedding model. In Advances in Neural Information Processing
Soft-HGR objective. Systems (NIPS), 2121–2129.
[Gebelein 1941] Gebelein, H. 1941. Das statistische problem der
Conclusion korrelation als variations-und eigenwertproblem und sein zusam-
menhang mit der ausgleichsrechnung. ZAMM-Journal of Applied
In this paper, we propose a multimodal feature extraction Mathematics and Mechanics/Zeitschrift für Angewandte Mathe-
framework based on the HGR maximal correlation. Further, matik und Mechanik 21(6):364–379.
we replace the intrinsic whitening constraints with a “soft” [Guo et al. 2017] Guo, H.; Tang, R.; Ye, Y.; Li, Z.; and He, X. 2017.
regularizer which guarantees the efficiency and stability in DeepFM: A factorization-machine based neural network for ctr pre-
optimization. Our model is able to cope with more than two diction. In International Joint Conference on Artificial Intelligence
modalities, missing modalities, and can be readily general- (IJCAI), 1725–1731.
ized to the semi-supervised setting. Extensive experiments [He and Chua 2017] He, X., and Chua, T.-S. 2017. Neural factoriza-
show that our proposed model outperforms state-of-the-art tion machines for sparse predictive analytics. In International ACM
multimodal feature selection methods in different scenarios. SIGIR Conference on Research and Development in Information
Retrieval (SIGIR), 355–364.
Acknowledgement [Hirschfeld 1935] Hirschfeld, H. O. 1935. A connection between
The research of Shao-Lun Huang was funded by the Natu- correlation and contingency. Mathematical Proceedings of the
ral Science Foundation of China 61807021, and Shenzhen Cambridge Philosophical Society 31(4):520524.
Municipal Scientific Program JCYJ20170818094022586. [Hotelling 1936] Hotelling, H. 1936. Relations between two sets of
variates. Biometrika 28(3/4):321–377.
References [Huang et al. 2017] Huang, S.-L.; Makur, A.; Zheng, L.; and Wor-
nell, G. W. 2017. An information-theoretic approach to universal
[Akaho 2006] Akaho, S. 2006. A kernel method for canonical
feature selection in high-dimensional inference. In International
correlation analysis. CoRR abs/cs/0609071.
Symposium on Information Theory (ISIT), 1336–1340.
[Andrew et al. 2013] Andrew, G.; Arora, R.; Bilmes, J.; and Livescu,
K. 2013. Deep canonical correlation analysis. In International [Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. Batch nor-
Conference on Machine Learning (ICML), 1247–1255. malization: Accelerating deep network training by reducing internal
covariate shift. In International Conference on Machine Learning
[Arora and Livescu 2013] Arora, R., and Livescu, K. 2013. Multi- (ICML), 448–456.
view cca-based acoustic features for phonetic recognition across
speakers and domains. In IEEE International Conference on Acous- [Kingma and Ba 2014] Kingma, D. P., and Ba, J. 2014. Adam: A
tics, Speech and Signal Processing (ICASSP), 7135–7139. method for stochastic optimization. CoRR abs/1412.6980.
[Bach and Jordan 2002] Bach, F. R., and Jordan, M. I. 2002. Kernel [LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner,
independent component analysis. Journal of Machine Learning P. 1998. Gradient-based learning applied to document recognition.
Research 3(Jul):1–48. Proceedings of the IEEE 86(11):2278–2324.
[Baltrušaitis, Ahuja, and Morency 2018] Baltrušaitis, T.; Ahuja, C.; [Pearson 1895] Pearson, K. 1895. Note on regression and inheri-
and Morency, L.-P. 2018. Multimodal machine learning: A survey tance in the case of two parents. Proceedings of the Royal Society
and taxonomy. IEEE Transactions on Pattern Analysis and Machine of London 58:240–242.
Intelligence 1–1. [Rendle 2010] Rendle, S. 2010. Factorization machines. In IEEE
[Breese, Heckerman, and Kadie 1998] Breese, J. S.; Heckerman, D.; International Conference on Data Mining (ICDM), 995–1000.
and Kadie, C. 1998. Empirical analysis of predictive algorithms for [Rényi 1959] Rényi, A. 1959. On measures of dependence. Acta
collaborative filtering. In Conference on Uncertainty in Artificial Mathematica Hungarica 10(3-4):441–451.
Intelligence (UAI), 43–52. [Sohn, Shang, and Lee 2014] Sohn, K.; Shang, W.; and Lee, H.
[Chandar et al. 2016] Chandar, S.; Khapra, M. M.; Larochelle, H.; 2014. Improved multimodal deep learning with variation of in-
and Ravindran, B. 2016. Correlational neural networks. Neural formation. In Advances in Neural Information Processing Systems
Computation 28(2):257–285. (NIPS). Curran Associates, Inc. 2141–2149.
[Chang, Xiang, and Hospedales 2018] Chang, X.; Xiang, T.; and [Srivastava and Salakhutdinov 2012] Srivastava, N., and Salakhut-
Hospedales, T. M. 2018. Scalable and effective deep cca via soft dinov, R. R. 2012. Multimodal learning with deep boltzmann
decorrelation. In The IEEE Conference on Computer Vision and machines. In Advances in neural information processing systems,
Pattern Recognition (CVPR). 2222–2230.
[Chen et al. 2018] Chen, Y.; Xie, X.; Lin, S.-D.; and Chiu, A. 2018. [Vendrov et al. 2015] Vendrov, I.; Kiros, R.; Fidler, S.; and Urtasun,
Wsdm cup 2018: Music recommendation and churn prediction. In R. 2015. Order-embeddings of images and language. arXiv preprint
ACM International Conference on Web Search and Data Mining arXiv:1511.06361.
(WSDM), 8–9. [Wang et al. 2015a] Wang, W.; Arora, R.; Livescu, K.; and Bilmes,
[Eckart and Young 1936] Eckart, C., and Young, G. 1936. The ap- J. A. 2015a. Unsupervised learning of acoustic features via deep
proximation of one matrix by another of lower rank. Psychometrika canonical correlation analysis. In IEEE International Conference
1(3):211–218. on Acoustics, Speech and Signal Processing (ICASSP), 4590–4594.
[Feizi et al. 2017] Feizi, S.; Makhdoumi, A.; Duffy, K.; Kellis, M.; [Wang et al. 2015b] Wang, W.; Arora, R.; Livescu, K.; and Srebro,
and Medard, M. 2017. Network maximal correlation. IEEE Trans- N. 2015b. Stochastic optimization for deep cca via nonlinear
actions on Network Science and Engineering 4(4):229–247. orthogonal iterations. In Communication, Control, and Computing
(Allerton), 2015 53rd Annual Allerton Conference on, 688–695.
IEEE.
[Westbury 1994] Westbury, J. 1994. X-ray microbeam speech pro-
duction database user’s handbook. Waisman Center on Mental
Retardation & Human Development University of Wisconsin Madi-
son.
[Zhao, Hu, and Wang 2015] Zhao, L.; Hu, Q.; and Wang, W. 2015.
Heterogeneous feature selection with multi-modal deep neural net-
works and sparse group lasso. IEEE Transactions on Multimedia
17(11):1936–1948.