Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
[email protected] ak [email protected]
Abstract
Keywords: Audiovisual speech, speech synthesis, Tacotron2, emotional speech synthesis, blend-
shape coefficients
∗
Ahmed H. Abdelaziz and Anushree P. Kumar have contributed equally
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without
notice, after which this version may no longer be accessible.
1 Introduction
2
speech quality, lip movement quality, facial expression, and emotion in speech. We use mean opinion
score (MOS) tests as well as AB tests to obtain absolute and relative scores for the two approaches.
We also evaluate the holistic perception of the talking face.
To summarize, the main contributions of this study are:
• We describe and evaluate two systems for emotion-aware audiovisual speech synthesis
based on Tacotron2: an end-to-end (AVTacotron2) and a modular approach. We show that
both systems generate close to human-like emotional audiovisual speech without requiring
post-processing.
• We apply these approaches to a generic blendshape model that allows for synthesizing a
variety of 3D face models.
• The definition of empirical evaluation tests that assess the quality of different aspects of the
synthesized speech as well as the overall quality. We compare the synthesis quality of the
proposed systems to the original recordings of a professional actress.
The rest of the paper is organized as follows: In Section 2, we discuss the related work. Section 3
describes the problem of the offline extraction of facial controls from video sequences. In Sec-
tion 4, we introduce the speech emotion recognition system used for extracting emotion embeddings.
Section 5 outlines the model architectures of the end-to-end approach for synthesizing audiovisual
speech from text. The modular approach is then described in Section 6. The experiments used to
evaluate the model performance and the results are described in Section 7. Finally, we conclude the
paper and give an outline of future work in Section 8.
2 Related work
Manually and Video-driven Avatars: Although there has been an increasing trend in sophisticated
face and motion tracking systems, the cost of production still remains high. In many production
houses today, high fidelity speech animation is either created manually by an animator or using
facial motion capture of human actors [4, 6, 14, 40, 42]. Both manually and performance-driven
approaches require a skilled animator to precisely edit the resulting complex animation parameters.
Such approaches are extremely time consuming and expensive, especially in applications such as
computer generated movies, digital game production, and video conferencing, where talking faces
for tens of hours of dialogue are required. To solve this problem, speech- and text-driven talk-
ing faces can be used. In this study, we introduce new text-driven approaches for synthesizing
production-quality speech and facial animation with different styles of speech.
Face Representations: Face synthesis algorithms can be categorized into two major categories:
photorealistic and parametric. In photorealistic synthesis, videos of the speakers are generated either
from text or audio inputs [20, 29, 34]. These techniques involve combining both image and speech
synthesis tasks.
In parametric face synthesis, a sequence of facial controllers are used to deform a neutral face model.
The parametric approaches can either operate on 3D face models [3, 7] or on 2D images. For the 2D
face synthesis, active appearance models (AAMs) [9] are widely used. In this technique, a face is
modeled using shape and appearance parameters. However, due to modeling non-linear variability
by a linear PCA model and the low rank approximations, details such as the regions near the mouth
often appear blurry [12].
Our approach belongs to the parametric synthesis category, where a sequence of low-dimensional
blendshape coefficient vectors are synthesized and applied to a wide range of 3D avatars. We use
a method similar to that in [39] for offline extraction of the ground truth blendshape coefficients,
which are used as facial controllers for the 3D face model.
Text-driven Avatars: Many approaches have been proposed for visual and audiovisual speech syn-
thesis from text. For visual speech synthesis, hidden Markov models (HMM) were used in earlier
approaches, e.g., in [13, 25, 36, 41]. In [25] and [36], HMM-based systems were used to synthesize
both acoustic and visual speech. Decision trees with HMMs have been used to create expressive
talking heads [3]. In [15], the Festival System was firstly used for speech synthesis followed by a
text-to-visual-speech synthesis system and an audiovisual synchronizer. In [33], a learned phoneme-
to-viseme mapping was also used for visual speech synthesis. Recently, many deep-learning-based
3
approaches have been proposed for visual speech synthesis. In [23], the authors used a unit-selection-
based audiovisual speech synthesis system. In this approach, the output of a deep neural network
(DNN) was used as a likelihood of each audiovisual unit in an inventory. Dynamic programming
was then used to obtain the most likely trajectory of consecutive audiovisual units, which were later
used for synthesis. A more neural based approach was proposed in [32], where a fully connected
feed-forward neural network was used to synthesize natural-looking speech animation that is coher-
ent with the input text and speech. In [20], an LSTM architecture was used to convert the input text
into speech and corresponding coherent photo-realistic images of talking faces.
In contrast to these approaches, we use Tacotron2, a state-of-the-art architecture for acoustic speech
synthesis, for synthesizing coherent acoustic and visual speech. The end-to-end approach described
in this work synthesizes audiovisual speech in an end-to-end neural fashion using a single model.
Speech-driven Avatars: Voice Puppetry [5] is an early HMM-based approach for audio-driven
facial animation. This was followed by techniques based on decision trees [19], or deep bidirectional
LSTM [12] that outperformed HMM-based approaches. In [27], the authors proposed applying
an LSTM-based neural network directly to audio features. In [31], a deep neural network was
proposed to regress to a window of visual features from a sliding window of audio features. In
this work, a time-shift recurrent network trained on an unlabeled visual speech dataset produced
convincing results without having any dependencies on a speech recognition system. More recently,
convolutional neural networks (CNNs) were used in [18] to learn a mapping from speech to the
vertex coordinates of a 3D face model. Facial expressions and emotion were generated by a latent
code extracted from the network. In [43], the authors propose VisemeNet which consisted of a
three-stage LSTM network for lip sync. One stage predicts a sequence of phoneme-groups given
the audio and another stage predicts the geometric location of landmarks given the audio. The final
stage learns to use the predicted phoneme groups and facial landmarks to estimate parameters and
sparse speech motion curves.
Most of these systems synthesize visual controllers that deform a 3D face model that looks like
the speaker. In contrast, we predict a sequence of facial controllers that can be used across various
fictional and human-like characters. For our modular approach, we use a CNN-based speech-to-
animation model conditioned on emotion embeddings.
where b0 is the neutral face mesh, the columns of matrix B are additive displacements corresponding
to a set of n = 51 blendshapes, and x ∈ [0, 1] are the weights applied to such blendshapes.
We construct a personalized model from a set of RGB-D sequences of fixed, predetermined proto-
typical expressions, such as neutral, mouth open, smile, anger, surprise, disgust and while the head
is slightly rotating (these correspond to the matrix B in the expression above) for the actor in our
dataset. In particular, we use an extension of the example-based facial rigging method of [21], where
we modify a generic blendshape model to best match the talent’s example facial expressions. We
improve registration accuracy over [21] by adding positional constraints to a set of 2D landmarks
detected around the main facial features in the RGB images using a CNN similarly to [16].
Given the personalized model, we automatically generate labels for the head motion and blendshape
coefficients (BSCs) by tracking every video frame of the subject. To do so, we rigidly align the
4
model to the depth maps using iterative closest point (ICP) with point-plane constraints. Then, we
solve for the BSCs, which modify the model non-rigidly to best explain the input data. In particular,
we use a point-to-plane fitting term on the depth map:
2
Di (x) = ni ⊤ (vi (x) − vi ) , (2)
where vi is the i-th vertex of the mesh, vi is the projection of vi to the depth map, and ni is the
surface normal of vi . Additionally, we use a set of point-to-point fitting terms on the detected 2D
facial landmarks:
2
Lj (x) = ||π(vj (x)) − uj || , (3)
where uj is the position of a detected landmark and π(vj (x)) is its corresponding mesh vertices
projected into camera space. The terms Di and Lj are combined in the following optimization
problem: X X
x̃ = argmin min wd Di (x) + wl Lj (x) + wr ||x||1 , (4)
x
i j
where wd , wl , and wr represent the respective weights given to the depth term, the landmark term,
and a L1 regularization term, which promotes the solution to be sparse. The minimization is carried
out using a solver based on the Gauss-Seidel method. The estimated blendshape coefficients x̃ in
(4) are used as ground-truth facial controls for training the text-to-animation and audio-to-animation
models in the subsequent sections.
5
Figure 1: Speech emotion recognition neural network. The network is trained to classify neutral,
happy, and sad emotions using input acoustic features. Once the network is trained, the output of
the average pooling layer is used as a speech emotion representation.
Softmax
Emotion
Embeddings
Average Pool
FC
LSTM
Conv1D
Figure 2: Tacotron2 architecture for emotional audiovisual speech synthesis. The network’s encoder
converts an input sequence of phonemes and a compact representation of a predefined emotion into
a sequence of abstract vector representations of the same length. An attention mechanism is used to
focus on a subset of the encoded vectors at each time step. The concatenation of the weighted sum
of the encoder outputs (the context vector) and a pre-processed regressor output vector are used as
an input to the decoder. The decoder output is used to regress to visual and acoustic features that are
finally post-processed to give a final cleaner features.
Blendshape = Concatenation
coeffcients
Attention
HH
AH Endpointer
L
OW Encoder = Decoder Regressor Postnet
W
ER
L Mel-scaled
D filterbanks
Prenet
Emotion embeddings
Figure 2 shows the entire architecture of AVTacotron2. Although the network is trained end-to-end,
it can be divided into four major functional blocks: a phoneme encoder, an attention block, a decoder,
and a regressor. In the following sections, these four blocks are described in detail.
The function of the phoneme encoder in the Tacotron2 architecture is to transform a sequence of
phonemes into an intermediate representation, from which acoustic and visual features can be gener-
ated. As shown in Figure 3, the input to the encoder is a sequence of M phoneme indices, which are
transformed into 512-dimensional feature vectors using an embedding layer. As described in [26],
6
Figure 3: Emotion-conditioned phoneme encoder. The phoneme embeddings are processed by a set
of convolution and bi-derectional LSTM layers. The emotion embedding is repeated and cascaded
with the processed phoneme embeddings to give the encoder outputs.
Encoder
output
=
= Emotion
= embedding
BLSTM
Encoder
CNN
Embedding layer
HH AH L OW W ER L D
the M × 512 output matrix of the embedding layers is passed through 3 convolutional layers, each
of which has 512 feature maps with kernel size of 5 × 1 and ReLU activations. Each convolutional
layer is followed by a drop-out layer with drop-out probability of 0.5 and batch normalization. The
convolutional layers model the N-gram dependencies between consecutive phonemes [26]. For the
longer term dependencies, the output of the final convolutional layer is processed by a bi-directional
long short-term memory (BLSTM) layer with 512 units in each direction. The concatenation of
the bi-directional LSTM gives the M × 1024 intermediate representation of the M input phonemes.
The 64 dimensional emotion embedding is repeated and cascaded with the processed phoneme em-
beddings to give the encoder outputs. During training the embeddings are extracted from the same
output utterance. During inference, the emotion embedding is extracted from a reference utterance
that has the required level of emotion. The emotionally conditioned phoneme representation is used
to generate the audio-visual outputs in the later stages of the network, as will be described in the
following sections.
7
5.3 Decoder
The decoder stage of the Tacotron2 architecture is composed of two 1024-dimensional uni-
directional LSTM layers. The decoder iteratively generates an embedding, from which all audio,
video, and auxiliary information can be estimated. The input to the decoder at time-frame t are the
context vector at time-frame t and a pre-processed version of the regressor output of the previous
time-frame t − 1. Preprocessing of the regressor outputs is done by the so-called Pre-net, which
is composed of two 256-dimensional fully-connected layers with ReLU activations. Since the de-
coder generates embeddings that are asynchronous with the encoder output, the number of output
embeddings Ndecoder is independent of the encoder output length M .
5.4 Regressor
The embeddings extracted from the decoder encodes all information needed to synthesize one or
n acoustic and visual frames. We use n = 2 in this study, which means that the length of the
synthesized acoustic and visual features is 2Ndecoder . The regressor is simply a linear projection
of the embeddings from the decoder output onto the acoustic, the visual, and the auxiliary feature
spaces. The acoustic feature space is the 80-dimensional MFBs. The visual feature space is the 51-
dimensional BSCs. The auxiliary feature is a one-dimensional end-pointing feature that determines
the end of decoding.
The output MFB features go through a post-processing network (post-net) that is composed of five
1D convolutional layers followed by a linear fully-connected layer. The convolutional layers have
512 filters of kernel size 5. The first four convolutional layers use (tanh) non-linear activations,
whereas the final one has no non-linearity. A drop-out layer with drop-out probability of 0.5 is
added after every convolutional layer. The final fully connected layer has 80 units, which is the
same dimension as the regression acoustic feature space. A skip connection is added from the input
of the post-net to its output, which means the post-net output represents the estimation residual. The
residual is estimated with access to future frames, which makes the overall post-net output more
accurate than the linearly projected MFB features.
The network is trained to minimize a weighted sum of four loss functions.
Ltotal = ωmel Lmel + ωpost Lpost + ωend Lend + ωbsc Lbsc . (5)
Lmel is the L1 loss (mean absolute error) that is applied to the mel-scaled filter-bank features that
are output from the linear projection layer. Lpost and Lend are the L2 (mean square error) losses that
are applied to the output of the post-processing subnetwork and the end-pointing output respectively.
For the BSCs, we use the L2 loss function Lbsc . The weights ωmel , ωpost , ωend , andωbsc are empirically
tuned to give the best synthesis performance. In our experiments, we use 0.9 for ωmel , 1 for ωpost , 0.1
for ωend and 0.9 for ωbsc .
8
Table 1: The architecture of the audio-to-facial-animation network.
Type #Filters Kernel Stride/padd. Output Act.
Conv. 128 3x3 1x1/same 21x40x128 RELU
Conv. 64 3x3 2x2/same 11x20x64 RELU
Conv. 64 3x3 1x1/same 11x20x64 RELU
Conv. 64 3x3 2x2/same 6x10x64 RELU
Conv. 64 3x3 1x1/valid 4x8x64 RELU
Conv. 64 3x3 1x1/valid 2x6x64 RELU
FC 512 - - 1x1x512 RELU
FC+embed 128 - - 1x1x128 RELU
FC 51 - - 1x1x51 None
Table 2: The phoneme-viseme mapping used in our work and the viseme importance used to weight
the loss of the audio-to-facial-animation-network. The weights are the normalized accuracy of an
in-house video-based viseme classifier. The silence weight is heuristically set to zero.
In contrast to the end-to-end system, the audio-to-facial-animation network does not have access
to the entire phoneme sequence. In order to compensate for that, the BSC loss Lbsc in the audio-
to-facial-animation network is modified. For each entry in a batch, the loss function is weighted
differently. The weights of each entry are proportional to the importance of the corresponding
spoken viseme (visual phoneme.) With these weights, the network is penalized more for inference
errors related to important visemes, such as /p/ and /w/ and penalized less for errors related to less
important visemes, such as /t/ and /g/. Table 2 shows the used viseme importance/weights, which
are the classification accuracies of a video-based viseme classifier normalized over the maximum
accuracy, which happens to be of the viseme /P/. The silence weight, which actually has the highest
classification accuracy, is overridden and set to zero. Using these weights improves the performance
of the audio-to-facial-animation network. We use a hybrid loss of L1 and the cosine distance (see
[2]), which we found to be more effective than the L1 and L2 for small, sparse, and bounded labels,
such as the BSCs. We have not used the hybrid loss in Equation 5, as the L2 was easier to balance
with the other losses.
9
Table 3: Emotion classification performance from the Emotion Recognition System.
A high quality multimodal corpus was collected for training all of our neural networks. The dataset
consists of approximately 10 hours of multimodal data captured from a professional actress: 7
hours of neutral speech (∼7270 sentences), 1.5 hours (∼1430) of acted happy speech and 1.5 hours
(∼1430) of acted sad speech. In this dataset, there are 74 minutes (∼300) common phrases spoken
in a neutral, happy and sad emotional states. The text corpus was balanced to maintain a maximum
phonetic coverage of US English phoneme-pairs. For each utterance in the corpus a single channel
44.1kHz audio, 60 frame per second (fps) 1920x1080 RGB video, and 30 fps depth signals were
captured.
10
Figure 4: TSNE projection of the validation set emotion embeddings extracted using the speech
emotion recognition system.
Happy
Sad
Neutral
Figure 5: Grading template for the overall experience of the talking face
IMPORTANT:
Please release the task if:
* you do not have headphones,
* there is background noise,
* you have a hearing impairment, or
* you can’t hear/watch the audio/video samples for any reason
Instructions It is important to understand that it is NOT you who is being tested -
your results only provide information regarding the systems we are evaluating!
Video Synthesis Evaluation
We need your help evaluating video samples from an audio-visual speech synthesis system.
Video
Label: happy
Facial Expression Perfectly Most of the time Sometimes Doesn’t match Doesn’t match at all
matches the emotion most of the time
How natural is Natural Not natural most of
the talking face overall? Perfectly natural most of the time
Natural sometimes the time Not natural at all
Comments
graders provided five mean opinion scores (MOSs) for the five questions described above. The
scores range from (1) bad (no match) to (5) excellent (perfect match). Figure 5 shows the grading
template used for evaluation.
We used 30 videos, half of which are synthesized with happy emotion and the other half with sad
emotions. The MOS tests were conducted on each of the following sets of data - videos generated by
the modular approach, videos generated by the end-to-end approach and the original(ground truth)
recordings. We also used the acoustic speech synthesized from the end-to-end approach to drive the
audio-to-facial-animation module in the modular approach to ensure fair comparison. In this way
the graders hear the same speech with different lip movements from the two systems.
For all tests, the graders gave high scores (between good and excellent) as shown in Figures 6-10.
All results are significant according to Mann-Whitney significance test. Figures 6-10 show that
11
Figure 6: Mean opinion scores comparing the quality of the lip movements synthesized by the end-
to-end (E2E) and modular approaches to the ground truth (GT).
E2E
4.3
Modular
4.1
4.3
GT
Figure 7: Mean opinion scores comparing the quality of the facial expressions synthesized by the
end-to-end (E2E) and modular approaches to the ground truth (GT).
4.3
Modular
4.1
4.3
GT
the end-to-end approach gives better and more consistent performance, i.e, small variance, than
the modular approach in all aspects. Furthermore, the performance of the end-to-end approach is
almost as good as the ground truth results. Figures 6-7 show results of grading the visual aspects
of the synthesized talking face, i.e, lip movements and facial expressions. The ground truth results
depicted in Figures 6-7 are also an indication of the performance of the offline blendshape estimation
algorithms discussed in Section 3. The ground truth BSCs are estimated from RGB+depth images
and the quality of these inputs relies on many factors, such as the head pose and lighting conditions.
The outliers in the ground truth results in Figures 6, 7 and 12 are explained by cases where the BSC
estimation may have failed when such conditions are not optimal. Neural networks on the other
hand, when trained on these data, learn to filter out such outliers and infer the conditional mean.
Therefore, the end-to-end approach shows a more consistent performance in Figures 6-7. In Figure
7, the larger variance of the modular approach synthesis is explained by the audio-to-animation
model falling short in synthesizing reasonable emotional facial expressions from speech.
12
Figure 8: Mean opinion scores (MOS) comparing the synthesis quality of acoustic speech synthe-
sized by the end-to-end (E2E) approach to the ground truth (GT). Visual aspects affect the MOS of
the modular approach although same synthesized speech is used for both tests.
E2E
4.4
Modular
4.3
4.4
GT
Figure 9: Mean opinion scores comparing the quality of the emotions in speech synthesized by the
end-to-end (E2E) to the ground truth (GT). Visual aspects affect the MOS of the modular approach
although same synthesized speech is used for both tests. Graders preferred the synthesized speech
emotions which were more milder than the exaggerated acted emotions in the original recordings.
4.6
Modular
4.3
4.4
GT
An interesting outcome from this study was that although the graders were instructed to completely
ignore visual aspects when they grade the acoustic speech synthesis, the graders seemed to get
unconsciously affected by the other aspects. This effect can be seen in the evaluation of the acoustic
speech synthesis in Figures 8 and 9. Although, the speech synthesized from the end-to-end approach
was the same speech used as input to the audio-to-animation model, the graders gave better speech
synthesis and speech emotion scores to talking faces generated using the end-to-end approach.
Another interesting result is shown in Figure 9, where graders gave higher scores for the emotional
speech synthesized from the end-to-end approach over the original recordings (GT). The reason for
this is that in the original recording, the actor exaggerated happy and sad emotions in certain cases.
We noticed that the emotions synthesized from the end-to-end approach are less exaggerated than the
13
Figure 10: Mean opinion scores comparing the quality of the holistic(overall) experience of the
emotional talking faces synthesized by the end-to-end (E2E) and modular (Mod) approaches to the
ground truth (GT).
E2E
4.1
Modular
3.9
4.1
GT
Figure 11: Results of the ABX test that comparing the quality of the avatar’s lip movements that are
driven by the end-to-end and the modular approaches in an isolated setup.
End-to-end Modular No preference
44
23
33
0 12.5 25 37.5 50
Grader Preference [%]
original recordings. This effect is discussed more in the upcoming sections. The graders preferred
less exaggerated happy and sad emotions in comparison to the overacted ones in the original dataset.
In order to evaluate the performance of the end-to-end and modular approaches in terms of acoustic
and visual speech synthesis in a more independent manner, we introduce two additional subjective
tests described in the subsequent sections that evaluate visual and acoustic speech synthesis perfor-
mance isolated from all other aspects.
Visual Speech Synthesis We have conducted an AB test to compare the performance of the end-to-
end approach and the modular approach in terms of the synthesized lip movements (visual speech),
i.e. how naturally the avatar’s lip movements follow acoustic speech. In order for the graders to focus
only on the quality of the synthesized visual speech, we have used a neutral emotion embedding to
synthesize neutral speech. We have also eliminated (zeroed out) the controls of the upper half of
the face, such as eyebrows movements and blinking. Human graders were given 50 pairs of videos
that correspond to sentences that were not used during training. The graders were asked which video
matches the speech more naturally? Graders were advised to focus only on the lip movements and
to provide comments as to why they prefer their chosen video. To prevent display ordering effects,
the order of the videos in a given pair is randomized. In total, 30 graders evaluated 50 videos.
Figure 11 shows that the graders preferred the performance of the end-to-end system more than the
modular one. This result is significant according to Mann-Whitney significance test. The binary
result of the AB test does not assess the absolute quality of the visual speech synthesized by any of
the two approaches. To get an absolute quality score for each of the two systems, we have conducted
two additional five point scale MOS tests for each approach individually using a similar setup of the
AB test. The MOS test results depicted in Figure 12 shows that while the end-to-end approach is
better than the modular approach, both systems generate very good lip movements.
14
Figure 12: Mean opinion scores of the visual speech synthesis of the end-to-end (E2E) and the
modular (Mod) approaches in an isolated setup.
E2E
4.3
Modular
4.2
4.4
GT
Figure 13: Mean opinion scores of the acoustic speech synthesis of the audiovisual (AV) and audio-
only (A) synthesis systems that use Tacotron2 architecture.
3.9
AV
7.2.2.3 Acoustic Speech Synthesis It is important to assess whether the quality of the synthesized
acoustic speech is affected by adding the facial control estimation task to the Tacotron2 model. To
assess this, we have conducted audio-only MOS tests for neutral speech, where graders were asked
to primarily focus on assessing the quality of the synthesis. The graders were given 50 pairs of audio
signals. Each pair consisted of a reference recording from the actor and a synthesized audio signal
of the same phrase. The graders were asked to evaluate the synthesis quality given the reference
signals on a five point scale, where one corresponds to poor quality and five to excellent quality.
The MOS test was conducted one time with the audiovisual Tacotron2 and a second time with the
conventional audio-only Tacotron2.
The results of the two MOS tests are shown in Figure 13. These results are significant according to
Mann-Whitney significance test. Although the graders were explicitly instructed to focus only on the
synthesis quality and not on the style of speech in the synthesized signal, the results and subsequent
comments from the graders indicated that they were unconsciously affected by the speaking style.
The low MOSs of the AVTacotron2 in Figure 13 is heavily weighted by the fact that the prosody
of the synthesized speech does not always match the prosody of the reference signal. As shown
in Figure 13, this was not the case in the traditional audio-only Tacotron2, where the prosody of
the synthesis matches that of the reference signals and results in a higher MOS. The non-perfectly
matching prosody is also the reason why the graders preferred milder emotional speech synthesis of
AVTacotron2 than the overacted original recordings.
15
8 Conclusion
In this paper, we have described and compared two systems for audiovisual speech synthesis us-
ing Tacotron2. The first system is the AVTacotron2, which is an end-to-end system that takes an
input text and synthesizes acoustic and visual speech. The second modular system uses the con-
ventional audio-only Tacotron2 to synthesize acoustic speech from text. The synthesized speech
is used as input to a speech-to-animation module that generates the corresponding facial controls.
Emotion embeddings from a speech emotion classifier are used to allow both systems to synthe-
size emotional audiovisual speech. Both approaches are evaluated using a comprehensive subjective
evaluation scheme. The results show that both systems produce synthesized talking faces that are
almost as natural as the ground truth ones. Both approaches generate production-quality speech and
facial animation that can be applied to a variety of fictional and human-like face models without
post-processing. The end-to-end approach outperforms the modular approach in almost all aspects.
However, the additional task of the facial control estimation in the end-to-end approach can some-
times lead to a mismatch between the prosody of the synthesized acoustic speech and the original
recordings.
In order to enhance the experience of the talking faces, our on-going work involves the task of esti-
mating head pose in addition to lip movements and facial expression. We are currently investigating
the impact of the hybrid loss function on AVTacotron2 and estimating optimal weighting factors for
audio and visual losses. We are also working on improving the prosody of the synthesized acous-
tic speech from the AVTacotron2. Finally, we are investigating video-based and audiovisual-based
emotion embeddings compared to the audio-only one used here.
9 Acknowledgments
The authors are grateful to Oggi Rudovic, Barry-John Theobald, and Javier Latorre Martinez for
their valuable comments.
References
[1] A. Abdelaziz, B. Theobald, J. Binder, G. Fanelli, P. Dixon, N. Apostoloff, T. Weise, and S. Ka-
jareker. Speaker-independent speech-driven visual speech synthesis using domain-adapted
acoustic models. In International Conference on Multimodal Interaction, pages 220–225,
2019.
[2] Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Ka-
jarekar, Devang Naik, and Ahmed Hussen Abdelaziz. Self-supervised learning of visual speech
features with audiovisual speech enhancement. arXiv preprint arXiv:2004.12031, 2020.
[3] Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. An expressive text-driven
3d talking head. In ACM SIGGRAPH 2013 Posters, pages 1–1. 2013.
[4] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W Sumner, and Markus Gross. High-quality passive facial performance capture using
anchor frames. In ACM SIGGRAPH 2011 papers, pages 1–10. 2011.
[5] M. Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics
and interactive techniques, pages 21–28. ACM Press/Addison-Wesley Publishing Co., 1999.
[6] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial perfor-
mance capture. ACM Transactions on Graphics (ToG), 34(4):1–9, 2015.
[7] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial animation
with image-based dynamic avatars. ACM Transactions on Graphics, 35(4), 2016.
[8] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In Advances in neural information processing
systems, pages 577–585, 2015.
[9] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance models.
IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
16
[10] Jack Chilton Cotton. Normal visual hearing. Science, 1935.
[11] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of
Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.
[12] B. Fan, L. Wang, F. Soong, and L. Xie. Photo-real talking head with deep bidirectional LSTM.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 4884–4888. IEEE, 2015.
[13] S. Fu, R. Gutierrez-Osuna, A. Esposito, P. Kakumanu, and O. Garcia. Audio/visual mapping
with cross-modal hidden Markov models. IEEE Transactions on Multimedia, 7(2):243–252,
2005.
[14] Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, Paul Graham, Koki Nagano,
Jay Busch, and Paul Debevec. Driving high-resolution facial blendshapes with video perfor-
mance capture. In ACM SIGGRAPH 2013 Talks, pages 1–1. 2013.
[15] Udit Kumar Goyal, Ashish Kapoor, and Prem Kalra. Text-to-audiovisual speech synthesizer.
In International Conference on Virtual Worlds, pages 256–269. Springer, 2000.
[16] Zhenliang H., Meina K., Jie J., Xilin C., and Shiguang S. A fully end-to-end cascaded CNN
for facial landmark detection. In IEEE International Conference on Automatic Face & Gesture
Recognition, pages 200–207, 2017.
[17] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward
Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu.
Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.
[18] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint
end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):94,
2017.
[19] T. Kim, Y. Yue, S. Taylor, and I. Matthews. A decision tree framework for spatiotemporal
sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 577–586. ACM, 2015.
[20] Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brébisson, and Yoshua Bengio.
Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
[21] H. Li, T. Weise, and M. Pauly. Example-based facial rigging. In ACM SIGGRAPH, pages
32:1–32:6. ACM, 2010.
[22] Harry McGurk and John MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746–
748, 1976.
[23] Jonathan Parker, Ranniery Maia, Yannis Stylianou, and Roberto Cipolla. Expressive visual text
to speech and expression adaptation using deep neural networks. In 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4920–4924. IEEE,
2017.
[24] John Sabini and Maury Silver. Ekman’s basic emotions: Why not love and jealousy? Cognition
& Emotion, 19(5):693–712, 2005.
[25] Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura.
HMM-based text-to-audio-visual speech synthesis. In Sixth International Conference on Spo-
ken Language Processing, 2000.
[26] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural TTS synthesis by con-
ditioning WAVENet on mel spectrogram predictions. In 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
[27] T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. Talking heads synthesis from audio with deep
neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII),
pages 100–105. IEEE, 2015.
17
[28] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J
Weiss, Rob Clark, and Rif A Saurous. Towards end-to-end prosody transfer for expressive
speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047, 2018.
[29] Linsen Song, Wayne Wu, Chen Qian, Chen Qian, and Chen Change Loy. Everybodys talkin:
Let me talk as you want. arXiv preprint, arXiv:, 2020.
[30] Lorenzo Tarantino, Philip N Garner, and Alexandros Lazaridis. Self-attention for speech emo-
tion recognition. Proc. Interspeech 2019, pages 2578–2582, 2019.
[31] S. Taylor, A. Kato, B. Milner, and I. Matthews. Audio-to-visual speech conversion using deep
neural networks. In Interspeech. International Speech Communication Association, 2016.
[32] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. Rodriguez, J. Hodgins, and I. Matthews.
A deep learning approach for generalized speech animation. ACM Transactions on Graphics
(TOG), 36(4):93, 2017.
[33] S. Taylor, M. Mahler, B. Theobald, and I. Matthews. Dynamic units of visual speech. In
Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages
275–284. Eurographics Association, 2012.
[34] Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner.
Neural voice puppetry: Audio-driven facial reenactment. arXiv 2019, 2019.
[35] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. journal of ma-
chine learning research 9. Nov (2008), 2008.
[36] Lijuan Wang, Xiaojun Qian, Lei Ma, Yao Qian, Yining Chen, and Frank K Soong. A real-time
text to audio-visual speech synthesis system. In Ninth Annual Conference of the International
Speech Communication Association, 2008.
[37] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end
speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[38] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying
Xiao, Fei Ren, Ye Jia, and Rif A Saurous. Style tokens: Unsupervised style modeling, control
and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017, 2018.
[39] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime performance-based facial animation. In
ACM transactions on graphics (TOG), volume 30, page 77. ACM, 2011.
[40] Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. Real-time facial animation on mobile
devices. Graphical Models, 76(3):172–179, 2014.
[41] L. Xie and Z. Liu. Realistic mouth-synching for speech-driven talking face using articulatory
modelling. IEEE Transactions on Multimedia, 9(3):500–510, 2007.
[42] Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. Spacetime faces: High-resolution
capture for˜ modeling and animation. In Data-Driven 3D Facial Animation, pages 248–276.
Springer, 2008.
[43] Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan
Singh. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on
Graphics (TOG), 37(4):1–10, 2018.
18