0% found this document useful (0 votes)
51 views18 pages

Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally

Uploaded by

majidshadow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views18 pages

Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally

Uploaded by

majidshadow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Audiovisual Speech Synthesis using Tacotron2

Ahmed Hussen Abdelaziz* Anushree Prasanna Kumar*


Apple Apple
Cupertino, CA Cupertino, CA
arXiv:2008.00620v1 [eess.AS] 3 Aug 2020

[email protected] ak [email protected]

Chloe Seivwright Gabriele Fanelli Justin Binder


Apple Apple Apple
London Zurich Cupertino, CA
[email protected] gabriele [email protected] [email protected]

Yannis Stylianou Sachin Kajarekar


Apple Apple
London Cupertino, CA
[email protected] [email protected]

Abstract

Audiovisual speech synthesis is the problem of synthesizing a talking face while


maximizing the coherency of the acoustic and visual speech. In this paper, we pro-
pose and compare two audiovisual speech synthesis systems for 3D face models.
The first system is the AVTacotron2, which is an end-to-end text-to-audiovisual
speech synthesizer based on the Tacotron2 architecture. AVTacotron2 converts
a sequence of phonemes representing the sentence to synthesize into a sequence
of acoustic features and the corresponding controllers of a face model. The out-
put acoustic features are used to condition a WaveRNN to reconstruct the speech
waveform, and the output facial controllers are used to generate the corresponding
video of the talking face. The second audiovisual speech synthesis system is modu-
lar, where acoustic speech is synthesized from text using the traditional Tacotron2.
The reconstructed acoustic speech signal is then used to drive the facial controls
of the face model using an independently trained audio-to-facial-animation neural
network. We further condition both the end-to-end and modular approaches on
emotion embeddings that encode the required prosody to generate emotional au-
diovisual speech. We analyze the performance of the two systems and compare
them to the ground truth videos using subjective evaluation tests. The end-to-
end and modular systems are able to synthesize close to human-like audiovisual
speech with mean opinion scores (MOS) of 4.1 and 3.9, respectively, compared to
a MOS of 4.1 for the ground truth generated from professionally recorded videos.
While the end-to-end system gives a better overall quality, the modular approach
is more flexible and the quality of acoustic speech and visual speech synthesis is
almost independent of each other.

Keywords: Audiovisual speech, speech synthesis, Tacotron2, emotional speech synthesis, blend-
shape coefficients

Ahmed H. Abdelaziz and Anushree P. Kumar have contributed equally

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without
notice, after which this version may no longer be accessible.
1 Introduction

Human perception of speech is inherently bimodal (audiovisual). In face-to-face conversations, vi-


sion improves the intelligibility of speech by reading the lips of the talker. Such forms of visual
hearing are exploited not only by hearing impaired people, but also by all individuals [10]. The
presumed biological link between perception and production makes bimodal speech production es-
sential in generating realistic talking characters. Thus, audiovisual speech synthesis involves syn-
thesizing visual controllers for a face model that is coherent with the synthesized acoustic speech
units. This is extremely challenging, as we, humans, are highly sensitive to any subtle alteration
or poor synchronization between lip movements (visual speech) and accompanied acoustic speech.
Poor visual speech synthesis can negatively influence the intelligibility of what is being spoken [22]
even if the accompanied synthesized acoustic speech is relatively natural. Additionally, since the
same speech units can be uttered in different ways depending on the underlying facial expressions,
it becomes essential to synthesize plausible facial expressions along with acoustic speech and lip
movements to increase the naturalness of visual speech synthesis.
Recent advances in neural text-to-speech (TTS) systems have led to synthesizing close-to-human-
like speech. One of the widely used neural TTS architectures is Tacotron 2 [26]. In this paper, we
describe two approaches for audiovisual speech synthesis using this model. The first one, which we
call AVTacotron2, is an end-to-end approach, where facial controls along with the speech’s spectral
features are synthesized directly from text. The resulting spectral features are then used to recon-
struct the speech signal using WaveRNN [17]. The facial controllers generated by the AVTacotron2
are used to animate a 3D face model. Finally, the video of the talking face is generated by combin-
ing the synthesized acoustic and visual information. The second approach is modular, where we use
the Tacotron2 system in a regular fashion to synthesize acoustic speech from text. Then, we use an
independent speech-to-animation module that generates the corresponding facial controls from the
synthesized speech. Speech signal reconstruction and video generation are done in the same way as
in the end-to-end approach.
Both the end-to-end and modular systems preserve the synergy between the synthesized acoustic
and visual speech. Also, both approaches synthesize talking faces that can be directly deployed in
computer graphics applications, such as computer-generated movies, video games, video conferenc-
ing, and virtual agents in virtual and augmented reality, without any manual post-processing. The
modular approach provides more flexibility, as the audio-to-animation module can be easily trained
to generalize across many speakers [1]. It can then be conveniently used with any voice and/or lan-
guage synthesized by any text-to-speech system. Also, the quality of the acoustic and visual speech
speech is almost decoupled. This means each of the modules can be optimized separately without
compromising the quality of the other module. On the other hand, the end-to-end solution consists
of a single neural network that automatically captures the correlation between the audio and visual
cues. Our results show that it also gives better overall synthesis quality compared to the modular
system.
In this work, we represent the space of facial motion using generic blendshapes. The benefit of using
such a facial representation is twofold: 1) It provides a semantic meaning of the facial controls,
i.e., the blendshape coefficients, which makes them easy to analyze and post-process for further
improvements. 2) The blendshape coefficients can be applied to a variety of fictional and human-
like face models.
Synthesizing human-like audiovisual speech is more complicated than just producing raw acoustic
and visual speech. Modulating the synthesized speech with the corresponding prosody [28] and
facial expressions is a crucial quality factor for audiovisual speech synthesis. Recently, many sig-
nificant breakthroughs in style modeling have been proposed in the speech synthesis realm [38]. In
this study, we are addressing the synthesis of easy-to-define styles of speech, which are emotions.
We present a supervised approach for emotion-aware audiovisual speech synthesis. In particular,
we condition both the end-to-end and modular approaches on emotion embeddings that encode the
required prosody of audiovisual emotional speech. The emotion embeddings are extracted from a
stand alone speech emotion classifier.
Evaluating audio-only and audiovisual speech synthesis systems is challenging, as their quality need
to be assessed using human perception. In this paper, we evaluate the two proposed systems using
various subjective tests. Different tests focus on different aspects of the synthesis, such as acoustic

2
speech quality, lip movement quality, facial expression, and emotion in speech. We use mean opinion
score (MOS) tests as well as AB tests to obtain absolute and relative scores for the two approaches.
We also evaluate the holistic perception of the talking face.
To summarize, the main contributions of this study are:
• We describe and evaluate two systems for emotion-aware audiovisual speech synthesis
based on Tacotron2: an end-to-end (AVTacotron2) and a modular approach. We show that
both systems generate close to human-like emotional audiovisual speech without requiring
post-processing.
• We apply these approaches to a generic blendshape model that allows for synthesizing a
variety of 3D face models.
• The definition of empirical evaluation tests that assess the quality of different aspects of the
synthesized speech as well as the overall quality. We compare the synthesis quality of the
proposed systems to the original recordings of a professional actress.
The rest of the paper is organized as follows: In Section 2, we discuss the related work. Section 3
describes the problem of the offline extraction of facial controls from video sequences. In Sec-
tion 4, we introduce the speech emotion recognition system used for extracting emotion embeddings.
Section 5 outlines the model architectures of the end-to-end approach for synthesizing audiovisual
speech from text. The modular approach is then described in Section 6. The experiments used to
evaluate the model performance and the results are described in Section 7. Finally, we conclude the
paper and give an outline of future work in Section 8.

2 Related work
Manually and Video-driven Avatars: Although there has been an increasing trend in sophisticated
face and motion tracking systems, the cost of production still remains high. In many production
houses today, high fidelity speech animation is either created manually by an animator or using
facial motion capture of human actors [4, 6, 14, 40, 42]. Both manually and performance-driven
approaches require a skilled animator to precisely edit the resulting complex animation parameters.
Such approaches are extremely time consuming and expensive, especially in applications such as
computer generated movies, digital game production, and video conferencing, where talking faces
for tens of hours of dialogue are required. To solve this problem, speech- and text-driven talk-
ing faces can be used. In this study, we introduce new text-driven approaches for synthesizing
production-quality speech and facial animation with different styles of speech.
Face Representations: Face synthesis algorithms can be categorized into two major categories:
photorealistic and parametric. In photorealistic synthesis, videos of the speakers are generated either
from text or audio inputs [20, 29, 34]. These techniques involve combining both image and speech
synthesis tasks.
In parametric face synthesis, a sequence of facial controllers are used to deform a neutral face model.
The parametric approaches can either operate on 3D face models [3, 7] or on 2D images. For the 2D
face synthesis, active appearance models (AAMs) [9] are widely used. In this technique, a face is
modeled using shape and appearance parameters. However, due to modeling non-linear variability
by a linear PCA model and the low rank approximations, details such as the regions near the mouth
often appear blurry [12].
Our approach belongs to the parametric synthesis category, where a sequence of low-dimensional
blendshape coefficient vectors are synthesized and applied to a wide range of 3D avatars. We use
a method similar to that in [39] for offline extraction of the ground truth blendshape coefficients,
which are used as facial controllers for the 3D face model.
Text-driven Avatars: Many approaches have been proposed for visual and audiovisual speech syn-
thesis from text. For visual speech synthesis, hidden Markov models (HMM) were used in earlier
approaches, e.g., in [13, 25, 36, 41]. In [25] and [36], HMM-based systems were used to synthesize
both acoustic and visual speech. Decision trees with HMMs have been used to create expressive
talking heads [3]. In [15], the Festival System was firstly used for speech synthesis followed by a
text-to-visual-speech synthesis system and an audiovisual synchronizer. In [33], a learned phoneme-
to-viseme mapping was also used for visual speech synthesis. Recently, many deep-learning-based

3
approaches have been proposed for visual speech synthesis. In [23], the authors used a unit-selection-
based audiovisual speech synthesis system. In this approach, the output of a deep neural network
(DNN) was used as a likelihood of each audiovisual unit in an inventory. Dynamic programming
was then used to obtain the most likely trajectory of consecutive audiovisual units, which were later
used for synthesis. A more neural based approach was proposed in [32], where a fully connected
feed-forward neural network was used to synthesize natural-looking speech animation that is coher-
ent with the input text and speech. In [20], an LSTM architecture was used to convert the input text
into speech and corresponding coherent photo-realistic images of talking faces.
In contrast to these approaches, we use Tacotron2, a state-of-the-art architecture for acoustic speech
synthesis, for synthesizing coherent acoustic and visual speech. The end-to-end approach described
in this work synthesizes audiovisual speech in an end-to-end neural fashion using a single model.
Speech-driven Avatars: Voice Puppetry [5] is an early HMM-based approach for audio-driven
facial animation. This was followed by techniques based on decision trees [19], or deep bidirectional
LSTM [12] that outperformed HMM-based approaches. In [27], the authors proposed applying
an LSTM-based neural network directly to audio features. In [31], a deep neural network was
proposed to regress to a window of visual features from a sliding window of audio features. In
this work, a time-shift recurrent network trained on an unlabeled visual speech dataset produced
convincing results without having any dependencies on a speech recognition system. More recently,
convolutional neural networks (CNNs) were used in [18] to learn a mapping from speech to the
vertex coordinates of a 3D face model. Facial expressions and emotion were generated by a latent
code extracted from the network. In [43], the authors propose VisemeNet which consisted of a
three-stage LSTM network for lip sync. One stage predicts a sequence of phoneme-groups given
the audio and another stage predicts the geometric location of landmarks given the audio. The final
stage learns to use the predicted phoneme groups and facial landmarks to estimate parameters and
sparse speech motion curves.
Most of these systems synthesize visual controllers that deform a 3D face model that looks like
the speaker. In contrast, we predict a sequence of facial controllers that can be used across various
fictional and human-like characters. For our modular approach, we use a CNN-based speech-to-
animation model conditioned on emotion embeddings.

3 Estimation of the Ground Truth Blendshape Coefficients


One of the most challenging problems in facial animation is to find the ground-truth controllers of
the face model. Manual labeling of the facial controls for each time frame in a dataset that is large
enough to train a neural network is impractical, as annotation is both subjective, time consuming and
expensive. Thus, ground truth facial controls are usually algorithmically estimated. In this study, we
use an extension of the method described in [39] to automatically estimate the facial controls from
RGB-D video streams.
We represent the space of facial expressions, including those caused by speech, using a low-
dimensional generic blendshape model inspired by Ekmans Facial Action Coding System [11]. Such
a model generates a mesh corresponding to a specific facial expression as

v(x) = b0 + Bx, (1)

where b0 is the neutral face mesh, the columns of matrix B are additive displacements corresponding
to a set of n = 51 blendshapes, and x ∈ [0, 1] are the weights applied to such blendshapes.
We construct a personalized model from a set of RGB-D sequences of fixed, predetermined proto-
typical expressions, such as neutral, mouth open, smile, anger, surprise, disgust and while the head
is slightly rotating (these correspond to the matrix B in the expression above) for the actor in our
dataset. In particular, we use an extension of the example-based facial rigging method of [21], where
we modify a generic blendshape model to best match the talent’s example facial expressions. We
improve registration accuracy over [21] by adding positional constraints to a set of 2D landmarks
detected around the main facial features in the RGB images using a CNN similarly to [16].
Given the personalized model, we automatically generate labels for the head motion and blendshape
coefficients (BSCs) by tracking every video frame of the subject. To do so, we rigidly align the

4
model to the depth maps using iterative closest point (ICP) with point-plane constraints. Then, we
solve for the BSCs, which modify the model non-rigidly to best explain the input data. In particular,
we use a point-to-plane fitting term on the depth map:
2
Di (x) = ni ⊤ (vi (x) − vi ) , (2)
where vi is the i-th vertex of the mesh, vi is the projection of vi to the depth map, and ni is the
surface normal of vi . Additionally, we use a set of point-to-point fitting terms on the detected 2D
facial landmarks:
2
Lj (x) = ||π(vj (x)) − uj || , (3)
where uj is the position of a detected landmark and π(vj (x)) is its corresponding mesh vertices
projected into camera space. The terms Di and Lj are combined in the following optimization
problem: X X
x̃ = argmin min wd Di (x) + wl Lj (x) + wr ||x||1 , (4)
x
i j
where wd , wl , and wr represent the respective weights given to the depth term, the landmark term,
and a L1 regularization term, which promotes the solution to be sparse. The minimization is carried
out using a solver based on the Gauss-Seidel method. The estimated blendshape coefficients x̃ in
(4) are used as ground-truth facial controls for training the text-to-animation and audio-to-animation
models in the subsequent sections.

4 Speech Emotion Features


Emotions in visual speech synthesis systems are increasingly required since they enhance the user
experience and make the interaction more natural. Adding emotions to the synthesized acoustic
and visual speech requires conditioning the neural network on some sort of a representation of
the required emotion. In contrast to the style transfer approaches in [38] and [28] that have been
recently successful in learning the speech style/emotion in an unsupervised manner, we chose to
exploit emotion labels to train a separate neural network in a supervised way to classify emotions
from speech. We consider the penultimate layer of the speech emotion classifier as a fine-grained
vector representation of the emotion. Three basic emotion categories [24] are investigated in this
paper: neutral, positive (happy) and negative (sad)
Figure 1 shows the architecture of the speech emotion classifier. In this approach, 40-dimensional
mel-scaled filter bank speech features are firstly passed through a 1d-convolutional layer, which
contains 100 filters with kernel size of 3 × 1. The output of this convolutional layer is then fed
into a single long short-term memory (LSTM) layer with 64 cells followed by a linear layer of the
same dimension. A mean pooling layer is then used to summarize the LSTM outputs across time. A
softmax layer is finally applied to the average pooling output to predict the emotion class.
After the model is trained, the output of the penultimate layer, i.e, the layer before the softmax layer,
is used as the speech emotion embedding. These embeddings are used to condition the audiovisual
speech synthesis systems during training as will be described in the upcoming sections. During
inference, we use the speech emotion classifier to extract the emotion embedding from a reference
utterance to get the desired emotion for the synthesized talking face. The audiovisual speech synthe-
sizer is conditioned on that emotion embedding to synthesize speech and facial movements to match
the emotion.

5 End-to-end Audiovisual Speech Synthesis


Deep neural networks, in particular sequence-to-sequence architectures [26, 37], has led to speech
synthesizers producing almost human-like speech. Tacotron2 [26] is one of the most succesful
sequence-to-sequence architectures for speech synthesis. The Tacotron2 architecture takes as input
a phoneme sequence of length M that represents the pronunciation of the sentence to be synthesized.
As an output, Tacotron2 generates a sequence of 80-dimensional mel-scaled filter bank (MFB) fea-
tures. The MFB features are finally used as an input to another sequence-to-sequence model, e.g.,
WaveRNN [17], to generate the samples of the synthesized speech signal.
In this paper, we adopt the Tacotron2 architecture for synthesizing not only the audio features, but
also all 51 BSCs that control speech- and non-speech-related facial expressions.

5
Figure 1: Speech emotion recognition neural network. The network is trained to classify neutral,
happy, and sad emotions using input acoustic features. Once the network is trained, the output of
the average pooling layer is used as a speech emotion representation.

Softmax
Emotion
Embeddings
Average Pool

FC

LSTM

Conv1D

Figure 2: Tacotron2 architecture for emotional audiovisual speech synthesis. The network’s encoder
converts an input sequence of phonemes and a compact representation of a predefined emotion into
a sequence of abstract vector representations of the same length. An attention mechanism is used to
focus on a subset of the encoded vectors at each time step. The concatenation of the weighted sum
of the encoder outputs (the context vector) and a pre-processed regressor output vector are used as
an input to the decoder. The decoder output is used to regress to visual and acoustic features that are
finally post-processed to give a final cleaner features.
Blendshape = Concatenation
coeffcients

Attention

HH
AH Endpointer
L
OW Encoder = Decoder Regressor Postnet
W
ER
L Mel-scaled
D filterbanks

Prenet
Emotion embeddings

Figure 2 shows the entire architecture of AVTacotron2. Although the network is trained end-to-end,
it can be divided into four major functional blocks: a phoneme encoder, an attention block, a decoder,
and a regressor. In the following sections, these four blocks are described in detail.

5.1 Text Encoder

The function of the phoneme encoder in the Tacotron2 architecture is to transform a sequence of
phonemes into an intermediate representation, from which acoustic and visual features can be gener-
ated. As shown in Figure 3, the input to the encoder is a sequence of M phoneme indices, which are
transformed into 512-dimensional feature vectors using an embedding layer. As described in [26],

6
Figure 3: Emotion-conditioned phoneme encoder. The phoneme embeddings are processed by a set
of convolution and bi-derectional LSTM layers. The emotion embedding is repeated and cascaded
with the processed phoneme embeddings to give the encoder outputs.

Encoder
output
=
= Emotion
= embedding

BLSTM

Encoder

CNN

Embedding layer

HH AH L OW W ER L D
the M × 512 output matrix of the embedding layers is passed through 3 convolutional layers, each
of which has 512 feature maps with kernel size of 5 × 1 and ReLU activations. Each convolutional
layer is followed by a drop-out layer with drop-out probability of 0.5 and batch normalization. The
convolutional layers model the N-gram dependencies between consecutive phonemes [26]. For the
longer term dependencies, the output of the final convolutional layer is processed by a bi-directional
long short-term memory (BLSTM) layer with 512 units in each direction. The concatenation of
the bi-directional LSTM gives the M × 1024 intermediate representation of the M input phonemes.
The 64 dimensional emotion embedding is repeated and cascaded with the processed phoneme em-
beddings to give the encoder outputs. During training the embeddings are extracted from the same
output utterance. During inference, the emotion embedding is extracted from a reference utterance
that has the required level of emotion. The emotionally conditioned phoneme representation is used
to generate the audio-visual outputs in the later stages of the network, as will be described in the
following sections.

5.2 Attention mechanism

A location-sensitive attention [8] is applied to the M × 1088 phoneme intermediate representation


that is output by the phoneme encoder. The output of the attention mechanism is called the context
vector and is estimated as a weighted sum of the M intermediate representations. The attention and
decoding blocks are applied in an iterative way. Therefore, the attention weights, i.e. alignments, of
the t time-frames are determined by how much the intermediate representation is correlated with the
decoder output of the t − 1 time-frame and to a processed version of the t − 1 estimated attention.
The later gave rise to the name, location sensitive attention, where the processed alignments from
the previous time frames help determine the decoder previous location and hence, motivates the
model to process future frames without unreasonable loops or skips. The alignments of the previous
time step is processed by a 1D convolutional layer of 32 feature maps, each of 31-dimensions. The
correlation between the encoder output feature vectors, the decoder output of the previous time-
frame, and the processed alignment of the previous time-frame is estimated using a weighted fully
connected layer with tanh activations followed by a softmax layer for a normalization purpose as
described in [26]. The context vector has 1088 dimensions and is used as the input to the decoder,
which is the next stage of the network.

7
5.3 Decoder

The decoder stage of the Tacotron2 architecture is composed of two 1024-dimensional uni-
directional LSTM layers. The decoder iteratively generates an embedding, from which all audio,
video, and auxiliary information can be estimated. The input to the decoder at time-frame t are the
context vector at time-frame t and a pre-processed version of the regressor output of the previous
time-frame t − 1. Preprocessing of the regressor outputs is done by the so-called Pre-net, which
is composed of two 256-dimensional fully-connected layers with ReLU activations. Since the de-
coder generates embeddings that are asynchronous with the encoder output, the number of output
embeddings Ndecoder is independent of the encoder output length M .

5.4 Regressor

The embeddings extracted from the decoder encodes all information needed to synthesize one or
n acoustic and visual frames. We use n = 2 in this study, which means that the length of the
synthesized acoustic and visual features is 2Ndecoder . The regressor is simply a linear projection
of the embeddings from the decoder output onto the acoustic, the visual, and the auxiliary feature
spaces. The acoustic feature space is the 80-dimensional MFBs. The visual feature space is the 51-
dimensional BSCs. The auxiliary feature is a one-dimensional end-pointing feature that determines
the end of decoding.
The output MFB features go through a post-processing network (post-net) that is composed of five
1D convolutional layers followed by a linear fully-connected layer. The convolutional layers have
512 filters of kernel size 5. The first four convolutional layers use (tanh) non-linear activations,
whereas the final one has no non-linearity. A drop-out layer with drop-out probability of 0.5 is
added after every convolutional layer. The final fully connected layer has 80 units, which is the
same dimension as the regression acoustic feature space. A skip connection is added from the input
of the post-net to its output, which means the post-net output represents the estimation residual. The
residual is estimated with access to future frames, which makes the overall post-net output more
accurate than the linearly projected MFB features.
The network is trained to minimize a weighted sum of four loss functions.
Ltotal = ωmel Lmel + ωpost Lpost + ωend Lend + ωbsc Lbsc . (5)
Lmel is the L1 loss (mean absolute error) that is applied to the mel-scaled filter-bank features that
are output from the linear projection layer. Lpost and Lend are the L2 (mean square error) losses that
are applied to the output of the post-processing subnetwork and the end-pointing output respectively.
For the BSCs, we use the L2 loss function Lbsc . The weights ωmel , ωpost , ωend , andωbsc are empirically
tuned to give the best synthesis performance. In our experiments, we use 0.9 for ωmel , 1 for ωpost , 0.1
for ωend and 0.9 for ωbsc .

6 Modular Audiovisual Speech Synthesis


We adopt the Tacotron2 architecture in its standard audio-only version and feed the synthesized
audio after reconstructing it from the predicted spectral features, e.g., using WaveRNN, into an
audio-to-facial-animation module. On one hand, the AVTacotron2 system could give more accurate
estimates of the lip movements since the embeddings extracted from the decoder are rich in acoustic
and lexical information. On the other hand, the modular system is more flexible, as the audio-to-
facial-animation network can be trained in a speaker-independent manner and used with any pre-
trained text-to-speech voice. This would reduce the need for collecting hours of multimodal data for
different speakers, which is necessary to train the AVTacotron2 model.
We have trained a CNN-based speaker-dependent audio-to-facial-animation neural network. The net-
work architecture is shown in Table 1. The input acoustic features to the audio-to-facial-animation
network is 40-dimensional filter bank features with a context window of 21 (10 past, current, and 10
future frames.) The acoustic features are extracted from the wav files. The acoustic speech synthe-
sis is completely decoupled from the visual speech synthesis, so any text-to-speech system can be
used. The ground truth BSCs described in Section 3 are used to train the network. The input to the
penultimate fully connected layer is the concatenation of the former fully connected (FC) layer that
encodes the audio features with the emotion embedding.

8
Table 1: The architecture of the audio-to-facial-animation network.
Type #Filters Kernel Stride/padd. Output Act.
Conv. 128 3x3 1x1/same 21x40x128 RELU
Conv. 64 3x3 2x2/same 11x20x64 RELU
Conv. 64 3x3 1x1/same 11x20x64 RELU
Conv. 64 3x3 2x2/same 6x10x64 RELU
Conv. 64 3x3 1x1/valid 4x8x64 RELU
Conv. 64 3x3 1x1/valid 2x6x64 RELU
FC 512 - - 1x1x512 RELU
FC+embed 128 - - 1x1x128 RELU
FC 51 - - 1x1x51 None

Table 2: The phoneme-viseme mapping used in our work and the viseme importance used to weight
the loss of the audio-to-facial-animation-network. The weights are the normalized accuracy of an
in-house video-based viseme classifier. The silence weight is heuristically set to zero.

Viseme cluster Viseme Phoneme Weights


Bilabial /P/ /p/ /b/ /m/ 1.0
Labio-Dental /F/ /f/ /v/ 0.97
/sh/ /zh/ /ch/
Palato alveolar /SH/ 0.75
/jh/
Dental /TH/ /th/ /dh/ 0.66
Alveolar fricative /Z/ /z/ /s/ 0.66
Lip rounded vowels /uw/ /uh/ /ow/
/V2/ 0.6
level 2 /w/
Lip rounded vowels /aa/ /ah/ /ao/ 0.59
/V1/
level 1 /aw/ /er/ /oy/
Lip stretched vowels /ae/ /eh/ /ey/
/V3/ 0.58
level 1 /ay /y/
Alveolar semivowels /L/ /l/ /el/ /r/ 0.5
Lip stretched vowels
/V4/ /ih/ /iy/ 0.48
level 2
/g/ /ng/ /k/
Velar /G/ 0.46
/hh/
/t/ /d/ /n/
Alveolar /T/ 0.36
/en/
Silence /SIL/ /sil/ /sp/ 0.0

In contrast to the end-to-end system, the audio-to-facial-animation network does not have access
to the entire phoneme sequence. In order to compensate for that, the BSC loss Lbsc in the audio-
to-facial-animation network is modified. For each entry in a batch, the loss function is weighted
differently. The weights of each entry are proportional to the importance of the corresponding
spoken viseme (visual phoneme.) With these weights, the network is penalized more for inference
errors related to important visemes, such as /p/ and /w/ and penalized less for errors related to less
important visemes, such as /t/ and /g/. Table 2 shows the used viseme importance/weights, which
are the classification accuracies of a video-based viseme classifier normalized over the maximum
accuracy, which happens to be of the viseme /P/. The silence weight, which actually has the highest
classification accuracy, is overridden and set to zero. Using these weights improves the performance
of the audio-to-facial-animation network. We use a hybrid loss of L1 and the cosine distance (see
[2]), which we found to be more effective than the L1 and L2 for small, sparse, and bounded labels,
such as the BSCs. We have not used the hybrid loss in Equation 5, as the L2 was easier to balance
with the other losses.

9
Table 3: Emotion classification performance from the Emotion Recognition System.

Emotion Precision (%) Recall (%) F1-Score (%)


happy 95.0 96.0 96.0
sad 99.0 100.0 99.0
neutral 99.0 99.0 99.0
Accuracy (%) 97.8

7 Experiments and Results


7.1 Dataset

A high quality multimodal corpus was collected for training all of our neural networks. The dataset
consists of approximately 10 hours of multimodal data captured from a professional actress: 7
hours of neutral speech (∼7270 sentences), 1.5 hours (∼1430) of acted happy speech and 1.5 hours
(∼1430) of acted sad speech. In this dataset, there are 74 minutes (∼300) common phrases spoken
in a neutral, happy and sad emotional states. The text corpus was balanced to maintain a maximum
phonetic coverage of US English phoneme-pairs. For each utterance in the corpus a single channel
44.1kHz audio, 60 frame per second (fps) 1920x1080 RGB video, and 30 fps depth signals were
captured.

7.2 Experimental Setup and Results

7.2.1 Speech Emotion Recognition


For training the speech emotion recognition neural network, the corpus is randomly split into
emotion-stratified partitions using a 80/20 rule. The resulting splits consist of 8116 utterances for
training, 2029 utterances for validation, and 50 utterances for testing. As described in Section 4,
40-dimensional mel-scaled filter banks are used as input acoustic features. The network is trained
to optimize a cross entropy loss using the Adam optimizer. We use a learning rate of 0.001 for a
maximum of 20 epochs. We monitor the validation performance during training and apply early
stopping when the validation loss converges. Since the data is class imbalanced, we penalize the
classification errors of each class inversely proportional to the class probability.
Figure 4 shows a TSNE projection [35] of the emotion embeddings for the test set clustered based
on emotion labels. As shown, the emotion embeddings are well separated, which is reflected in
the high classification accuracy shown in Table 3. Although the performance of state-of-the-art
speech emotion recognition systems in the wild does not reach such high accuracies [30], the high
classification performance shown in Table 3 was actually expected, as the emotions are acted. In
contrast to simply using a one hot vector to represent the emotion state, the distance between emotion
embeddings within the same cluster allows for fine-grained control of the emotion level that needs
to be synthesized.

7.2.2 Audiovisual Speech Synthesis


One of the most challenging tasks when it comes to machine learning problems involving synthesis
is how to best compare models. Objective measures, such as the loss functions that the network
is trained to optimize, do not usually reflect the naturalness and the quality of the network output.
Human ratings of subjective quality are usually more reliable. In this study, we have conducted
subjective tests to assess the quality of three different aspects of the synthesized talking face. The
graders are gender balanced (15 male and 15 female) and they all are native US English speakers in
the age range 2150.
Overall Experience of the Emotional Audiovisual Synthesis To assess the overall experience of
the audiovisual emotional speech synthesis, we have conducted a subjective test, where graders are
asked to evaluate videos from acoustic, visual, and overall naturalness perspectives. From the audio
perspective, the graders are asked to evaluate the speech synthesis quality and emotion appropriate-
ness given an emotion tag. To grade the visual aspects, the graders are asked to evaluate how much
the visual synthesis (the synthesized lip movements and facial expressions) match the synthesized
speech. Finally, the graders are asked to evaluate the overall naturalness of the talking face. Thirty

10
Figure 4: TSNE projection of the validation set emotion embeddings extracted using the speech
emotion recognition system.

Happy
Sad
Neutral

Figure 5: Grading template for the overall experience of the talking face

IMPORTANT:
Please release the task if:
* you do not have headphones,
* there is background noise,
* you have a hearing impairment, or
* you can’t hear/watch the audio/video samples for any reason
Instructions It is important to understand that it is NOT you who is being tested -
your results only provide information regarding the systems we are evaluating!
Video Synthesis Evaluation
We need your help evaluating video samples from an audio-visual speech synthesis system.

Video

Label: happy

Speech Synthesis Quality Excellent Good Fair Poor Bad

Speech Emotion Doesn’t match


matches the given label Perfectly Most of the time Sometimes most of the time Doesn’t match at all

Lip movements Doesn’t follow


follow speech Perfectly Most of the time Sometimes Doesn’t follow at all
most of the time

Facial Expression Perfectly Most of the time Sometimes Doesn’t match Doesn’t match at all
matches the emotion most of the time
How natural is Natural Not natural most of
the talking face overall? Perfectly natural most of the time
Natural sometimes the time Not natural at all

Comments

graders provided five mean opinion scores (MOSs) for the five questions described above. The
scores range from (1) bad (no match) to (5) excellent (perfect match). Figure 5 shows the grading
template used for evaluation.
We used 30 videos, half of which are synthesized with happy emotion and the other half with sad
emotions. The MOS tests were conducted on each of the following sets of data - videos generated by
the modular approach, videos generated by the end-to-end approach and the original(ground truth)
recordings. We also used the acoustic speech synthesized from the end-to-end approach to drive the
audio-to-facial-animation module in the modular approach to ensure fair comparison. In this way
the graders hear the same speech with different lip movements from the two systems.
For all tests, the graders gave high scores (between good and excellent) as shown in Figures 6-10.
All results are significant according to Mann-Whitney significance test. Figures 6-10 show that

11
Figure 6: Mean opinion scores comparing the quality of the lip movements synthesized by the end-
to-end (E2E) and modular approaches to the ground truth (GT).

Bad Poor Fair Good Excellent

E2E

4.3
Modular

4.1
4.3
GT

Figure 7: Mean opinion scores comparing the quality of the facial expressions synthesized by the
end-to-end (E2E) and modular approaches to the ground truth (GT).

Bad Poor Fair Good Excellent


E2E

4.3
Modular

4.1
4.3
GT

the end-to-end approach gives better and more consistent performance, i.e, small variance, than
the modular approach in all aspects. Furthermore, the performance of the end-to-end approach is
almost as good as the ground truth results. Figures 6-7 show results of grading the visual aspects
of the synthesized talking face, i.e, lip movements and facial expressions. The ground truth results
depicted in Figures 6-7 are also an indication of the performance of the offline blendshape estimation
algorithms discussed in Section 3. The ground truth BSCs are estimated from RGB+depth images
and the quality of these inputs relies on many factors, such as the head pose and lighting conditions.
The outliers in the ground truth results in Figures 6, 7 and 12 are explained by cases where the BSC
estimation may have failed when such conditions are not optimal. Neural networks on the other
hand, when trained on these data, learn to filter out such outliers and infer the conditional mean.
Therefore, the end-to-end approach shows a more consistent performance in Figures 6-7. In Figure
7, the larger variance of the modular approach synthesis is explained by the audio-to-animation
model falling short in synthesizing reasonable emotional facial expressions from speech.

12
Figure 8: Mean opinion scores (MOS) comparing the synthesis quality of acoustic speech synthe-
sized by the end-to-end (E2E) approach to the ground truth (GT). Visual aspects affect the MOS of
the modular approach although same synthesized speech is used for both tests.

Bad Poor Fair Good Excellent

E2E

4.4
Modular

4.3
4.4
GT

Figure 9: Mean opinion scores comparing the quality of the emotions in speech synthesized by the
end-to-end (E2E) to the ground truth (GT). Visual aspects affect the MOS of the modular approach
although same synthesized speech is used for both tests. Graders preferred the synthesized speech
emotions which were more milder than the exaggerated acted emotions in the original recordings.

Bad Poor Fair Good Excellent


E2E

4.6
Modular

4.3
4.4
GT

An interesting outcome from this study was that although the graders were instructed to completely
ignore visual aspects when they grade the acoustic speech synthesis, the graders seemed to get
unconsciously affected by the other aspects. This effect can be seen in the evaluation of the acoustic
speech synthesis in Figures 8 and 9. Although, the speech synthesized from the end-to-end approach
was the same speech used as input to the audio-to-animation model, the graders gave better speech
synthesis and speech emotion scores to talking faces generated using the end-to-end approach.
Another interesting result is shown in Figure 9, where graders gave higher scores for the emotional
speech synthesized from the end-to-end approach over the original recordings (GT). The reason for
this is that in the original recording, the actor exaggerated happy and sad emotions in certain cases.
We noticed that the emotions synthesized from the end-to-end approach are less exaggerated than the

13
Figure 10: Mean opinion scores comparing the quality of the holistic(overall) experience of the
emotional talking faces synthesized by the end-to-end (E2E) and modular (Mod) approaches to the
ground truth (GT).

Bad Poor Fair Good Excellent

E2E

4.1
Modular

3.9
4.1
GT

Figure 11: Results of the ABX test that comparing the quality of the avatar’s lip movements that are
driven by the end-to-end and the modular approaches in an isolated setup.
End-to-end Modular No preference

44
23
33
0 12.5 25 37.5 50
Grader Preference [%]
original recordings. This effect is discussed more in the upcoming sections. The graders preferred
less exaggerated happy and sad emotions in comparison to the overacted ones in the original dataset.
In order to evaluate the performance of the end-to-end and modular approaches in terms of acoustic
and visual speech synthesis in a more independent manner, we introduce two additional subjective
tests described in the subsequent sections that evaluate visual and acoustic speech synthesis perfor-
mance isolated from all other aspects.
Visual Speech Synthesis We have conducted an AB test to compare the performance of the end-to-
end approach and the modular approach in terms of the synthesized lip movements (visual speech),
i.e. how naturally the avatar’s lip movements follow acoustic speech. In order for the graders to focus
only on the quality of the synthesized visual speech, we have used a neutral emotion embedding to
synthesize neutral speech. We have also eliminated (zeroed out) the controls of the upper half of
the face, such as eyebrows movements and blinking. Human graders were given 50 pairs of videos
that correspond to sentences that were not used during training. The graders were asked which video
matches the speech more naturally? Graders were advised to focus only on the lip movements and
to provide comments as to why they prefer their chosen video. To prevent display ordering effects,
the order of the videos in a given pair is randomized. In total, 30 graders evaluated 50 videos.
Figure 11 shows that the graders preferred the performance of the end-to-end system more than the
modular one. This result is significant according to Mann-Whitney significance test. The binary
result of the AB test does not assess the absolute quality of the visual speech synthesized by any of
the two approaches. To get an absolute quality score for each of the two systems, we have conducted
two additional five point scale MOS tests for each approach individually using a similar setup of the
AB test. The MOS test results depicted in Figure 12 shows that while the end-to-end approach is
better than the modular approach, both systems generate very good lip movements.

14
Figure 12: Mean opinion scores of the visual speech synthesis of the end-to-end (E2E) and the
modular (Mod) approaches in an isolated setup.

Bad Poor Fair Good Excellent

E2E

4.3
Modular

4.2
4.4
GT

Figure 13: Mean opinion scores of the acoustic speech synthesis of the audiovisual (AV) and audio-
only (A) synthesis systems that use Tacotron2 architecture.

Bad Poor Fair Good Excellent


4.4
A

3.9
AV

7.2.2.3 Acoustic Speech Synthesis It is important to assess whether the quality of the synthesized
acoustic speech is affected by adding the facial control estimation task to the Tacotron2 model. To
assess this, we have conducted audio-only MOS tests for neutral speech, where graders were asked
to primarily focus on assessing the quality of the synthesis. The graders were given 50 pairs of audio
signals. Each pair consisted of a reference recording from the actor and a synthesized audio signal
of the same phrase. The graders were asked to evaluate the synthesis quality given the reference
signals on a five point scale, where one corresponds to poor quality and five to excellent quality.
The MOS test was conducted one time with the audiovisual Tacotron2 and a second time with the
conventional audio-only Tacotron2.
The results of the two MOS tests are shown in Figure 13. These results are significant according to
Mann-Whitney significance test. Although the graders were explicitly instructed to focus only on the
synthesis quality and not on the style of speech in the synthesized signal, the results and subsequent
comments from the graders indicated that they were unconsciously affected by the speaking style.
The low MOSs of the AVTacotron2 in Figure 13 is heavily weighted by the fact that the prosody
of the synthesized speech does not always match the prosody of the reference signal. As shown
in Figure 13, this was not the case in the traditional audio-only Tacotron2, where the prosody of
the synthesis matches that of the reference signals and results in a higher MOS. The non-perfectly
matching prosody is also the reason why the graders preferred milder emotional speech synthesis of
AVTacotron2 than the overacted original recordings.

15
8 Conclusion
In this paper, we have described and compared two systems for audiovisual speech synthesis us-
ing Tacotron2. The first system is the AVTacotron2, which is an end-to-end system that takes an
input text and synthesizes acoustic and visual speech. The second modular system uses the con-
ventional audio-only Tacotron2 to synthesize acoustic speech from text. The synthesized speech
is used as input to a speech-to-animation module that generates the corresponding facial controls.
Emotion embeddings from a speech emotion classifier are used to allow both systems to synthe-
size emotional audiovisual speech. Both approaches are evaluated using a comprehensive subjective
evaluation scheme. The results show that both systems produce synthesized talking faces that are
almost as natural as the ground truth ones. Both approaches generate production-quality speech and
facial animation that can be applied to a variety of fictional and human-like face models without
post-processing. The end-to-end approach outperforms the modular approach in almost all aspects.
However, the additional task of the facial control estimation in the end-to-end approach can some-
times lead to a mismatch between the prosody of the synthesized acoustic speech and the original
recordings.
In order to enhance the experience of the talking faces, our on-going work involves the task of esti-
mating head pose in addition to lip movements and facial expression. We are currently investigating
the impact of the hybrid loss function on AVTacotron2 and estimating optimal weighting factors for
audio and visual losses. We are also working on improving the prosody of the synthesized acous-
tic speech from the AVTacotron2. Finally, we are investigating video-based and audiovisual-based
emotion embeddings compared to the audio-only one used here.

9 Acknowledgments
The authors are grateful to Oggi Rudovic, Barry-John Theobald, and Javier Latorre Martinez for
their valuable comments.

References
[1] A. Abdelaziz, B. Theobald, J. Binder, G. Fanelli, P. Dixon, N. Apostoloff, T. Weise, and S. Ka-
jareker. Speaker-independent speech-driven visual speech synthesis using domain-adapted
acoustic models. In International Conference on Multimodal Interaction, pages 220–225,
2019.
[2] Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Ka-
jarekar, Devang Naik, and Ahmed Hussen Abdelaziz. Self-supervised learning of visual speech
features with audiovisual speech enhancement. arXiv preprint arXiv:2004.12031, 2020.
[3] Robert Anderson, Björn Stenger, Vincent Wan, and Roberto Cipolla. An expressive text-driven
3d talking head. In ACM SIGGRAPH 2013 Posters, pages 1–1. 2013.
[4] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W Sumner, and Markus Gross. High-quality passive facial performance capture using
anchor frames. In ACM SIGGRAPH 2011 papers, pages 1–10. 2011.
[5] M. Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics
and interactive techniques, pages 21–28. ACM Press/Addison-Wesley Publishing Co., 1999.
[6] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial perfor-
mance capture. ACM Transactions on Graphics (ToG), 34(4):1–9, 2015.
[7] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial animation
with image-based dynamic avatars. ACM Transactions on Graphics, 35(4), 2016.
[8] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio.
Attention-based models for speech recognition. In Advances in neural information processing
systems, pages 577–585, 2015.
[9] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. Active appearance models.
IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.

16
[10] Jack Chilton Cotton. Normal visual hearing. Science, 1935.
[11] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of
Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.
[12] B. Fan, L. Wang, F. Soong, and L. Xie. Photo-real talking head with deep bidirectional LSTM.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pages 4884–4888. IEEE, 2015.
[13] S. Fu, R. Gutierrez-Osuna, A. Esposito, P. Kakumanu, and O. Garcia. Audio/visual mapping
with cross-modal hidden Markov models. IEEE Transactions on Multimedia, 7(2):243–252,
2005.
[14] Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, Paul Graham, Koki Nagano,
Jay Busch, and Paul Debevec. Driving high-resolution facial blendshapes with video perfor-
mance capture. In ACM SIGGRAPH 2013 Talks, pages 1–1. 2013.
[15] Udit Kumar Goyal, Ashish Kapoor, and Prem Kalra. Text-to-audiovisual speech synthesizer.
In International Conference on Virtual Worlds, pages 256–269. Springer, 2000.
[16] Zhenliang H., Meina K., Jie J., Xilin C., and Shiguang S. A fully end-to-end cascaded CNN
for facial landmark detection. In IEEE International Conference on Automatic Face & Gesture
Recognition, pages 200–207, 2017.
[17] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward
Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu.
Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435, 2018.
[18] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint
end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG), 36(4):94,
2017.
[19] T. Kim, Y. Yue, S. Taylor, and I. Matthews. A decision tree framework for spatiotemporal
sequence prediction. In Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 577–586. ACM, 2015.
[20] Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Brébisson, and Yoshua Bengio.
Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442, 2017.
[21] H. Li, T. Weise, and M. Pauly. Example-based facial rigging. In ACM SIGGRAPH, pages
32:1–32:6. ACM, 2010.
[22] Harry McGurk and John MacDonald. Hearing lips and seeing voices. Nature, 264(5588):746–
748, 1976.
[23] Jonathan Parker, Ranniery Maia, Yannis Stylianou, and Roberto Cipolla. Expressive visual text
to speech and expression adaptation using deep neural networks. In 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4920–4924. IEEE,
2017.
[24] John Sabini and Maury Silver. Ekman’s basic emotions: Why not love and jealousy? Cognition
& Emotion, 19(5):693–712, 2005.
[25] Shinji Sako, Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura.
HMM-based text-to-audio-visual speech synthesis. In Sixth International Conference on Spo-
ken Language Processing, 2000.
[26] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang,
Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural TTS synthesis by con-
ditioning WAVENet on mel spectrogram predictions. In 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
[27] T. Shimba, R. Sakurai, H. Yamazoe, and J. Lee. Talking heads synthesis from audio with deep
neural networks. In 2015 IEEE/SICE International Symposium on System Integration (SII),
pages 100–105. IEEE, 2015.

17
[28] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J
Weiss, Rob Clark, and Rif A Saurous. Towards end-to-end prosody transfer for expressive
speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047, 2018.
[29] Linsen Song, Wayne Wu, Chen Qian, Chen Qian, and Chen Change Loy. Everybodys talkin:
Let me talk as you want. arXiv preprint, arXiv:, 2020.
[30] Lorenzo Tarantino, Philip N Garner, and Alexandros Lazaridis. Self-attention for speech emo-
tion recognition. Proc. Interspeech 2019, pages 2578–2582, 2019.
[31] S. Taylor, A. Kato, B. Milner, and I. Matthews. Audio-to-visual speech conversion using deep
neural networks. In Interspeech. International Speech Communication Association, 2016.
[32] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. Rodriguez, J. Hodgins, and I. Matthews.
A deep learning approach for generalized speech animation. ACM Transactions on Graphics
(TOG), 36(4):93, 2017.
[33] S. Taylor, M. Mahler, B. Theobald, and I. Matthews. Dynamic units of visual speech. In
Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages
275–284. Eurographics Association, 2012.
[34] Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner.
Neural voice puppetry: Audio-driven facial reenactment. arXiv 2019, 2019.
[35] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. journal of ma-
chine learning research 9. Nov (2008), 2008.
[36] Lijuan Wang, Xiaojun Qian, Lei Ma, Yao Qian, Yining Chen, and Frank K Soong. A real-time
text to audio-visual speech synthesis system. In Ninth Annual Conference of the International
Speech Communication Association, 2008.
[37] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end
speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[38] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying
Xiao, Fei Ren, Ye Jia, and Rif A Saurous. Style tokens: Unsupervised style modeling, control
and transfer in end-to-end speech synthesis. arXiv preprint arXiv:1803.09017, 2018.
[39] T. Weise, S. Bouaziz, H. Li, and M. Pauly. Realtime performance-based facial animation. In
ACM transactions on graphics (TOG), volume 30, page 77. ACM, 2011.
[40] Yanlin Weng, Chen Cao, Qiming Hou, and Kun Zhou. Real-time facial animation on mobile
devices. Graphical Models, 76(3):172–179, 2014.
[41] L. Xie and Z. Liu. Realistic mouth-synching for speech-driven talking face using articulatory
modelling. IEEE Transactions on Multimedia, 9(3):500–510, 2007.
[42] Li Zhang, Noah Snavely, Brian Curless, and Steven M Seitz. Spacetime faces: High-resolution
capture for˜ modeling and animation. In Data-Driven 3D Facial Animation, pages 248–276.
Springer, 2008.
[43] Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan
Singh. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on
Graphics (TOG), 37(4):1–10, 2018.

18

You might also like