0% found this document useful (0 votes)
33 views12 pages

A Facial Expression Recognition Method Using Deep

This article presents a facial expression recognition method utilizing deep convolutional neural networks and edge computing to address issues of sample imbalance and overfitting. It introduces a constrained circular consensus generative adversarial network (GAN) that enhances sample diversity and recognition rates while incorporating gradient penalty rules to stabilize training. The proposed method shows improved performance in generating expression data and classifying expressions compared to traditional approaches.

Uploaded by

danmes479
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views12 pages

A Facial Expression Recognition Method Using Deep

This article presents a facial expression recognition method utilizing deep convolutional neural networks and edge computing to address issues of sample imbalance and overfitting. It introduces a constrained circular consensus generative adversarial network (GAN) that enhances sample diversity and recognition rates while incorporating gradient penalty rules to stabilize training. The proposed method shows improved performance in generating expression data and classifying expressions compared to traditional approaches.

Uploaded by

danmes479
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/[Link] Number

A Facial Expression Recognition Method Using


Deep Convolutional Neural Networks Based on
Edge Computing
An Chen 1, Hang Xing 2, *, and Feiyu Wang 3
1
Department of Experiment Teaching, Guangdong University of Technology, Guangzhou, Guangdong, 510006, China
2
College of Engineering, South China Agricultural University, Guangzhou, Guangdong, 510642, China
3
School of Transportation Management, Xinjiang Vocational & Technical College of Communications, Urumchi, Xinjiang, 831401, China

Corresponding author: Hang Xing (e-mail: hangsail@[Link]).


This work was supported by the Science and Technology Program of Guangdong Province (No.2015A020209106).

ABSTRACT The imbalanced number and the high similarity of samples in expression database can lead to
overfitting in facial recognition neural networks. To address this problem, based on edge computing, a
facial expression recognition method using deep convolutional neural networks is proposed. In order to
overcome the shortcoming that circular consensus adversarial network model can only be mapped one-to-
one, we construct a constrained circular consensus generative adversarial network by adding class
constraint information. Discriminators and classifiers in this network can share network parameters. In
addition, for the problems of unstable training and easy to encounter model collapse in original GAN
networks, this paper introduces gradient penalty rule into discriminator's loss function to achieve the
normative constraint on gradient changes. Using this network not only generates sample data for a few
classes in the training set of expression database, but also performs effective expression classification.
Compared with other methods, the improved discriminative classifier network structure can enhance the
diversity of samples and get a higher expression recognition rate. Even if other expression feature
extraction methods are used, the higher recognition rate can still be obtained after using proposed data
augmentation framework.

INDEX TERMS facial expression recognition; generative adversarial network; deep learning; edge
computing; class constraint information; gradient penalty rule

I. INTRODUCTION vision, and even the Internet of Things [1-3]. However,


With the rapid development of intelligent information facial expression recognition is a complex task for
technology and the wide use of computers, people can turn computers. The process of recognition facial expressions
the complicated work to computers, which not only changes can be divided into three steps: image preprocessing,
the traditional way of life, but also provides great feature extraction and facial expression classification. How
convenience for human beings. However, in today's to effectively extract the features of facial expression
diversified human-computer interaction needs, the images is a critical step in facial expression recognition.
traditional single and fixed input-output mode has been Early facial expression recognition methods were
unable to meet the needs of real life and market developed to extract facial expression features manually by
applications. We hope that computers can understand their designing feature extraction algorithms. These methods
intentions more intelligently, so as to better serve us. Facial include active appearance model algorithms for face
expressions are an essential way to express emotions in modeling based on feature point location and extraction
nonverbal communication. With the development of algorithms based on local features, such as Garbor wavelet,
computers, facial expression recognition plays an important Weber Local Descriptor (WLD), Local Binary Pattern
role in many applications, such as human-computer (LBP), multi-feature fusion, etc. These artificially designed
interaction and medical escort. It becomes current research feature extraction methods may lose some feature
hotspots in the fields of artificial intelligence, computer

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

information of original image and are not robust to image consensus generative adversarial network. And it can judge
scale and lighting conditions. the authenticity of input images and classify expressions.
Compared to manual feature extraction methods, deep 2) Aiming at the problems of unstable training and easy
neural networks can learn features automatically and to encounter model collapse in original GAN networks, this
achieve a high recognition rate in facial expression paper introduces gradient penalty rules into the loss
recognition. In order to extract more facial expression function of discriminator, which can achieve the normative
features, the number of layers of neural networks is also constraint on gradient changes.
increasing gradually. However, these networks tend to
overfit with the deepening of networks and the increase of II. GANS-BASED EXPRESSION RECOGNITION
parameters. The smaller data set, the more severe APPLICATION
overfitting phenomenon. Most of facial expression data sets Facial expression image editing is a special and important
have problems of insufficient data and high sample research topic. Due to human vision is sensitive to facial
similarity. Besides, imbalanced samples can also lead to irregularities and deformation, it is not easy to edit realistic
unsatisfactory neural network recognition [4,5]. facial expression images. In this regard, GANs can edit facial
Data augmentation is an important means to resolve expression images with high-quality detailed textures.
sample shortages and imbalances. Reference [6] applied Moreover, the expression recognition on the edited
traditional rotation and crop data augmentation methods to expression images can still achieve good effect.
expand the training samples. Most of the images carry Facial expression editing is a challenging task because it
duplicate information, which is close to a simple copy of requires advanced semantic understanding of the input facial
the sample. Besides, this is still far from the same number images. In traditional methods, either the paired training data
of different samples in amount of information, and they do is needed, or the synthesized face image resolution is very
not change the identity information of images. Therefore, low. Reference [7] proposed Conditional Adversarial Auto
the problem of large sample similarity remains unsolved. Encoder (CAAE) to learn face manifolds, and then realized
Different from using simple geometric transformations and smooth age expression image regression on the face manifold.
crop for data augmentation, GANs introduce an adversarial In CAAE, the face image is first mapped to a latent vector by
loss function and learns the facial expression images with convolutional encoder and the vector is projected onto a face
the same distribution as target datasets, which can solve the manifold based on age by deconvolution generator. Latent
high similarity problem of generated samples. However, vectors retain the subject's facial features, and age conditions
GANs networks map random vectors into the target dataset. control regression. Using adversarial learning on the encoder
This is often due to the lack of constraints, resulting in and generator makes generated images more realistic.
uneven quality. Experimental results show that the framework has good
For the uneven number of facial expression samples, performance and flexibility, the quality of generated images
such as relatively small amount of disgust and sad is high. Reference [8] proposed an Expression Generative
expression data, this paper introduces Cycle GAN into the Adversarial Networks (Expr GAN) based on CAAE, which
facial expression data augmentation. This enables the can edit the facial expression intensity of real images. In
mapping of neutral expressions to multi-category addition to the encoder and decoder networks, Expr GAN
expressions. At the same time, Cycle GAN has a one-to-one also designed an expression intensity control network
mapping relationship. Therefore, when there is a one-to- specifically for learning expression intensity of generated
many mapping relationship (such as a neutral expression to images. This novel network structure allows the intensity of
a variety of expressions such as happy, sad and surprised), generated expression images to be adjusted from low to high.
the model needs to be trained multiple times, which brings Reference [9] proposed a Conditional Difference
a huge time cost. To address this problem, this paper Adversarial Autoencoder (CDAAE), which is a facial
improves Cycle GAN and further proposes a constrained expression synthesis based on AU tags. Enter an unseen face
circular consensus generative adversarial network for facial image and use the target expression label or facial action unit
expression recognition. The network introduces class (AU) label to generate the person's facial expression image.
constraint conditions and gradient penalty rules, and CDAAE adds a feedforward path to autoencoder structure
implements one-to-many mapping transformation in one and it connects the low-level features of encoder with the
model. It reduces the model training overhead while corresponding levels features of decoder. By learning to
obtaining higher quality generated images. distinguish the differences between low-level image features
Compared with the circular consensus generative of different facial expressions for the same person, the
adversarial network, this network has three major problem of changes due to different identities and changes
improvements: due to different facial expressions can be solved.
1) Add an auxiliary expression classifier based on the Experimental results show that CDAAE can more accurately
discriminator. The newly added discriminative classifiers save the facial expression information of unknown objects
are used to replace the two discriminators of circular than the latest methods. However, the resolution of facial

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

images generated by CDAAE is only 32 × 32 and facial function is optimized by back propagation iteration.
images with AU labels not be well evaluated quantitatively. Reference [18] considered that GANs-based image
Reference [10] combined the geometric model of 3DMM [11] restoration models are susceptible to the initial solution of
with generative model. The former can separate expression non-convex optimization criteria and built an end-to-end
attributes from other facial attributes and generate 3DMM trainable parametric network. They started with a good initial
expression attributes based on target AU label. In this way, solution and get a more realistic image reconstruction with
we can generate high-resolution facial expression images and significant optimization speed and learned to use a recurrent
the target expression label is determined by AU label. neural network to optimize the time window of the initial
Reference [12] proposed a method for augmenting facial solution. In the iterative optimization process, a time
expression data based on Cycle GAN. They used neutral smoothness loss is applied to respect the redundancy of
expressions to convert to other expressions, and expand sequence time dimension. Experimental results show that this
expression images with less data, such as disgust and sad method is significantly better than other methods in image
expressions. Consequently, the classification accuracy is reconstruction quality. Reference [19] designed a facial
improved by 5% -10% after using data augmentation image generative network based on Wasserstein GANs,
technology. which can generate context-complete complementary images
Image restoration is a traditional graphics problem. It and expression recognition networks for the occlusion areas
refers to restoring the missing part of images based on the in images. It extracts expression features and infers
existing information in images, so that human eyes cannot expression categories, and achieves a high recognition effect
distinguish which part is restored. Something about image on CK + database. All of these mechanisms have achieved
restoration involves statistics, probabilistic models, etc. [13- good recognition results, but the quality of images generated
15]. The application of image restoration in facial expression by GANs network is uneven due to the lack of constraints.
recognition is very common. During the process of Moreover, it can't realize flexible expression mapping in the
identifying facial expressions, key parts of facial images that case of unbalanced number of facial expression samples.
may be identified may be occluded. For example, some facial Reference [20] proposed a facial restoration algorithm
images wear sunglasses and some wear scarves, which will based on depth generative model. Different from the
block eyes or mouth and the effect of these occlusions on completion of well-designed background restoration studies,
expression recognition is still significant. Therefore, the the facial restoration task is more challenging because it
occlusion part can be restored by a design algorithm and then usually requires the generation of semantically new pixels for
facial recognition is performed. The traditional methods are key missing parts of a large number of appearance changes,
to restore an image by copying pixels from the original such as eyes and mouth. In reference [21], a novel cascaded
image or by copying patches from an image library, while backbone-branches fully convolutional neural network (BB-
GANs provide a new method for image restoration. FCN) is proposed, which is used to locate face markers
Reference [16] proposed a network structure of context quickly and accurately in unconstrained and chaotic
encoder that is the first image restoration method based on environment. BB-FCN does not need any preprocessing, and
GANs. The network is based on an encoder-decoder generates facial landmark response map directly from the
architecture and the network input is 128 * 128 images with original image. BB-FCN follows a cascade pipeline from
missing blocks. The output is 64 * 64 missing content (when coarse to fine, which is composed of a backbone network and
the missing block is in the middle of original image) or 128 * a branch network, which is used to roughly detect the
128 full restore image (when the missing block is anywhere location of all facial marker points, and provide a branch
in original image). Besides, the objective function includes network for each type of marker point detected to further
adversarial loss and content loss. Experimental results show refine its location. These mechanisms have achieved good
that the restoration effect is better when missing block is in recognition results. However, these mechanisms need
the middle of original image. Reference [17] proposed a multiple training models, which brings huge time cost.
method of semantic image restoration based on GANs In addition to the above two GANs-based expression
iteration, which pre-trains GANs, and its generator maps the recognition applications, there are other application methods.
hidden variable z into an image. Enter an image 0x with However, the applications of GANs in expression
some missing information, and encode 0x into *z by recognition are mostly used for data augmentation.
minimizing the objective function. Among them, the
objective function includes an adversarial loss function and a III. THE PRINCIPLE OF GAN
content loss function. The content loss function calculates the The GAN model contains two networks: generator G and
weighted L1 pixel distance between generated image (*Gz) discriminator D . the basic structure and calculation flow are
and the undamaged area on 0x. The pixel values near missing shown in Figure 1.
information area have a higher weight. Finally, the objective

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

X
Real sample data-x

Discriminator-D True/ false

G(z)
RAndom variable-Z Generator-G

FIGURE 1. Basic structure and calculation flow chart of GAN


back propagation algorithm; it does not need to gradually
Generator G has a nonlinear and differentiable deep generate images in the form of pixel points like PixelRNN,
neural network structure, which can be trained in the end-to- but takes direct generation, and the generation speed is faster
end mode. The objective function of generator G is to than other methods;
minimize the log likelihood function so that the data Generally speaking, the traditional discriminant model is
distribution of G ( z ) is close to the data distribution of an optimization function, and there must be an optimal
training sample x . The maximum log likelihood function is solution for a convex optimization problem. However, GAN
used to determine whether the data is from the training needs to take into account both the discriminant network and
sample or from the generator. In short, the input of generator the generated network training situation. It is difficult to find
G is a random coding vector, and the output of generator G out the Nash equilibrium point between the two networks,
is a complex sample; the input of discriminator D is a and it is difficult to determine the position of this equilibrium
complex sample, and the output of discriminator D is a point through a simple gradient descent algorithm.
probability, which indicates whether the sample of input GANs is a milestone for the development of generation
discriminator D is a real sample or a false sample generated model. As a new generation method, GANs does not need to
by generator. The purpose of the discriminator is to define the probability distribution of the generation model in
distinguish whether the input samples come from the real advance. Instead, it can generate artificial samples that match
samples or the samples generated by the generator, while the the input samples through the counter learning of generators
purpose of the generator is to make the discriminator unable and discriminators, which can effectively solve the problem
to distinguish whether the input samples come from the real of high-dimensional data generation. The network structure
samples or the samples generated by the generator. From this of generator and discriminator has no limitation on the
point of view, the goal of the generator and the discriminator generation dimension, and widens the scope of generating
is the opposite, and the two are antagonistic. Generator G samples. Compared with other generation models, GANs has
and discriminator D use min max rule to train alternately. the following advantages:
Through continuous iteration, update the parameters of (1) GANs can sample the generated samples in parallel.
generator G and discriminator D until the balance is The generator of GANs is a simple feedforward network
reached. At this time, there is no difference between the data which maps the hidden variable Z to the real sample x . it
distribution of G ( z ) and the real sample x . at the same can generate data at one time, thus greatly speeding up the
time, discriminator D can not distinguish whether the data is sampling speed.
from the real data or the output data of generator G . (2) In the training process of GANs, there is no need to
The characteristics of the generated countermeasure make all kinds of approximate reasoning, to use the
network are as follows: inefficient Markov chain method, and to calculate the
(1) The complexity of generating data is linearly related to complex lower bound of variation, which greatly improves
the dimension. If we want to generate a larger image, we will the training difficulty and training efficiency of the
not see the exponential increase of parameter calculation as generation model.
the traditional generation model does; (3) The training method of generator and discriminator
(2) There are few prior hypotheses, and GAN does not increases the diversity of generating samples. Compared with
make any display parameter distribution hypothesis for the other generation model experimental results, GANs can
data, that is, it does not need to know or assume the generate higher quality and clearer samples, which provides a
distribution of training data in advance, and only requires that possible solution for generating meaningful samples for
the discriminant network and the generated network be human beings.
differentiable; (4) GANs not only makes a great contribution to the
(3) It can generate higher quality images, and the generation model, but also provides inspiration in
generation process does not depend on the form of Markov unsupervised learning. In semi supervised learning, we can
chain, and the confrontation between the generation network first use unlabeled samples to train GANs, and then use a
and the discrimination network can be realized through the

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

small part of labeled data to train discriminators for constructs a structure that constrained generates adversarial
traditional classification and regression tasks. networks based on constrained cycles, as shown in Fig. 2.
Add class constraint information to CycleGAN to achieve
IV. PROPOSED IMPROVED CYCLEGAN multi-category style conversion in one model. 2) Add an
The proposed improved CycleGAN achieves image style auxiliary expression classifier C to the discriminator. The
conversion from source domain to target domain. For the newly added discriminative classifiers are used to replace the
expression recognition task, the data augmentation mainly two discriminators of circular consensus generative
focuses on the conversion of neutral expressions to multiple adversarial network, which can judge the authenticity of
expressions (anger, disgust, fear, sadness, surprise and joy). It input images and classify expressions.
uses circular consensus generative adversarial network to
train multiple models for transformation. This section
cross − entropy ( ei ,ei ) Fake/real ei

C Dcs
ei
Input - Bi

Genc Gdec
G
Gcs
Generate- Bi
Input - A Cyclic- A

Cycle consistent loss

Generate- A

G Genc Gdec

Gcs
Input - Bi Input - A Cyclic- Bi

Dcs C
ei

cross − entropy ( ei ,ei )


Fake/real ei

FIGURE 2. Structure of the generator


of network has an additional classification loss in addition
In Fig. 2, the sample training set is divided into two parts, to the adversarial loss and the cycle consistent loss:
neutral expressions and other classified expressions. 1) Adversarial loss Ladv
Neutral expressions are represented by source domain, and Promote generator Gcs to transform the neutral expression
other classified expressions are represented by target
pictures of source domain A into pictures that more
domain and together form a mapping network from neutral
closely approximate true multi-type expression distribution
expressions to other classified expressions, where is
responsible for encoding the original picture. We from multi-target domain BiS=1 . Here S represents the total
concatenating it with the target domain class constraint number of expression categories.
information (such as: happy) after implicit coding, and then LLSGAN ( Gcs , Dcs , A, Bi )
decoding. Thus, the target domain category expression
= Eb ( D ( b ) − 1)2 
image is obtained. The discrimination classifier is  cs
Pdata ( Bi )  (1)
responsible for classifying and authenticating generated
(
+ Ea P ( A)  Dcs ( Gcs ( a, ei ) )  )
2
images, which is a many-to-one mapping network of multi- data  
type expressions to neutral expressions. The loss function

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Equation (1) represents the loss function to be optimized noise z , and a given view angle label v . The purpose of
by mapping A → BiS=1 , and ei represents category generator G is to generate a new image G ( v, z ) in view v .
information. The role of discriminator D is that the view distinguishes
LLSGAN ( G, Dcs , A, Bi ) real data x from generated data G ( v, z ) . The loss

= Ea ( D ( a ) − 1)2  function of generator G is shown in formula (9):


 cs
Pdata ( A) 
LGvzx = E  Ds ( G ( v, z ) ) 
(2)

(
+ Eb P ( B )  Dcs ( G ( b ) )  )
2 z Pz
 
(9)
data i 

Equation (2) represents the loss function to be optimized z Pz 


(
+3 E  P Dv ( G ( v, z ) ) = v 
 )
by mapping BiS=1 → A . The discriminator D removes the input of view angle
Ladv = LLSGAN ( Gcs , Dcs , A, Bi ) label and increases the output of classification angle and
(3) score. So the loss function of discriminator D is as shown
+ LLSGAN ( G, Dcs , A, Bi ) in formula (10), where the third term in formula is a
Equation (3) represents the overall adversarial loss function. gradient loss function, and the fourth term is a cross-
2) Cycle consistent loss: entropy loss function using ACGAN.
Lcyc = Ea  G  ( Gcs ( a, ei ) − a )  LDxvs = E  Ds ( G ( v, z ) ) − E  Ds ( x )

Pdata ( A ) 1 z Pz x Px
(4)
+ Eb P ( B )  Gcs ( G  ( b ) , ei ) − b 
(
+1 E   x D ( xˆ ) 2 − 1  )
2
 1 (10)
x Px 
 
data i

3) Classification loss
As mentioned in original GAN network, the training is −2 E  P ( Dv ( x ) = v )
x Px
unstable and the model collapses easily. Most researchers in
this problem area have proposed that this is caused by In the reconstruction path, encoder E and decoder D
trying to minimize a strong divergence in network training. are mainly trained, encoder E attempts to reconstruct the
To solve this problem, we introduce gradient penalty rules training samples. The cross-reconstruction method is used
into the discriminator's loss function which can regulate the in encoder E to reconstruct angle information from
gradient change. The gradient penalty used is shown in identity information to ensure that the images of multiple
formula (5): views have the same identity information. Specifically,
samples ( xi , x j ) with the same identity but different angles
2
 Ex −    x D ( x ) p − 1 (5)
 
The introduced gradient penalty is not applied to the are reconstructed from xi to x j . xi is used as the input of
entire network area. It can apply to the real sample encoder E and outputs a view estimate v and a
concentration area, the generated sample concentration area, representation z retained by identity information flag, that
and the area in middle of them. Thus, first randomly sample is, ( v , z ) = ( Ev ( xi ) , Ez ( xi ) ) = E ( xi ) . The resulting z
a pair of true and false samples and generate a random
number in range [0-1] as follows: and v j are input into generator G together. Guided by
xr Pr , xg Pg ,  Uniform  0,1 (6) angle v j , G generates the corresponding xˆ j . At this point
In formula (6), xr Pr represents the area sampling of
xˆ j was refactored from xi . Finally, the discriminator D
real sample concentration, and xg Pg represents the area
sampling of generated sample concentration. The  value tries to distinguish real x j from generated xˆ j , and obtains
is a random number in the interval [0,1]. Then perform the corresponding score and angle information. In this
random interpolation sampling on the line between xr and network, the loss function of encoder E is shown in
formula (11):
xg :
 Ds ( xˆ j ) + 
x =  xr + (1 −  ) xg (7)  
The distribution satisfied by x obtained by sampling
LE = E 
(
3 P Dv ( xˆ j ) v j 

) (11)
−4 L1 ( xˆ j , x j ) − 
according to above process is Px , and the discriminator xi , x j Px 

gradient punishes loss:  


 L ( E ( x ) , v ) 
 Ex −    x D ( x )
− 1
2
(8)  5 v v i i 
 p 
The generative path is mainly training generator G and
discriminator D . The input to generator is not only random

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Limiting xˆ j with E loss is x j reconstructed from xi . min LDcs ,C = 3 LDxvs + Ladv (14)
Dcs ,C
Lv loss is the cross-entropy loss between estimated view In the experiment, 1 , 2 , 3 in formula (13) and
and real view. The loss function of discriminator D is: formula (14) are set to 100, 10, and 1, respectively.
LDxvs = E  Ds ( xˆ j ) − Ds ( xˆ j ) 
xi , x j Px   [Link] EXPRESSION RECOGNITION DIAGRAM

( )
BASED ON IMPROVED CYCLE GAN
+1 E   x D ( xˆ ) 2 − 1 
2
(12)
xi Px   On the basis of improving Cycle GAN, this paper proposes

an efficient and secure facial expression recognition method
− 2 E  P ( Dv ( xi ) = vi )  based on the edge cloud framework combined with the
x j Px
improved Cycle GAN. The edge cloud computing
Combining the above three parts of loss function, for the framework is shown in Figure 3. In this system, the Internet
generators G and G , the loss function that needs to be of Things obtains facial expression signals from users
optimized is: through multi secret sharing technology, and then distributes
min LGcs ,G = 1Lcyc + 2 LGvzx − Ladv (13) them to different edge clouds to ensure the privacy of users.
Gcs ,G
For discriminative classifier, the loss function that needs
to be optimized is:

Edge computing

data flow

FIGURE 3. Structure of the edge computing


a (128, 128, 1) three-dimensional tensor. Using 5
Fig. 4 is a schematic diagram of the network structure of convolutional layers with a size of 4 and a stride of 2, the
generator. The size of input facial expression images is 128 * number of convolutional output channels per layer is doubled.
128. The coding structure of generator is a stack of five After five layers, the output is a (4,4,1024) three-dimensional
convolutional layers of same convolution kernel size and step tensor, which is equivalent to 1024 4 * 4 feature maps as the
size. First, the input image is dimensionally transformed into image encoding.
DeConv(1024,4,2)

DeConv(512,4,2)

DeConv(256,4,2)

DeConv(128,4,2)
Conv(1024,4,2)

DeConv(6,4,2)
Conv(128,4,2)

Conv(256,4,2)

Conv(512,4,2)
Conv(64,4,2)

Input - A Generate− B
Genc eh Gdec

FIGURE 4. Structure of the generator


passes through five convolutional layers with a size of 4 and
Fig. 5 is a schematic diagram of network structure of a stride of 2. It then passes through two fully connected
discriminator. The input sample image of (128, 128, 1) first layers and outputs a one-dimensional discrimination result.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Conv(1024,4,2)
Conv(128,4,2)

Conv(256,4,2)

Conv(512,4,2)
Conv(64,4,2)

FC(1024)

Fake/real
FC(1)
Input - A/ Generate− B

FIGURE 5. Structure of the discriminator


Fig. 6 is a schematic diagram of network structure of discriminator, and the weights differ only in the last two fully
classifier. The network structure of classifier is similar to that connected layers.
of discriminator. It shares network parameters with

Angry/Disgust/Fear/
Happy/Sad/Surprise
Conv(1024,4,2)
Conv(128,4,2)

Conv(256,4,2)

Conv(512,4,2)

FC(7),Sigmod
Conv(64,4,2)

FC(1024)
Input - A/ Generate− B

FIGURE 6. Structure of the classifier


Experiments use the structures of Fig. 4, 5, and 6 as
VI. EXPERIMENT generators, discriminators and classifiers of adversarial
network respectively. Similarly, the model structure at
A. EXPERIMENTAL RESULTS AND ANALYSIS OF expression recognition stage is the same as the classifier of
JAFFE DATABASE Fig. 6. At the same time, the parameters are initialized
In this section, experiments are performed on JAFFE dataset randomly. In order to prevent overfitting during training, a
[22], CK + and FER2013 dataset [23, 24]. The experimental Dropout layer with probability p = 0.9 is added after the first
hardware environment is Intel (R) Core (TM) i7-7700K CPU fully connected layer of discriminator and classifier. In
@ 4.20 GHz processor, 16GB running memory (RAM), addition, the parameters of all convolutional layers and all
NVIDIA Geforce GTX 1080Ti GPU. The deep learning fully connected layers of discriminator are L2 regularized.
framework is Tensor Flow. Each layer of generator and discriminator uses Batch
JAFFE dataset contains 213 images of 7 facial expressions Normalization. Batch normalize the input of hidden layer to
(6 basic facial expressions + 1 neutral) composed of 10 prevent the gradient from disappearing during training.
Japanese female models. Each image is rated as 6 emotion The model optimizer uses Adam, the learning rate is set to
adjectives by 60 Japanese subjects. Contains seven types of 0.0001, the momentum is 0.5 and the batch size is fixed to 32.
facial expression grayscale images of anger, disgust, fear, Adopt circular consensus generative adversarial network
neutrality, happiness, sadness and surprise. Since the size of and constrained circular consensus generative adversarial
output images generated by adversarial network is 128 * 128, network respectively. After 200,000 iterations, a sample
when using generated samples, this section uniformly comparison chart is generated. Since circular consensus
processes the image samples to 48 * 48. generative adversarial network cannot achieve a one-to-many
The amount of disgusted and surprised expressions in mapping relationship, it is necessary to implement a neutral
JAFFE database is far less than happy, sad and angry. This expression to multiple expression mapping. Multiple models
also stems from the biased objective fact of human emotion need to be trained, and the entire training takes about 4-5
expression. Besides, the severely uneven distribution of data times as much as the method proposed in this paper. In
samples is an important reason limiting the increase in facial addition, due to the proposed improved Cycle GAN,
expression recognition rate. In addition to happy and angry classification loss is introduced. Following the attack,
expressions, neutral expressions also accounted for a larger generated samples are more natural in terms of emotional
proportion of experimental data. Mapping neutral expression.
expressions to aversions can alleviate sample imbalances. In order to test the reliability of samples generated by
Therefore, the experiment uses the proposed improved Cycle network and the gain effect on facial expression recognition
GAN to augment JAFFE database. in this chapter. The neutral expression is used as the source

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

domain and remaining expression classes are used as target Fig. 7 shows the experimental results of expression
domain. Consistently generative adversarial network samples classification on JAFFE dataset in this section. It can be seen
through a constrained loop. Since the number of aversive from Fig. 7 that after data augmentation on original data set,
expressions in JAFFE dataset is relatively small, more not only the recognition rate of aversion expressions is
aversive expressions are generated to enhance the original improved, but also the recognition rate of other expressions
data set, and a small number of samples are added to surprise also is improved. This is because as the number of training
expression category. Consequently, the final sample images increases, the differences between expressions
distribution is basically balanced. Some samples in test set increase. The more features we can get during training, the
were not augmented. lower false recognition rate. The average recognition rate is
improved accordingly.

Recognition rate(%) gain


Recognition rate(%) Enhanced dataset
Recognition rate(%) Original data set

4.01
Average 76.96
72.95
1.44
Surprise 83.76
82.32
3.13
Sad 73.56
70.43
2.55
Happy 88.98
87.43
2.21
Neutral 77.88
75.67
2.53
Fear 67.87
65.34
14.22
Disgust 68.48
54.26
2.98
Angry 78.21
75.23

FIGURE 7. Recognition results on JAFFE dataset


From table 1, we can see that there are many samples of each
B. Experimental results and analysis of FER2013 expression category in FER2013 database, and the data of
dataset disgust and surprise expressions in FER2013 database is far
In order to further verify the effectiveness of the method less than that of happiness, sadness and anger. This is also
proposed in this chapter, we have carried out experiments on due to the biased objective facts of human emotional
the FER2013 database. FER2013 facial expression data set expression. The seriously unbalanced distribution of data
consists of 35886 facial expression pictures, including 28708 samples is an important reason to limit the improvement of
training pictures, 3589 public test pictures and 3589 private facial expression recognition rate. All expression classes are
test pictures. Each picture is composed of gray-scale images enhanced by using constraint cycle to generate consistent
with fixed size of 48 × 48. There are 7 expressions, adversary network. Table 2 shows the expression number
corresponding to digital labels 0-6. The expression number distribution of FER2013 library after adding disgusting
distribution of FER2013 database is shown in Table 1. expressions and the number of test sets and training sets for
TABLE 1
facial expression classification. Because the number of
NUMBER DISTRIBUTION OF SEVEN EXPRESSIONS IN FER2013 DATABASE
disgusting expressions in FER2013 data set is relatively
Expression Classification Training set Test set small, more disgusting expressions are generated to enhance
angry 3962 991
aversion 438 109
the original data set, and a small number of samples are
scared 4097 1024 added to surprise expression category. The final sample
happy 7191 1798 distribution is basically balanced. Some samples of the test
sad 4862 1215 set were not enhanced.
surprised 3202 800
TABLE 2
neutral 4598 1240
NUMBER DISTRIBUTION OF SEVEN EXPRESSIONS IN FER2013 DATABASE
AFTER DATA AUGMENTATION

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Expression Classification Training set Test set scared 71 16


happy 183 45
angry 3962 991 sad 84 21
aversion 438(+4000) 109 surprised 195 39
scared 4097 1024 neutral 235 59
happy 7191 1798
sad 4862 1215
surprised 3202(+800) 800 From Table 4, we can see that there are fewer samples of
neutral 4598 1240
each expression category in CK+ database. Adopt
constrained loops to constrained generative adversarial
Table 3 shows the experimental results of expression networks for data augmentation for all expression classes.
classification on FER2013 dataset. It can be seen from table Before that, the mirror image flip operation was performed
3 that after data enhancement of the original data set, not on neutral expression images to obtain 470 neutral
only the recognition rate of disgusting expressions has been expression images. The expression distribution of CK+
improved, but also the recognition rate of other expressions library after data augmentation is shown in Table 5.
has been improved. This is because with the increase of the TABLE 5
number of training images, more differences between NUMBER DISTRIBUTION OF SEVEN EXPRESSIONS IN CK+ DATABASE AFTER
expressions will be brought, and the more features can be DATA AUGMENTATION
obtained during training, the lower the error recognition rate Expression Classification Training set Test set
and the corresponding increase of the average recognition angry 113(+470) 28
rate. aversion 131(+470) 37
TABLE 3 scared 71(+470) 16
happy 183(+470) 45
RECOGNITION RESULTS ON FER2013 DATABASE
sad 84(+470) 21
RECOGNITION RATE (%) surprised 195(+470) 39
Expression
Enhanced neutral 235(+470) 59
Classification Raw data gain
dataset
angry 75.23 77.43 2.20
aversion 56.87 73.98 17.12 Adopt circular consensus generative adversarial network
scared 65.32 69.56 4.24 and constrained circular consensus generative adversarial
happy 87.43 88.67 1.24 network respectively. After 100,000 iterations, a sample
sad 73.21 77.45 4.24
surprised 82.48 84.62 2.14
comparison chart is generated. In order to verify the
neutral 75.45 78.55 3.10 effectiveness of network proposed for expression classifier
Average 73.71 78.61 4.90 training in this paper, the recognition rate on CK+ dataset is
compared with the methods in other literatures. Table 6
C. EXPERIMENTAL RESULTS AND ANALYSIS OF CK + shows comparison results of the recognition rates of different
DATASET recognition methods on CK+ dataset. Reference [19]
In order to further verify effectiveness of the method considered the individual differences of facial expressions
proposed in this paper, we conducted experiments on CK+ during facial expression recognition and increased the
database. The CK + database is extended from Cohn Kanade amount of feature extraction. Reference [25] made up for the
dataset and released in 2010. This database is much larger lack of training samples by combining low-level features
than Jaffe. It can also be obtained free of charge, including with high-level features. Reference [26] considers the
the label of expression and the label of action units. CK + problem of low recognition rate due to the influence of
database includes 123 subjects and 593 image sequences. various factors such as lighting, pose, expression, occlusion
The last frame of each image sequence has the label of action and noise on face recognition. They proposed a face
units. Among these 593 image sequences, 327 have the label recognition method (IE (w) ATR-LBP) combining weighted
of emotion. This database is a popular database in facial information entropy (IEw) with adaptive threshold circular
expression recognition. Many articles will use this data for local binary pattern (ATRLBP) operator.
testing. In addition to the seven expressions in CK+ database, The training classification network proposed in this paper
contempt expressions are also included. In order to be based on constrained circular consensus generative
consistent with JAFFE database, we removed contempt adversarial network benefits from optimizing both the
expressions from the database in our experiments. A total of adversarial loss and the classification loss. The recognition
1,249 facial expressions were extracted from the remaining rate on CK+ is 97.23%, which is higher than other methods.
seven types. The distribution of the number of expressions in In addition, based on the reference [19] [25] [26], the
CK+ library in the experiment is shown in Table 4. methods proposed in them are trained using data-enhanced
TABLE 4 data. It can be seen from Table 3 that enhanced training set
NUMBER DISTRIBUTION OF SEVEN EXPRESSIONS IN CK+ DATABASE
improves the recognition rate of all methods on test set. It
Expression Classification Training set Test set brings an average gain of not less than 1.01%. For reference
angry 113 28 [19], the gain is less obvious. This may be because it
aversion 131 37

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

suppresses identity information in loss function, while other reference [25], since it is enhanced with geometric data such
types of facial expression samples mapped from neutral as rotation and clipping, the gain brought by it is obvious.
expressions retain the original identity information. For
TABLE 6 RECOGNITION RESULTS OF DIFFERENT METHODS ON CK+ DATABASE
Recognition rate before Recognition rate after
Method Feature Gain (%)
enhancement (%) enhancement (%)
Reference [19] Wasserstein GANs 94.35 95.56 1.21
Reference [25] LeNet-5 cross connected 84.34 86.45 2.11
Reference [26] combine IEw and ATRLBP operator 97.34 98.35 1.01
Cycle GAN + class constraint condition +
Proposed method 97.23 98.46 1.23
gradient penalty rule
Wireless Networks (2019), to be published. DOI:
VII. CONCLUSION [Link]
[4] X. Ma et al., “An IoT-based task scheduling optimization
Facial expression recognition is an important research scheme considering the deadline and cost-aware scientific
content of computer vision and artificial intelligence, which workflow for cloud computing,” EURASIP Journal on Wireless
is widely used in security, automatic driving, business and Communications and Networking, 2019: 249. DOI:
[Link]
other aspects. Facial expression database is the data base of [5] H. Gao et al., “Context-aware QoS prediction with neural
facial expression recognition, which plays an important role collaborative filtering for Internet-of-Things services,” IEEE
in the development of facial expression recognition Internet of Things Journal, 2019. to be published. DOI:
[Link]
technology. In this paper, the shortcomings of traditional data [6] A. T. Lopes, E. D. Aguiar, and T. O. D. Santos, “A facial
enhancement methods are analyzed and summarized. Aiming expression recognition system using convolutional networks,”
at the problem of class imbalance in the existing facial in Proc. SIBGRAPI, Salvador, Bahia, Brazil, 2015, pp. 273-280.
[7] Z. Zhang, Y. Song, and H. Qi, “Age progression/pegression by
expression database, the paper improves Cycle GAN, conditional adversarial autoencoder,” in Proc. CVPR, 2017, pp.
proposes a method of facial expression recognition based on 5810-5818920.
constraint cycle consistent generation to resist network, and [8] D. Hui, S. Kumar, and C. Rama, “ExprGAN: Facial expression
introduces class constraint condition and gradient penalty editing with controllable expression intensity,” in Proc. Thirty-
Second AAAI Conference on Artificial Intelligence, New
rule. The experimental results show that the improved Orleans, Louisiana, USA, 2018, pp. 6781-6788.
generation model can better learn the detailed texture [9] Z. Y, and B. E. Shi, “Photorealistic facial expression synthesis
information of the face image, and the quality of the by the conditional difference adversarial autoencoder,” in Proc.
ACII, San Antonio, TX, USA, 2017, pp. 370-376.
generated image is high. The improved discriminator [10] Z. Liu et al., “Conditional Adversarial Synthesis of 3D Facial
network has better classification and recognition effect on the Action Units,” Neurocomputing, to be published. DOI:
enhanced face expression image recognition. 10.1016/[Link].2019.05.003.
[11] V. Blanz, and T. Vetter, “A morphable model for the synthesis
This paper studies facial expression recognition and of 3D faces,” in Proc. SIGGRAPH, Los Angeles, CA, USA,
expression image data enhancement. Although some 1999, pp. 187-194.
achievements have been made, there are still some [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Image net
classification with deep convolutional neural networks,” Neural
deficiencies, which need further research and improvement. Information Processing Systems, vol. 25, no. 2, pp. 1097-1105,
First of all, the expression recognition and data enhancement Jan. 2012.
in this paper are based on the static image, while the [13] H. Gao et al., “Applying probabilistic model checking to path
emotional changes in real life are of a certain timing, and the planning in an intelligent transportation system using mobility
trajectories and their statistical data,” Autosoft, vol. 25, no. 3, pp.
static image can only reflect the expression state of a person 547–559, 2019.
at a certain time. The next work will focus on the data [14] K. Xia et al., “Liver semantic segmentation algorithm based on
enhancement of video sequence. Secondly, in the process of improved deep adversarial networks in combination of
weighted loss function on abdominal CT Images,” IEEE Access,
data enhancement, neutral expression image is used as the vol. 5, pp. 96349-96358, Dec. 2019.
source domain and other expression images as the target [15] H. Gao et al., “Research on cost-driven services composition in
domain, but the expression state of human in the real scene an uncertain environment,” JIT, vol. 20, no. 3, pp. 755-769,
2019.
can be transformed at will. How to enhance the data without [16] D. Pathak et al., “Context encoders: Feature learning by
limiting the expression state of the input image is also a inpainting,” in Proc. CVPR, Las Vegas, NV, USA, 2016, pp.
direction that can be improved in the future. 2536-2544.
[17] R. A. Yeh et al., “Semantic image inpainting with deep
generative models,” in Proc. IEEE Conference on Computer
REFERENCES Vision and Pattern Recognition, Las Vegas, NV, USA, 2017,
[1] M. Zhang et al., “Emotional context modulates micro- pp. 5485-5493.
expression processing as reflected in event-related potentials,” [18] D. A. Pitaloka et al., “Enhancing CNN with preprocessing stage
Psych Journal, vol. 7, no. 1, pp. 13-24, 2018. in sutomatic emotion recognition,” Procedia Computer Science,
[2] J. Yu et al., “Hierarchical deep click feature prediction for fine- vol. 116, pp. 523-529, Oct. 2017.
grained image recognition,” IEEE Transactions on Pattern [19] N. M. Yao et al., “Robust facial expression recognition with
Analysis and Machine Intelligence, to be published. DOI: generative adversarial networks,” Acta Automatica Sinica, vol.
10.1109/TPAMI.2019.2932058. 44, no. 5, pp. 865-877, 2018.
[3] H. Gao et al., “Transformation-based processing of typed [20] Y. Li et al., “Generative face completion,” in Proc. CVPR,
resources for multimedia sources in the IoT environment,”

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2980060, IEEE Access

Honolulu, HI, USA, 2017, pp. 3911-3919.


[21] Liu L , Li G , Xie Y , et al. Facial Landmark Machines: A
Backbone-Branches Architecture with Progressive
Representation Learning[J]. IEEE Transactions on Multimedia,
2019, 21(9):2248-2262.
[22] A. V. Anusha et al. (2016). Facial expression recognition and
gender classification using facial patches. ComNet. [Online].
Available: 10.1109/CSN.2016.7824014
[23] P. Lucey et al. (2010). The extended Cohn-Kanade dataset
(CK+): A complete dataset for action unit and emotion-
specified expression. CVPRW. [Online]. Available:
10.1109/CVPRW.2010.5543262
[24] Z. Meng et al., “Identity-aware convolutional neural network
for facial expression recognition,” in Proc. FG 2017,
Washington, DC, USA, 2017, pp. 558-565.
[25] Y. Li, X. Z. Liu, and M. Y. Jiang, “Facial expression
recognition with cross-connect LeNet-5 network,” Acta
Automatica Sinica, vol. 44, no. 1, pp. 176-182, Jan. 2018.
[26] L. Ding et al., “Face recognition combining weighted
information entropy with enhanced local binary pattern,”
Journal of Computer Applications, vol. 39, no. 8, pp. 2210-
2216, 2019.

AN CHEN, Lecturer, has got Master of Control


theory and Control engineering from Wuhan
University of Technology in 2006. Worked in
Guangdong University of Technology. His research
interests include image recognition and analysis,
electronic science and technology.

HANG XING, Ph.D. of Agricultural Electrification


and Automation, Lecturer. Graduated from South
China Agricultural University in 2014. Worked in
South China Agricultural University. Her research
interests include image recognition and analysis,
control theory and engineering.

FEIYU WANG, Master of science in Software


Engineering, Associate Professor. Graduated from
Beijing University of Technology in [Link] in
Xinjiang Vocational & Technical College of
Communications. Her research interests include
intelligent algorithm, software engineering and E-
COMMERCE.

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see [Link]

You might also like