Overview of Deep Generative Models
Overview of Deep Generative Models
To cite this article: Jungang Xu, Hui Li & Shilong Zhou (2014): An Overview of Deep Generative Models, IETE Technical
Review, DOI: 10.1080/02564602.2014.987328
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
An Overview of Deep Generative Models
Jungang Xu, Hui Li and Shilong Zhou
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing 101408, China
ABSTRACT
As an important category of deep models, deep generative model has attracted more and more attention with
the proposal of Deep Belief Networks (DBNs) and the fast greedy training algorithm based on restricted Boltz-
mann machines (RBMs). In the past few years, many different deep generative models are proposed and
used in the area of Artificial Intelligence. In this paper, three important deep generative models including
DBNs, deep autoencoder, and deep Boltzmann machine are reviewed. In addition, some successful applica-
tions of deep generative models in image processing, speech recognition and information retrieval are also
introduced and analysed.
Keywords:
Deep Autoencoder, Deep Belief Networks, Deep Boltzmann Machine, Deep Generative Model, Restricted
Downloaded by [University of Otago] at 16:27 21 December 2014
Boltzmann Machine
corresponding learning algorithm created a new situa- 2.1 Deep Generative Model
tion in the field of AI and steps closer to the final objec-
tive of AI. Deep generative model is usually represented as a
graphical model [15]. Sigmoid belief network is a kind
In this paper, we reviewed the deep generative mod- of deep generative model which is proposed and stud-
els, including its history, architecture, and applica- ied before 2006 and trained using variational approxi-
tions. The rest of the paper is organized as follows. mations [1619]. However, to calculate the multi-layer
Section 2 introduced the historical context of deep gen- joint distribution using this model is barely acceptable
erative models. Section 3 described the architecture of [5]. At the beginning of the twentieth century, Hinton
three typical deep generative models, including DBN, et al. proposed a kind of deep generative model called
deep autoencoder, and deep Boltzmann machine deep belief networks based on sigmoid belief networks
(DBM). Section 4 introduced and analysed some typi- [12]. Different from sigmoid belief networks, the top
cal applications of Artificial Intelligence. Section 5 pre- two layers in DBN form a restricted Boltzmann
sented the discussions and perspectives. machine (RBM) [2023], which inspires a fast training
method. The fast unsupervised learning algorithm of
DBN trains one layer at a time greedily and obtains a
2. THE HISTORICAL CONTEXT OF DEEP multi-layer probabilistic model finally. More deep gen-
erative models similar to DBN are proposed and
MODELS
Downloaded by [University of Otago] at 16:27 21 December 2014
tuning. DBM is an undirected graphical model and the trained as an RBM. To approximate the post-probabil-
training algorithm is more complicated than the other ity of each layer, we perform the algorithm as follows:
two models. (1) sampling h1 » Qðh1 j xÞ from the first RBM where
Qðh1 j xÞ is approximating distribution of h1 , (2) calcu-
3.1 Deep Belief Network lating h2 with the sample h1 as the input of the second
RBM, and (3) repeating these two steps until the top
DBN is similar to sigmoid belief network, but the top
layer, see Figure 3.
two layers construct an RBM, which is different from
sigmoid belief network. The DBN has four layers
As a building block of training DBN, RBM plays a very
including a visible layer x and three hidden layers h1 ,
important role in deep learning. RBM is a restricted
h2 , and h3 ; see Figure 2. Different from sigmoid belief
type of Boltzmann machine (BM) which has been
network with factorized prior probability Pðh3 Þ on the
introduced as bidirectionally connected networks of
top layer, the top two layers of DBN form the distribu-
stochastic processing units [22]. A BM can be used to
tion of RBM which is an undirected graphical model
learn important aspects of an unknown probability
with probability Pðh2 ; h3 Þ. Therefore, the joint distribu-
distribution based on samples from this distribution.
tion of DBN is defined as Eq. (1).
However, there are practical limitations in using BM
! due to difficult and time-consuming learning process.
lY
¡ 2
l
Pðx; h ; . . . h Þ ¼ Pðh
1 l¡1 l
;h Þ k
Pðh jh kþ1
Þ Pðxjh1 Þ (1) RBM is proposed to alleviate this problem by imposing
k¼1
restrictions on the network topology [30].
4.1 Selected Applications in Image Processing 4.2 Selected Applications in Speech Recognition
Traditional image recognition technologies include The state-of-the-art hidden Markov model (HMM) sys-
wavelet transformation, Gabor filter, Bayes Network tems with observation probabilities approximated
decision, etc., for example, a novel approach in pursuit with Gaussian mixture models (GMMs) have been
of recognizing facial expression was proposed in refer- used in speech recognition for a long time and the tra-
ence [51], where facial feature is represented by a ditional neural network is barely used because of the
hybrid of Gabor wavelet transform of an image and low performance.
local transitional pattern code. However, the effect and
efficiency of traditional image recognition technologies Until a few years ago, a five-layer DBN was used to
are still not very satisfied. DBN was proposed and replace the Gaussian mixture component of the
tested on simple image recognition task on MNIST GMMHMM, and the monophone state was used as
data-set of handwritten digits, which is a common the modeling unit to model phone data [60]. Although
data-set for machine learning and pattern recognition monophones are generally accepted as a weaker pho-
experiments [5,5254]. DBN showed promising results netic representation than triphones, the DBNHMM
and outperformed most of the existing models. At the approach with monophones was shown to achieve
same time, deep autoencoder was developed and dem- higher phone recognition accuracy than the state-of-
onstrated with success on dimensionality reduction the-art triphone GMMHMM systems [61]. In more
task [27]. The parameters of deep autoencoder are ini- recent work, one popular type of sequence classifica-
tialized by stacking multiple RBMs and training each tion criterion, maximum mutual information, was
RBM greedily, which allows deep autoencoder net- successfully applied to learn DBN weights for the
works to learn low-dimensional codes that work much Texas Instruments and Massachusetts Institute of
better than principal components analysis as a tool to Technology (TIMIT) phone recognition task [6264].
reduce the dimensionality of data.
The DBNHMM was extended from the monophone
A modified DBN is developed where the top-layer phonetic representation to the triphone or context-
model uses a third-order Boltzmann machine [55]. dependent counterpart and from phone recognition to
This type of DBN is applied to the NORB database large vocabulary speech recognition [6571]. The
which is a three-dimensional object recognition task. experiments on the Bing mobile voice search data-set
Then, two strategies to improve the robustness of the collected under the real usage scenario demonstrate
DBN are developed [56]. First, sparse connections in that the triphone DBNHMM significantly outper-
the first layer of the DBN are used as a way to regular- forms the state-of-the-art HMM system [60]. Three fac-
ize the model. Second, a probabilistic de-noising algo- tors additional to the DBN contribute to the success:
rithm is developed. Both techniques are shown to be the use of triphones as the DBN modeling units, the
effective in improving robustness against occlusion use of the best available triphone GMMHMM to gen-
and random noise in a noisy image recognition task. erate the alignment with each state in the triphones,
DBN has also been successfully applied to create com- and the tuning of the transition probabilities. The
pact but meaningful representations of images for experiments also indicated that the decoding time of a
five-layer DBNHMM is almost the same as that of the information retrieval are also described respectively.
state-of-the-art triphone GMMHMM [68,69]. Although various models of deep learning and their
applications are proposed, there are a lot of works to
4.3 Selected Applications in Information Retrieval do in the future. First, more promoted deep generative
models are needed with architecture more close to
Semantic hashing is the first method which is used to human brains and simpler training theories. Second,
model documents to high-level features using deep after DistBelief is proposed as a distributed large scale
generative models [72,73]. Based on the word-count fea- deep network by Google company, distributed and
tures, the hidden variables in the final layer of a DBN parallel training algorithms of deep generative models
give a much better representation of each document become a hot research area, and in these algorithms
than the widely used latent semantic analysis and the map/reduce programming model will be used [75].
traditional term frequency-inverse document frequency These large scale deep networks are promising to pro-
(TF-IDF) approach for information retrieval. Documents cess big data. Third, the application of deep generative
are mapped to the space of memory addresses where models in information retrieval are worth to develop
semantically similar text documents are located at further, existing deep models are suitable to process
nearby address to facilitate rapid document retrieval. sensitive data with multiple layers like image data and
speech data, but they are too complex to deal with
While pre-training, a constrained conditional Poisson plain data like text data.
Downloaded by [University of Otago] at 16:27 21 December 2014
11. Y. Bengio, and Y. LeCun, “Scaling learning algorithms towards 30. K. H. Cho, T. Raiko, and A. Ilin, “Parallel tempering is efficient
AI,” Large-Scale Kernel Machines, Vol. 34, pp. 141, Sept. for learning restricted Boltzmann machines,” in Proceedings of
2007. the 2010 International Joint Conference on Neural Networks,
12. G. E. Hinton, S. Osindero, and Y. W. The, “A learning algorithm Thessaloniki, 2010, pp. 18.
for deep belief nets,” Neural Computation, Vol. 18, no. 7, 31. N. Le Roux, and Y. Bengio, “Representational power of
pp. 152754, Jul. 2006. restricted Boltzmann machines and deep belief networks,”
13. B. Taskar, P. Abbeel, and D. Koller, “Discriminative probabilistic Neural Computation, Vol. 20, no. 6, pp. 163149, Jun. 2008.
models for relational data,” in Proceedings of Conference 32. A. Fischer, and C. Igel, “Empirical analysis of the divergence of
on Uncertainty in Artificial Intelligence, Alberta, 2002, Gibbs sampling based learning algorithms for restricted Boltz-
pp. 48592. mann machines,” in Proceedings of the 20th International
14. J. A. Lasserre, C. M. Bishop, and T. P. Minka, “Principled Conference on Artificial Neural Networks, Thessaloniki, 2010,
hybrids of generative and discriminative models,” in Proceed- pp. 20817.
ings of IEEE Computer Society Conference on Computer Vision 33. G. E. Hinton, “Products of experts,” in Proceedings of the 9th
and Pattern Recognition, New York, 2006, pp. 8794. International Conference on Artificial Neural Networks, London,
15. M. I. Jordan, Learning in Graphical Models. Dordrecht: Kluwer, 1999, pp. 16.
1998. 34. G. E. Hinton, “Learning multiple layers of representation,” Trends
16. P. Dayan, G. E. Hinton, R. Neal, and R. Zemel. “The Helmholtz in Cognitive Sciences, Vol. 11, no. 10, pp. 42834, Oct. 2007.
machine,” Neural Computation, Vol. 7, no. 5, pp. 889904, 35. T. Tieleman, “Training restricted Boltzmann machines using
Sept. 1995. approximations to the likelihood gradient,” in Proceedings of
17. G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The “wake- the 25th International Conference on Machine Learning, New
sleep” algorithm for unsupervised neural network,” Science, York, 2008, pp. 106471.
Downloaded by [University of Otago] at 16:27 21 December 2014
Vol. 268, no. 5214, pp. 1558161, May 1995. 36. T. Tieleman, and G. Hinton, “Using fast weights to improve per-
18. L. K. Saul, T. Jaakkola, and M. I. Jordan, “Mean field theory for sistent contrastive divergence,” in Proceedings of the 26th
sigmoid belief networks,” Journal of Artificial Intelligence Annual International Conference on Machine Learning, New
Research, Vol. 4, no. 1, pp. 6176, Jan. 1996. York, 2009, pp. 103340.
19. I. Titov, and J. Henderson, “Constituent parsing with incremen- 37. Y. Bengio, and O. Delalleau, “Justifying and generalizing
tal sigmoid belief networks,” in Proceedings of Meeting of contrastive divergence,” Neural Computation, Vol. 21, no. 6,
Association for Computational Linguistics, Prague, 2007, pp. 160121, Jun. 2009.
pp. 6329. 38. A. Fischer, and C. Igel, “Empirical analysis of the divergence of
20. P. Smolensky, “Information processing in dynamical systems: Gibbs sampling based learning algorithms for restricted
foundations of harmony theory,” Parallel Distributed Process- Boltzmann machines,” in Proceedings of the 20th International
ing: Explorations in the Microstructure of Cognition, Vol. 1, Conference on Artificial Neural Networks, Thessaloniki, 2010,
pp. 194281, Feb. 1986. pp. 20817.
21. Y. Freund, and D. Haussler, “Unsupervised learning of distribu- 39. D. J. Earl, and M. W. Deem, “Parallel tempering: theory, appli-
tions on binary vectors using two layer networks,” in Advances cations, and new perspectives,” Physical Chemistry Chemical
in Neural Information Processing Systems, Vol. 4, J. E. Moody, Physics, Vol. 7, pp. 39106, Aug. 2005.
S. J. Hanson, and R.P. Lippmann, Eds. Denver, CO: Morgan 40. G. Desjardins, A. Courville, and Y. Bengio, “Parallel tempering
Kaufmann, 1991, pp. 9129. for training of restricted Boltzmann machines,” in Proceedings
22. G. E. Hinton, “Training products of experts by minimizing con- of the 13th International Conference on Artificial Intelligence
trastive divergence,” Neural Computation, Vol. 14, no. 8, and Statistics, New York, 2010, pp. 14552.
pp. 1771800, Aug. 2002. 41. R. M. Neal, “Sampling from multimodal distributions using tem-
23. M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential family pered transitions,” Statistics and Computing, Vol. 6, no. 4,
harmoniums with an application to information retrieval,” in pp. 35366, Dec. 1996.
Advances in Neural Information Processing Systems, Vol. 17, L. 42. Y. Iba, “Extended ensemble monte carlo,” International Journal
K. Saul, Y. Weiss and L. Bottou, Eds. Cambridge, MA: MIT of Modern Physics, Vol. 12, no. 5, pp. 62356, Jun. 2001.
Press, 2004, pp. 14818. 43. J. Xu, H. Li, and S. Zhou, “Improving Mixing Rate with Tem-
24. R. Salakhutdinov, and G. E. Hinton, “Deep Boltzmann pered Transition for Learning Restricted Boltzmann Machines,”
machines,” in Proceedings of International Conference on Arti- Neurocomputing, Vol. 139, pp. 32835, Sept. 2014.
ficial Intelligence and Statistics, Florida, 2009, pp. 44855. 44. D. C. Plaut, and G. E. Hinton, “Learning sets of filters using
25. G. E. Hinton, and R. Salakhutdinov, “Reducing the dimension- back-propagation,” Computer, Speech and Language, Vol. 2,
ality of data with neural networks,” Science, Vol. 313, no. 5786, no. 1, pp. 3561, Mar. 1987.
pp. 5047, May 2006. 45. D. DeMers, and G. Cottrell, “Non-linear dimension reduction,”
26. R. Collobert, and J. Weston, “A unified architecture for natural in Advances in Neural Information Processing Systems, Vol. 5,
language processing: Deep neural networks with multitask S. J. Hanson, J. D. Cowan and C. L. Giles, Eds. San Mateo,
learning,” in Proceedings of International Conference on CA: Morgan Kaufmann, 1992, pp. 5807.
Machine learning, Helsinki, 2008, pp. 1607. 46. R. Hecht-Nielsen, “Replicator neural networks for universal
27. M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, “Efficient optimal source coding,” Science, Vol. 269, no. 5232,
learning of sparse representations with an energy-based pp. 18603, Sept. 1995.
model,” in Advances in Neural Information Processing 47. N. Kambhatla, and T. K. Leen, “Dimension reduction by local
€ lkopf, J. C. Platt and T. Hoffman, Eds.
Systems, Vol. 19, B. Scho principal component analysis,” Neural Computation, Vol. 9, no.
Cambridge: MIT Press, 2006, pp. 113744. 7, pp. 1493516, Oct. 1997.
28. P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for 48. R. Salakhutdinov, and G. E. Hinton, “Deep Boltzmann
convolutional neural networks applied to visual document machines,” in Proceedings of the 12th International Confer-
analysis,” in Proceedings of the 7th International Conference ence on Artificial Intelligence and Statistics, Clearwater Beach,
on Document Analysis and Recognition, Washington DC, 2009, pp. 44855.
2003, pp. 95863.
49. R. Salakhutdinov, and G. Hinton, “A better way to pretrain deep
29. L. Deng, and D. Yu, “Deep learning for signal and information Boltzmann machines,” Advances in Neural Information Proc-
processing,” Microsoft Research Report, Redmond, 2013. essing Systems, Vol. 25, F. Pereira, C. J. C. Burges, L. Bottou
and K. Q. Weinberger, Eds. Cambridge, MA: MIT Press, 2012, Proceedings of the 37th International Conference on Acoustics,
pp. 19. Speech, and Signal Processing, Kyoto, 2012, pp. 427376.
50. R. Salakhutdinov, “Learning deep generative models,” Ph.D. 64. A. Mohamed, D. Yu, and L. Deng, “Investigation of full-
Dissertation, Graduate Department of Computer Science, sequence training of deep belief networks for speech recogni-
Univ. Toronto, Toronto, 2009. tion,” in Proceedings of the 11th Annual Conference of the
51. A. Tanveer, J. Taskeed, and C. Ui-Pil, “Facial expression recog- International Speech Communication Association, Makuhari,
nition using local transitional pattern on gabor filtered facial 2010, pp. 28469.
images”, IETE Technical Review, Vol. 30, no. 1, pp. 4752, 65. D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in
Jan. 2013. deep neural networks for large vocabulary speech recogni-
52. G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algo- tion,” in Proceedings of the 37th International Conference on
rithm for deep belief nets,” Neural Computation, Vol. 18, no. 7, Acoustics, Speech, and Signal Processing, Kyoto, 2012,
pp. 152754, Jul. 2006. pp. 440912.
53. J. Luo, and A. Brodsky, “An EM-based multi-step piecewise 66. D. Yu, S. Wang, Z. Karam, and L. Deng, “Language recognition
surface regression learning algorithm,” in Proceedings of the using deep-structured conditional random fields,” in Proceed-
7th International Conference on Data Mining, Las Vegas, 2011, ings of the 35th International Conference on Acoustics, Speech
pp. 28692. and Signal Processing, 2010, pp. 50303.
54. J. Luo, A. Brodsky, and Y. Li, “An EM-based ensemble learning 67. F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in
algorithm on piecewise surface regression problem,” Interna- context-dependent deep neural networks for conversational
tional Journal of Applied Mathematics and Statistics, Vol. 28, speech transcription,” in Proceedings of the 2011 IEEE Work-
no. 4, pp. 5974, Aug. 2012. shop on Automatic Speech Recognition and Understanding,
Hawaii, 2011, pp. 249.
55. V. Nair, and G.Hinton, “3-d object recognition with deep belief
68. G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
Downloaded by [University of Otago] at 16:27 21 December 2014
Authors
Jungang Xu is an associate professor of the Shilong Zhou is an MS student of School of
School of Computer and Control Engineering, Computer and Control Engineering, University
University of Chinese Academy of Sciences. of Chinese Academy of Sciences. He received
He received the PhD degree in computer the BS degree in software engineering from
applied technology from Graduate University Northeast University in 2012. His current
of Chinese Academy of Sciences in 2003. Dur- research interests include deep learning and
ing 20032005, he was a post-doctor of Tsing- information retrieval.
hua University. His current research interests
include deep learning, parallel computing, big Email: [email protected].
data management, etc.
Email: [email protected].