Introduction to Gaussian Process Models
Cesar Lincoln Cavalcante Mattos
Federal University of Ceara (UFC)
Department of Teleinformatics Engineering (DETI)
Graduate Program in Teleinformatics Engineering (PPGETI)
November 2015
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Agenda
1 Introduction to GPs
- Why GPs?
- Basic Definitions
- GPs for Regression
- Covariance Function and Hyperparameters Optimization
- From Feature Space to GPs
2 Dynamical System Identification
3 Advanced Topics
- Sparse Models
- Classification
- Robust Learning
- Unsupervised Learning
- Deep Models
- More Nonlinear Dynamical Models
4 Conclusion
2 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Where I come from
Fortaleza, Ceara, Brazil
2.55 millions of inhabitants.
5th largest city in Brazil.
34 Km of beaches.
Around 25-30 C all year.
2nd Brazilian tourism
destination.
3 / 54
Beira Mar Avenue. Iracema guerreira statue.
Jangada at the sunset. Dragao do Mar Center of Art and Culture.
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Federal University of Ceara (UFC)
Graduate Program in Teleinformatics Engineering (PPGETI)
UFC
8 campi.
114 undergraduate courses.
146 graduate courses.
2,150 professors.
26,800 undergraduate
students.
6,000 graduate students.
PPGETI
200 masters dissertations.
75 PhD thesis.
5 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
CENTAURO - Center of Reference for Automation and
Robotics
Automation, Robotics, Electromagnetic Compatibility, Industrial
Processes and Machine Learning.
20 collaborators, around 1/3 working with ML, led by Prof.
Guilherme Barreto.
International collaboration with Portugal, German, Finland, England.
6 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Basics
Why Gaussian Process?
Parametric models act as a bottleneck between the training set and
the predictions.
The complexity of the model should grow with the amount of data
available.
Balance between data fit and regularization.
Principled way to find (few) hyperparameters.
Models uncertainty with probability distributions (Bayesian
treatment).
Most of the theoretical background comes from the useful properties
of the Gaussian distribution.
7 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Basics
Multivariate Gaussian Distribution
Definition
If a vector of random variables f RN follows a multivariate Gaussian
distribution, we can express it by
1 1 > 1
p(f |, K ) = N 1 exp (f ) K (f ) , (1)
(2) 2 |K | 2 2
where the distribution is completely defined by its mean vector RN
and its covariance matrix K RN N .
8 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Basics
Two Important Properties
Consider the following collection of random variables:
f1 1 K 11 K 12
f = N (, K ), = , K = . (2)
f2 2 K 21 K 22
Marginalization
The observation of a larger collection of variables does not affect the
distribution of smaller subsets, which implies that f 1 N (1 , K 11 ) and
f 2 N (2 , K 22 ).
Conditioning
Conditioning on Gaussians results in a new Gaussian distribution given by
p(f 1 |f 2 = y ) = N (f 1 |1 + K 12 K 1 1
22 (y 2 ), K 11 K 12 K 22 K 21 ) (3)
9 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Basics
Gaussian Process
Definition
A GP defines a distribution of functions
f : X R, (4)
such as, for any finite subset {x 1 , x 2 , , x N } X of the domain, a
vector f = [f (x 1 ), f (x 2 ), , f (x N )]> follows a multivariate
Gaussian distribution:
f N (, K ). (5)
In the infinite case, we have a GP prior for the function f ().
By the marginalization property, we can analyze the infinite object
f () by analyzing any finite subset f .
The vector of evaluations f is a single sample from a GP.
10 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Basics
Samples from a GP
Figure 1 : Samples from the GP prior.
11 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Nonlinear Regression
Let a dataset be given by
D = {(x i , yi )|N
i=1 } = (X , y ), (6)
where x i RD are the inputs, X RN D and y RN are observed
outputs.
A general nonlinear task can be modeled as
yi = f (x i ) + i , i N (0, n2 ) (7)
The values fi = f (x i ) are not observed directly and f () is unknown.
12 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Standard GP modeling
Choose a multivariate Gaussian prior for the unknown function
f = f (X ) N (f |0, K ), (8)
Kij = k (x i , x j ), (9)
where K RN N , Kij = k (x i , x j ), is the covariance matrix,
obtained with a kernel function k (, ).
A common choice is the squared exponential function:
D
!
1 X
k (x i , x j ) = f2 exp wd2 (xid xjd )2 , (10)
2
d=1
where the vector = [f2 , w12 , . . . , wD2 ]T is comprised of the
hyperparameters which characterize the covariance of the model.
The hyperparameters w12 , . . . , wD 2 are responsible for the so-called
automatic relevance determination (ARD) of the input dimensions.
13 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Standard GP modeling
Likelihood
Considering the observation of a Gaussian noisy version of f , we have
p(y |f ) = N (y |f , n2 I ), (11)
where I RN N is the identity matrix.
Marginal likelihood
The marginal distribution of y is calculated by integrating out f :
Z Z
p(y |X ) = p(y |f )p(f |X )d f = N (y |f , n2 I )N (f |0, K )d f , (12)
= N (y |0, K + n2 I ). (13)
14 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Standard GP modeling
Inference
Inference for f , given a new input x , is obtained by conditioning:
K + n2 I k f
y
N 0, , (14)
f k f k
p(f |x , X , y ) = N (k f (K + n2 I )1 y , k k f (K + n2 I )1 k f ).
(15)
where
K = K (X , X ), (16)
k f = [K (x , x 1 ), , K (x , x N )], (17)
kf = k>
f , (18)
k = K (x , x ). (19)
15 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Standard GP modeling
Figure 2 : Posterior predictive distribution of a GP.
16 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Samples from a GP
Figure 3 : Samples from the GP prior, with f2 = 1, w = 1 and n2 = 0.
17 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Samples from a GP
Figure 4 : Samples from the posterior (after the observation of y ), without noise
(n2 = 0).
18 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Samples from a GP
Figure 5 : Samples from the posterior (after the observation of y ), with noise
(n2 = 0.01). 19 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Samples from a GP
Figure 6 : Samples with f2 = 0.1, Figure 7 : Samples with f2 = 2,
w = 1 and n2 = 0. w = 1 and n2 = 0.
20 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GPs for Regression
Samples from a GP
Figure 8 : Samples with f2 = 1, Figure 9 : Samples with f2 = 1,
w = 2 and n2 = 0. w = 0.5 and n2 = 0.
21 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Covariance Matrix
Hyperparameters Optimization
The noise variance n2 is included in the vector of hyperparameters
which is optimized with the maximization of the marginal log-likelihood
L() = log p(y |X , ), the so-called evidence of the model:
1 1 N
L() = log |K + n2 I | y > (K + n2 I )1 y log(2). (20)
2| {z } 2| {z } 2
model capacity data fitting
Figure 10 : Bayesian model selection. 22 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Covariance Matrix
Hyperparameters Optimization
The hyperparameters are optimized with the help of analytical
gradients:
2
L() 1 2 1 (K + n I )
= Tr (K + n I )
i 2 i
2
(21)
1 (K + n I )
+ y > (K + n2 I )1 (K + n2 I )1 y .
2 i
No need for cross-validation to perform model selection.
23 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Covariance Matrix
Kernel Function
The choice of the kernel function directly affects the characteristics of
the model.
The squared exponential, for example, imposes a certain degree of
smoothness.
Any function that generates a positive semidefinite covariance matrix
is acceptable.
New kernel functions can be created by linear combination of valid
kernels, increasing the expressiveness of the model.
24 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Covariance Matrix
Samples from different kernel functions
Squared exponential Linear
Matern 3/2 Periodic
25 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Alternative View
From Feature Space to GPs
Let : RD RQ be a mapping function and w i RQ a vector of
weights:
yi = w > (x i ) + i , i N (0, n2 ). (22)
If we consider the prior p(w ) = N (w |0, w ) and = (X ) is a
matrix where each row is given by (x i ), we have the posterior
p(y |w , X )p(w ) 1 1 1
p(w |y , X ) = = N w 2 A y , A
, (23)
p(y |X ) n
where A RQQ = 12 > + 1 w .
n
Prediction is performed by averaging the weights:
Z
p(f |x , X , y ) = p(f |x , w )p(w |y , X )d w (24)
1 > 1 > 1
= N f 2 (x ) A y , (x ) A (x ) .
n
(25)
26 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Alternative View
From Feature Space to GPs
We can rewrite the predictive distribution:
p(f |x , X , y ) = N f (x )> w (T w + n2 I )1 y ,
(x )> w (x ) (x )> w (T w + n2 I )1 T w (x ) .
(26)
Now we apply the kernel trick:
> w = T 1/2 1/2 >
w w = = k (X , X ) = K , (27)
(x )> w = k (x , X ) = k f , (28)
> w (x ) = k (X , x ) = k f , (29)
>
(x ) w (x ) = k (x , X ) = k . (30)
Finally, we get the standard GP prediction expression:
p(f |x , X , y ) = N (k f (K + n2 I )1 y , k k f (K + n2 I )1 k f ).
(31)
27 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Introduction
Dynamical System Identification
Dynamical Systems
A process whose states present a temporal dependency, i.e., its outputs are
function of its past.
Black-box modeling
The model is obtained only from the systems inputs and outputs.
Figure 11 : Model M obtained after the identification of the system P .
28 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Introduction
Dynamical System Identification
NARX (nonlinear autoregressive with exogenous inputs) Models
Given a dynamical system with input ui and output yi , we have:
x i = [yi1 , yi2 , , yiLy , ui1 , ui2 , , uiLu ]> , (32)
yi = f (x i ) + i , (33)
where x i is the regressor vector (or state), Ly and Lu are some specified
lags, f () it the transition function and i is a Gaussian noise.
Figure 12 : Structure of an autoregressive model.
29 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
GP-NARX
Dynamical System Identification
Figure 13 : GP-NARX model for system identification. 30 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Experiments
Dynamical System Identification
Validation
1-step ahead prediction: the prediction is performed based on past
inputs and known observed outputs.
Infinite-step ahead prediction (free simulation): the prediction is
performed based on past inputs and past predictions.
Metrics
q PN
1
Root Mean Square Error: RMSE = N i=1 (yi i )2 .
Negative Log Density
Error:
1 PN (yi i )2
NLD = 2N i=1 log(2) + log(i2 ) + i2
.
31 / 54
Artificial datasets1
Input/Samples
# Output Estimation Test Noise
yi =
yi1 yi2 (yi1 +2.5)
+ ui1 ui = U(2, 2) ui = sin(2i/25)
1 2
1+yi1 2
+yi2 N (0, 0.29)
300 samples 100 samples
yi1
ui = sin(2i/25)+
yi = 3
+ ui1 ui = U(2, 2)
2 2
1+yi1 sin(2i/10) N (0, 0.65)
300 samples 100 samples
yi = 0.8yi1 + ui = U(1, 1) ui = sin(2i/25)
3 N (0, 0.07)
(ui1 0.8)ui1 (ui1 + 0.5) 300 samples 100 samples
ui = N (ui |0, 1) ui = N (ui |0, 1)
3 )
yi = yi1 0.5 tanh(yi1 + ui1
4 1 ui 1 1 ui 1 N (0, 0.0025)
150 samples 150 samples
yi = 0.3yi1 + 0.6yi2 + ui = U (1, 1) ui = sin(2i/250)
5 N (0, 0.18)
0.3 sin(3ui1 ) + 0.1 sin(5ui1 ) 500 samples 500 samples
1
Narendra, K.S., Parthasarathy, K., Identification and control of dynamical
systems using neural networks, 1990; Kocijan J. et. al., Dynamic systems
identification with Gaussian processes, 2005
Artificial-1 dataset
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.
Artificial-2 dataset
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.
Artificial-3 dataset
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.
Artificial-4 dataset
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.
Artificial-5 dataset (SE kernel)
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX SE - 1-step ahead prediction. GP-NARX SE - Free simulation.
Artificial-5 dataset (SE+Periodic kernel)
Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.
GP-NARX SE+Periodic - 1-step ahead prediction. GP-NARX SE+Periodic - Free simulation.
Mackey-Glass Time Series (100 training samples)
GP-NAR - SE+Linear kernel - 1-step ahead prediction.
GP-NAR - SE+Linear kernel - Free simulation.
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Robotics and Control
GP for Learning in Robotics and Control2
Probabilistic Inference for Learning Control (PILCO)
Model-based policy search for autonomous learning.
Considers model uncertainty with the use of GPs.
Figure 14 : Examples of systems controlled with PILCO. Videos available in
http://www.youtube.com/user/PilcoLearner
2
Deisenroth, M. P., Fox, D. and Rasmussen, C. E., Gaussian processes for data-efficient
learning in robotics and control, 2015 40 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Sparse Models
Sparse GP Approximations
In order to use GP for large datasets, we need to avoid its O(N 3 )
complexity and O(N 2 ) storage demands.
Most of the sparse approximations schemes consist of replacing the
full kernel matrix K N by K NM K 1 2
M K MN and get O(M N )
2
complexity and O(M ) storage demands, where M < N .
Figure 15 : Visualization of the sparse approximation.
41 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Classification
GPs for Binary Classification
The observed output (a class probability) is related to the latent
function f () by a non-Gaussian likelihood p(yi = 1|f (x i )):
Figure 16 : Squashing a latent function into a class probability.
Need to apply approximate inference methods:
- Sampling methods (e.g. MCMC).
- Laplace approximation.
- Expectation Propagation (EP).
- Variational Bayes (VB).
42 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Robust Learning
Training in the Presence of Outliers
We can use heavy-tailed distributions to account for non-Gaussian
noise in the form of outliers.
Approximate inference is also necessary.
Figure 17 : Comparison between the Gaussian and heavy-tailed distributions.
43 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Robust Learning
Robust Latent NARX GP (GP-RLARX)3
Performs autoregression with latent (non-observed) variables, instead
of the outputs.
Chooses a Student-t likelihood, expressed as a mixture of Gaussians.
Uses a variational approximation for inference.
(x ) (x ) (x )
xi = f (xi1 , , xiLx ui1 , , uiLu ) + i , i N (i |0, x2 ),
(34)
(y) (y) (y)
yi = xi + i , i N (i |0, i1 ), i (i |, ). (35)
3
Mattos, C. L. C., et al., Latent Autoregressive Gaussian Process Models for Robust
System Identification, submitted to DYCOPS 2016.
44 / 54
GP-RLARX for System Identification
Artificial 1. Artificial 2. Artificial 3.
Artificial 4. Artificial 5.
RMSE values for free simulation with different levels of contamination by outliers.
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Unsupervised Learning
GPs for Unsupervised Learning
We have observed (noisy) data but do not
know the latent inputs which generated it.
Related to the problem of nonlinear dimen-
sionality reduction.
Gaussian Process Latent Variable Model (GPLVM)4
- GP prior over the unknown mapping f () to the outputs.
- Prior over the latent input space, e.g. p(X ) = N 2
Q
i=1 N (x i |0, x I ).
- Propagation of the uncertainty over a nonlinear function is intractable.
- Usually applies variational approximations5 .
4
Lawrence, N., Gaussian process latent variable models for visualisation of high
dimensional data, 2004
5
Titsias, M., Lawrence, N., Bayesian Gaussian process latent variable model, 2010
46 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Deep Models
Deep GPs6
If we consider a GPLVM with a GP prior in the inputs, we have a
Deep GP model with two layers.
Multiple hidden layers provide a powerful hierarchical structure for
both deep unsupervised and supervised learning.
Inference is usually performed with a variational approximation.
Figure 18 : Deep GP hierarchical structure.
6
Damianou A. and Lawrence, N., Deep Gaussian Process, 2013
47 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
More Nonlinear Dynamical Models
VGPDS (Variational GP Dynamical Systems)7
Figure 19 : Dynamical latent variables GP model.
7
Damianou A., et al, Variational Gaussian Process Dynamical Systems, 2011 48 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
More Nonlinear Dynamical Models
GP-SSM (GP State-Space Models)8
Figure 20 : State-space model with GP transition.
8
Frigola, R. et al, Variational Gaussian Process State-Space Models, 2014 49 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
More Nonlinear Dynamical Models
RGP (Recurrent Gaussian Process)9
A Deep GP model with a latent autoregressive structure and a specific
variational approximation:
(h) (h) (h) (h)
xi = f (h) x i + i , f (h) N 0, K f , 1hH
(H +1) (H +1) (H +1)
y i = f (H +1) x i + i , f (H +1) N 0, K f
where
h i> hh i i>
(1) (1) (1)
x i1 , u i1 = x i1 , , xiL , [ui1 , , u iL u ] , if h = 1,
h i> hh i h ii>
(h) (h) (h1) (h) (h) (h1) (h1)
x i = x i1 , x i = xi1 , , xiL , xi , , xiL+1 , if 1 < h H ,
h i>
(H ) (H )
x i = xi , , x (H )
, if h = H + 1.
iL+1
9
work in progress!
50 / 54
RGP for System Identification (free simulation)
GP-NARX. RGP with 2 hidden layers.
GP-NARX. RGP with 2 hidden layers.
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Final Remarks
GP is a powerful nonparametric learning framework.
Versatile method: regression, classification, unsupervised learning, etc.
Can be made even more flexible by the introduction of new kernel
functions and architectures.
Optimization of hyperparameters directly from the marginal likelihood.
Computational cost is usually high, but can be alleviated with sparse
approximations.
Output is a fully defined distribution.
52 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
References
1 NARENDRA, K.S., PARTHASARATHY, K.: Identification and control of dynamical
systems using neural networks. IEEE T Neural Networ 1(1) (1990) 4-27.
2 KOCIJAN, J.; GIRARD A.; BANKO B.; MURRAY-SMITH R., Dynamic Systems
Identification With Gaussian Processes. Mathematical and Computer Modelling of
Dynamical Systems, v. 11, n. 4, p. 411-424, 2005.
3 RASMUSSEN C. E.; WILLIAMS C. K. I., Gaussian Processes for Machine Learning.
Cambridge, MA: MIT Press, 2006.
4 TITSIAS, Michalis K and LAWRENCE, Neil D, Bayesian Gaussian process latent variable
model. International Conference on Artificial Intelligence and Statistics, p. 207-215, 2010.
5 DAMIANOU, Andreas and LAWRENCE, Neil, Deep Gaussian Processes. Proceedings of
the Sixteenth International Conference on Artificial Intelligence and Statistics, p. 207-215,
2013.
6 FRIGOLA, Roger and CHEN, Yutian and RASMUSSEN, Carl, Variational Gaussian
process state-space models. Advances in Neural Information Processing Systems, p.
3680-3688, 2014.
53 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion
Thank you for your attention!
Questions?
Cesar Lincoln Cavalcante Mattos
[email protected] 54 / 54