A Deep Hybrid Model For Weather Forecasting: Aditya Grover Ashish Kapoor Eric Horvitz
A Deep Hybrid Model For Weather Forecasting: Aditya Grover Ashish Kapoor Eric Horvitz
dictors for the weather variables that are trained using his- features, we use an ensemble of boosted decision-tree learn-
torical data. A variety of off-the-shelf machine learning pro- ers to make predictions.
cedures can be applied to the recorded data to build these
individual predictors. The second component works to re- 3.2 Data-Centric Kernel for Spatial Interpo-
fine inferences produced by the separate predictors by con- lation
straining the output to be spatially smooth and aligned with The individual predictors provide predictions only for par-
constraints imposed by physical laws. The interplay of these ticular locations (the weather stations), and we need to in-
constraints is dynamic and hence, we develop a data-centric terpolate the results across larger spatial regions. To extend
approach. The third component consists of a deep belief net- predictions in a smooth manner beyond the weather sta-
work which leads to a preference for solutions that respect tions, we rely on smoothness constraints induced via the
the expected joint statistics of the weather variables. We de- GP prior.
scribe the three key components in detail below and finally The covariance or kernel matrix K captures the notion of
conclude this section with an integrated graphical model of similarity among data points that are close in space and time
our framework. and is the key in determining the accuracy of spatial inter-
polation. While static Radial Basis Function (RBF) kernels
based on distance give reasonable estimates, they fail to cap-
3.1 Base-Level Predictors ture the dynamics of the system. For instance, predictions
The base-level predictors are individual regression func- about wind velocity at location s∗ , are not necessarily in-
tions that are trained using historical data at different tem- fluenced similarly by weather at equidistant stations, per
poral granularities. The intuition is that long-term historical factors such as regional turbulence. We need to have an
records of weather should provide insights about the weather ability to capture a preferential bias towards classes of func-
at particular locations, given sets of observations in the im- tions that respect certain physical constraints among the
mediate past. In general, the weather conditions change weather variables. The physical constraints include long-
gradually over time and also exhibit cyclicity through sea- range spatial dependencies, such as wind vectors aligning
sons, consequently enabling some success in predicting the with isobars1 and modeling the direct relationship between
signals. We need to train different predictors for each sta- pressure, temperature, and dew point due to natural gas
tion and range of altitudes considered as weather conditions laws.
change significantly across the vertical profile. We use a novel kernel defining a GP prior. For any pair of
The performance of the local regressions depends critically locations i and j, if the current pressure, temperature, and
on evidential features. We consider features over short- and the wind direction are denoted as p, t and θ respectively,
long-term spans of time. For the short-term features, we then we define our kernel as:
consider the values of weather variables over the last seven
D θ p t
days. As observations come at twelve-hour intervals, we con- Ki,j = Ki,j · Ki,j · (Ki,j + (1 − )Ki,j ). (1)
sider shorter-term features by time of the day. Such short-
D θ p t
term segmentation can be useful because winds, tempera- Here, Ki,j , Ki,j , Ki,j and Ki,j are RBF kernels over geo-
ture, and other weather variables may differ significantly graphic distance, the angle of the wind, pressure and tem-
over day and night due to the influence of solar heating. perature respectively and is a tunable parameter such that
We consider separate short-term features for day and night 0 ≤ ≤ 1. The resulting kernel matrix would be positive
rather than averaging over the daily variation. Features semi-definite as the proposed kernel function is a linear com-
spanning longer periods of time incorporate average seasonal bination and Hadamard product of kernels.
data on weather variables. The long-term features are com-
puted for several years in the past to reduce the influence 1
Since we are not doing surface-level predictions, the effect
of atypical weather phenomena. Given the set of engineered of friction is negligible.
Multiple kernels are commonly used to integrate similarity GP( W; S) S
notions from different sources [6]. In our case, the similar- ∓"
Pre-‐trained
Predictor
a critical role in inducing long-range dependencies. As an ψ (t j ) = N ( µt , κ 2 )
j
ψ (pj)
example, consider two stations A and B with wind vectors
ψ (d j )
[ax , ay ] and [bx , by ], respectively. We are performing inter- ψ (v jy )
ψ (v xj )
polation separately in X and Y directions. Hence, for any Observa-on
at
Test
Posi-on
station, e.g., station A, we can assume independence in the
pj t j v*x v* d* p* t*
y
two directions such that a neighboring station B can only v xj v jy d j
induce an air flow change in ax through bx and similarly,
ay is only influenced by by . K θ captures this intuition by Hidden
Hidden
Layer
1
Layer
1
defining an RBF over the angles made by the wind vectors η (z j ) η (z * )
with the corresponding axis for which the kernel matrix is Hidden
Hidden
Layer
2
Layer
2
defined. The balance between the pressure gradient force Ns
and Coriolis effect (geostrophic force) causes the winds to
follow isobars. This implies that stations in close vicinity Figure 2: Deep hybrid model. Probabilistic graphi-
having similar pressure will have winds aligned in the same cal model for weather prediction where weather sta-
direction, justifying the contribution of K p in computing the tions denoted by S, induce a Gaussian process (GP)
similarity. prior over the true values of the weather variables
W. Only noisier versions (zi ) of the true values are
3.3 Joint Modeling of Weather Variables observed at all the sites and are related via φ. The
Weather variables are influenced heavily by the interac- forecasts given by the pre-trained predictor are re-
tion of several factors. At the most fundamental level, these lated to the future observations via the potential
dependencies are based in the natural laws of thermody- ψ(·). The joint distribution of the true weather vari-
namics. Approaches to inferences about weather relying on ables is further constrained via a deep belief network
numerical simulation seek to characterize these dependen- (η(·)). All potentials arise at a test site s∗ , except
cies analytically. However these interdependencies are com- that there is no pre-trained predictor.
plex and unpredictable, which explains the limited success
of analytical techniques. At the same time, discriminative
statistical analysis beyond temporal and spatial techniques stations. Each weather station has random variables for
described above, does not generalize well for domains with wind velocity, dew point, pressure and temperature. Each of
the dynamism of weather phenomena. For weather, it is nat- these random variables are constrained via (a) an individual
ural to consider architectures that can automatically learn predictor that is trained on the historical data (ψ(·)), (b) a
rich representations from raw data. Hence, we model the Gaussian process prior (GP (·)) and the Gaussian likelihood
joint distribution between weather variables through a deep function φ(·) that use data dependent prior to impose spatial
belief network (DBN). and functional smoothness and (c) a deep belief network
The DBN consists of layers of stacked Restricted Boltz- that encourages solutions that respect the joint statistics
mann Machines (RBM) where the connections between any observed in historical data and that are also aligned with
two layers of a RBM form a bipartite graph. The top layer physical laws.
of the DBN consists of five units corresponding to the nor- Similarly, at a test site s∗ , we use the GP prior and
malized values of the latent weather variables (two units for the likelihood to first interpolate observations made at the
representing 2D vector winds). We assume a Gaussian prior weather stations. These interpolations are then further con-
over these variables, such that Wi ∼ N (mi , di ), each unit strained via the deep belief network to impose the joint sta-
having a bias ai . The primary level interactions between tistical distribution of the weather variables. Formally, we
the variables give rise to a secondary set of features repre- have the following distribution corresponding to the graph-
sented as the layer 2 hidden units, H. Similarly, we can ical model:
have another RBM below to capture the interactions be- Y Y
tween H and the layer 3 units, G. The hidden units follow p(W, Z|S) ∝ GP (W; S) φ(wi , zi )η(zi ) ψ(zi ).
a Bernoulli distribution and have a biases bj and ck . The i∈L∪∗ i∈L
weights between the first two layers are U= [uij ] and the Here, Z is the collection of random variables representing
next two layers are V = [vjk ]. Several structural and tun- the observations at any of the locations. The terms GP (·)
able design parameters are involved, which we discuss in the and φ(·) enforce the smoothness constraints as defined by
next section. the data dependent kernel (described in section 3.2). The
potential term η(·) arises due to the deep belief network
3.4 Probabilistic Graphical Model component (Section 3.3). Finally, the term ψ(·) applies only
The graphical model for the proposed approach is shown to the weather station sites and enforces consistency with
in Figure 2. The matrix W is the collection of all the weather the prediction of the pre-trained regression functions. In
variables wi denoting the true value at each location i. The particular, the observations are related to the output of an
observations zi = {vix , viy , pi , ti , di } recorded at any of the individual predictor via a simple Gaussian function: e.g.
sites is simply a noisier version of this true value. We use ψ(p) = N (µp , κ2 ), where µp is the individual prediction for
plate notation to show observations at Ns number of weather the pressure variable.
Algorithm 1 Deep hybrid model learning. Algorithm 2 Deep hybrid model inference.
procedure TrainWeatherModels procedure ForecastWeatherVariable(x, s∗ , Z)
B Prediction Variable: x, Test Site: s∗ , Observations: Z
B Boosted Decision Trees for Every Location, Variable
for all x ∈ {v, p, t, d} do if s∗ ∈ S then
for all s ∈ S do tmp∗ ← getBDTreePred(x, s∗ , Z)
trainData ← getTrainData(x, getHistData(s)) B uses corresponding BstDecTree model from Alg. 1
param ← getBestParam(trainData)
BstDecT ree[x, s] ← TrainBDTree(param) else
end for for all si ∈ S do
end for tmpi ← getBDTreePred(x, si , Z)
Predicted Winds
Predicted Winds
0 0
stations, considering the wind measurements from the rest
−10 −10
of the U.S. stations. Thus, each station served as an inde-
−20 −20 pendent test point, whose value is interpolated using a GPR
−30 −30
model and compared against the true winds shown in Fig. 1
All values in knots All values in knots
(a). Fig. 1 (b) shows the interpolated wind vectors when a
−40 −40
−40 −30 −20 −10
True Winds
0 10 20 −40 −30 −20 −10
True Winds
0 10 20
static kernel matrix is used. Here an entry Ki,j in the matrix
is simply a decreasing exponential in the geographical dis-
Figure 3: True versus interpolated wind plots for tance between two stations i and j. In contrast, the hybrid
static and hybrid kernel. Static interpolation shows approach, as illustrated in Fig. 1 (c), captures the similarity
high deviations from true winds. Atmospheric dy- between each pair of stations such that every entry Ki,j of
namics are more effectively captured with use of a the matrix is computed dynamically at training time using
hybrid data-centric kernel. the formula given in Eq. 1. The pressure, temperature, and
angle θ between the wind vectors are the known values for
the current time step.
Additionally, we also approximate the posterior over the la- Now consider the following stations: Topeka (Kansas),
tent variables and q(Hj ) ∼ Bern (γj ) and q(Gk ) ∼ Bern (βk ) Omaha (Nebraska), Springfield (Missouri) and Norman (Ok-
for the two hidden layers respectively. The following varia- lahama), referred to in Fig. 1 as stations P, A, B and C, re-
tional updates are then used to estimate the parameters of spectively. For a static interpolation of winds at P, a higher
the distribution (lth component): contribution would come from B than C, as B is geograph-
ically closer. However, the temperature and pressure con-
−[bl + µti uil ]
P
Layer 2 to 3: 1/βlt+1 = 1 + e j KP,A is maximum in both cases. Hence, the hybrid ker-
nel does not ignore distance as a similarity criteria. How-
− γlt+1 −[bl + k βk vlk ]
P t+1
0
1 ever, in cases involving a tradeoff between distance and other
Layer 3 to 2: 1/γlt+1 = 1 + e
γlt+1 weather variables, their combined contribution might alter
X 0 the relative importance of a particular neighboring station,
Layer 2 to 1: µt+1
l = (ml + d2l al + d2l γjt+1 ulj )/(1 + d2l ) as in the aforementioned case. The quantitative gains in
j
q predication accuracy are displayed in the RMS plots in Fig.
σlt+1 = dl / 1 + d2l 3 (a, b).
The mean parameters µl are initialized to the estimates Weather Variable RMS Error Reduction (in %age)
ml , while γ and β are initialized randomly. The parame- X Y Overall
ter dl corresponds to the variance in the initial estimates 6 hours 2.17 2.05 2.11
and signifies our confidence in those predictions. We set 12 hours 1.05 1.01 1.03
these variances via a cross-validation procedure over histor- 24 hours 1.05 0.97 1.01
ical data. The derivation for the above update equations
follows from the application of prior work in variational in- Table 1: Improvement in performance obtained us-
ference [1] to deep belief networks. ing the deep belief network. The final step of re-
finement uses the DBN results to further improve
5. EXPERIMENTAL EVALUATION prediction accuracy.
We performed a set of experiments to evaluate the pro-
posed methodology. In the experiments, we explored three 5.2 Dynamic Prediction and Deep Learning
main questions. First, we compare and highlight the advan-
tage of the spatial interpolation procedure that relies on a In another experiment, we evaluate the performance gains
data-centric dynamic kernel matrix to the more commonly due to the final refinement step of the deep belief network.
used static kernel matrix. Second, we seek to compare the The percentage reduction in error for wind forecasts for three
proposed model with a baseline approach. Third, we explore time steps in the future are shown in Table 1. We see that
the importance of modeling the joint statistics of predictive the DBN leads to an additional 1-2% error reduction and
variables via the deep belief network. Finally we compare clearly, modeling the joint statistics of the weather variables
the wind forecast results with those of state-of-the-art sys- helps in making better predictions. We observed a per-
tems. formance improvement of similar magnitude for the other
The experiments were based on five years of historical weather variables as well.
data, from 2009 to present, extracted from the IGRA dataset. After establishing the superiority of the data-centric ker-
The data consists of balloon observations recorded at 60 lo- nel and the DBN independently, we evaluate the prediction
cations across the continental US. accuracy of the full deep hybrid model for each weather vari-
able2 , aggregated over all stations in the continental U.S.,
5.1 Interpolation in a Hybrid Field where current and historical data is available.
To illustrate the efficacy of a hybrid kernel in handling 2
The IGRA dataset provides the geopotential height and
long-range spatial dependencies among weather variables, dew point at roughly constant pressures. These quantities,
Winds Dew Point
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
6 hours 12 hours 24 hours 6 hours 12 hours 24 hours
A 7.44 13.23 14.89 A 7.29 12.94 14.58
B 5.91 10.58 12.13 B 6.23 11.77 12.12
C 3.71 9.58 8.74 C 5.01 7.87 9.15
160 3.5
140 3
120 2.5
100 2
80 1.5
60 1
40
0.5
20
0
0 6 hours 12 hours 24 hours
6 hours 12 hours 24 hours
A 1.45 3.71 2.91
A 64.6 171.4 129.2
B 1.18 2.5 1.95
B 76.76 141.96 135.11
C 1.39 2.75 1.21
C 16.27 39.91 25.69
Figure 4: Results on predicting weather variables for different approaches. The temporal predictors that
employ a hybrid data-centric scheme for interpolation and use DBNs for modeling the joint relationship
among weather variables show significant improvements over the baselines.
The accuracy of the proposed model is compared with Time Step Model RMS Error (in knots)
two baseline models in Fig. 4 for three future time steps X Y Overall
as before. The baseline prediction, marked as A in Fig. 6 hours Deep Hybrid Model 2.29 1.33 1.81
4, uses the current values as estimates for the future; the Kapoor et al. 2014 3.94 2.16 3.05
intuition with using current values to predict the future is NOAA 3.18 3.44 3.31
that weather conditions typically do not change greatly over 12 hours Deep Hybrid Model 4.44 2.59 3.56
a day. For the second baseline (marked B), we construct Kapoor et al. 2014 5.03 3.93 4.48
a typical spatiotemporal prediction model, where baseline NOAA 5.13 4.34 4.88
boosted decision tree predictors are augmented with a static 24 hours Deep Hybrid Model 6.57 3.82 5.19
interpolation scheme. We observe that the deep hybrid Kapoor et al. 2014 8.93 5.24 7.08
model (marked C) comprising of a dynamic data-driven in- NOAA 8.79 6.37 7.58
terpolation scheme and a DBN in addition to the boosted
decision tree predictors used in B, significantly outperforms Table 2: Comparison of the proposed methodology
both the baselines. In a couple of cases involving short- with state of the art in wind prediction. Results
term temperature forecasts, B marginally outperformed the summarized here are for weather stations in Wash-
DHM, suggesting limited interdependences between temper- ington for a period of one month. We observe that
ature and other variables for short-term predictions. the new model results in significantly lower errors
than competitive models. The best performance is
5.3 Comparison with State of the art indicated in bold.
Apart from winds, forecasts for other variables across the
vertical atmospheric profile are not available for comparative
analyses. We compare the wind predictions of the proposed
model against two forecast systems. The first one proposed time steps into the future: 6, 12 and 24 hours. Table 2
by [9] makes predictions using a static GPR interpolation show the accuracy of the two forecast systems for the Seat-
scheme, coupled with relative velocity data obtained through tle station. The results summarize the predictions made for
airplanes. Our second set of comparisons is with the Winds the weather stations in the state of Washington for a pe-
Aloft forecast, released by NOAA every six hours, for three riod of one month. We observe that, while Kapoor et al.
2014 achieve better performance than NOAA, the proposed
under reasonable assumptions, serve as proxies for pressure method shows significantly better performance than both of
and specific humidity, respectively. the competitors.
6. CONCLUSION AND FUTURE WORK [9] A. Kapoor, Z. Horvitz, S. Laube, and E. Horvitz.
We presented a weather forecasting model that makes Airplanes aloft as a sensor network for wind
predictions via considerations of the joint influence of key forecasting. In Proceedings of the 13th international
weather variables. We introduced a data-centric kernel and symposium on Information Processing in Sensor
showed how using GPR with such a kernel can effectively in- Networks (IPSN), pages 25–34. IEEE Press, 2014.
terpolate over space, taking into account weather phenom- [10] V. M. Krasnopolsky and M. S. Fox-Rabinovitz.
ena such as turbulence. We performed temporal analysis Complex hybrid models combining deterministic and
using short- and longer-term features within a gradient-tree machine learning components for numerical climate
based learner. We augmented the system with a deep be- modeling and weather prediction. Neural Networks,
lief network and tuned the parameters to model the depen- 19(2):122–134, 2006.
dencies among weather variables. A set of experiments on [11] R. J. Kuligowski and A. P. Barros. Localized
real-world data shows that the new methodology can pro- precipitation forecasts from a numerical weather
vide better results than NOAA benchmarks, as well as re- prediction model using artificial neural networks.
cent research that had demonstrated improvements over the Weather and Forecasting, 13(4):1194–1204, 1998.
benchmarks. [12] G. Marchuk. Numerical methods in weather prediction.
Future work includes projecting weather predictions to Elsevier, 2012.
more distant times into the future. We are also interested in [13] A. McGovern, D. John Gagne, N. Troutman, R. A.
exploring the use of computations of the value of information Brown, J. Basara, and J. K. Williams. Using
to guide sensing at weather stations. We note that airplanes spatiotemporal relational random forests to improve
in flight can serve as sensors of wind speeds, as explored in our understanding of severe weather processes.
[9]. We wish to investigate the boosts in predictive power Statistical Analysis and Data Mining: The ASA Data
that might be achieved via integrating such additional data Science Journal, 4(4):407–429, 2011.
into the hybrid model. [14] A. McGovern, T. Supinie, I. Gagne, M. Collier,
R. Brown, J. Basara, and J. Williams. Understanding
Acknowledgments severe weather processes through spatiotemporal
We are grateful to Imke Durre for answering our queries relational random forests. In 2010 NASA conference
concerning the IGRA dataset. The first author would like on intelligent data understanding, 2010.
to thank Microsoft Research, Redmond for conducting the [15] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee.
Worldwide Internship Program that made this research pos- Structured Recurrent Temporal Restricted Boltzmann
sible. Machines. In Proceedings of the 31st International
Conference on Machine Learning (ICML), pages
7. REFERENCES 1647–1655, 2014.
[1] M. J. Beal. Variational algorithms for approximate [16] Y. Radhika and M. Shashi. Atmospheric temperature
Bayesian inference. PhD thesis, University of London, prediction using support vector machines.
2003. International Journal of Computer Theory and
[2] L. Chen and X. Lai. Comparison between ARIMA and Engineering, 1(1):1793–8201, 2009.
ANN models used in short-term wind speed [17] C. E. Rasmussen. Gaussian processes for machine
forecasting. In Power and Energy Engineering learning. 2006.
Conference (APPEEC), 2011 Asia-Pacific, pages 1–4. [18] L. F. Richardson. Weather prediction by numerical
IEEE, 2011. process. Cambridge University Press, 2007.
[3] A. S. Cofıno, R. Cano, C. Sordo, and J. M. Gutierrez. [19] N. I. Sapankevych and R. Sankar. Time series
Bayesian networks for probabilistic weather prediction using support vector machines: a survey.
prediction. In 15th Eureopean Conference on Artificial Computational Intelligence Magazine, IEEE,
Intelligence (ECAI), 2002. 4(2):24–38, 2009.
[4] I. Durre, R. S. Vose, and D. B. Wuertz. Overview of [20] I. Sutskever, G. E. Hinton, and G. W. Taylor. The
the Integrated Global Radiosonde Archive. Journal of Recurrent Temporal Restricted Boltzmann Machine.
Climate, 19(1):53–68, 2006. In Advances in Neural Information Processing
[5] I. Durre, R. S. Vose, and D. B. Wuertz. Robust Systems, pages 1601–1608, 2009.
automated quality assurance of radiosonde [21] C. Voyant, M. Muselli, C. Paoli, and M.-L. Nivet.
temperatures. Journal of Applied Meteorology and Numerical Weather Prediction (NWP) and hybrid
Climatology, 47(8):2081–2095, 2008. ARMA/ANN model to predict global radiation.
[6] M. Gönen and E. Alpaydın. Multiple kernel learning Energy, 39(1):341–355, 2012.
algorithms. The Journal of Machine Learning
Research, 12:2211–2268, 2011.
[7] G. Hinton, S. Osindero, and Y.-W. Teh. A fast
learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.
[8] I. Horenko, R. Klein, S. Dolaptchiev, and C. Schütte.
Automated Generation of Reduced Stochastic
Weather Models I: simultaneous dimension and model
reduction for time series analysis. Multiscale Modeling
& Simulation, 6(4):1125–1145, 2008.