0% found this document useful (0 votes)
30 views8 pages

A Deep Hybrid Model For Weather Forecasting: Aditya Grover Ashish Kapoor Eric Horvitz

This document presents a novel hybrid model for weather forecasting that combines discriminative predictive models with deep neural networks to enhance accuracy in spatiotemporal predictions. It addresses challenges in traditional weather modeling by leveraging historical data and incorporating long-range spatial dependencies through a data-driven kernel function. The proposed methodology shows promise in improving weather prediction systems by respecting physical laws and optimizing model parameters efficiently.

Uploaded by

nitya.agarwal005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views8 pages

A Deep Hybrid Model For Weather Forecasting: Aditya Grover Ashish Kapoor Eric Horvitz

This document presents a novel hybrid model for weather forecasting that combines discriminative predictive models with deep neural networks to enhance accuracy in spatiotemporal predictions. It addresses challenges in traditional weather modeling by leveraging historical data and incorporating long-range spatial dependencies through a data-driven kernel function. The proposed methodology shows promise in improving weather prediction systems by respecting physical laws and optimizing model parameters efficiently.

Uploaded by

nitya.agarwal005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Deep Hybrid Model for Weather Forecasting

Aditya Grover∗ Ashish Kapoor Eric Horvitz


IIT Delhi Microsoft Research Microsoft Research
[email protected] [email protected] [email protected]

ABSTRACT in forecasting systems. We explore weather as a fundamental


Weather forecasting is a canonical predictive challenge that challenge for machine data mining and inference. We intro-
has depended primarily on model-based methods. We ex- duce methods that show promise for advancing the state of
plore new directions with forecasting weather as a data- the art of weather forecasting systems.
intensive challenge that involves inferences across space and In the U.S., the National Oceanic and Atmospheric Ad-
time. We study specifically the power of making predic- ministration (NOAA) is responsible with providing publicly
tions via a hybrid approach that combines discriminatively available weather forecasts, based on periodic observations.
trained predictive models with a deep neural network that These measurements are logged in the Integrated Global Ra-
models the joint statistics of a set of weather-related vari- diosonde Archive (IGRA) [4]. Forecasts for winds and tem-
ables. We show how the base model can be enhanced with perature are accessible via NOAA’s Winds Aloft program.
spatial interpolation that uses learned long-range spatial de- To date, the best approaches to weather modeling rely on
pendencies. We also derive an efficient learning and infer- mathematical simulations. The methodology centers on the
ence procedure that allows for large scale optimization of use of a generative model to capture atmospheric dynam-
the model parameters. We evaluate the methods with ex- ics, where samples are drawn from physical simulations to
periments on real-world meteorological data that highlight make predictions [18, 12]. In contrast, we take a data-centric
the promise of the approach. approach. Rather than define a generative model, we dis-
criminatively train predictive models from the historic data,
considering a historical data on a core set of variables for
Categories and Subject Descriptors learning and inference about weather: atmospheric pressure,
I.2.6 [Artificial Intelligence]: Learning temperature, dew point, and winds. We use boosted deci-
sion trees as predictors in the studies.
General Terms Several challenges must be addressed in taking a data-
centric approach to weather prediction. First, we note that
Machine Learning, Graphical Models, Weather Forecasting the set of weather variables under consideration are tightly
coupled. For example, pressure and temperature follow nat-
Keywords ural gas laws (i.e., the well-known formula, P V = nRT ).
Gaussian Processes, Deep Learning Similarly, there is a tight relationship between relative hu-
midity and temperature. Consequently, any model that
jointly aims to predict the set of weather variables should
1. INTRODUCTION leverage knowledge of the tight statistical couplings that are
Making inferences and predictions about weather has been based in physics. Secondly, dependencies among the vari-
an omnipresent challenge throughout human history. Chal- ables may have long-range influences across space and time.
lenges with accurate meteorological modeling brings to the For instance, wind vectors across large geographic distances
fore difficulties with reasoning about the complex dynamics may follow isobaric contours. As another consideration, the
of Earth’s atmospheric system. Methods have sought to de- weather phenomena may be affected by local geography and
fine weather in terms of sets of fundamental quantities, and associated natural processes (e.g. isolated thunderstorms),
various characterizations have been proposed and employed as well as shifts in the large-scale structure of atmospheric

Research performed during an internship at Microsoft Re- phenomena (e.g. shifting of jet streams).
search. We aim to tackle these challenges via a representation that
jointly predicts winds, temperature, pressure, and dew point
Permission to make digital or hard copies of all or part of this work for personal or
across space and time. The proposed architecture combines
classroom use is granted without fee provided that copies are not made or distributed a bottom-up predictor for each individual variable with a
for profit or commercial advantage and that copies bear this notice and the full cita- top-down deep belief network that models the joint statisti-
tion on the first page. Copyrights for components of this work owned by others than cal relationships. Another key component in the framework
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- is a data-driven kernel, based on a similarity function that
publish, to post on servers or to redistribute to lists, requires prior specific permission is learned automatically from the data. The kernel is used
and/or a fee. Request permissions from [email protected].
KDD’15, August 10-13, 2015, Sydney, NSW, Australia.
to impose long-range dependencies across space and to en-
c 2015 ACM. ISBN 978-1-4503-3664-2/15/08 ...$15.00. sure that the inferences respect natural laws. We present an
DOI: http://dx.doi.org/10.1145/2783258.2783275.
efficient procedure for combining inferences from separate have been blind to long-range phenomena based on the laws
predictors of local phenomena while considering constraints of nature, such as winds aligning by pressure as captured by
imposed by the deep belief network such that the predictions the structure and dynamics of isobars.
respect the natural regularities expected with the large-scale We introduce methods that address these limitations, via
phenomena. introduction of a hybrid representation. With a hybrid rep-
The main contributions of this work can be summarized resentation, individual predictors are discriminatively trained
as follows: from historic data and local inferences from these models are
combined with a deep neural network that overlays statisti-
1. We present a novel hybrid model with discriminative cal constraints among key weather variables. We addition-
and generative components for spatiotemporal infer- ally apply a spatial interpolation scheme that respects con-
ences about weather. straints of long-range statistical dependencies. The method-
ology employs covariance matrix for Gaussian Process re-
2. We design and implement a data-driven kernel func-
gression constructed from a large dataset. Here, the co-
tion that shapes predictions in accordance with phys-
variance matrix, also referred to as the kernel, allows us to
ical laws.
enforce smoothness constraints over the weather variables.
3. We provide an efficient inference procedure that en- By ensuring that the kernel captures the dynamics of the
ables optimization of the predictive model in accor- system as informed by the training data, we are able to
dance with large-scale phenomena. align estimates according to spatial constraints imposed by
natural laws.
4. We evaluate the methods with a set of experiments
that highlight the performance and value of the method- 3. THE DEEP HYBRID MODEL
ology. We seek a prediction model that respects spatiotempo-
The rest of the paper is structured as follows: We next dis- ral dependencies among weather variables induced by atmo-
cuss background and related work. In Section 3, we describe spheric physics. We test the framework with data drawn
the technical details of our approach, showing the compo- from a continental scale weather corpus composed of data
nents of a comprehensive graphical model that we call a Deep captured via balloons. In particular, we consider the IGRA
Hybrid Model. The learning and inference algorithms based dataset consisting of balloon observations made at 60 sta-
on this model are discussed in Section 4. In Section 5, we tions across the U.S. These balloons transmit observations
present the results of experiments with the model on real- about wind speed and direction, temperature, geopotential
world data. We conclude with a brief summary and discuss height, dew point, and other weather variables. These ob-
future work in Section 6. servations are released in real time by the NOAA and later
by the National Climatic Data Center following preprocess-
ing. The data is eventually integrated into the curated
2. BACKGROUND AND RELATED WORK IGRA dataset which is updated daily and contains historical
The proliferation of satellites, radar, sensors, coupled with weather data spanning decades compiled from eleven source
rapidly decreasing costs of storing and distributing informa- datasets. Any data added to the archive undergoes a cy-
tion have catalyzed an explosion in quantities of weather cle of quality assurance to resolve potential inconsistencies
data available for studies. Most work in weather forecasting among variables [4, 5].
to date rely on the use of generative approaches, where the Formally, we consider four weather variables in the model:
weather systems are simulated via numerical methods [18, wind velocity, v; pressure, p; temperature, t and dew point,
12, 10], or rely on time-series analysis such as ARIMA mod- d. The wind observations are represented as a two-dimensional
els and simple classifiers based on Artificial Neural Networks vector, v = [v x , v y ] while all other weather variables are
[11, 10, 8, 2, 21] or Support Vector Machines [16, 19]. These scalars. We represent weather stations (where the balloons
statistical models often make strong assumptions such as are released) as SL = {s1 ,...,sNs } where Ns is the total num-
spatial independence to overcome the curse of dimensional- ber of weather stations. For each of these stations, we have
ity, which do not hold well in practice. historical weather data logged at a frequency of approxi-
Despite the success of machine learning in a variety of mately six hours over several years.
tasks, applications to the problem of weather forecasting Our approach to building the weather model was governed
has been limited. Exceptions include the use of Bayesian by the following guidelines:
Networks for precipitation forecasts [3] and temporal mod-
eling via Restricted Boltzmann Machines (RBM) [20, 15]. 1. Temporal mining: Our model should be able to iden-
A separate thread of research has also focused on efficient tify and learn from recurring weather patterns over
representation of relational spatiotemporal data in Random time.
Forests for prediction of severe surface-level weather pro-
2. Spatial interpolation: The dynamic influence of atmo-
cesses, such as droughts and tornadoes [14, 13]. More re-
spheric laws on weather phenomena need to be ac-
cently, large-scale wind prediction has been presented [9]
counted for in our predictions.
using a Bayesian framework with Gaussian Processes [17].
To date, uses of machine learning for weather prediction 3. Inter-variable interactions: The local interdependen-
have been limited in several ways. First, almost all methods cies between weather variables should be captured by
consider only one variable at a time and do not explore the our model.
joint spatiotemporal statistic of multiple weather phenom-
ena. Also, to our knowledge, long-range spatiotemporal de- Accordingly our model can be viewed as having three main
pendencies have not been modeled explicitly. Thus, models components. The first component is a set of individual pre-
Figure 1: Spatial Interpolation of winds in a static and hybrid field. Filled contours represent temperature
and isobar lines are marked in red. In the hybrid field, the interpolated wind vectors are closely aligned with
the true values. However, the static field fails to account for the long-range dependencies.

dictors for the weather variables that are trained using his- features, we use an ensemble of boosted decision-tree learn-
torical data. A variety of off-the-shelf machine learning pro- ers to make predictions.
cedures can be applied to the recorded data to build these
individual predictors. The second component works to re- 3.2 Data-Centric Kernel for Spatial Interpo-
fine inferences produced by the separate predictors by con- lation
straining the output to be spatially smooth and aligned with The individual predictors provide predictions only for par-
constraints imposed by physical laws. The interplay of these ticular locations (the weather stations), and we need to in-
constraints is dynamic and hence, we develop a data-centric terpolate the results across larger spatial regions. To extend
approach. The third component consists of a deep belief net- predictions in a smooth manner beyond the weather sta-
work which leads to a preference for solutions that respect tions, we rely on smoothness constraints induced via the
the expected joint statistics of the weather variables. We de- GP prior.
scribe the three key components in detail below and finally The covariance or kernel matrix K captures the notion of
conclude this section with an integrated graphical model of similarity among data points that are close in space and time
our framework. and is the key in determining the accuracy of spatial inter-
polation. While static Radial Basis Function (RBF) kernels
based on distance give reasonable estimates, they fail to cap-
3.1 Base-Level Predictors ture the dynamics of the system. For instance, predictions
The base-level predictors are individual regression func- about wind velocity at location s∗ , are not necessarily in-
tions that are trained using historical data at different tem- fluenced similarly by weather at equidistant stations, per
poral granularities. The intuition is that long-term historical factors such as regional turbulence. We need to have an
records of weather should provide insights about the weather ability to capture a preferential bias towards classes of func-
at particular locations, given sets of observations in the im- tions that respect certain physical constraints among the
mediate past. In general, the weather conditions change weather variables. The physical constraints include long-
gradually over time and also exhibit cyclicity through sea- range spatial dependencies, such as wind vectors aligning
sons, consequently enabling some success in predicting the with isobars1 and modeling the direct relationship between
signals. We need to train different predictors for each sta- pressure, temperature, and dew point due to natural gas
tion and range of altitudes considered as weather conditions laws.
change significantly across the vertical profile. We use a novel kernel defining a GP prior. For any pair of
The performance of the local regressions depends critically locations i and j, if the current pressure, temperature, and
on evidential features. We consider features over short- and the wind direction are denoted as p, t and θ respectively,
long-term spans of time. For the short-term features, we then we define our kernel as:
consider the values of weather variables over the last seven
D θ p t
days. As observations come at twelve-hour intervals, we con- Ki,j = Ki,j · Ki,j · (Ki,j + (1 − )Ki,j ). (1)
sider shorter-term features by time of the day. Such short-
D θ p t
term segmentation can be useful because winds, tempera- Here, Ki,j , Ki,j , Ki,j and Ki,j are RBF kernels over geo-
ture, and other weather variables may differ significantly graphic distance, the angle of the wind, pressure and tem-
over day and night due to the influence of solar heating. perature respectively and  is a tunable parameter such that
We consider separate short-term features for day and night 0 ≤  ≤ 1. The resulting kernel matrix would be positive
rather than averaging over the daily variation. Features semi-definite as the proposed kernel function is a linear com-
spanning longer periods of time incorporate average seasonal bination and Hadamard product of kernels.
data on weather variables. The long-term features are com-
puted for several years in the past to reduce the influence 1
Since we are not doing surface-level predictions, the effect
of atypical weather phenomena. Given the set of engineered of friction is negligible.
Multiple kernels are commonly used to integrate similarity GP( W; S) S
notions from different sources [6]. In our case, the similar- ∓"

ity between any two sites is a function of the geographic =N(W;0,K )


distance as well as the similarity in the weather variables. W (
−||z−w||2
)
We note that the kernel K θ over the wind direction plays Balloon  Observa-ons   ϕ (z,w)=e 2σ 2

Pre-­‐trained  Predictor  
a critical role in inducing long-range dependencies. As an ψ (t j ) = N ( µt , κ 2 )
j

ψ (pj)
example, consider two stations A and B with wind vectors
ψ (d j )
[ax , ay ] and [bx , by ], respectively. We are performing inter- ψ (v jy )
ψ (v xj )
polation separately in X and Y directions. Hence, for any Observa-on  at  
Test  Posi-on  
station, e.g., station A, we can assume independence in the
pj t j v*x v* d* p* t*
y
two directions such that a neighboring station B can only v xj v jy d j
induce an air flow change in ax through bx and similarly,
ay is only influenced by by . K θ captures this intuition by Hidden   Hidden  
Layer  1   Layer  1  
defining an RBF over the angles made by the wind vectors η (z j ) η (z * )
with the corresponding axis for which the kernel matrix is Hidden   Hidden  
Layer  2   Layer  2  
defined. The balance between the pressure gradient force Ns
and Coriolis effect (geostrophic force) causes the winds to
follow isobars. This implies that stations in close vicinity Figure 2: Deep hybrid model. Probabilistic graphi-
having similar pressure will have winds aligned in the same cal model for weather prediction where weather sta-
direction, justifying the contribution of K p in computing the tions denoted by S, induce a Gaussian process (GP)
similarity. prior over the true values of the weather variables
W. Only noisier versions (zi ) of the true values are
3.3 Joint Modeling of Weather Variables observed at all the sites and are related via φ. The
Weather variables are influenced heavily by the interac- forecasts given by the pre-trained predictor are re-
tion of several factors. At the most fundamental level, these lated to the future observations via the potential
dependencies are based in the natural laws of thermody- ψ(·). The joint distribution of the true weather vari-
namics. Approaches to inferences about weather relying on ables is further constrained via a deep belief network
numerical simulation seek to characterize these dependen- (η(·)). All potentials arise at a test site s∗ , except
cies analytically. However these interdependencies are com- that there is no pre-trained predictor.
plex and unpredictable, which explains the limited success
of analytical techniques. At the same time, discriminative
statistical analysis beyond temporal and spatial techniques stations. Each weather station has random variables for
described above, does not generalize well for domains with wind velocity, dew point, pressure and temperature. Each of
the dynamism of weather phenomena. For weather, it is nat- these random variables are constrained via (a) an individual
ural to consider architectures that can automatically learn predictor that is trained on the historical data (ψ(·)), (b) a
rich representations from raw data. Hence, we model the Gaussian process prior (GP (·)) and the Gaussian likelihood
joint distribution between weather variables through a deep function φ(·) that use data dependent prior to impose spatial
belief network (DBN). and functional smoothness and (c) a deep belief network
The DBN consists of layers of stacked Restricted Boltz- that encourages solutions that respect the joint statistics
mann Machines (RBM) where the connections between any observed in historical data and that are also aligned with
two layers of a RBM form a bipartite graph. The top layer physical laws.
of the DBN consists of five units corresponding to the nor- Similarly, at a test site s∗ , we use the GP prior and
malized values of the latent weather variables (two units for the likelihood to first interpolate observations made at the
representing 2D vector winds). We assume a Gaussian prior weather stations. These interpolations are then further con-
over these variables, such that Wi ∼ N (mi , di ), each unit strained via the deep belief network to impose the joint sta-
having a bias ai . The primary level interactions between tistical distribution of the weather variables. Formally, we
the variables give rise to a secondary set of features repre- have the following distribution corresponding to the graph-
sented as the layer 2 hidden units, H. Similarly, we can ical model:
have another RBM below to capture the interactions be- Y Y
tween H and the layer 3 units, G. The hidden units follow p(W, Z|S) ∝ GP (W; S) φ(wi , zi )η(zi ) ψ(zi ).
a Bernoulli distribution and have a biases bj and ck . The i∈L∪∗ i∈L

weights between the first two layers are U= [uij ] and the Here, Z is the collection of random variables representing
next two layers are V = [vjk ]. Several structural and tun- the observations at any of the locations. The terms GP (·)
able design parameters are involved, which we discuss in the and φ(·) enforce the smoothness constraints as defined by
next section. the data dependent kernel (described in section 3.2). The
potential term η(·) arises due to the deep belief network
3.4 Probabilistic Graphical Model component (Section 3.3). Finally, the term ψ(·) applies only
The graphical model for the proposed approach is shown to the weather station sites and enforces consistency with
in Figure 2. The matrix W is the collection of all the weather the prediction of the pre-trained regression functions. In
variables wi denoting the true value at each location i. The particular, the observations are related to the output of an
observations zi = {vix , viy , pi , ti , di } recorded at any of the individual predictor via a simple Gaussian function: e.g.
sites is simply a noisier version of this true value. We use ψ(p) = N (µp , κ2 ), where µp is the individual prediction for
plate notation to show observations at Ns number of weather the pressure variable.
Algorithm 1 Deep hybrid model learning. Algorithm 2 Deep hybrid model inference.
procedure TrainWeatherModels procedure ForecastWeatherVariable(x, s∗ , Z)
B Prediction Variable: x, Test Site: s∗ , Observations: Z
B Boosted Decision Trees for Every Location, Variable
for all x ∈ {v, p, t, d} do if s∗ ∈ S then
for all s ∈ S do tmp∗ ← getBDTreePred(x, s∗ , Z)
trainData ← getTrainData(x, getHistData(s)) B uses corresponding BstDecTree model from Alg. 1
param ← getBestParam(trainData)
BstDecT ree[x, s] ← TrainBDTree(param) else
end for for all si ∈ S do
end for tmpi ← getBDTreePred(x, si , Z)

B GP Hyperparameters for Every Weather Variable wi ← DBNinference(x, tmpi )


for all x ∈ {v, p, t, d} do B uses DBN model from Alg. 1
hyP aram ← getBestHParam(x, getAllHistData())
end for Append wi to w
end for
B DBN joint model training through CD
DBN model ← ContDivergence(getAllHistData()) tmp∗ ← GPinterpolate(x, s∗ , w, Z)
end if
end procedure
w∗ ← DBNinference(x, tmp∗ )
return w∗
4. ALGORITHMIC DETAILS
To make the Deep Hybrid Model work in practice, we need end procedure
to learn several parameters pertaining to the three compo-
nents and design an efficient inference procedure for test-
ing. Here we note that since we operate our model in batch figurations yielding best cross validation results comprised
mode, we can afford to have an elaborate learning procedure. of two stacked RBMs consisting of 50 and 150 hidden neu-
Specifically, the deep belief network component indeed has rons, trained with a learning rate close to 0.1, batch size of
high training time requirements. On the other hand, since 100 and 20 greedy iterations. The limitation to these set of
our forecasts are made in real time, inference at test time parameters is purely because of the high computational re-
needs to be extremely efficient. We now discuss the learning quirements and engineering effort in training deep networks,
and inference algorithms which achieve these objectives. and indeed, the gains could be potentially more significant
if the deep belief network is trained over a richer range of
4.1 Learning parameters.
Given the historical observations at various weather sta-
tions, we train the various components of our model in order 4.2 Inference
to get the best predictive capability. We perform piecewise Given the trained components, we seek to determine the
training of individual components, where the individual pre- posterior distribution over the set of observations z∗ at the
dictors, the parameters of the DBN and the kernel hyper- test site s∗ . Exact inference in the proposed model is hard
parameters of the GP kernel are estimated. A simplified due to the presence of potential functions η(·) induced via
workflow for the training procedure is given in Algorithm 1. the deep belief network. We apply piecewise approximate
In particular, we trained Boosted Tree-based Learners us- inference as illustrated in Algorithm 2. For the trivial case,
ing the set of short- and long-term features described previ- when the prediction needs to be made at a weather station
ously and used the best models that were obtained for each site, we invoke the pre-trained predictor models to provide a
weather station in the U.S. for a range of altitudes from 3000 forecast which is then refined using the deep belief network.
feet up to 39000 feet with an interval of 3000 feet. The opti- If, however, we need to make a prediction at an arbitrary test
mal parameters with regard to the number of leaves, number site, the refined estimates computed for all weather stations
of iterations, and the learning rate, were obtained through are interpolated. These refined estimates are then interpo-
analysis with a 10-fold cross-validation study. lated to the test site via the Gaussian Process component.
Similarly, the hyperparameters of the data driven kernel We note that, since the kernel function is data driven, we use
were set via 10-fold cross validation and the final values of simple interpolated values of the weather variable at the test
the kernel bandwidths were set to 150, 0.1, 0.05 and 1 for site in order to compute the kernel. Given the interpolated
distance, wind angle, temperature and pressure respectively values at the test site, we then carry out a last refinement
and  was set to 0.2. Finally, the DBN component of our of prediction in order to resolve the estimates with the joint
model was trained via a standard contrastive divergence pro- statistical constraints imposed by the deep model.
cedure [7]. We explored the following parametric ranges: the At the heart of the inference scheme, we employ an itera-
learning rate (0.1-0.01), the number of greedy iterations of tive procedure that aligns the predicted estimates with the
convergence divergence (1-100) and the batch size (10-1000). potential induced via the deep model. We use a variational
The structural properties of the neural net such as the num- approximation to resolve and refine the posterior distribu-
ber of hidden layers (1-3) and the number of neurons in each tion over the observations z. Formally, we denote the ap-
hidden layer (50-500) were also experimented with. The con- proximation of the refined posterior by q(zi ) ∼ N (µi , σi ).
20
Static Kernel
20
Hybrid Kernel we considered a cluster of stations spread across the central
RMS Error RMS Error
X direction: 9.8274 X direction: 4.9907 U.S. (states demarcated by black lines in Fig. 1) and in-
10 10
Y direction: 3.0086
Y direction: 6.0501 terpolated winds via Gaussian Process regression at these

Predicted Winds
Predicted Winds

0 0
stations, considering the wind measurements from the rest
−10 −10
of the U.S. stations. Thus, each station served as an inde-
−20 −20 pendent test point, whose value is interpolated using a GPR
−30 −30
model and compared against the true winds shown in Fig. 1
All values in knots All values in knots
(a). Fig. 1 (b) shows the interpolated wind vectors when a
−40 −40
−40 −30 −20 −10
True Winds
0 10 20 −40 −30 −20 −10
True Winds
0 10 20
static kernel matrix is used. Here an entry Ki,j in the matrix
is simply a decreasing exponential in the geographical dis-
Figure 3: True versus interpolated wind plots for tance between two stations i and j. In contrast, the hybrid
static and hybrid kernel. Static interpolation shows approach, as illustrated in Fig. 1 (c), captures the similarity
high deviations from true winds. Atmospheric dy- between each pair of stations such that every entry Ki,j of
namics are more effectively captured with use of a the matrix is computed dynamically at training time using
hybrid data-centric kernel. the formula given in Eq. 1. The pressure, temperature, and
angle θ between the wind vectors are the known values for
the current time step.
Additionally, we also approximate the posterior over the la- Now consider the following stations: Topeka (Kansas),
tent variables and q(Hj ) ∼ Bern (γj ) and q(Gk ) ∼ Bern (βk ) Omaha (Nebraska), Springfield (Missouri) and Norman (Ok-
for the two hidden layers respectively. The following varia- lahama), referred to in Fig. 1 as stations P, A, B and C, re-
tional updates are then used to estimate the parameters of spectively. For a static interpolation of winds at P, a higher
the distribution (lth component): contribution would come from B than C, as B is geograph-
ically closer. However, the temperature and pressure con-
−[bl + µti uil ]
P

Layer 1 to 2: 1/γlt+1 = 1 + e i ditions at C are closely aligned to that of P and end up


contributing more in the hybrid approach. We observe that
−[cl + γit+1 vjl ]
P

Layer 2 to 3: 1/βlt+1 = 1 + e j KP,A is maximum in both cases. Hence, the hybrid ker-
nel does not ignore distance as a similarity criteria. How-
− γlt+1 −[bl + k βk vlk ]
P t+1
0
1 ever, in cases involving a tradeoff between distance and other
Layer 3 to 2: 1/γlt+1 = 1 + e
γlt+1 weather variables, their combined contribution might alter
X 0 the relative importance of a particular neighboring station,
Layer 2 to 1: µt+1
l = (ml + d2l al + d2l γjt+1 ulj )/(1 + d2l ) as in the aforementioned case. The quantitative gains in
j
q predication accuracy are displayed in the RMS plots in Fig.
σlt+1 = dl / 1 + d2l 3 (a, b).

The mean parameters µl are initialized to the estimates Weather Variable RMS Error Reduction (in %age)
ml , while γ and β are initialized randomly. The parame- X Y Overall
ter dl corresponds to the variance in the initial estimates 6 hours 2.17 2.05 2.11
and signifies our confidence in those predictions. We set 12 hours 1.05 1.01 1.03
these variances via a cross-validation procedure over histor- 24 hours 1.05 0.97 1.01
ical data. The derivation for the above update equations
follows from the application of prior work in variational in- Table 1: Improvement in performance obtained us-
ference [1] to deep belief networks. ing the deep belief network. The final step of re-
finement uses the DBN results to further improve
5. EXPERIMENTAL EVALUATION prediction accuracy.
We performed a set of experiments to evaluate the pro-
posed methodology. In the experiments, we explored three 5.2 Dynamic Prediction and Deep Learning
main questions. First, we compare and highlight the advan-
tage of the spatial interpolation procedure that relies on a In another experiment, we evaluate the performance gains
data-centric dynamic kernel matrix to the more commonly due to the final refinement step of the deep belief network.
used static kernel matrix. Second, we seek to compare the The percentage reduction in error for wind forecasts for three
proposed model with a baseline approach. Third, we explore time steps in the future are shown in Table 1. We see that
the importance of modeling the joint statistics of predictive the DBN leads to an additional 1-2% error reduction and
variables via the deep belief network. Finally we compare clearly, modeling the joint statistics of the weather variables
the wind forecast results with those of state-of-the-art sys- helps in making better predictions. We observed a per-
tems. formance improvement of similar magnitude for the other
The experiments were based on five years of historical weather variables as well.
data, from 2009 to present, extracted from the IGRA dataset. After establishing the superiority of the data-centric ker-
The data consists of balloon observations recorded at 60 lo- nel and the DBN independently, we evaluate the prediction
cations across the continental US. accuracy of the full deep hybrid model for each weather vari-
able2 , aggregated over all stations in the continental U.S.,
5.1 Interpolation in a Hybrid Field where current and historical data is available.
To illustrate the efficacy of a hybrid kernel in handling 2
The IGRA dataset provides the geopotential height and
long-range spatial dependencies among weather variables, dew point at roughly constant pressures. These quantities,
Winds Dew Point
16 16

RMS Error (in degrees)


RMS Error (in knots)

14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
6 hours 12 hours 24 hours 6 hours 12 hours 24 hours
A 7.44 13.23 14.89 A 7.29 12.94 14.58
B 5.91 10.58 12.13 B 6.23 11.77 12.12
C 3.71 9.58 8.74 C 5.01 7.87 9.15

Geopotential Height Temperature


4
180

RMS Error (in degrees)


RMS Error (in meters)

160 3.5

140 3
120 2.5
100 2
80 1.5
60 1
40
0.5
20
0
0 6 hours 12 hours 24 hours
6 hours 12 hours 24 hours
A 1.45 3.71 2.91
A 64.6 171.4 129.2
B 1.18 2.5 1.95
B 76.76 141.96 135.11
C 1.39 2.75 1.21
C 16.27 39.91 25.69

A: Baseline B: Typical Prediction Model C: Deep Hybrid Model

Figure 4: Results on predicting weather variables for different approaches. The temporal predictors that
employ a hybrid data-centric scheme for interpolation and use DBNs for modeling the joint relationship
among weather variables show significant improvements over the baselines.

The accuracy of the proposed model is compared with Time Step Model RMS Error (in knots)
two baseline models in Fig. 4 for three future time steps X Y Overall
as before. The baseline prediction, marked as A in Fig. 6 hours Deep Hybrid Model 2.29 1.33 1.81
4, uses the current values as estimates for the future; the Kapoor et al. 2014 3.94 2.16 3.05
intuition with using current values to predict the future is NOAA 3.18 3.44 3.31
that weather conditions typically do not change greatly over 12 hours Deep Hybrid Model 4.44 2.59 3.56
a day. For the second baseline (marked B), we construct Kapoor et al. 2014 5.03 3.93 4.48
a typical spatiotemporal prediction model, where baseline NOAA 5.13 4.34 4.88
boosted decision tree predictors are augmented with a static 24 hours Deep Hybrid Model 6.57 3.82 5.19
interpolation scheme. We observe that the deep hybrid Kapoor et al. 2014 8.93 5.24 7.08
model (marked C) comprising of a dynamic data-driven in- NOAA 8.79 6.37 7.58
terpolation scheme and a DBN in addition to the boosted
decision tree predictors used in B, significantly outperforms Table 2: Comparison of the proposed methodology
both the baselines. In a couple of cases involving short- with state of the art in wind prediction. Results
term temperature forecasts, B marginally outperformed the summarized here are for weather stations in Wash-
DHM, suggesting limited interdependences between temper- ington for a period of one month. We observe that
ature and other variables for short-term predictions. the new model results in significantly lower errors
than competitive models. The best performance is
5.3 Comparison with State of the art indicated in bold.
Apart from winds, forecasts for other variables across the
vertical atmospheric profile are not available for comparative
analyses. We compare the wind predictions of the proposed
model against two forecast systems. The first one proposed time steps into the future: 6, 12 and 24 hours. Table 2
by [9] makes predictions using a static GPR interpolation show the accuracy of the two forecast systems for the Seat-
scheme, coupled with relative velocity data obtained through tle station. The results summarize the predictions made for
airplanes. Our second set of comparisons is with the Winds the weather stations in the state of Washington for a pe-
Aloft forecast, released by NOAA every six hours, for three riod of one month. We observe that, while Kapoor et al.
2014 achieve better performance than NOAA, the proposed
under reasonable assumptions, serve as proxies for pressure method shows significantly better performance than both of
and specific humidity, respectively. the competitors.
6. CONCLUSION AND FUTURE WORK [9] A. Kapoor, Z. Horvitz, S. Laube, and E. Horvitz.
We presented a weather forecasting model that makes Airplanes aloft as a sensor network for wind
predictions via considerations of the joint influence of key forecasting. In Proceedings of the 13th international
weather variables. We introduced a data-centric kernel and symposium on Information Processing in Sensor
showed how using GPR with such a kernel can effectively in- Networks (IPSN), pages 25–34. IEEE Press, 2014.
terpolate over space, taking into account weather phenom- [10] V. M. Krasnopolsky and M. S. Fox-Rabinovitz.
ena such as turbulence. We performed temporal analysis Complex hybrid models combining deterministic and
using short- and longer-term features within a gradient-tree machine learning components for numerical climate
based learner. We augmented the system with a deep be- modeling and weather prediction. Neural Networks,
lief network and tuned the parameters to model the depen- 19(2):122–134, 2006.
dencies among weather variables. A set of experiments on [11] R. J. Kuligowski and A. P. Barros. Localized
real-world data shows that the new methodology can pro- precipitation forecasts from a numerical weather
vide better results than NOAA benchmarks, as well as re- prediction model using artificial neural networks.
cent research that had demonstrated improvements over the Weather and Forecasting, 13(4):1194–1204, 1998.
benchmarks. [12] G. Marchuk. Numerical methods in weather prediction.
Future work includes projecting weather predictions to Elsevier, 2012.
more distant times into the future. We are also interested in [13] A. McGovern, D. John Gagne, N. Troutman, R. A.
exploring the use of computations of the value of information Brown, J. Basara, and J. K. Williams. Using
to guide sensing at weather stations. We note that airplanes spatiotemporal relational random forests to improve
in flight can serve as sensors of wind speeds, as explored in our understanding of severe weather processes.
[9]. We wish to investigate the boosts in predictive power Statistical Analysis and Data Mining: The ASA Data
that might be achieved via integrating such additional data Science Journal, 4(4):407–429, 2011.
into the hybrid model. [14] A. McGovern, T. Supinie, I. Gagne, M. Collier,
R. Brown, J. Basara, and J. Williams. Understanding
Acknowledgments severe weather processes through spatiotemporal
We are grateful to Imke Durre for answering our queries relational random forests. In 2010 NASA conference
concerning the IGRA dataset. The first author would like on intelligent data understanding, 2010.
to thank Microsoft Research, Redmond for conducting the [15] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee.
Worldwide Internship Program that made this research pos- Structured Recurrent Temporal Restricted Boltzmann
sible. Machines. In Proceedings of the 31st International
Conference on Machine Learning (ICML), pages
7. REFERENCES 1647–1655, 2014.
[1] M. J. Beal. Variational algorithms for approximate [16] Y. Radhika and M. Shashi. Atmospheric temperature
Bayesian inference. PhD thesis, University of London, prediction using support vector machines.
2003. International Journal of Computer Theory and
[2] L. Chen and X. Lai. Comparison between ARIMA and Engineering, 1(1):1793–8201, 2009.
ANN models used in short-term wind speed [17] C. E. Rasmussen. Gaussian processes for machine
forecasting. In Power and Energy Engineering learning. 2006.
Conference (APPEEC), 2011 Asia-Pacific, pages 1–4. [18] L. F. Richardson. Weather prediction by numerical
IEEE, 2011. process. Cambridge University Press, 2007.
[3] A. S. Cofıno, R. Cano, C. Sordo, and J. M. Gutierrez. [19] N. I. Sapankevych and R. Sankar. Time series
Bayesian networks for probabilistic weather prediction using support vector machines: a survey.
prediction. In 15th Eureopean Conference on Artificial Computational Intelligence Magazine, IEEE,
Intelligence (ECAI), 2002. 4(2):24–38, 2009.
[4] I. Durre, R. S. Vose, and D. B. Wuertz. Overview of [20] I. Sutskever, G. E. Hinton, and G. W. Taylor. The
the Integrated Global Radiosonde Archive. Journal of Recurrent Temporal Restricted Boltzmann Machine.
Climate, 19(1):53–68, 2006. In Advances in Neural Information Processing
[5] I. Durre, R. S. Vose, and D. B. Wuertz. Robust Systems, pages 1601–1608, 2009.
automated quality assurance of radiosonde [21] C. Voyant, M. Muselli, C. Paoli, and M.-L. Nivet.
temperatures. Journal of Applied Meteorology and Numerical Weather Prediction (NWP) and hybrid
Climatology, 47(8):2081–2095, 2008. ARMA/ANN model to predict global radiation.
[6] M. Gönen and E. Alpaydın. Multiple kernel learning Energy, 39(1):341–355, 2012.
algorithms. The Journal of Machine Learning
Research, 12:2211–2268, 2011.
[7] G. Hinton, S. Osindero, and Y.-W. Teh. A fast
learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.
[8] I. Horenko, R. Klein, S. Dolaptchiev, and C. Schütte.
Automated Generation of Reduced Stochastic
Weather Models I: simultaneous dimension and model
reduction for time series analysis. Multiscale Modeling
& Simulation, 6(4):1125–1145, 2008.

You might also like