0% found this document useful (0 votes)
71 views14 pages

Deep Air Quality Forecasting Using Hybrid Deep

Uploaded by

tyh16310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views14 pages

Deep Air Quality Forecasting Using Hybrid Deep

Uploaded by

tyh16310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Air Quality Forecasting Using Hybrid Deep

Learning Framework

Shengdong Du, Tianrui Li, Senior Member, IEEE, Yan Yang, Member, IEEE, and Shi-Jinn Horng

Abstract—Air quality forecasting has been regarded as the key problem of air pollution early warning and control management.
In this paper, we propose a novel deep learning model for air quality (mainly PM2.5) forecasting, which learns the spatial-temporal
correlation features and interdependence of multivariate air quality related time series data by hybrid deep learning architecture.
Due to the nonlinear and dynamic characteristics of multivariate air quality time series data, the base modules of our model include
one-dimensional Convolutional Neural Networks (1D-CNNs) and Bi-directional Long Short-term Memory networks (Bi-LSTM).
The former is to extract the local trend features and spatial correlation features, and the latter is to learn spatial-temporal
dependencies. Then we design a jointly hybrid deep learning framework based on one-dimensional CNNs and Bi-LSTM for shared
representation features learning of multivariate air quality related time series data. We conduct extensive experimental evaluations
using two real-world datasets, and the results show that our model is capable of dealing with PM2.5 air pollution forecasting with
satisfied accuracy.

Index Terms—Air quality forecasting, deep learning, convolutional neural networks, long short-term memory networks

——————————  ——————————

1 INTRODUCTION

W ITH the acceleration of industrialization and the


rapid development of urbanization, the problem of
urban air pollution has become more and more serious,
efficiency based on Integrated Parametric and Nonpara-
metric Regression method [5]. Because air pollution is usu-
ally affected by weather, traffic and other factors, it is dif-
which has badly affected our living environment and ficult to accurately represent and predict by the statistical
physical health. Therefore, research on air quality forecast- methods and shallow machine learning models.
ing is very important and has always been regarded as a In the big data era, with the rapid development and ap-
key issue in environmental protection. It is also an im- plication of the Internet of Things and sensor technology,
portant means to guide the scientific decision-making of air quality forecasting is increasingly dependent on a vari-
severe air pollution warning and air pollution control. ety of sensors and related data acquisition equipment to
Many large cities have established air quality monitoring collect the urban air big data, e.g. PM2.5, NO2, PM10,
stations to detect the city’s PM2.5 and other air pollutants weather condition data and traffic data, etc. Since tradi-
in real time. Early diagnosis of air pollution occurrence and tional shallow learning models still have bottlenecks in
PM2.5 concentration value evolution is considered to be a handling big data, new air quality forecasting methods
key problem of air quality forecasting task. need data-driven model support [7][8]. Deep learning is
In recent years, some researchers have made efforts on currently the most popular data-driven method [9], which
air pollution occurrence and air quality forecasting [1] [2]. can extract and learn the inherent features of various air
However, most of these studies do rely on mathematical quality data automatically. Since 2012, deep learning has
equations or simulation techniques to describe the evolu- made great progress in research and applications of image
tion of air pollution [3]. These traditional methods are rep- processing, audio processing, and natural language under-
resented by classic shallow machine learning algorithms. standing [10][11][12]. Although air quality forecasting task
For example, Dong et al. presented a novel approach usually adopts the traditional shallow machine learning
which is based on hidden semi-Markov models (HSMMs) methods, the deep learning method for time series analysis
for PM2.5 concentration value prediction [4]. Donnelly et and air quality prediction is getting more and more atten-
al. proposed a model for producing real-time air quality tion from researchers [13][14][34][36]. In the issue of air
forecasts with both high accuracy and high computational quality forecasting, which is a typical multivariate time se-
ries analysis problem [35], it’s a useful exploration of learn-
ing various implicit features and long temporal dependen-
———————————————— cies of multivariate air quality time series data based on the
• Shengdong Du, Tianrui Li, Yan Yang are with the School of Information hybrid deep learning model.
Science and Technology, Southwest Jiaotong University. Emails: {sddu, In this paper, we propose an end-to-end model for the
trli,yyang}@[Link].
• Shi-Jinn Horng is with the Department of Computer Science and Infor-
air quality forecasting problem in one framework called
mation Engineering, National Taiwan University of Science and Technol- the Deep Air Quality Forecasting Framework (DAQFF),
ogy, is also with School of Computer Science, University of Technology which addresses the dynamic, spatial-temporal and non-
Sydney. Email: horngsj@[Link]. linear characteristics of multivariate air quality time series
• Corresponding authors: Tianrui Li and Shi-Jinn Horng
data by a hybrid deep learning model. The proposed based air quality prediction method which inherently con-
model can learn the local trend pattern and long spatila- siders spatial and temporal correlations [22]. Ong et al.
temporal dependencies of multivariate air quality related proposed a deep recurrent neural network (DRNN) for air
time series data, e.g. PM2.5, wind speed, temperature, etc. pollution prediction which is improved by using the auto-
It is also shown that the proposed model DAQFF has good encoder model as a novel pre-training method [23]. Qi et
forecasting performance and generalization ability. Exper- al. developed a general and effective approach to solve in-
iments indicate that our proposed method is effective in air terpolation, prediction and feature analysis in one model
quality prediction tasks. which is called Deep Air Learning (DAL) [14]. Moreover,
The rest of the paper is organized as follows: Section II the deep convolution network could process time series
presents the related works. Section III shows an overview features of citywide crowd big data and Zhang et al. pro-
of the deep air quality forecasting framework, including posed a novel deep residuals network to analyze how the
the overall design of our model, e.g. how to expand and congestions are evolving [25]. The hybrid deep learning
assemble basic deep neural network modules into our method is based on the idea of a combination of various
model. Section IV describes the comparative experiments, deep neural network structures and has achieved good ap-
and the effectiveness of the proposed framework is ana- plication effects in many fields, e.g. face detection and
lyzed and evaluated. We draw conclusions and directions video classification [26][27], but it has not yet been well ap-
for future research in the last section. plied for air quality forecasting problems.
In this paper, by a comparison of traditional shallow ma-
chine learning models and classic deep learning models,
we propose a new end-to-end air quality forecasting
2 RELATED WORKS framework, DAQFF, based on the hybrid deep learning
Air quality forecasting has a good study history in the lit- method, which is motivated to address local trend features
erature, most of the existing works solve the problems of and long temporal dependency problems by utilizing the
air quality forecasting using statistical methods and shal- multivariate time series data and performing feature selec-
low machine learning models [3] including Regression [5], tion automatically. The proposed DAQFF can extract and
ARIMA [17], HMM [4], and Artificial Neural Network [16]. learn the nonlinear spatial-temporal features of air quality
Zhang et al. presented a comprehensive assessment of the related time series data under different conditions such as
history, current status, major research and future direc- different weather conditions and different traffic states.
tions of real-time air quality forecasting problems [1] [2].
Zhou et al. proposed a probabilistic dynamic causal (PDC)
model based on Lasso-Granger to uncover the dynamic
temporal dependencies of PM2.5 [6]. Zhou et al. developed 3 METHODOLOGY
a hybrid model for one-day-ahead PM2.5 forecasting 3.1 Problems and Motivations
based on ensemble empirical mode decomposition and a Air quality forecasting has been a key issue in early warn-
general regression neural network method [16]. Deleawe ing and control of urban air pollution. Its goal is to antici-
et al. investigated the use of machine learning technologies pate changes in the PM2.5 value of air pollution at obser-
to predict the CO2 level, which is an indicator of air quality vation points over time. The observation time period is
in urban air environments [15]. usually set for one hour, which is decided by the ground-
In recent years, air quality forecasting based on big data based air-quality monitoring station. Typical air pollution
analysis has become a research hotspot. Because air quality data, e.g. PM2.5, is shown in Fig. 1.
related time series data have dynamic and nonlinear char-
acteristics, more and more researchers are trying to use
data-driven models, especially in the field of urban com-
puting [18]. A large number of air quality forecasting
methods based on the big data have been proposed to help
air pollution warning and control [37]. Zheng et al. devel-
oped a semi-supervised learning approach for air quality
forecasting which is based on a co-training framework con-
sisting of two separated classifiers (ANN and CRF) [7].
Hsieh et al. presented a novel method which can infer the
real-time and fine-grained air quality throughout a city by
a semi-supervised inference model [8]. Zheng et al. also
proposed a real-time air quality forecasting framework
which uses data-driven models to predict fine-grained air
quality [19]. Fig. 1. PM2.5 values in one month (01/01/2010-01/31/2010) of Bei-
More recently, deep learning has been widely applied to jing air pollution data set from UCI [31].
sequence data processing and time series problems
[20][21][24]. Air quality is typical time series data. Li et al. PM2.5 prediction problem is illustrated as follows.
presented a novel spatial-temporal deep learning (STDL)- Given time 𝑇𝑇, the prediction task is to anticipate the PM2.5
concentration value 𝑃𝑃𝑖𝑖,𝑇𝑇+1 at time 𝑇𝑇 + 1 or 𝑃𝑃𝑖𝑖,𝑇𝑇+𝑛𝑛 at time
𝑇𝑇 + 𝑛𝑛 which models the history air quality related time air quality and vice versa. Air quality forecasting task is
series dataset 𝐴𝐴𝐴𝐴𝐴𝐴 = {𝐴𝐴𝐴𝐴𝐴𝐴𝑖𝑖,𝑡𝑡 |𝑖𝑖 ∈ 𝑂𝑂, 𝑡𝑡 = 1,2, … , 𝑇𝑇 in the challenging due to rapidly changing weather and pollu-
past}, where 𝐴𝐴𝐴𝐴𝐴𝐴 represents the history air quality related tant emission conditions and it is influenced by a lot of
data, 𝑂𝑂 means the overall observation points, and 𝐴𝐴𝐴𝐴𝐴𝐴 not factors. Moreover, these factors are nonlinear and dynamic
only includes PM2.5 itself but also includes other air qual- (See Fig. 3), such as wind speed, temperature, humidity,
ity related time series data such as press, temperature, and pollutants itself. Those influences are complex and
wind speed, etc. Fig. 2 shows an example. highly non-linear and it is hard to precise forecasting air
quality for a specific time and place. Because these factors
are inherently interdependent, how to deal with the inter-
dependence and exploit it from the multivariate air quality
related time series data is another key problem for air qual-
ity forecasting.
Regarding the issues above, an air quality forecasting
(mainly predicting PM2.5) method based on a hybrid deep
learning architecture is proposed in this paper. In general,
because the statistical characteristics of air quality related
time series data are different (different time series always
have different representations and related structures), it is
Fig. 2. Air quality related time series data in one month (01/01/2010-
01/31/2010) (include PM2.5 pollution concentration, temperature, difficult to use shallow machine learning models for fusion
pressure, wind speed, wind direction, snow, rain, etc.) of Beijing modeling. Many researchers have studied the hybrid deep
PM2.5 data set from UCI [31]. learning model, which is usually effective for improving
the performance of classic deep learning models [26].
As shown in Fig. 2, air quality data usually contains the
CNN is very popular for image processing and target
real-valued PM2.5 pollutant, and some other datasets also
recognition [10], and it is also successfully applied to time
have CO2 and PM10, etc. In addition to pollutant data, air
series forecasting tasks [24], due to the one-dimensional
quality is highly related by meteorological observation
structure of single time series and two-dimensional struc-
data. For example, high wind speed will reduce the con-
ture of multivariate time series. For the above reasons, re-
centration of PM2.5, high humidity usually aggravates air
searches on target recognition in images also can be ap-
pollution, and high atmospheric pressure usually results in
plied to time series modeling as well. Meanwhile, recur-
good air quality [7][19]. Therefore, the above data charac-
rent neural networks (RNN) model can be used for tem-
teristics are very important for air quality forecasting task.
poral representation learning of the long dependency fea-
tures. Because a feedback loop is created in the internal
state of the RNN network [12], this is why RNN performs
better at predicting time series. LSTM is a special kind of
RNN, capable of learning long-term dependencies. We use
a bi-directional LSTM to process time-series information in
two directions with two separate hidden layers and then
feed this information to the same output layer [29][30] so
that it can access both past and future contexts for each
point in the time series.

3.2 Overview of the Deep Air Quality Forecasting


Framework
In the following, we describe the air quality forecasting
Fig. 3. The interdependences and correlations of multivariate air framework, DAQFF, based on the hybrid deep learning ar-
quality time series data (such as PM2.5, dew point, temperature,
wind speed etc.). chitecture. It is a combination of multiple one-dimensional
CNNs and Bi-directional LSTM that take into account the
How to process and capture the spatial-temporal fea- spatial-temporal dependence of air quality-related time se-
tures of above air quality data items is the key point for air ries data. Because there have correlations between local
quality forecasting. Taking the PM2.5 data itself as an ex- trend features and long dependencies of air quality multi-
ample (See Fig. 3, for a month observation data points dur- variate time series data, PM2.5 time series is also related to
ing 01/01/2010-01/31/2010), there is contextual infor- other air quality time series data. And these factors are in-
mation among the observation points in the PM2.5 and herently interdependent. Fig. 4 shows the graphical illus-
wind speed time series, and the historical state has some tration of the deep air quality forecasting framework. From
influence on the evolution of future trends. That is to say, Fig. 4, the overall model consists of two main components:
the adjacent data points and the periodic interval of the air one is the multiple convolution layers (one-dimensional
quality time series data usually have a strong correlation CNNs) for local trend and spatial correlation features
with each other. learning of time series data and the other is the bi-direc-
In addition, air quality data have sharp nonlinearities tional LSTM for getting the long dependency temporal fea-
resulting from transitions from bad air pollution to good tures from the corresponding time series data.
Fig. 4. The architecture of the proposed deep air quality forecasting framework (DAQFF). The proposed model works on air quality forecast-
ing using hierarchical feature representation learning and multi-scale spatial-temporal dependency feature fusion learning.

To exploit spatial-temporal dependency features of dif- denotes the station number), the process to learn the spa-
ferent air quality related time series data (see the lower tial-temporal dependency features of multiple stations
right corner of Fig. 4), the first step is to train multiple one- data can be represented as follows:
dimensional CNNs to extract local trend features and pos-
sible spatial correlation features of multiple stations time 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼𝑖𝑖 ) → 𝐿𝐿𝑖𝑖 (1)
series data. Unlike traditional image processing methods Concatenate(𝐿𝐿1 … 𝐿𝐿𝑖𝑖 … 𝐿𝐿𝑛𝑛 ) → 𝐿𝐿𝐶𝐶𝑡𝑡 (2)
(which are fed with two-dimensional image pixels), the in- 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵(𝐿𝐿𝐿𝐿𝑡𝑡 ) → 𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 (3)
puts to our DAQFF model are multiple one-dimensional 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 ) → 𝑂𝑂𝑡𝑡 (4)
time series. We employ an improved CNN model, which
can compress the length of air quality time series. Rather where 𝐿𝐿𝑖𝑖 denotes the local trend features of single station
than learning the features of each single time series sepa- time series data 𝐼𝐼𝑖𝑖 , and 𝐿𝐿𝐶𝐶𝑡𝑡 denotes the concatenated local
rately, we learn all the time series data of each observation trend features of all stations and the hidden spatial corre-
point of multiple stations. lation features between all stations. These spatial correla-
Then, the extracted features (including the local trend tion features with local trend features are concatenated and
features of each station data and the possible spatial corre- learned by the Bi-LSTM model automatically. Note that 𝑆𝑆𝑡𝑡
lation features of multiple stations data) of many one-di- and 𝑇𝑇𝑡𝑡 denote the spatial and temporal dependency fea-
mensional CNNs are concatenated and fed into certain bi- tures, respectively, which are extracted from multiple sta-
directional LSTMs. These LSTMs learns spatial-temporal tions data, and 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 ) represents the feature level
dependency features from both past and future contexts fusion result. 𝑂𝑂𝑡𝑡 is the shared representation between 𝑆𝑆𝑡𝑡
utilizing time series in forward and backward directions and 𝑇𝑇𝑡𝑡 . Next, we use a fusion layer to concatenate all the
simultaneously. spatial-temporal shared features among different time se-
Given an air quality time series dataset 𝐼𝐼𝑖𝑖 of a station (𝑖𝑖 ries data together. The model is formulated as follows:
We use ReLU as the activation function. 𝑥𝑥𝑖𝑖𝑙𝑙−1 and 𝑐𝑐𝑗𝑗𝑙𝑙 repre-
𝑖𝑖
𝐹𝐹((𝑂𝑂𝑡𝑡−𝑙𝑙 , 𝑂𝑂𝑡𝑡−𝑙𝑙+1 , … , 𝑂𝑂𝑡𝑡 ), 𝑊𝑊 , 𝑏𝑏 𝑖𝑖 )
→ 𝑀𝑀𝜋𝜋 , i=1, 2, ..., n (5) sent the input and output vectors to a convolution layer,
respectively. Here 𝑙𝑙 represents the involved layer. We use
where 𝑀𝑀𝜋𝜋 denotes the joint fusion representation for dif- three convolution layers for local trend feature learning.
ferent learned spatial-temporal dependency features Each layer learns a non-linear representation from the pre-
which are extracted from multiple air quality time series vious layer, and the learned representation is then fed into
data. 𝑊𝑊 𝑖𝑖 and 𝑏𝑏 𝑖𝑖 are weights and biases, respectively, the next layer to form hierarchical feature representations.
which are learned by the fusion model with all training da- After processing three convolution layer, we use a flatten
tasets, where 𝑖𝑖 indicates the 𝑖𝑖 th time window of input time layer to transform the high-level representation to a feature
series data, 𝑙𝑙 indicates the time window size (also called vector and use a fully connected layer to reduce the dimen-
lookup size). The training objective function of DAQFF sion of the final output vector.
model is as follows: As introduced above, the multi-station input air quality
time series data are processed using multiple 1D-CNNs,
𝑛𝑛 𝑚𝑚
1 and are flattened into the fully connected layer. Then the
argmin 𝐶𝐶𝑖𝑖 = � � ||𝑦𝑦�𝚤𝚤 𝑗𝑗 − 𝑦𝑦𝑖𝑖 𝑗𝑗 ||2 (6) final output is given by a concatenated layer, which not
𝜽𝜽 𝑛𝑛
𝑖𝑖=1 𝑗𝑗=1 only captures the local trend features of single station time
series data (as one dimensional filter is used in each con-
The final model training problem is to minimize the volutional layer, the local trend change features of the time
overall error 𝐶𝐶𝑖𝑖 of training samples for each time window series over time can be captured.), but also integrates the
time series, where 𝑖𝑖 indicates each time window time se- possible spatial correlation features of multiple stations.
ries input (i=1, 2, ..., n), 𝑗𝑗 indicates the input samples num- Moreover, one-dimensional CNN's local perception
ber of a time window input data (j=1, 2, ..., m) and 𝜽𝜽 is the and weighted sharing features can reduce the number of
parameter space including 𝑊𝑊𝑙𝑙𝑖𝑖 and 𝑏𝑏𝑙𝑙𝑖𝑖 of each layer. parameters for processing multivariate time series data,
Based on the above process, one-dimensional CNNs are thereby improving learning efficiency. Thus, our method
used to extract the local trend and spatial correlation can learn more deep representation features of air quality
features. Bi-LSTM is used to capture and learn the spatial- related data.
temporal dependency features of the sequence and obtains
the correlation pattern of the time series context. Then we 3.4 Bi-LSTM for long temporal dependencies
fuse these learned spatial-temporal dependency features learning
by concatenating layers. Finally, we input these joint fu- Although traditional statistical methods like ARIMA
sion features into the linear regression layer for final pre- and shallow learning models similar to deep neural net-
diction. In this way, DAQFF combines multiple one- works can process time series, the efficiency is not so good,
dimensional CNNs and bi-directional LSTM in one end-to- because it does not take into account the long-term tem-
end deep learning architecture, which can simultaneously poral dependence of time series data. In order to overcome
extract the local trend features and the spatial-temporal de- this shortcoming, Long Short-term Memory network
pendency features of air quality related multivariate time (LSTM) is a good option [32], which is a popular dynamic
series data. model for handling sequence tasks.
As shown in the upper right corner of Fig. 4, the LSTM
Cell Block represents a typical LSTM diagram [33]. The
3.3 Multiple 1D-CNNs for local trend and spatial memory cell of each LSTM block contains four main com-
features learning ponents. The collaboration of these components enables
CNN not only has excellent performance in image pro- cells to learn and memory long dependency features. The
cessing [10], but also can be effectively applied on time se- typical LSTM block computing process is as follows:
ries data mining. A typical CNN has three layers: convolu-
tional layer, activation layer, and pooling layer. Unlike the 𝑖𝑖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑖𝑖) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑖𝑖) ℎ𝑡𝑡−1 + 𝑏𝑏𝑖𝑖 � (11)
classical CNN model (also traditional two-dimensional 𝑓𝑓𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑓𝑓) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑓𝑓) ℎ𝑡𝑡−1 + 𝑏𝑏𝑓𝑓 � (12)
CNN used for images), we propose to use multiple one- 𝑜𝑜𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑜𝑜) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑜𝑜) ℎ𝑡𝑡−1 + 𝑏𝑏𝑜𝑜 � (13)
dimensional filters convolved (1D-CNNs) over all time 𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 (𝑐𝑐) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑐𝑐) ℎ𝑡𝑡−1 + 𝑏𝑏𝑐𝑐 � (14)
steps of air quality time series data. The computing pro- 𝑠𝑠𝑡𝑡 = 𝑓𝑓𝑡𝑡 ∘ 𝑠𝑠𝑡𝑡−1 + 𝑖𝑖𝑡𝑡 ∘ 𝑠𝑠̃𝑡𝑡 (15)
cesses of 1D-CNN layers are formulated as below: ℎ𝑡𝑡 = 𝑜𝑜𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠𝑡𝑡 ) (16)

𝑐𝑐𝑗𝑗𝑙𝑙 = ∑𝑖𝑖 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑤𝑤𝑖𝑖𝑖𝑖𝑙𝑙 + 𝑏𝑏𝑗𝑗𝑙𝑙 (7) As shown in the above formulas, 𝑖𝑖𝑡𝑡 represents the input
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑐𝑐𝑗𝑗𝑙𝑙 ) (8) gate and it decides the new information input the memory
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹(𝑥𝑥𝑗𝑗𝑙𝑙 ) (9) cell. 𝑓𝑓𝑡𝑡 represents the forget gate which decides how much
𝑙𝑙+1 𝑙𝑙+1 𝑙𝑙
𝑥𝑥𝑘𝑘 = 𝐹𝐹𝐹𝐹(𝑤𝑤𝑘𝑘𝑘𝑘 𝑥𝑥𝑗𝑗 + 𝑏𝑏𝑘𝑘𝑙𝑙+1 ) (10) information should be discarded. 𝑜𝑜𝑡𝑡 indicates the output
gate which decides the amount of information should
Note that Eq. (7) and Eq. (8) model the convolutional transfer to the next time step or to the output. 𝑠𝑠̃𝑡𝑡 is a neuron
layer learning process, where * denotes a convolution op- with a self-recurrent cell like RNN. 𝑠𝑠𝑡𝑡 is the internal
erator, 𝑤𝑤𝑖𝑖𝑖𝑖𝑙𝑙 and 𝑏𝑏𝑗𝑗𝑙𝑙 are the filters and biases, respectively. memory cell of LSTM block which is summed by two parts.
The first part is calculated by the previous internal
memory state 𝑠𝑠𝑡𝑡−1 and forget gate 𝑓𝑓𝑡𝑡 . The second part is listed as follows (as shown in Table 1):
calculated by element wise multiplication of self-recurrent
state 𝑠𝑠̃𝑡𝑡 and input gate 𝑖𝑖𝑡𝑡 . ℎ𝑡𝑡 is hidden state of LSTM block. TABLE 1
One disadvantage of traditional LSTMs is that they can EXPERIMENTS DATASETS DESCRIPTION
only utilize the previous context of sequence data, and Bi- Dataset Beijing PM2.5 Urban Air Quality
directional LSTM can process the time series data in two Dataset[31] Dataset [19]
directions simultaneously through two independent hid- multivariable multivariable
Data type
den layers [30], and these data are concatenated and fed time series time series
forward to the output layer. In other words, Bi-directional Intervals 60-minutes 60-minutes
LSTM processes the time series data in two directions iter- Location Beijing Beijing
atively (forward layer from t = 1 to T, backward layer from 01/01/2010- 05/01/2014-
Time Span
t = T to 1). 12/31/2014 04/30/2015
Variable number 8 14
𝚤𝚤⃗𝑡𝑡 = 𝜎𝜎�𝑈𝑈�⃗(𝑖𝑖) 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 �⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑖𝑖 �
���⃗ (𝑖𝑖) ℎ (17) Used records 43,824 278,023

𝑓𝑓𝑡𝑡 = 𝜎𝜎�𝑈𝑈 �⃗ 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊
(𝑓𝑓) ���⃗ (𝑓𝑓) ℎ�⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑓𝑓 � (18) Station number 1 station 36 stations
𝑜𝑜⃗𝑡𝑡 = 𝜎𝜎�𝑈𝑈 �⃗(𝑜𝑜) 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 ���⃗ (𝑜𝑜) ℎ�⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑜𝑜 � (19)
𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 ℎ𝑡𝑡−1 + 𝑏𝑏�⃗𝑐𝑐 �
���⃗ �⃗ (𝑐𝑐) ���⃗ (𝑐𝑐) �⃗
(20) Beijing PM2.5 Dataset: This hourly dataset contains the
𝑠𝑠⃗𝑡𝑡 = 𝑓𝑓⃗𝑡𝑡 ∘ 𝑠𝑠⃗𝑡𝑡−1 + 𝚤𝚤⃗𝑡𝑡 ∘ ���⃗
𝑠𝑠̃𝑡𝑡 (21) PM2.5 data of the US Embassy in Beijing. Meanwhile, me-
�⃗𝑡𝑡 = 𝑜𝑜⃗𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠⃗𝑡𝑡 )
ℎ (22) teorological data are also included. Data items include
PM2.5 concentration, Dew Point, Temperature, Pressure,
⃖𝚤𝚤𝑡𝑡 = 𝜎𝜎�𝑈𝑈⃖��(𝑖𝑖) 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑖𝑖 �
⃖��� (𝑖𝑖) ℎ (23) Combined wind direction, Cumulated wind speed (m/s),
𝑓𝑓⃖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 ⃖�� 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊
(𝑓𝑓) ⃖��� (𝑓𝑓) ℎ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑓𝑓 � (24) Cumulated hours of snow, Cumulated hours of rain. The
𝑜𝑜⃖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 ⃖��(𝑜𝑜) 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ⃖��� (𝑜𝑜) ℎ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑜𝑜 � (25) dataset used for experiments is ranged from 01/01/2010 to
𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ℎ𝑡𝑡−1 + 𝑏𝑏⃖�𝑐𝑐 �
⃖��� ⃖�
� (𝑐𝑐) ⃖��� (𝑐𝑐) ⃖�
(26) 12/31/2014, which has 43824 records.
𝑠𝑠⃖𝑡𝑡 = 𝑓𝑓⃖𝑡𝑡 ∘ 𝑠𝑠⃖𝑡𝑡−1 + ⃖𝚤𝚤𝑡𝑡 ∘ ⃖���
𝑠𝑠̃𝑡𝑡 (27) Urban Air Quality Dataset: This hourly dataset is com-
⃖�𝑡𝑡 = 𝑜𝑜⃖𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠⃖𝑡𝑡 )
ℎ (28) prised of six parts of data over a period of one year (from
05/01/2014 to 04/30/2015), which has been used in [7] [19]
�⃗𝑡𝑡 ∘ ℎ
ℎ𝑡𝑡 = ℎ ⃖�𝑡𝑡 (29) to infer the fine-grained air quality of current and future
times. We select the data from Beijing as the experimental
The above equations show the layer functions of Bi- dataset, which contains a total of 278,023 records from 36
LSTM, and the two direction arrows denote the forward monitoring stations, where the data items include PM2.5,
and backward process, respectively. ℎ𝑡𝑡 represents the final PM10, NO2, CO, O3, SO2, weather, temperature, humidity,
hidden element of Bi-LSTM, which is the concatenated vec- pressure, wind speed and wind direction, etc.
�⃗𝑡𝑡 and the backward output ℎ
tor of the forward output ℎ ⃖�𝑡𝑡 .
Through the above process, Bi-LSTM can learn both past 4.2 Experimental Setup
and future features of time series data and the predictive This section describes the hardware and software environ-
output is generated from past and future contexts. ment of the experiment and the configuration of relevant
parameters. The open source deep learning library Keras
4 EXPERIMENTS which based on Tensorflow is used to build baseline deep
learning models and DAQFF model, and Scikit-learn is
In this section, we use two real air quality data sets to con-
used to build shallow learning models. All experiments are
duct experiments to analyze and evaluate the performance
conducted on a PC Server, and the server configuration is
of the proposed model. Through the comparison of classi-
Intel(R) Xeon(R) CPU E5-2623 3.00GHz, 4 GPUs each is
cal shallow learning models, baseline deep learning
12G NVIDIA Tesla K80C, and memory is 128GB.
models and our model DAQFF, the forecasting perfor-
Our framework is compared with two classic shallow
mance and effectiveness of the proposed model are vali-
learning models and five baseline deep learning models.
dated.
They are summarized as follows.
ARIMA is one of the most common traditional statistical
4.1 Datasets
methods in time series prediction.
Our experiment uses two real-world air quality datasets: SVR (Support Vector Regression) is a kernel method of
The first one is the Beijing air quality dataset from UCI [31], machine learning which also can be used for time series
which includes meteorological data and PM2.5 pollution forecasting. And the kernel-based SVR can make it possi-
data. The dataset is collected every hour and is sourced ble to learn nonlinear trend of the training dataset. There
from the data interface released by the US Embassy in Bei- are three SVR models with different kernels (RBF, poly and
jing [28]. As an experimental air quality UCI data set, it linear).
contains different attributes such as date, time, tempera- RNN (Recurrent Neural Network) is a popular deep
ture, humidity, wind speed, wind direction, and PM2.5 learning method for handling sequence tasks. GRU (Gated
values. And the second dataset is the Urban Air Quality Recurrent Units) and LSTM (Long Short-term Memory)
Dataset collected in the Urban Air project of Microsoft Re- are the most popular variants of RNN. CNN (Convolu-
search [19]. The details of the two experimental data set are tional Neural Network) is widely used in image processing,
but one-dimensional CNN can also be used for time series is superior to other baseline methods in terms of PM2.5
prediction. single-step forward prediction performance in both two
The most critical task of deep learning applications is set- datasets. Compared to the baseline shallow and deep
ting hyper-parameters and optimizing them. In order to ef- learning models, our model reduces RMSE to 8.20 and
fectively model a deep neural network, a large number of MAE to 6.19 in Beijing PM2.5 Dataset, also has the lowest
hyper-parameters need to be set. In experiments, the de- error in Urban Air Quality Dataset, which improves the
fault parameters in Keras are used for deep neural network forecasting accuracy obviously. In addition, the model er-
initialization (e.g., weight initialization). In order to avoid ror of classic deep learning models (such as LSTM, CNN,
the over-fitting problem of the deep learning models, we and GRU) are similar and also lower than shallow models.
apply several methods to solve it, such as a dropout policy This means that deep learning models are more effective
with probability 0.3, which is used widely between layers for air quality time series forecasting than traditional shal-
(including convolutional layers, recurrent layers, and low learning models in single step prediction task.
dense layers). And the default training parameters are: the
batch size is 32, the epochs size is 100, and the lookup size TABLE 2
is 1. We also use tanh as the activation function of the RNN THE MODEL ERROR OF DAQFF AND COMPARISONS WITH
OTHER BASELINE MODELS FOR THE SINGLE-STEP PM2.5 PRE-
model (include GRU and LSTM) and ReLU as the activa-
DICTION TASK.
tion function of the CNN layers. In addition, we use Adam
as an optimizer. The baseline model's network structure Beijing PM2.5 Urban Air Quality
uses one hidden layer default and the number of neurons Models Dataset Dataset
of each hidden layer is set to 128. RMSE MAE RMSE MAE
We use three convolution layers for local trend feature SVR-POLY 42.61 31.82 56.35 47.20
learning. Each layer has different filter size and kernel size SVR-RBF 41.86 34.93 50.51 42.26
parameter settings, say (64, 5), (32, 3), (16, 1). We use ReLU SVR-LINEAR 30.60 20.47 29.23 18.82
as the activation function. The bidirectional-LSTM layer is ARIMA 24.52 12.50 27.92 14.35
equipped using 128 hidden neurons for temporal features LSTM 13.03 9.29 24.96 22.31
learning. We use mean square error (MSE) as the loss func- GRU 11.75 8.71 23.70 21.43
tion of our DAQFF model. The activation function of the CNN 12.21 9.09 20.95 16.36
output layer is a linear function, which is also used for final RNN 10.61 8.83 13.79 11.62
prediction. Moreover, we apply min-max function to nor- DAQFF 8.20 6.19 11.81 9.96
malize the air quality time series data to [0,1]. Missing fea- Note: forward-step prediction size is 1, and model testing error (RMSE and
tures in the experimental data are filled using the average MAE) are the prediction error of the next 1 hour (h1).
value of the column in which they are located.
Additionally, for the Beijing PM2.5 Dataset experiment, Moreover, the prediction performance of baseline deep
we select the first four-year data for training and validation learning methods is dramatically better than the classic
(three-year data for training, and the rest one-year data for shallow learning methods such as SVR and ARIMA (1 to 2
validation) and select the last year data for testing times the gap). Our model performs the best since DAQFF
(01/01/2014-12/31/2014). For the Urban Air Quality Da- can learn local trend features by one-dimensional CNN
taset experiment, we select the first eight-month data for and long-term dependencies feature by Bi-directional
training (05/01/2014-12/31/2014) and select the last four- LSTM of air quality multivariate time series data.
month data for testing (01/01/2015-04/30/2015). We use
RMSE and MAE as the model error evaluation indicators,
which are used to analyze the experimental results.

𝑛𝑛
1
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = � �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�)
𝚤𝚤
2 (30)
𝑛𝑛
𝑖𝑖=1
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 = � | 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝚤𝚤 | (31)
𝑛𝑛
𝑖𝑖=1
where 𝑦𝑦�𝚤𝚤 represents the predicted PM2.5 value, 𝑦𝑦𝑖𝑖 repre-
sents ground truth value and 𝑛𝑛 is the number of test da-
taset.

4.3 Single Step Forecasting Results Analysis


The single step PM2.5 prediction quantitative results of Fig. 5. RMSE of the proposed DAQFF model versus different lookup
two real-world datasets are reported in Table 2, which give size and comparisons with baseline deep learning models in the ex-
periment on Beijing PM2.5 Dataset. Hyper-parameters settings are:
RMSE and MAE comparative analysis of ARIMA, SVR (rbf, prediction size is 1 step, batch size is 64, the epoch is 30.
linear and poly kernel), RNN, CNN, LSTM, GRU, and our
proposed model DAQFF. As shown in Table 2, our model In addition, it is found by experiments that the choice of
lookup size (lookup size is also called window size, which when the epoch size exceeds 90. In other words, the higher
represents historical observations input size of the model) the epochs, the more the computational resources will be
has an influence on the single step forecasting performance. consumed. Although an increase of iterations can improve
We analyze the impact of lookup size among the baseline the training performance of the model, it will also cause
deep learning models and our model DAQFF in the Beijing overfitting problems.
PM2.5 Dataset. As Fig. 5 shows, we observe that compared In order to further evaluate the single step forecasting
to baseline deep models, our model DAQFF has the lowest performance of DAQFF and baseline models in two real-
prediction error versus different lookup sizes. With the in- world datasets, we analyze the PM2.5 prediction ability of
crease of lookup size, these models error first decrease DAQFF and the other two baseline models over the course
gradually. When the lookup size is around 9, the RMSE of of one month (31 days, 24 hours a day, all together includ-
DAQFF reaches the minimum. As the lookup size contin- ing 744 observed PM2.5 data points). Figs. 7 (a) (b) (c) give
ues to increase, the model error remains stable or gradually a comparison of the ground truth (expected) and predicted
increases, which may be led to overfitting problem. one-step forward PM2.5 values of SVR, LSTM and DAQFF
models in the experiment on Beijing PM2.5 Dataset. As
shown in the figures, the performance of our model is bet-
ter than those of SVR and LSTM models with single step
forward prediction, especially in the time period of wave
peak and trough of air quality time series data. Figs. 8 (a)
(b) (c) show a comparison of the ground truth (expected)
and predicted one-step forward PM2.5 values of SVR,
LSTM and DAQFF models in the experiment on Urban Air
Quality Dataset. Similarly, as shown in the figures, the sin-
gle step forward prediction performance of our model is
also better than those of SVR and LSTM models, both dur-
ing the time period of wave peak and trough conditions.
In addition, we also observe that the single-step predic-
tion performance of baseline models is sensitive to differ-
ent data sets. For example, the SVR-RBF model has better
forecasting performance in Beijing PM2.5 Dataset (see Fig.
Fig. 6. RMSE of the proposed DAQFF model versus different epochs 7 (a)) than in Urban Air Quality Dataset (see Fig. 8 (a)). And
and comparisons with another baseline deep learning models in the the single-step forecasting performance of LSTM model is
experiment on Beijing PM2.5 Dataset. Hyper-parameters settings similar with SVR-RBF model, and the prediction perfor-
are: lookup and prediction size is 1, batch size is 64.
mance of LSTM in Urban Air Quality Dataset (see Fig. 8 (b))
Then, we investigate the impact of epochs on different is worse than that in Beijing PM2.5 Dataset (see Fig. 7 (b)).
deep learning models. Fig. 6 shows the model error (RMSE) But our model can maintain the best singe step prediction
curve of the proposed DAQFF model versus different performance in both two datasets.
epochs and comparisons with another baseline deep learn- In summary, for single-step prediction of air quality time
ing models in the experiment on Beijing PM2.5 Dataset. It series under different experiment conditions, our model
is obviously that our model DAQFF always maintains can maintain the best performance, and the prediction per-
higher performance than the other baseline deep models formance of the baseline deep learning models is also not
versus different epochs. In addition, as the number of bad, because the single-step prediction of time series is rel-
epochs increases, the prediction error of the deep learning ative simple, which often only needs to follow the trend of
models first gradually decreases. The RMSE of DAQFF the previous step to achieve a good forecasting perfor-
achieves the lowest value when the epoch size is about 70 mance. However, multi-step time series forecasting is not
and then gradually grows when the epoch size continues that simple, and it is often difficult to foresee what happens
to increase. It is clearly that the generalization capability after multiple time steps. In the next section, we will ana-
does not improve obviously when the epoch size is larger lyze the performance of the multi-step forecasting models
than 70. Moreover, all models seem to be a little overfitting of air quality time series data.
Fig. 7. In the experiment on Beijing PM2.5 Dataset, a comparison of single step ground truth and predicted PM2.5 value during one month
(01/01/2014-01/31/2014) of different models (SVR, LSTM, and DAQFF). (a) SVR with RBF kernel model; (b) LSTM model; (c) DAQFF model.

Fig. 8. In the experiment on Urban Air Quality Dataset, a comparison of single step ground truth and predicted PM2.5 value during one month
(01/01/2015-01/31/2015) of 1001 station versus different models (SVR, LSTM, and DAQFF). (a) SVR with RBF kernel model; (b) LSTM
model; (c) DAQFF model.

4.4 Multi-step Forecasting Results Analysis baseline models whether it is a short time step or a long time
The multi-step PM2.5 prediction quantitative results of step prediction.
two real-world datasets are reported in Table 3 (model test- TABLE 3
ing error in the table is the average of the prediction error THE MODEL ERROR OF DAQFF AND COMPARISONS WITH
value in the next forward 6 hours, h1~h6), which gives OTHER BASELINE MODELS FOR THE MULTI-STEP PM2.5 PREDIC-
RMSE and MAE comparative analysis of SVR (rbf, linear TION TASK.
and poly kernel), RNN, CNN, LSTM, GRU, and our pro- Beijing PM2.5 Urban Air Quality
posed model DAQFF. As shown in Table 3, our model is Models Dataset Dataset
also superior to other methods in terms of PM2.5 multi-step RMSE MAE RMSE MAE
prediction performance. Compared to the baseline shallow SVR-POLY 56.62 44.94 64.02 50.82
and deep learning models, our model reduces MAE to SVR-RBF 57.66 46.32 65.11 53.59
27.53 in the Beijing PM2.5 Dataset, and also has the lowest SVR-LINEAR 49.82 36.82 53.48 36.35
MAE as 25.01 in the Urban Air Quality Dataset, which im- LSTM 57.49 44.12 58.25 44.28
proves the forecasting accuracy obviously. It is worth noting GRU 52.61 38.99 60.76 45.53
that the testing error of classic deep learning models (RNN, RNN 57.38 44.69 60.71 46.16
CNN, LSTM, and GRU) are similar and larger than SVR- CNN 52.85 39.68 53.38 38.21
LINEAR model in the Beijing PM2.5 Dataset. Does this
DAQFF 43.49 27.53 46.49 25.01
mean that the multi-step forecasting performance of the
Note: forward multi-step prediction size is 6, and model testing error (RMSE
baseline deep learning model is worse than those of some
and MAE) are the average of the prediction error in the next forward 6 hours
shallow models (such as SVR-LINEAR)? In fact, it is not en-
(h1~h6).
tirely true, as shown in Table 4, if long-term time step predic-
tion is performed, we will find that the prediction perfor- Next, we analyze the impact of forward prediction size
mance of the baseline deep learning models will exceed the among the baseline deep learning models and DAQFF. As
SVR-LINEAR model as the prediction time step increases. shown in Table 4, in the Beijing PM2.5 Dataset, the perfor-
Taking the LSTM model as an example, in the next 3 hours mance of PM2.5 multi-step forward prediction is signifi-
(h1~h3), the average prediction error of the LSTM model is cantly lower than that of single step forward prediction
larger than that of the SVR-LINEAR model. However, when (see Table 2). As the forward prediction size increases, the
the forward prediction size is greater than 3, e.g. in the next forecasting performances of these models gradually de-
h4~h6, h7~h12, or h13~h24 time period, the average predic- crease. But we can observe that compared to baseline
tion error of the LSTM model is lower than that of the SVR- methods, our DAQFF model also has the lowest prediction
LINEAR model. DAQFF model does not have this problem, error (RMSE and MAE) versus different forward predic-
since the performance of our model is better than that of the tion sizes.
TABLE 4
IN THE EXPERIMENT ON BEIJING PM2.5 DATASET, THE MODEL ERROR OF DAQFF AND COMPARISONS WITH OTHER BASELINE MOD-
ELS FOR THE MULTI-STEP PREDICTION OF PM2.5 VALUES IN THE NEXT 24 HOURS.

RMSE MAE
Models
1h~3h 4h~6h 7h~12h 13h~24h 1h~3h 4h~6h 7h~12h 13h~24h
SVR- POLY 48.99 64.26 75.70 84.91 39.14 50.74 59.71 66.92
SVR-RBF 51.15 64.18 75.46 84.92 41.76 50.88 59.57 67.02
SVR-LINEAR 38.69 60.96 76.24 85.60 27.31 46.32 60.10 67.04
RNN 49.50 65.27 77.06 80.56 38.89 50.49 59.88 59.18
CNN 45.95 59.76 70.83 79.18 35.62 43.74 51.21 57.91
LSTM 45.88 57.51 69.52 79.15 35.31 40.32 48.76 57.17
GRU 47.32 57.90 69.57 79.23 37.01 40.98 48.77 57.25
DAQFF 34.35 52.64 66.14 77.38 20.91 34.15 44.93 54.58
Note: Hyper-parameters settings are: forward multi-step prediction size is 24, lookup size is 1, epochs is 100, batch-size is 64,
and testing error (RMSE, MAE) of t~t+n is the average of the prediction error in the next forward n hours.

Fig. 9. In the experiment on Beijing PM2.5 Dataset, a comparison of multi-step (next t1, t3, and t6) ground truth and predicted PM2.5 value
during one month (01/01/2014-01/31/2014) of different models (SVR, LSTM and DAQFF). (a) SVR-LINEAR model for next 1 hour (t1) predic-
tion; (b) SVR-LINEAR model for the third hour of the future (t3) prediction; (c) SVR-LINEAR model for the 6th hour in the future (t6) predic-
tion; (d) LSTM model for next 1 hour (t1) prediction; (e) LSTM model for the third hour of the future (t3) prediction; (f) LSTM model for the 6th
hour in the future (t6) prediction; (g) DAQFF model for next 1 hour (t1) prediction; (h) DAQFF model for the third hour of the future (t3) pre-
diction; (i) DAQFF model for the 6th hour in the future (t6) prediction;
TABLE 5
IN THE EXPERIMENT ON URBAN AIR QUALITY DATASET, THE MODEL ERROR OF DAQFF AND COMPARISONS WITH OTHER BASELINE
MODELS FOR THE MULTI-STEP PREDICTION OF PM2.5 VALUES IN THE NEXT 48 HOURS.

RMSE MAE
Models
1h~6h 7h~12h 13h~24h 25h~48h 1h~6h 7h~12h 13h~24h 25h~48h
SVR-POLY 64.02 83.72 105.73 109.98 50.82 64.01 83.96 86.83
SVR-RBF 65.11 83.96 88.81 90.38 53.59 67.52 70.32 74.23
SVR-LINEAR 53.48 83.85 91.06 93.11 36.35 68.01 73.72 72.73
RNN 60.71 86.25 100.54 115.13 46.16 67.94 81.97 99.45
GRU 60.76 95.59 107.82 111.06 45.53 72.81 82.45 90.25
LSTM 58.25 88.52 96.61 103.21 44.28 69.39 76.40 84.65
CNN 53.38 83.48 94.85 94.53 38.21 61.97 71.94 77.92
DAQFF 46.49 69.15 77.88 80.06 25.01 48.37 59.69 61.75
Note: Hyper-parameters settings are: forward multi-step prediction size is 48, lookup size is 1, epochs is 100, batch-size is 64,
and testing error (RMSE, MAE) of t~t+n is the average of the prediction error in the next forward n hours.

Fig. 10. In the experiment on Urban Air Quality Dataset, forecast the PM2.5 value of one station (no.1001) based on the trained model of 36
stations data. A comparison of multi-step (t6) ground truth and predicted PM2.5 value during two months (01/01/2015-02/28/2015) of SVR-
LINEAR, LSTM and DAQFF model (a) SVR-LINEAR for the next 6th hour (t6) prediction; (b) LSTM for the next 6th hour (t6) prediction; (c)
DAQFF for the next 6th hour (t6) prediction.
In order to further analyze and compare the forecasting of Zheng et al. [19], when the prediction size is less than 6,
performance of DAQFF and the other baseline models, we the forecasting performance of the DAQFF model is worse
analyze the multi-step forecasting ability of our model in than that of the method of Zheng et al., but for long time-
the experiment on Beijing PM2.5 Dataset under different step forward forecasting, our model DAQFF performs bet-
time-step (t1, t3 and t6) forward prediction over the course ter. It should be noted that the benchmarks are not con-
of test data in one month (24 hours per day, 31 days, totally sistent for comparison of different studies due to the small
744 time-step points, 01/01/2014-01/31/2014). Figs. 9 (a)- number of benchmark data sets available in the air quality
(i) give a comparison of the ground truth and the predicted forecasting field, even the released Urban Air Quality Da-
PM2.5 value of the experiment with different models (SVR- taset [19] having not been well pre-processed (e.g. there are
LINEAR, LSTM, and DAQFF) under different predict sizes’ many missing values to which different processing meth-
(1 time-step, 3 time-step, and 6 time-step) conditions, ods can be applied).
where x-coordinate indicates the observation time-steps
and y-coordinate indicates the PM2.5 value. As shown in TABLE 6
these figures, the multi-step forecasting performance of COMPARISONS AMONG DIFFERENT RESEARCHES WITH URBAN
our model is better than those of SVR-LINEAR represented AIR QUALITY DATASET
shallow models and LSTM represented deep learning
MAE
models, not only under short prediction size but also un- Models
1h~6h 7h~12h 13h~24h 25h~48h
der long prediction size conditions, especially in the time
Zheng [Link] [19] 23.70 52.40 63.90 69.0
period of wave peak and trough of test data. Moreover, as
DAQFF 25.01 48.37 59.69 61.75
the prediction time-steps grow, the predictive perfor-
mance of SVR models decreases dramatically, but our
DAQFF model can maintain the best performance. All in all, for the proposed DAQFF, the PM2.5 prediction
Moreover, we analyze whether DAQFF model can can be well matched with the ground truth with single step
maintain the same forecasting ability for different air qual- forward prediction, also has better performance than base-
ity data sets. In the experiment on Urban Air Quality Dataset, line models with multi-step forward prediction, which im-
we further verify the forecasting performance of DAQFF plies the deep air quality forecasting framework can effec-
model. As Table 5 shows, compared with other models, the tively learn the local trend and long-term temporal de-
forecasting performances of SVR-POLY, RNN, and GRU pendence characteristics of multivariate air quality time se-
model have large fluctuations in long-term time step (e.g. ries data. The proposed air quality forecasting model
h13~h24, h25~h48) prediction. In addition, what is inter- DAQFF which is based on a hybrid deep learning structure
esting about the data in Table 5 is that as the prediction can provide a useful reference for air pollution manage-
time step grows, our model can maintain a significant im- ment and early warning.
provement over the baseline models compared with the
data of Table 4. In short, our model can maintain optimal 5 CONCLUSION AND FUTURE WORK
prediction performance over short-term or long-term time
step forecasting conditions. In this paper, we proposed a new air quality forecasting
Figs. 10 (a) (b) (c) give a zoom in a comparison of the framework (DAQFF) for PM2.5 single step forward and
ground truth (expected) and six-step forward predicted multi-step forward prediction, which is based on a hybrid
PM2.5 values of SVR-LINEAR, LSTM, and DAQFF models. deep learning method. DAQFF consists of two deep neural
Through the comparison of these three figures, it is found networks: one-dimensional CNNs and Bi-directional
that the multi-step forecasting performance of classic deep LSTM. It can learn the correlation features of local trend
learning model like LSTM is better than that of shallow and spatial-temporal dependencies pattern of multivariate
model SVR-LINEAR, and SVR-LINEAR model cannot ef- air quality related time series data. Experiments showed
fectively predict the PM2.5 values such as the wave valley that the proposed model has better performance than clas-
and the wave peak of air quality time series data when the sic shallow learning and deep learning models, which can
prediction time step is 6. Although the prediction perfor- explore and learn the interdependence and nonlinear cor-
mance of LSTM is not bad, it is not accurate enough to pre- relations of multivariate air quality related time series (e.g.
dict the wave valley and wave peak values of PM2.5 time temperature, humidity, wind speed, SO2, PM10 and PM2.5
series. DAQFF has better performance at all time period itself) effectively. The main contributions of this paper are
observation points and can effectively predict PM2.5 val- as follows:
ues under different conditions (e.g. in the case of missing 1) We firstly proposed a new hybrid deep learning
values, as shown in the red wireframe section in Fig. 10). framework which can deal with hierarchical feature repre-
In short, through experimental results, we find that the sentation and multi-scale spatial-temporal dependency fu-
overall multi-step prediction performance of DAQFF is the sion learning in an end-to-end process for air quality fore-
best, no matter if it is on general weather day or extreme casting.
weather day, weekdays or weekends. 2) This study was the first attempt to combine multiple
Finally, we have compared the multi-step forecasting one-dimensional CNNs and bi-directional LSTM for hy-
performance of DAQFF and the state-of-the-art method, brid fusion learning of air quality related multivariate time
Zheng et al. [19], which used the same Urban Air Quality series data, which can extract spatial-temporal depend-
Dataset. As shown in Table 6, compared with the method ency and correlation features for air quality multi-step
forecasting modeling. [11] Karpathy A, and Li F F, “Deep visual-semantic alignments for gener-
3) We demonstrated the effectiveness of our model by ating image descriptions,” in Proceedings of the 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137,
testing it on two real-world air quality datasets, and the
2015.
experimental results indicated that our model has good [12] Cho K, Van Merriënboer B, Gulcehre C, et al, “Learning phrase repre-
forecasting ability (not only single step but also multi-step sentations using RNN encoder-decoder for statistical machine transla-
forecasting). It was also showed that the proposed model tion”. arXiv preprint arXiv:1406.1078, 2014
has better prediction ability than typical shallow learning [13] Gamboa J C B, “Deep learning for time-series analysis”. arXiv preprint
and baseline deep learning models. arXiv:1701.01887, 2017
[14] Qi Z, Wang T, Song G, et al, “Deep air learning: interpolation, predic-
In future research, we believe that the abrupt change
tion, and feature analysis of fine-grained air quality”. IEEE Transac-
(also called outlier or anomaly point) of air pollution time tions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2285-
series deserves further study. If we can predict the sudden 2297, 2018
change of air pollution in advance, which will greatly im- [15] Deleawe S, Kusznir J, Lamb B, et al, “Predicting air quality in smart
prove the multi-step forecasting ability of our model. In environments,” Journal of Ambient Intelligence and Smart Environ-
addition, the model DAQFF also needs to be researched in ments, vol. 2, no. 2, pp. 145-154, 2010.
depth and improved under different forecasting condi- [16] Zhou Q, Jiang H, Wang J, et al, “A hybrid model for PM2. 5 forecasting
based on ensemble empirical mode decomposition and a general re-
tions.
gression neural network,” The Science of the Total Environment, vol.
496, pp.264-274, 2014.
[17] Díaz-Robles L A, Ortega J C, Fu J S, et al, “A hybrid ARIMA and artifi-
ACKNOWLEDGMENT cial neural networks model to forecast particulate matter in urban ar-
eas: the case of Temuco, Chile,” Atmospheric Environment, vol. 42, no.
This research was partially supported by the National Nat- 35, pp. 8331-8340, 2008.
ural Science Foundation of China (Nos. 61773324, [18] Zheng Y, Capra L, Wolfson O, et al, “Urban computing: concepts,
61572407), the “Center for Cyber-physical System Innova- methodologies, and applications,” ACM Transactions on Intelligent
tion” from The Featured Areas Research Center Program Systems and Technology, vol. 5, no. 3, pp. 38, 2014.
within the framework of the Higher Education Sprout Pro- [19] Zheng Y, Yi X, Li M, et al, “Forecasting fine-grained air quality based
on big data,” in Proceedings of the 21st ACM SIGKDD International
ject by the Ministry of Education (MOE) in Taiwan and
Conference on Knowledge Discovery and Data Mining (KDD), pp.
MOST under 106-2221-E-011-149-MY2 and108-2218-E-011- 2267-2276, 2015.
006. [20] Lipton Z C, Berkowitz J, and Elkan C, “A critical review of recurrent
neural networks for sequence learning”. arXiv preprint
arXiv:1506.00019, 2015
[21] Längkvist M, Karlsson L, and Loutfi A, “A review of unsupervised
REFERENCES
feature learning and deep learning for time-series modeling,” Pattern
[1] Zhang Y., Bouquet, M., Mallet, V., Seigneur, C., and Baklanov, A., Recognition Letters, vol. 42, pp. 11-24, 2014.
“Real-time air quality forecasting, Part I: History, techniques, and cur- [22] Li X, Peng L, Hu Y, et al, “Deep learning architecture for air quality
rent status, “ Atmospheric Environment, vol. 60, pp. 632-655, 2012. predictions,” Environmental Science and Pollution Research, vol. 23,
[2] Zhang, Y., Bouquet, M., Mallet, V., Seigneur, C., and Baklanov, A., no. 22, pp. 22408-22417, 2016.
“Real-time air quality forecasting, Part II: State of the science, current [23] Ong B T, Sugiura K, and Zettsu K, “Dynamically pre-trained deep re-
research needs, and future prospects,” Atmospheric Environment, vol. current neural networks using environmental monitoring data for pre-
60, pp. 656-676, 2012. dicting PM2. 5,” Neural Computing and Applications, vol. 27, no. 6,
[3] Vardoulakis S, Fisher BE A, Pericleous K, et al, “Modelling air quality pp. 1553-1566, 2016.
in street canyons: a review,” Atmospheric Environment, vol. 37, no. 2, [24] Yang J, Nguyen MN, San P P, et al., “Deep convolutional neural net-
pp. 155-182, 2003. works on multichannel time series for human activity recognition,” in
[4] M. Dong, D. Yang, Y. Kuang, D. He, S. Erdal, and D. Kenski, “PM2.5 Proceedings of International Joint Conference on Artificial Intelligence
concentration prediction using hidden semi-Markov model-based (IJCAI), pp. 3995-4001, 2015.
times series data mining,” Expert Systems with Applications, vol. 36, [25] Zhang J, Zheng Y, Qi D, “Deep spatio-temporal residual networks for
no. 5, pp. 9046-9055, 2009 citywide crowd flows prediction,“ in Proceedings of AAAI Confer-
[5] Donnelly, A., Misstear, B., and Broderick, B, “Real-Time air quality ence on Artificial Intelligence (AAAI), pp. 1655-1661, 2017.
forecasting using integrated parametric and nonparametric regression [26] Sun Y, Wang X, Tang X, “Hybrid deep learning for face verification,”
techniques,” Atmospheric Environment , vol. 103, pp. 53-65, 2015. in Proceedings of 2013 IEEE International Conference on Computer
[6] Zhou X, Huang W, Zhang N, et al, “Probabilistic dynamic causal Vision (ICCV), pp. 1489-1496, 2013.
model for temporal data,” in Proceedings of 2015 International Joint [27] Wu Z, Wang X, Jiang Y G, et al, “Modeling spatial-temporal clues in a
Conference on Neural Networks (IJCNN). IEEE, pp. 1-8, 2015. hybrid deep learning framework for video classification,” in Proceed-
[7] Y. Zheng, F. Liu, and H.-P. Hsieh, “U-air: When urban air quality in- ings of the 23rd ACM International Conference on Multimedia (ACM
ference meets big data,” in Proceedings of the 19th ACM SIGKDD In- MM), pp. 461-470, 2015.
ternational Conference on Knowledge Discovery and Data Mining [28] Liang X, Zou T, Guo B, et al, “Assessing Beijing's PM2. 5 pollution: se-
(KDD), pp. 1436-1444, 2013. verity, weather impact, APEC and winter heating,” in Proceedings of
[8] H.-P. Hsieh, S.-D. Lin, and Y. Zheng, “Inferring air quality for station The Royal Society A, vol. 471, no. 2182, 2015.
location recommendation based on urban big data,” in Proceedings of [29] Graves A, Fernández S, Schmidhuber J, “Bidirectional LSTM net-
the 21st ACM SIGKDD International Conference on Knowledge Dis- works for improved phoneme classification and recognition,” in Pro-
covery and Data Mining (KDD), pp. 437-446, 2015. ceedings of International Conference on Artificial Neural Networks
[9] Schmidhuber J, “Deep learning in neural networks: An overview. (ICANN), pp. 799-804, 2005.
Neural networks, “ vol. 61, pp. 85-117, 2015. [30] Graves A, and Schmidhuber J, “Framewise phoneme classification
[10] Krizhevsky A, Sutskever I, and Hinton G E, “Imagenet classification with bidirectional LSTM and other neural network architectures,”
with deep convolutional neural networks,” in Proceedings of Ad- Neural Networks, vol. 18, no. 5-6, pp. 602-610, 2005.
vances in Neural Information Processing Systems (NIPS), pp. 1097- [31] Beijing PM2.5 Data Set [Online]Available:[Link]
1105, 2012. [Link]/ml/datasets/Beijing+PM2.5+Data.
[32] Hochreiter S, and Schmidhuber J, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[33] Olah C. Understanding LSTM networks. 2015. [Online]Available:
[Link] 2015
[34] Eravci B, and Ferhatosmanoglu H, “Diverse Relevance Feedback for
Time Series with Autoencoder Based Summarizations,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2298-
2311, 2018.
[35] Zhuang, Dennis EH, Gary CL Li, and Andrew KC Wong, “Discovery
of temporal associations in multivariate time series,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 26, no. 12, pp. 2969-
2982, 2014.
[36] Wang J, and Song G, ‘’A deep spatial-temporal ensemble model for air
quality prediction." Neurocomputing, vol. 314, pp. 198-206, 2018.
[37] Yi X, Zhang J, Wang Z, Li T, and Zheng Y, "Deep distributed fusion
network for air quality prediction." in Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), pp. 965-973, 2018.

You might also like