Deep Air Quality Forecasting Using Hybrid Deep
Deep Air Quality Forecasting Using Hybrid Deep
Learning Framework
Shengdong Du, Tianrui Li, Senior Member, IEEE, Yan Yang, Member, IEEE, and Shi-Jinn Horng
Abstract—Air quality forecasting has been regarded as the key problem of air pollution early warning and control management.
In this paper, we propose a novel deep learning model for air quality (mainly PM2.5) forecasting, which learns the spatial-temporal
correlation features and interdependence of multivariate air quality related time series data by hybrid deep learning architecture.
Due to the nonlinear and dynamic characteristics of multivariate air quality time series data, the base modules of our model include
one-dimensional Convolutional Neural Networks (1D-CNNs) and Bi-directional Long Short-term Memory networks (Bi-LSTM).
The former is to extract the local trend features and spatial correlation features, and the latter is to learn spatial-temporal
dependencies. Then we design a jointly hybrid deep learning framework based on one-dimensional CNNs and Bi-LSTM for shared
representation features learning of multivariate air quality related time series data. We conduct extensive experimental evaluations
using two real-world datasets, and the results show that our model is capable of dealing with PM2.5 air pollution forecasting with
satisfied accuracy.
Index Terms—Air quality forecasting, deep learning, convolutional neural networks, long short-term memory networks
—————————— ——————————
1 INTRODUCTION
To exploit spatial-temporal dependency features of dif- denotes the station number), the process to learn the spa-
ferent air quality related time series data (see the lower tial-temporal dependency features of multiple stations
right corner of Fig. 4), the first step is to train multiple one- data can be represented as follows:
dimensional CNNs to extract local trend features and pos-
sible spatial correlation features of multiple stations time 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼𝑖𝑖 ) → 𝐿𝐿𝑖𝑖 (1)
series data. Unlike traditional image processing methods Concatenate(𝐿𝐿1 … 𝐿𝐿𝑖𝑖 … 𝐿𝐿𝑛𝑛 ) → 𝐿𝐿𝐶𝐶𝑡𝑡 (2)
(which are fed with two-dimensional image pixels), the in- 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵(𝐿𝐿𝐿𝐿𝑡𝑡 ) → 𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 (3)
puts to our DAQFF model are multiple one-dimensional 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 ) → 𝑂𝑂𝑡𝑡 (4)
time series. We employ an improved CNN model, which
can compress the length of air quality time series. Rather where 𝐿𝐿𝑖𝑖 denotes the local trend features of single station
than learning the features of each single time series sepa- time series data 𝐼𝐼𝑖𝑖 , and 𝐿𝐿𝐶𝐶𝑡𝑡 denotes the concatenated local
rately, we learn all the time series data of each observation trend features of all stations and the hidden spatial corre-
point of multiple stations. lation features between all stations. These spatial correla-
Then, the extracted features (including the local trend tion features with local trend features are concatenated and
features of each station data and the possible spatial corre- learned by the Bi-LSTM model automatically. Note that 𝑆𝑆𝑡𝑡
lation features of multiple stations data) of many one-di- and 𝑇𝑇𝑡𝑡 denote the spatial and temporal dependency fea-
mensional CNNs are concatenated and fed into certain bi- tures, respectively, which are extracted from multiple sta-
directional LSTMs. These LSTMs learns spatial-temporal tions data, and 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝑆𝑆𝑡𝑡 , 𝑇𝑇𝑡𝑡 ) represents the feature level
dependency features from both past and future contexts fusion result. 𝑂𝑂𝑡𝑡 is the shared representation between 𝑆𝑆𝑡𝑡
utilizing time series in forward and backward directions and 𝑇𝑇𝑡𝑡 . Next, we use a fusion layer to concatenate all the
simultaneously. spatial-temporal shared features among different time se-
Given an air quality time series dataset 𝐼𝐼𝑖𝑖 of a station (𝑖𝑖 ries data together. The model is formulated as follows:
We use ReLU as the activation function. 𝑥𝑥𝑖𝑖𝑙𝑙−1 and 𝑐𝑐𝑗𝑗𝑙𝑙 repre-
𝑖𝑖
𝐹𝐹((𝑂𝑂𝑡𝑡−𝑙𝑙 , 𝑂𝑂𝑡𝑡−𝑙𝑙+1 , … , 𝑂𝑂𝑡𝑡 ), 𝑊𝑊 , 𝑏𝑏 𝑖𝑖 )
→ 𝑀𝑀𝜋𝜋 , i=1, 2, ..., n (5) sent the input and output vectors to a convolution layer,
respectively. Here 𝑙𝑙 represents the involved layer. We use
where 𝑀𝑀𝜋𝜋 denotes the joint fusion representation for dif- three convolution layers for local trend feature learning.
ferent learned spatial-temporal dependency features Each layer learns a non-linear representation from the pre-
which are extracted from multiple air quality time series vious layer, and the learned representation is then fed into
data. 𝑊𝑊 𝑖𝑖 and 𝑏𝑏 𝑖𝑖 are weights and biases, respectively, the next layer to form hierarchical feature representations.
which are learned by the fusion model with all training da- After processing three convolution layer, we use a flatten
tasets, where 𝑖𝑖 indicates the 𝑖𝑖 th time window of input time layer to transform the high-level representation to a feature
series data, 𝑙𝑙 indicates the time window size (also called vector and use a fully connected layer to reduce the dimen-
lookup size). The training objective function of DAQFF sion of the final output vector.
model is as follows: As introduced above, the multi-station input air quality
time series data are processed using multiple 1D-CNNs,
𝑛𝑛 𝑚𝑚
1 and are flattened into the fully connected layer. Then the
argmin 𝐶𝐶𝑖𝑖 = � � ||𝑦𝑦�𝚤𝚤 𝑗𝑗 − 𝑦𝑦𝑖𝑖 𝑗𝑗 ||2 (6) final output is given by a concatenated layer, which not
𝜽𝜽 𝑛𝑛
𝑖𝑖=1 𝑗𝑗=1 only captures the local trend features of single station time
series data (as one dimensional filter is used in each con-
The final model training problem is to minimize the volutional layer, the local trend change features of the time
overall error 𝐶𝐶𝑖𝑖 of training samples for each time window series over time can be captured.), but also integrates the
time series, where 𝑖𝑖 indicates each time window time se- possible spatial correlation features of multiple stations.
ries input (i=1, 2, ..., n), 𝑗𝑗 indicates the input samples num- Moreover, one-dimensional CNN's local perception
ber of a time window input data (j=1, 2, ..., m) and 𝜽𝜽 is the and weighted sharing features can reduce the number of
parameter space including 𝑊𝑊𝑙𝑙𝑖𝑖 and 𝑏𝑏𝑙𝑙𝑖𝑖 of each layer. parameters for processing multivariate time series data,
Based on the above process, one-dimensional CNNs are thereby improving learning efficiency. Thus, our method
used to extract the local trend and spatial correlation can learn more deep representation features of air quality
features. Bi-LSTM is used to capture and learn the spatial- related data.
temporal dependency features of the sequence and obtains
the correlation pattern of the time series context. Then we 3.4 Bi-LSTM for long temporal dependencies
fuse these learned spatial-temporal dependency features learning
by concatenating layers. Finally, we input these joint fu- Although traditional statistical methods like ARIMA
sion features into the linear regression layer for final pre- and shallow learning models similar to deep neural net-
diction. In this way, DAQFF combines multiple one- works can process time series, the efficiency is not so good,
dimensional CNNs and bi-directional LSTM in one end-to- because it does not take into account the long-term tem-
end deep learning architecture, which can simultaneously poral dependence of time series data. In order to overcome
extract the local trend features and the spatial-temporal de- this shortcoming, Long Short-term Memory network
pendency features of air quality related multivariate time (LSTM) is a good option [32], which is a popular dynamic
series data. model for handling sequence tasks.
As shown in the upper right corner of Fig. 4, the LSTM
Cell Block represents a typical LSTM diagram [33]. The
3.3 Multiple 1D-CNNs for local trend and spatial memory cell of each LSTM block contains four main com-
features learning ponents. The collaboration of these components enables
CNN not only has excellent performance in image pro- cells to learn and memory long dependency features. The
cessing [10], but also can be effectively applied on time se- typical LSTM block computing process is as follows:
ries data mining. A typical CNN has three layers: convolu-
tional layer, activation layer, and pooling layer. Unlike the 𝑖𝑖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑖𝑖) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑖𝑖) ℎ𝑡𝑡−1 + 𝑏𝑏𝑖𝑖 � (11)
classical CNN model (also traditional two-dimensional 𝑓𝑓𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑓𝑓) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑓𝑓) ℎ𝑡𝑡−1 + 𝑏𝑏𝑓𝑓 � (12)
CNN used for images), we propose to use multiple one- 𝑜𝑜𝑡𝑡 = 𝜎𝜎�𝑈𝑈 (𝑜𝑜) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑜𝑜) ℎ𝑡𝑡−1 + 𝑏𝑏𝑜𝑜 � (13)
dimensional filters convolved (1D-CNNs) over all time 𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 (𝑐𝑐) 𝑥𝑥𝑡𝑡 + 𝑊𝑊 (𝑐𝑐) ℎ𝑡𝑡−1 + 𝑏𝑏𝑐𝑐 � (14)
steps of air quality time series data. The computing pro- 𝑠𝑠𝑡𝑡 = 𝑓𝑓𝑡𝑡 ∘ 𝑠𝑠𝑡𝑡−1 + 𝑖𝑖𝑡𝑡 ∘ 𝑠𝑠̃𝑡𝑡 (15)
cesses of 1D-CNN layers are formulated as below: ℎ𝑡𝑡 = 𝑜𝑜𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠𝑡𝑡 ) (16)
𝑐𝑐𝑗𝑗𝑙𝑙 = ∑𝑖𝑖 𝑥𝑥𝑖𝑖𝑙𝑙−1 ∗ 𝑤𝑤𝑖𝑖𝑖𝑖𝑙𝑙 + 𝑏𝑏𝑗𝑗𝑙𝑙 (7) As shown in the above formulas, 𝑖𝑖𝑡𝑡 represents the input
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅(𝑐𝑐𝑗𝑗𝑙𝑙 ) (8) gate and it decides the new information input the memory
𝑥𝑥𝑗𝑗𝑙𝑙 = 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹(𝑥𝑥𝑗𝑗𝑙𝑙 ) (9) cell. 𝑓𝑓𝑡𝑡 represents the forget gate which decides how much
𝑙𝑙+1 𝑙𝑙+1 𝑙𝑙
𝑥𝑥𝑘𝑘 = 𝐹𝐹𝐹𝐹(𝑤𝑤𝑘𝑘𝑘𝑘 𝑥𝑥𝑗𝑗 + 𝑏𝑏𝑘𝑘𝑙𝑙+1 ) (10) information should be discarded. 𝑜𝑜𝑡𝑡 indicates the output
gate which decides the amount of information should
Note that Eq. (7) and Eq. (8) model the convolutional transfer to the next time step or to the output. 𝑠𝑠̃𝑡𝑡 is a neuron
layer learning process, where * denotes a convolution op- with a self-recurrent cell like RNN. 𝑠𝑠𝑡𝑡 is the internal
erator, 𝑤𝑤𝑖𝑖𝑖𝑖𝑙𝑙 and 𝑏𝑏𝑗𝑗𝑙𝑙 are the filters and biases, respectively. memory cell of LSTM block which is summed by two parts.
The first part is calculated by the previous internal
memory state 𝑠𝑠𝑡𝑡−1 and forget gate 𝑓𝑓𝑡𝑡 . The second part is listed as follows (as shown in Table 1):
calculated by element wise multiplication of self-recurrent
state 𝑠𝑠̃𝑡𝑡 and input gate 𝑖𝑖𝑡𝑡 . ℎ𝑡𝑡 is hidden state of LSTM block. TABLE 1
One disadvantage of traditional LSTMs is that they can EXPERIMENTS DATASETS DESCRIPTION
only utilize the previous context of sequence data, and Bi- Dataset Beijing PM2.5 Urban Air Quality
directional LSTM can process the time series data in two Dataset[31] Dataset [19]
directions simultaneously through two independent hid- multivariable multivariable
Data type
den layers [30], and these data are concatenated and fed time series time series
forward to the output layer. In other words, Bi-directional Intervals 60-minutes 60-minutes
LSTM processes the time series data in two directions iter- Location Beijing Beijing
atively (forward layer from t = 1 to T, backward layer from 01/01/2010- 05/01/2014-
Time Span
t = T to 1). 12/31/2014 04/30/2015
Variable number 8 14
𝚤𝚤⃗𝑡𝑡 = 𝜎𝜎�𝑈𝑈�⃗(𝑖𝑖) 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 �⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑖𝑖 �
���⃗ (𝑖𝑖) ℎ (17) Used records 43,824 278,023
⃗
𝑓𝑓𝑡𝑡 = 𝜎𝜎�𝑈𝑈 �⃗ 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊
(𝑓𝑓) ���⃗ (𝑓𝑓) ℎ�⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑓𝑓 � (18) Station number 1 station 36 stations
𝑜𝑜⃗𝑡𝑡 = 𝜎𝜎�𝑈𝑈 �⃗(𝑜𝑜) 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 ���⃗ (𝑜𝑜) ℎ�⃗𝑡𝑡−1 + 𝑏𝑏�⃗𝑜𝑜 � (19)
𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 𝑥𝑥⃗𝑡𝑡 + 𝑊𝑊 ℎ𝑡𝑡−1 + 𝑏𝑏�⃗𝑐𝑐 �
���⃗ �⃗ (𝑐𝑐) ���⃗ (𝑐𝑐) �⃗
(20) Beijing PM2.5 Dataset: This hourly dataset contains the
𝑠𝑠⃗𝑡𝑡 = 𝑓𝑓⃗𝑡𝑡 ∘ 𝑠𝑠⃗𝑡𝑡−1 + 𝚤𝚤⃗𝑡𝑡 ∘ ���⃗
𝑠𝑠̃𝑡𝑡 (21) PM2.5 data of the US Embassy in Beijing. Meanwhile, me-
�⃗𝑡𝑡 = 𝑜𝑜⃗𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠⃗𝑡𝑡 )
ℎ (22) teorological data are also included. Data items include
PM2.5 concentration, Dew Point, Temperature, Pressure,
⃖𝚤𝚤𝑡𝑡 = 𝜎𝜎�𝑈𝑈⃖��(𝑖𝑖) 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑖𝑖 �
⃖��� (𝑖𝑖) ℎ (23) Combined wind direction, Cumulated wind speed (m/s),
𝑓𝑓⃖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 ⃖�� 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊
(𝑓𝑓) ⃖��� (𝑓𝑓) ℎ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑓𝑓 � (24) Cumulated hours of snow, Cumulated hours of rain. The
𝑜𝑜⃖𝑡𝑡 = 𝜎𝜎�𝑈𝑈 ⃖��(𝑜𝑜) 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ⃖��� (𝑜𝑜) ℎ⃖�𝑡𝑡−1 + 𝑏𝑏⃖�𝑜𝑜 � (25) dataset used for experiments is ranged from 01/01/2010 to
𝑠𝑠̃𝑡𝑡 = 𝑡𝑡𝑡𝑡𝑡𝑡ℎ�𝑈𝑈 𝑥𝑥⃖𝑡𝑡 + 𝑊𝑊 ℎ𝑡𝑡−1 + 𝑏𝑏⃖�𝑐𝑐 �
⃖��� ⃖�
� (𝑐𝑐) ⃖��� (𝑐𝑐) ⃖�
(26) 12/31/2014, which has 43824 records.
𝑠𝑠⃖𝑡𝑡 = 𝑓𝑓⃖𝑡𝑡 ∘ 𝑠𝑠⃖𝑡𝑡−1 + ⃖𝚤𝚤𝑡𝑡 ∘ ⃖���
𝑠𝑠̃𝑡𝑡 (27) Urban Air Quality Dataset: This hourly dataset is com-
⃖�𝑡𝑡 = 𝑜𝑜⃖𝑡𝑡 ∘ 𝑡𝑡𝑡𝑡𝑡𝑡ℎ(𝑠𝑠⃖𝑡𝑡 )
ℎ (28) prised of six parts of data over a period of one year (from
05/01/2014 to 04/30/2015), which has been used in [7] [19]
�⃗𝑡𝑡 ∘ ℎ
ℎ𝑡𝑡 = ℎ ⃖�𝑡𝑡 (29) to infer the fine-grained air quality of current and future
times. We select the data from Beijing as the experimental
The above equations show the layer functions of Bi- dataset, which contains a total of 278,023 records from 36
LSTM, and the two direction arrows denote the forward monitoring stations, where the data items include PM2.5,
and backward process, respectively. ℎ𝑡𝑡 represents the final PM10, NO2, CO, O3, SO2, weather, temperature, humidity,
hidden element of Bi-LSTM, which is the concatenated vec- pressure, wind speed and wind direction, etc.
�⃗𝑡𝑡 and the backward output ℎ
tor of the forward output ℎ ⃖�𝑡𝑡 .
Through the above process, Bi-LSTM can learn both past 4.2 Experimental Setup
and future features of time series data and the predictive This section describes the hardware and software environ-
output is generated from past and future contexts. ment of the experiment and the configuration of relevant
parameters. The open source deep learning library Keras
4 EXPERIMENTS which based on Tensorflow is used to build baseline deep
learning models and DAQFF model, and Scikit-learn is
In this section, we use two real air quality data sets to con-
used to build shallow learning models. All experiments are
duct experiments to analyze and evaluate the performance
conducted on a PC Server, and the server configuration is
of the proposed model. Through the comparison of classi-
Intel(R) Xeon(R) CPU E5-2623 3.00GHz, 4 GPUs each is
cal shallow learning models, baseline deep learning
12G NVIDIA Tesla K80C, and memory is 128GB.
models and our model DAQFF, the forecasting perfor-
Our framework is compared with two classic shallow
mance and effectiveness of the proposed model are vali-
learning models and five baseline deep learning models.
dated.
They are summarized as follows.
ARIMA is one of the most common traditional statistical
4.1 Datasets
methods in time series prediction.
Our experiment uses two real-world air quality datasets: SVR (Support Vector Regression) is a kernel method of
The first one is the Beijing air quality dataset from UCI [31], machine learning which also can be used for time series
which includes meteorological data and PM2.5 pollution forecasting. And the kernel-based SVR can make it possi-
data. The dataset is collected every hour and is sourced ble to learn nonlinear trend of the training dataset. There
from the data interface released by the US Embassy in Bei- are three SVR models with different kernels (RBF, poly and
jing [28]. As an experimental air quality UCI data set, it linear).
contains different attributes such as date, time, tempera- RNN (Recurrent Neural Network) is a popular deep
ture, humidity, wind speed, wind direction, and PM2.5 learning method for handling sequence tasks. GRU (Gated
values. And the second dataset is the Urban Air Quality Recurrent Units) and LSTM (Long Short-term Memory)
Dataset collected in the Urban Air project of Microsoft Re- are the most popular variants of RNN. CNN (Convolu-
search [19]. The details of the two experimental data set are tional Neural Network) is widely used in image processing,
but one-dimensional CNN can also be used for time series is superior to other baseline methods in terms of PM2.5
prediction. single-step forward prediction performance in both two
The most critical task of deep learning applications is set- datasets. Compared to the baseline shallow and deep
ting hyper-parameters and optimizing them. In order to ef- learning models, our model reduces RMSE to 8.20 and
fectively model a deep neural network, a large number of MAE to 6.19 in Beijing PM2.5 Dataset, also has the lowest
hyper-parameters need to be set. In experiments, the de- error in Urban Air Quality Dataset, which improves the
fault parameters in Keras are used for deep neural network forecasting accuracy obviously. In addition, the model er-
initialization (e.g., weight initialization). In order to avoid ror of classic deep learning models (such as LSTM, CNN,
the over-fitting problem of the deep learning models, we and GRU) are similar and also lower than shallow models.
apply several methods to solve it, such as a dropout policy This means that deep learning models are more effective
with probability 0.3, which is used widely between layers for air quality time series forecasting than traditional shal-
(including convolutional layers, recurrent layers, and low learning models in single step prediction task.
dense layers). And the default training parameters are: the
batch size is 32, the epochs size is 100, and the lookup size TABLE 2
is 1. We also use tanh as the activation function of the RNN THE MODEL ERROR OF DAQFF AND COMPARISONS WITH
OTHER BASELINE MODELS FOR THE SINGLE-STEP PM2.5 PRE-
model (include GRU and LSTM) and ReLU as the activa-
DICTION TASK.
tion function of the CNN layers. In addition, we use Adam
as an optimizer. The baseline model's network structure Beijing PM2.5 Urban Air Quality
uses one hidden layer default and the number of neurons Models Dataset Dataset
of each hidden layer is set to 128. RMSE MAE RMSE MAE
We use three convolution layers for local trend feature SVR-POLY 42.61 31.82 56.35 47.20
learning. Each layer has different filter size and kernel size SVR-RBF 41.86 34.93 50.51 42.26
parameter settings, say (64, 5), (32, 3), (16, 1). We use ReLU SVR-LINEAR 30.60 20.47 29.23 18.82
as the activation function. The bidirectional-LSTM layer is ARIMA 24.52 12.50 27.92 14.35
equipped using 128 hidden neurons for temporal features LSTM 13.03 9.29 24.96 22.31
learning. We use mean square error (MSE) as the loss func- GRU 11.75 8.71 23.70 21.43
tion of our DAQFF model. The activation function of the CNN 12.21 9.09 20.95 16.36
output layer is a linear function, which is also used for final RNN 10.61 8.83 13.79 11.62
prediction. Moreover, we apply min-max function to nor- DAQFF 8.20 6.19 11.81 9.96
malize the air quality time series data to [0,1]. Missing fea- Note: forward-step prediction size is 1, and model testing error (RMSE and
tures in the experimental data are filled using the average MAE) are the prediction error of the next 1 hour (h1).
value of the column in which they are located.
Additionally, for the Beijing PM2.5 Dataset experiment, Moreover, the prediction performance of baseline deep
we select the first four-year data for training and validation learning methods is dramatically better than the classic
(three-year data for training, and the rest one-year data for shallow learning methods such as SVR and ARIMA (1 to 2
validation) and select the last year data for testing times the gap). Our model performs the best since DAQFF
(01/01/2014-12/31/2014). For the Urban Air Quality Da- can learn local trend features by one-dimensional CNN
taset experiment, we select the first eight-month data for and long-term dependencies feature by Bi-directional
training (05/01/2014-12/31/2014) and select the last four- LSTM of air quality multivariate time series data.
month data for testing (01/01/2015-04/30/2015). We use
RMSE and MAE as the model error evaluation indicators,
which are used to analyze the experimental results.
𝑛𝑛
1
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = � �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�)
𝚤𝚤
2 (30)
𝑛𝑛
𝑖𝑖=1
𝑛𝑛
1
𝑀𝑀𝑀𝑀𝑀𝑀 = � | 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝚤𝚤 | (31)
𝑛𝑛
𝑖𝑖=1
where 𝑦𝑦�𝚤𝚤 represents the predicted PM2.5 value, 𝑦𝑦𝑖𝑖 repre-
sents ground truth value and 𝑛𝑛 is the number of test da-
taset.
Fig. 8. In the experiment on Urban Air Quality Dataset, a comparison of single step ground truth and predicted PM2.5 value during one month
(01/01/2015-01/31/2015) of 1001 station versus different models (SVR, LSTM, and DAQFF). (a) SVR with RBF kernel model; (b) LSTM
model; (c) DAQFF model.
4.4 Multi-step Forecasting Results Analysis baseline models whether it is a short time step or a long time
The multi-step PM2.5 prediction quantitative results of step prediction.
two real-world datasets are reported in Table 3 (model test- TABLE 3
ing error in the table is the average of the prediction error THE MODEL ERROR OF DAQFF AND COMPARISONS WITH
value in the next forward 6 hours, h1~h6), which gives OTHER BASELINE MODELS FOR THE MULTI-STEP PM2.5 PREDIC-
RMSE and MAE comparative analysis of SVR (rbf, linear TION TASK.
and poly kernel), RNN, CNN, LSTM, GRU, and our pro- Beijing PM2.5 Urban Air Quality
posed model DAQFF. As shown in Table 3, our model is Models Dataset Dataset
also superior to other methods in terms of PM2.5 multi-step RMSE MAE RMSE MAE
prediction performance. Compared to the baseline shallow SVR-POLY 56.62 44.94 64.02 50.82
and deep learning models, our model reduces MAE to SVR-RBF 57.66 46.32 65.11 53.59
27.53 in the Beijing PM2.5 Dataset, and also has the lowest SVR-LINEAR 49.82 36.82 53.48 36.35
MAE as 25.01 in the Urban Air Quality Dataset, which im- LSTM 57.49 44.12 58.25 44.28
proves the forecasting accuracy obviously. It is worth noting GRU 52.61 38.99 60.76 45.53
that the testing error of classic deep learning models (RNN, RNN 57.38 44.69 60.71 46.16
CNN, LSTM, and GRU) are similar and larger than SVR- CNN 52.85 39.68 53.38 38.21
LINEAR model in the Beijing PM2.5 Dataset. Does this
DAQFF 43.49 27.53 46.49 25.01
mean that the multi-step forecasting performance of the
Note: forward multi-step prediction size is 6, and model testing error (RMSE
baseline deep learning model is worse than those of some
and MAE) are the average of the prediction error in the next forward 6 hours
shallow models (such as SVR-LINEAR)? In fact, it is not en-
(h1~h6).
tirely true, as shown in Table 4, if long-term time step predic-
tion is performed, we will find that the prediction perfor- Next, we analyze the impact of forward prediction size
mance of the baseline deep learning models will exceed the among the baseline deep learning models and DAQFF. As
SVR-LINEAR model as the prediction time step increases. shown in Table 4, in the Beijing PM2.5 Dataset, the perfor-
Taking the LSTM model as an example, in the next 3 hours mance of PM2.5 multi-step forward prediction is signifi-
(h1~h3), the average prediction error of the LSTM model is cantly lower than that of single step forward prediction
larger than that of the SVR-LINEAR model. However, when (see Table 2). As the forward prediction size increases, the
the forward prediction size is greater than 3, e.g. in the next forecasting performances of these models gradually de-
h4~h6, h7~h12, or h13~h24 time period, the average predic- crease. But we can observe that compared to baseline
tion error of the LSTM model is lower than that of the SVR- methods, our DAQFF model also has the lowest prediction
LINEAR model. DAQFF model does not have this problem, error (RMSE and MAE) versus different forward predic-
since the performance of our model is better than that of the tion sizes.
TABLE 4
IN THE EXPERIMENT ON BEIJING PM2.5 DATASET, THE MODEL ERROR OF DAQFF AND COMPARISONS WITH OTHER BASELINE MOD-
ELS FOR THE MULTI-STEP PREDICTION OF PM2.5 VALUES IN THE NEXT 24 HOURS.
RMSE MAE
Models
1h~3h 4h~6h 7h~12h 13h~24h 1h~3h 4h~6h 7h~12h 13h~24h
SVR- POLY 48.99 64.26 75.70 84.91 39.14 50.74 59.71 66.92
SVR-RBF 51.15 64.18 75.46 84.92 41.76 50.88 59.57 67.02
SVR-LINEAR 38.69 60.96 76.24 85.60 27.31 46.32 60.10 67.04
RNN 49.50 65.27 77.06 80.56 38.89 50.49 59.88 59.18
CNN 45.95 59.76 70.83 79.18 35.62 43.74 51.21 57.91
LSTM 45.88 57.51 69.52 79.15 35.31 40.32 48.76 57.17
GRU 47.32 57.90 69.57 79.23 37.01 40.98 48.77 57.25
DAQFF 34.35 52.64 66.14 77.38 20.91 34.15 44.93 54.58
Note: Hyper-parameters settings are: forward multi-step prediction size is 24, lookup size is 1, epochs is 100, batch-size is 64,
and testing error (RMSE, MAE) of t~t+n is the average of the prediction error in the next forward n hours.
Fig. 9. In the experiment on Beijing PM2.5 Dataset, a comparison of multi-step (next t1, t3, and t6) ground truth and predicted PM2.5 value
during one month (01/01/2014-01/31/2014) of different models (SVR, LSTM and DAQFF). (a) SVR-LINEAR model for next 1 hour (t1) predic-
tion; (b) SVR-LINEAR model for the third hour of the future (t3) prediction; (c) SVR-LINEAR model for the 6th hour in the future (t6) predic-
tion; (d) LSTM model for next 1 hour (t1) prediction; (e) LSTM model for the third hour of the future (t3) prediction; (f) LSTM model for the 6th
hour in the future (t6) prediction; (g) DAQFF model for next 1 hour (t1) prediction; (h) DAQFF model for the third hour of the future (t3) pre-
diction; (i) DAQFF model for the 6th hour in the future (t6) prediction;
TABLE 5
IN THE EXPERIMENT ON URBAN AIR QUALITY DATASET, THE MODEL ERROR OF DAQFF AND COMPARISONS WITH OTHER BASELINE
MODELS FOR THE MULTI-STEP PREDICTION OF PM2.5 VALUES IN THE NEXT 48 HOURS.
RMSE MAE
Models
1h~6h 7h~12h 13h~24h 25h~48h 1h~6h 7h~12h 13h~24h 25h~48h
SVR-POLY 64.02 83.72 105.73 109.98 50.82 64.01 83.96 86.83
SVR-RBF 65.11 83.96 88.81 90.38 53.59 67.52 70.32 74.23
SVR-LINEAR 53.48 83.85 91.06 93.11 36.35 68.01 73.72 72.73
RNN 60.71 86.25 100.54 115.13 46.16 67.94 81.97 99.45
GRU 60.76 95.59 107.82 111.06 45.53 72.81 82.45 90.25
LSTM 58.25 88.52 96.61 103.21 44.28 69.39 76.40 84.65
CNN 53.38 83.48 94.85 94.53 38.21 61.97 71.94 77.92
DAQFF 46.49 69.15 77.88 80.06 25.01 48.37 59.69 61.75
Note: Hyper-parameters settings are: forward multi-step prediction size is 48, lookup size is 1, epochs is 100, batch-size is 64,
and testing error (RMSE, MAE) of t~t+n is the average of the prediction error in the next forward n hours.
Fig. 10. In the experiment on Urban Air Quality Dataset, forecast the PM2.5 value of one station (no.1001) based on the trained model of 36
stations data. A comparison of multi-step (t6) ground truth and predicted PM2.5 value during two months (01/01/2015-02/28/2015) of SVR-
LINEAR, LSTM and DAQFF model (a) SVR-LINEAR for the next 6th hour (t6) prediction; (b) LSTM for the next 6th hour (t6) prediction; (c)
DAQFF for the next 6th hour (t6) prediction.
In order to further analyze and compare the forecasting of Zheng et al. [19], when the prediction size is less than 6,
performance of DAQFF and the other baseline models, we the forecasting performance of the DAQFF model is worse
analyze the multi-step forecasting ability of our model in than that of the method of Zheng et al., but for long time-
the experiment on Beijing PM2.5 Dataset under different step forward forecasting, our model DAQFF performs bet-
time-step (t1, t3 and t6) forward prediction over the course ter. It should be noted that the benchmarks are not con-
of test data in one month (24 hours per day, 31 days, totally sistent for comparison of different studies due to the small
744 time-step points, 01/01/2014-01/31/2014). Figs. 9 (a)- number of benchmark data sets available in the air quality
(i) give a comparison of the ground truth and the predicted forecasting field, even the released Urban Air Quality Da-
PM2.5 value of the experiment with different models (SVR- taset [19] having not been well pre-processed (e.g. there are
LINEAR, LSTM, and DAQFF) under different predict sizes’ many missing values to which different processing meth-
(1 time-step, 3 time-step, and 6 time-step) conditions, ods can be applied).
where x-coordinate indicates the observation time-steps
and y-coordinate indicates the PM2.5 value. As shown in TABLE 6
these figures, the multi-step forecasting performance of COMPARISONS AMONG DIFFERENT RESEARCHES WITH URBAN
our model is better than those of SVR-LINEAR represented AIR QUALITY DATASET
shallow models and LSTM represented deep learning
MAE
models, not only under short prediction size but also un- Models
1h~6h 7h~12h 13h~24h 25h~48h
der long prediction size conditions, especially in the time
Zheng [Link] [19] 23.70 52.40 63.90 69.0
period of wave peak and trough of test data. Moreover, as
DAQFF 25.01 48.37 59.69 61.75
the prediction time-steps grow, the predictive perfor-
mance of SVR models decreases dramatically, but our
DAQFF model can maintain the best performance. All in all, for the proposed DAQFF, the PM2.5 prediction
Moreover, we analyze whether DAQFF model can can be well matched with the ground truth with single step
maintain the same forecasting ability for different air qual- forward prediction, also has better performance than base-
ity data sets. In the experiment on Urban Air Quality Dataset, line models with multi-step forward prediction, which im-
we further verify the forecasting performance of DAQFF plies the deep air quality forecasting framework can effec-
model. As Table 5 shows, compared with other models, the tively learn the local trend and long-term temporal de-
forecasting performances of SVR-POLY, RNN, and GRU pendence characteristics of multivariate air quality time se-
model have large fluctuations in long-term time step (e.g. ries data. The proposed air quality forecasting model
h13~h24, h25~h48) prediction. In addition, what is inter- DAQFF which is based on a hybrid deep learning structure
esting about the data in Table 5 is that as the prediction can provide a useful reference for air pollution manage-
time step grows, our model can maintain a significant im- ment and early warning.
provement over the baseline models compared with the
data of Table 4. In short, our model can maintain optimal 5 CONCLUSION AND FUTURE WORK
prediction performance over short-term or long-term time
step forecasting conditions. In this paper, we proposed a new air quality forecasting
Figs. 10 (a) (b) (c) give a zoom in a comparison of the framework (DAQFF) for PM2.5 single step forward and
ground truth (expected) and six-step forward predicted multi-step forward prediction, which is based on a hybrid
PM2.5 values of SVR-LINEAR, LSTM, and DAQFF models. deep learning method. DAQFF consists of two deep neural
Through the comparison of these three figures, it is found networks: one-dimensional CNNs and Bi-directional
that the multi-step forecasting performance of classic deep LSTM. It can learn the correlation features of local trend
learning model like LSTM is better than that of shallow and spatial-temporal dependencies pattern of multivariate
model SVR-LINEAR, and SVR-LINEAR model cannot ef- air quality related time series data. Experiments showed
fectively predict the PM2.5 values such as the wave valley that the proposed model has better performance than clas-
and the wave peak of air quality time series data when the sic shallow learning and deep learning models, which can
prediction time step is 6. Although the prediction perfor- explore and learn the interdependence and nonlinear cor-
mance of LSTM is not bad, it is not accurate enough to pre- relations of multivariate air quality related time series (e.g.
dict the wave valley and wave peak values of PM2.5 time temperature, humidity, wind speed, SO2, PM10 and PM2.5
series. DAQFF has better performance at all time period itself) effectively. The main contributions of this paper are
observation points and can effectively predict PM2.5 val- as follows:
ues under different conditions (e.g. in the case of missing 1) We firstly proposed a new hybrid deep learning
values, as shown in the red wireframe section in Fig. 10). framework which can deal with hierarchical feature repre-
In short, through experimental results, we find that the sentation and multi-scale spatial-temporal dependency fu-
overall multi-step prediction performance of DAQFF is the sion learning in an end-to-end process for air quality fore-
best, no matter if it is on general weather day or extreme casting.
weather day, weekdays or weekends. 2) This study was the first attempt to combine multiple
Finally, we have compared the multi-step forecasting one-dimensional CNNs and bi-directional LSTM for hy-
performance of DAQFF and the state-of-the-art method, brid fusion learning of air quality related multivariate time
Zheng et al. [19], which used the same Urban Air Quality series data, which can extract spatial-temporal depend-
Dataset. As shown in Table 6, compared with the method ency and correlation features for air quality multi-step
forecasting modeling. [11] Karpathy A, and Li F F, “Deep visual-semantic alignments for gener-
3) We demonstrated the effectiveness of our model by ating image descriptions,” in Proceedings of the 2015 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137,
testing it on two real-world air quality datasets, and the
2015.
experimental results indicated that our model has good [12] Cho K, Van Merriënboer B, Gulcehre C, et al, “Learning phrase repre-
forecasting ability (not only single step but also multi-step sentations using RNN encoder-decoder for statistical machine transla-
forecasting). It was also showed that the proposed model tion”. arXiv preprint arXiv:1406.1078, 2014
has better prediction ability than typical shallow learning [13] Gamboa J C B, “Deep learning for time-series analysis”. arXiv preprint
and baseline deep learning models. arXiv:1701.01887, 2017
[14] Qi Z, Wang T, Song G, et al, “Deep air learning: interpolation, predic-
In future research, we believe that the abrupt change
tion, and feature analysis of fine-grained air quality”. IEEE Transac-
(also called outlier or anomaly point) of air pollution time tions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2285-
series deserves further study. If we can predict the sudden 2297, 2018
change of air pollution in advance, which will greatly im- [15] Deleawe S, Kusznir J, Lamb B, et al, “Predicting air quality in smart
prove the multi-step forecasting ability of our model. In environments,” Journal of Ambient Intelligence and Smart Environ-
addition, the model DAQFF also needs to be researched in ments, vol. 2, no. 2, pp. 145-154, 2010.
depth and improved under different forecasting condi- [16] Zhou Q, Jiang H, Wang J, et al, “A hybrid model for PM2. 5 forecasting
based on ensemble empirical mode decomposition and a general re-
tions.
gression neural network,” The Science of the Total Environment, vol.
496, pp.264-274, 2014.
[17] Díaz-Robles L A, Ortega J C, Fu J S, et al, “A hybrid ARIMA and artifi-
ACKNOWLEDGMENT cial neural networks model to forecast particulate matter in urban ar-
eas: the case of Temuco, Chile,” Atmospheric Environment, vol. 42, no.
This research was partially supported by the National Nat- 35, pp. 8331-8340, 2008.
ural Science Foundation of China (Nos. 61773324, [18] Zheng Y, Capra L, Wolfson O, et al, “Urban computing: concepts,
61572407), the “Center for Cyber-physical System Innova- methodologies, and applications,” ACM Transactions on Intelligent
tion” from The Featured Areas Research Center Program Systems and Technology, vol. 5, no. 3, pp. 38, 2014.
within the framework of the Higher Education Sprout Pro- [19] Zheng Y, Yi X, Li M, et al, “Forecasting fine-grained air quality based
on big data,” in Proceedings of the 21st ACM SIGKDD International
ject by the Ministry of Education (MOE) in Taiwan and
Conference on Knowledge Discovery and Data Mining (KDD), pp.
MOST under 106-2221-E-011-149-MY2 and108-2218-E-011- 2267-2276, 2015.
006. [20] Lipton Z C, Berkowitz J, and Elkan C, “A critical review of recurrent
neural networks for sequence learning”. arXiv preprint
arXiv:1506.00019, 2015
[21] Längkvist M, Karlsson L, and Loutfi A, “A review of unsupervised
REFERENCES
feature learning and deep learning for time-series modeling,” Pattern
[1] Zhang Y., Bouquet, M., Mallet, V., Seigneur, C., and Baklanov, A., Recognition Letters, vol. 42, pp. 11-24, 2014.
“Real-time air quality forecasting, Part I: History, techniques, and cur- [22] Li X, Peng L, Hu Y, et al, “Deep learning architecture for air quality
rent status, “ Atmospheric Environment, vol. 60, pp. 632-655, 2012. predictions,” Environmental Science and Pollution Research, vol. 23,
[2] Zhang, Y., Bouquet, M., Mallet, V., Seigneur, C., and Baklanov, A., no. 22, pp. 22408-22417, 2016.
“Real-time air quality forecasting, Part II: State of the science, current [23] Ong B T, Sugiura K, and Zettsu K, “Dynamically pre-trained deep re-
research needs, and future prospects,” Atmospheric Environment, vol. current neural networks using environmental monitoring data for pre-
60, pp. 656-676, 2012. dicting PM2. 5,” Neural Computing and Applications, vol. 27, no. 6,
[3] Vardoulakis S, Fisher BE A, Pericleous K, et al, “Modelling air quality pp. 1553-1566, 2016.
in street canyons: a review,” Atmospheric Environment, vol. 37, no. 2, [24] Yang J, Nguyen MN, San P P, et al., “Deep convolutional neural net-
pp. 155-182, 2003. works on multichannel time series for human activity recognition,” in
[4] M. Dong, D. Yang, Y. Kuang, D. He, S. Erdal, and D. Kenski, “PM2.5 Proceedings of International Joint Conference on Artificial Intelligence
concentration prediction using hidden semi-Markov model-based (IJCAI), pp. 3995-4001, 2015.
times series data mining,” Expert Systems with Applications, vol. 36, [25] Zhang J, Zheng Y, Qi D, “Deep spatio-temporal residual networks for
no. 5, pp. 9046-9055, 2009 citywide crowd flows prediction,“ in Proceedings of AAAI Confer-
[5] Donnelly, A., Misstear, B., and Broderick, B, “Real-Time air quality ence on Artificial Intelligence (AAAI), pp. 1655-1661, 2017.
forecasting using integrated parametric and nonparametric regression [26] Sun Y, Wang X, Tang X, “Hybrid deep learning for face verification,”
techniques,” Atmospheric Environment , vol. 103, pp. 53-65, 2015. in Proceedings of 2013 IEEE International Conference on Computer
[6] Zhou X, Huang W, Zhang N, et al, “Probabilistic dynamic causal Vision (ICCV), pp. 1489-1496, 2013.
model for temporal data,” in Proceedings of 2015 International Joint [27] Wu Z, Wang X, Jiang Y G, et al, “Modeling spatial-temporal clues in a
Conference on Neural Networks (IJCNN). IEEE, pp. 1-8, 2015. hybrid deep learning framework for video classification,” in Proceed-
[7] Y. Zheng, F. Liu, and H.-P. Hsieh, “U-air: When urban air quality in- ings of the 23rd ACM International Conference on Multimedia (ACM
ference meets big data,” in Proceedings of the 19th ACM SIGKDD In- MM), pp. 461-470, 2015.
ternational Conference on Knowledge Discovery and Data Mining [28] Liang X, Zou T, Guo B, et al, “Assessing Beijing's PM2. 5 pollution: se-
(KDD), pp. 1436-1444, 2013. verity, weather impact, APEC and winter heating,” in Proceedings of
[8] H.-P. Hsieh, S.-D. Lin, and Y. Zheng, “Inferring air quality for station The Royal Society A, vol. 471, no. 2182, 2015.
location recommendation based on urban big data,” in Proceedings of [29] Graves A, Fernández S, Schmidhuber J, “Bidirectional LSTM net-
the 21st ACM SIGKDD International Conference on Knowledge Dis- works for improved phoneme classification and recognition,” in Pro-
covery and Data Mining (KDD), pp. 437-446, 2015. ceedings of International Conference on Artificial Neural Networks
[9] Schmidhuber J, “Deep learning in neural networks: An overview. (ICANN), pp. 799-804, 2005.
Neural networks, “ vol. 61, pp. 85-117, 2015. [30] Graves A, and Schmidhuber J, “Framewise phoneme classification
[10] Krizhevsky A, Sutskever I, and Hinton G E, “Imagenet classification with bidirectional LSTM and other neural network architectures,”
with deep convolutional neural networks,” in Proceedings of Ad- Neural Networks, vol. 18, no. 5-6, pp. 602-610, 2005.
vances in Neural Information Processing Systems (NIPS), pp. 1097- [31] Beijing PM2.5 Data Set [Online]Available:[Link]
1105, 2012. [Link]/ml/datasets/Beijing+PM2.5+Data.
[32] Hochreiter S, and Schmidhuber J, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[33] Olah C. Understanding LSTM networks. 2015. [Online]Available:
[Link] 2015
[34] Eravci B, and Ferhatosmanoglu H, “Diverse Relevance Feedback for
Time Series with Autoencoder Based Summarizations,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 30, no. 12, pp. 2298-
2311, 2018.
[35] Zhuang, Dennis EH, Gary CL Li, and Andrew KC Wong, “Discovery
of temporal associations in multivariate time series,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 26, no. 12, pp. 2969-
2982, 2014.
[36] Wang J, and Song G, ‘’A deep spatial-temporal ensemble model for air
quality prediction." Neurocomputing, vol. 314, pp. 198-206, 2018.
[37] Yi X, Zhang J, Wang Z, Li T, and Zheng Y, "Deep distributed fusion
network for air quality prediction." in Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), pp. 965-973, 2018.