0% found this document useful (0 votes)
31 views66 pages

SURVEY Accepted

Uploaded by

salma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views66 pages

SURVEY Accepted

Uploaded by

salma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/347364694

Deep Learning for Time Series Forecasting: A Survey

Article in Big Data · December 2020


DOI: 10.1089/big.2020.0159

CITATIONS READS
356 9,211

5 authors, including:

José Francisco Torres Dalil Hadjout


Universidad Pablo de Olavide 6 PUBLICATIONS 418 CITATIONS
25 PUBLICATIONS 1,189 CITATIONS
SEE PROFILE
SEE PROFILE

Abderrazak Sebaa Francisco Martínez-Álvarez


École supérieure en Sciences et Technologies de l'Informatique et du Numérique - E… Universidad Pablo de Olavide
25 PUBLICATIONS 579 CITATIONS 192 PUBLICATIONS 5,163 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Francisco Martínez-Álvarez on 18 December 2020.

The user has requested enhancement of the downloaded file.


1 Deep Learning for Time Series Forecasting: A Survey

2 J. F. Torresa,1 , D. Hadjoutb,1 , A. Sebaac,d , F. Martínez-Álvareza , A.


3 Troncosoa,∗
a
4 Data Science and Big Data Lab, Pablo de Olavide University, ES-41013 Seville, Spain.
b
5 Department of Commerce, SADEG Company (Sonelgaz Group), 06000 Bejaia, Algeria.
c
6 LIMED Laboratory, Faculty of Exact Sciences, University of Bejaia, 06000 Bejaia,
7 Algeria.
d
8 Higher School of Sciences and Technologies of Computing and Digital, Amizour, 06000
9 Bejaia, Algeria.

10 Abstract
11 Time series forecasting has become a very intensive field of research, which is
12 even increasing in recent years. Deep neural networks have proved to be pow-
13 erful and are achieving high accuracy in many application fields. For these
14 reasons, they are one of the most widely used methods of machine learning
15 to solve problems dealing with big data nowadays. In this work, the time
16 series forecasting problem is initially formulated along with its mathemati-
17 cal fundamentals. Then, the most common deep learning architectures that
18 are currently being successfully applied to predict time series are described,
19 highlighting their advantages and limitations. Particular attention is given
20 to feed forward networks, recurrent neural networks (including Elman, long-
21 short term memory, gated recurrent units and bidirectional networks) and
22 convolutional neural networks. Practical aspects, such as the setting of val-
23 ues for hyper-parameters and the choice of the most suitable frameworks, for
24 the successful application of deep learning to time series are also provided
25 and discussed. Several fruitful research fields in which the architectures ana-
26 lyzed have obtained a good performance are reviewed. As a result, research
27 gaps have been identified in the literature for several domains of application,
28 thus expecting to inspire new and better forms of knowledge.
29 Keywords: time series forecasting, deep learning, big data


Corresponding author
Email addresses: jftormal@[Link] (J. F. Torres), dhadjout@[Link] (D.
Hadjout), [Link]@[Link] (A. Sebaa), fmaralv@[Link] (F.
Martínez-Álvarez), atrolor@[Link] (A. Troncoso)
1
Equally contributing authors.

Preprint submitted to Big Data December 18, 2020


30 1. Introduction

31 The interest in processing huge amounts of data has experienced a rapid

32 increase during the last decade due to the massive deployment of smart sen-

33 sors [1] or the social media platforms [2], which generate data on a continuous

34 basis [3]. However, this situation poses new challenges, such as storing these

35 data in disks or making available the required computational resources.

36 Big data analytics emerges, in this context, as an essential process focused

37 on efficiently collecting, organizing and analyzing big data with the aim

38 of discovering patterns and extracting valuable information [4]. In most

39 organizations, this helps to identify new opportunities and making smarter

40 moves, which leads to more efficient operations and higher profits [5].

41 From all the learning paradigms that are currently being used in big

42 data, deep learning highlights because of its outstanding performance as the

43 scale of data increases [6]. Most of the layer computations in deep learning

44 can be done in parallel by, for instance, powerful graphic processing units

45 (GPUs). That way, scalable distributed models are easier to be built and

46 they provide better accuracy at a much higher speed. Higher depth allows for

47 more complex non-linear functions but, in turn, with higher computational

48 costs [7].

49 Deep learning can be applied to numerous research fields. Applications

50 to both supervised and unsupervised problems can be abundantly found

51 in the literature [8]. Pattern recognition and classification were the first

52 and most relevant uses of deep learning, achieving great success in speech

53 recognition, text mining or image analysis. Nevertheless, the application to

2
54 regression problems is becoming quite popular nowadays mainly due to the

55 development of deep learning architectures particularly conceived to deal

56 with data indexed over time. Such is the case of time series and, more

57 specifically, time series forecasting [9].

58 A time series is a set of measures collected at even intervals of time and

59 ordered chronologically [10]. Given this definition, it is hard to find physical

60 or chemical phenomena without variables that evolve over time. For this

61 reason, the proposal of time series forecasting approaches is fruitful and can

62 be found in almost all scientific disciplines.

63 Statistical approaches have been used from the 70s on, especially those

64 based on the Box-Jenkins methodology [11]. With the appearance of ma-

65 chine learning and its powerful regression methods [12], many models were

66 proposed outperforming the former, which have remained as baseline meth-

67 ods in most research works. However, methods based on deep learning are

68 currently achieving superior results and much effort is being put into devel-

69 oping new architectures.

70 For all the above mentioned, the primary motivation behind this survey

71 is to provide a comprehensive understanding of deep learning fundamentals

72 for researchers interested in the field of time series forecasting. Furthermore,

73 it overviews several applications in which these techniques have been proven

74 successful and, as a result, research gaps have been identified in the literature

75 and are expected to inspire new and better forms of knowledge.

76 Although other surveys discussing deep learning properties have been

77 published during the last years, the majority of them provided a general

78 overview of both theory and applications to time series forecasting. Thus,

3
79 Zhang et al. [13] reviewed emerging researches of deep learning models,

80 including their mathematical formulation, for big data feature learning. An-

81 other remarkable work can be found in [14], in which the authors introduced

82 the time series classification problem and provided an open source framework

83 with implemented algorithms and the UEA/UCR repository [15]. Recently,

84 Mayer and Jacobsen published a survey about scalable deep learning on dis-

85 tributed infrastructures, in which the focus was put on techniques and tools,

86 along with a smart discussion about the existing challenges in this field [16].

87 The rest of the paper is structured as follows. The forecasting prob-

88 lem and mathematical formulation for time series can be found in Section

89 2. Section 3 introduces the deep learning architectures typically used in the

90 context of time series forecasting. Section 4 provides information about sev-

91 eral practical aspects (including implementation, hyper-parameter tuning or

92 hardware resources) that must be considered when applying deep learning

93 to forecast time series. Section 5 overviews the most relevant papers, sorted

94 by fields, in which deep learning has been applied to forecast time series. Fi-

95 nally, the lessons learned and the conclusions drawn are discussed in Section

96 6.

97 2. Problem definition

98 This section provides the time series definition (Section 2.1), along with

99 description of the main time series components (Section 2.2). The mathe-

100 matical formulation for the time series forecasting problem is introduced in

101 Section 2.3. Final remarks about the length of the time series can be found

102 in Section 2.4.

4
103 2.1. Time series definition

104 A time series is defined as a sequence of values, chronologically ordered,

105 and observed over time. While the time is a variable measured on a continu-

106 ous basis, the values in a time series are sampled at constant intervals (fixed

107 sampling frequency).

108 This definition holds true for many applications, but not every time series

109 can be modeled in this way, due to some of the following reasons:

110 1. Missing data in time series is a very common problem due to the re-

111 liability of data collection. To deal with these values, there are a lot

112 a strategies but those based on imputing the missing information and

113 on omitting the entire record, are the most widely used [17].

114 2. Outlying data is also an issue that appears very frequently in time

115 series. Methods based on robust statistics must be chosen in order to

116 remove these values or, simply, to incorporate them into the model

117 [18].

118 3. When data are collected at irregular time periods, they can be either

119 called unevenly spaced time series or, if big enough, data streams [3].

120 Some of these issues can be handled natively by the used model, but if the

121 data is collected irregularly, this should be accounted for in the model. In

122 this survey, the time series pre-processing is out of scope, but please refer to

123 this work for detailed information [19].

124 2.2. Time series components

125 Time series are usually characterized by three components: trend, sea-

126 sonality and irregular components, also known as residuals [20]. Such com-

5
127 ponents are described below:

128 1. Trend. It is the general movement that the time series exhibits during

129 the observation period, without considering seasonality and irregular-

130 ities. In some texts, this component is also known as long-term vari-

131 ation. Although there are different kinds of trends in time series, the

132 most popular are linear, exponential or parabolic ones.

133 2. Seasonality. This component identifies variations that occur at specific

134 regular intervals and may provide useful information when time periods

135 exhibit similar patterns. It integrates the effects reasonably stable

136 along with the time, magnitude and direction. Seasonality can be

137 caused by several factors such as climate or economical cycles, or even

138 festivities.

139 3. Residuals. Once the trend and cyclic oscillations have been calculated

140 and removed, some residual values remain. These values can be, some-

141 times, high enough to mask the trend and the seasonality. In this case,

142 the term outlier is used to refer these residuals, and robust statistics

143 are usually applied to cope with them [21]. These fluctuations can be

144 of diverse origin, which makes the prediction almost impossible. How-

145 ever, if by any chance, this origin can be detected or modeled, they

146 can be thought of precursors in trend changes.

147 A time series is an aggregate of these three components. Real-world time

148 series present a meaningful irregular component and are not stationary (mean

149 and variance are not constant over time), turning this component into the

150 most challenging one to model. For this reason, to make accurate predictions

6
151 for them is extremely difficult, and many forecasting classical methods try

152 to decompose the target time series into these three components and make

153 predictions for all of them separately.

154 The effectiveness of one technique or another is assessed according to its

155 capability of forecasting this particular component. It is for the analysis of

156 this component where data mining-based techniques has been shown to be

157 particularly powerful.

158 Time series can be graphically represented. In particular, the x–axis

159 identifies the time, whereas the y–axis the values recorded at punctual time

160 stamps (xt ). This representation allows the visual detection of the most

161 highlighting features of a series, such as oscillations amplitude, existing sea-

162 sons and cycles or the existence of anomalous data or outliers. Figure 1

163 depicts an time series, xt , using an additive model with linear seasonality

164 with constant frequency and amplitude over time, represented by the func-

165 tion sin(x); linear trend where changes over time are consistently made by

166 the same amount, represented by the function 0.0213x; and residuals, repre-

167 sented by random numbers in the interval [0, 0.1].

168 2.3. Mathematical formulation

169 Time series models can be either univariate (one time-dependent vari-

170 able) or multivariate (more than one time-dependent variables). Although

171 models may dramatically differ between an univariate and a multivariate sys-

172 tem, the majority of the deep learning models can handle indistinctly with

173 both of them.

174 On the one hand, let y = y(t − L), . . . , y(t − 1), y(t), y(t + 1), . . . , y(t + h)

7
Seasonality Trend
1.00
1.0
0.75
0.50 0.8
0.25
0.6
0.00
x

x
0.25 0.4
0.50
0.2
0.75
1.00 0.0
0 10 20 30 40 50 0 10 20 30 40 50
t t
Residuals Time series
0.5 2.5

2.0
0.4
1.5
0.3
1.0
x

x
0.2 0.5

0.0
0.1
0.5
0.0
0 10 20 30 40 50 0 10 20 30 40 50
t t

Figure 1: Illustrative time series, showing seasonality, trend and residuals.

175 be a given univariate time series with L values in the historical data, where

176 each y(t − i), for i = 0, . . . , L, represents the recorded value of the variable

177 y at time t − i. The forecasting process consists in estimating the value of

178 y(t + 1), denoted by ŷ(t + 1), with the aim of minimizing the error, which is

179 typically represented as a function of y(t + 1) − ŷ(t + 1). This prediction can

180 be made also when the horizon of prediction, h, is greater than one, that is,

181 when the objective is to predict the h next values after y(t), that is, y(t + i),

182 with i = 1, . . . , h. In this situation, the best prediction is reached when the

function hi=1 (y(t + i) − ŷ(t + i)) is minimized.


P
183

On the other hand, multivariate time series can be expressed as follows,

8
in the matrix form:
 
y1  
  y1 (t − L) . . . y1 (t − 1) y1 (t) y1 (t + 1) . . . y1 (t + h)
 

 y2   

  y2 (t − L) . . . y2 (t − 1) y2 (t) y2 (t + 1) . . . y2 (t + h) 
  
 .. = 

 .   
... ... ... ... ... ... ...
   

yn
   
 
  yn (t − L) . . . yn (t − 1) yn (t) yn (t + 1) . . . yn (t + h)
yT
(1)

184 where yi (t − m) identifies the set of time series, with i = {1, 2, . . . , n},

185 being m = {0, 1, . . . , L} the historical data and current sample and m =

186 {−1, −2, . . . , −h} the future h values. Usually, there is one target time series

187 (the one to be predicted) and the remaining ones, are denoted as independent

188 time series.

189 2.4. Short and long time series forecasting

190 Another key issue is the length of the time series. Depending on the

191 number of samples, long or short time series can be defined. It is well-

192 known that the Box-Jenkins’ models do not work well for long time series

193 mainly due to the time consuming process of parameters optimization and

194 to the inclusion of information which is no longer useful to model the current

195 samples [9].

196 How to deal with these issues is highly related to the purpose of the

197 model. Flexible non-parametric models could be used, but this still assumes

198 that the model structure will work over the whole period of the data, which

199 is not always true. A better approach consists in allowing the model to vary

200 over time. This can be done by either adjusting a parametric model with

9
201 time-varying parameters, or by adjusting a non-parametric model with a

202 time-based kernel. But if the goal is only to forecast few observations, it is

203 simpler to fit a model with the most recent samples and transforming the

204 long time series into a short one [22].

205 Although a preliminary approach to use a distributed ARIMA model has

206 been recently published [23], it remains challenging to deal with such time

207 series with classical forecasting methods. However, a number of machine

208 learning algorithms adapted to deal with ultra-long time series, or big data

209 time series, have been published in recent years [24]. These models make

210 use of clusters of machines or GPUs to overcome the limitations described

211 in the previous paragraphs.

212 Deep learning models can deal with time series in a scalable way and

213 provide accurate forecasts [25]. Ensemble learning can also be useful to

214 forecast big data time series [26] or even methods based on well-established

215 methods such as nearest neighbours [27, 28] or pattern sequence similarity

216 [29].

217 3. Deep learning architectures

218 This section provides a theoretical tour of deep learning for time series

219 prediction in big data environments. First, a description of the most used

220 architectures in the literature to predict time series is made. Then, a state-of-

221 the-art analysis is carried out, where the deep learning works and frameworks

222 to deal with big data are described.

10
223 3.1. Deep Feed Forward Neural Network

224 Deep Feed Forward Neural Networks (DFFNN), also called multi-layer

225 perceptron, arose due to the inability of single-layer neural networks to learn

226 certain functions. The architecture of a DFFNN is composed of an input

227 layer, an output layer and different hidden layers as shown in Figure 2. In

228 addition, each hidden layer has a certain number of neurons to be deter-

229 mined.

Figure 2: Basic architecture of a DFFNN for time series forecasting.

The relationships between the neurons of two consecutive layers are mod-

elled by weights, which are calculated during the training phase of the net-

work. In particular, the weights are computed by minimizing a cost func-

tion by means of gradient descent optimization methods. Then, the back-

propagation algorithm is used to calculate the gradient of the cost function.

Once the weights are computed, the values of the output neurons of the

network are obtained using a feed forward process defined by the following

11
equation:

al = g(Wal al−1 + bla ) (2)

230 where al are the activation values in the l-th layer, that is, a vector composed

231 of the values of the neurons of the l-th layer, Wal and bla are the weights

232 and bias corresponding to the l-th layer, and g is the activation function.

233 Therefore, the al values are computed using the activation values of the

234 l − 1 layer, al−1 , as input. In time series forecasting, the rectified linear

235 unit function (ReLU) is commonly used as activation function for all layers,

236 except for the output layer to obtain the predicted values which generally

237 uses the hyperbolic tangent function (tanh).

238 For all network architectures, the values of some hyper-parameters have

239 to be chosen in advance. These hyper-parameters, such as the number of

240 layers and the number of neurons, define the network architecture, and

241 other hyper-parameters, such as the learning rate, the momentum, num-

242 ber of iterations or minibatch size, among others, have a great influence on

243 the convergence of the gradient descend methods. The optimal choice of

244 these hyper-parameters is important as these values greatly influence the

245 prediction results obtained by the network. The hyper-parameters will be

246 discussed in more detail in Section 4.2.

247 3.2. Recurrent Neural Network

248 Recurrent Neural Networks (RNN) are specifically designed to deal with

249 sequential data such as sequences of words in problems related to machine

250 translation, audio data in speech recognition or time series in forecasting

251 problems. All these problems present a common characteristic, which is

12
252 that the data have a temporal dependency between them. Traditional feed

253 forward neural networks cannot take into account these dependencies, and

254 RNNs arise precisely to address this problem [30]. Therefore, the input data

255 in the architecture of a RNN are both past and current data. There are

256 different types of architectures, depending on the number of data inputs and

257 outputs in the network, such as one to one (one input and one output), one

258 to many (one input and many outputs), many to one (many inputs and one

259 output), and many to many (many inputs and outputs). The most com-

260 mon RNNs are many to one for classification problems or many to many for

261 machine translation or time series forecasting for instance. In addition, for

262 the case of a time series, the length of the input data sequence is usually

263 different from the size of the output data sequence that usually is the num-

264 ber of samples to be predicted. A basic RNN architecture to address the

265 forecasting of time series is shown in Figure 3. xi and x̂i are the actual and

266 predicted values of the time series at time i, and h is the number of samples

267 to be predicted, called prediction horizon.

𝑥$#%' 𝑥$#%&

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

𝑥' 𝑥#
Many to many

Figure 3: Basic architecture of a RNN for time series forecasting.

268 The most widely used RNNs for time series forecasting are briefly de-

13
269 scribed below.

270 3.2.1. Elman Recurrent Neural Network

271 The Elman network (ENN) was the first RNN and incorporated the t

272 state of a hidden unit in order to make predictions in data sequences [31]. The

273 ENN consists of a classical one-layer feed-forward network but the hidden

274 layer is connected to a new layer, called context layer, using fixed weights

275 equal to one, as shown in Figure 4. The main function of the neurons of this

276 context layer is to save a copy of the values of activation of the neurons of

277 the hidden layer. Then, the model is defined by:

at = g(Wa xt + Ua at−1 + ba ) (3)

278 where at are the values of the neurons in the t state in the hidden layer, xt is

279 the current input, at−1 is the information saved in the context hidden units,

280 Wa , Ua and ba are the weights and the bias, and g the activation function.

Output

Hidden layer

Context layer

Input data

Figure 4: Architecture of an ENN network for time series forecasting.

14
281 3.2.2. Long-Short Term Memory

282 Standard basic RNNs suffer the vanishing gradient problem, which con-

283 sists in the gradient decreasing as the number of layers increases. Indeed, for

284 deep RNNs with a high number of layers, the gradient practically becomes

285 null, preventing the learning of the network. For this reason, these networks

286 have a short-term memory and do not obtain good results when dealing with

287 long sequences that require memorizing all the information contained in the

288 complete sequence. Long-Short Term Memory (LTSM) recurrent networks

289 emerge in order to solve the vanishing gradient problem [32]. For this pur-

290 pose, LTSM use three gates to keep longstanding relevant information and

291 discard irrelevant information. These gates are Γf forget gate, Γu update

292 gate and Γo output gate. Γf decides what information should be thrown

293 away or saved. A value close to 0 means that the past information is for-

294 gotten while a value close to 1 means that it remains. Γu decides what new

295 ct to use to update the ct memory state. Thus, ct is updated


information e

296 using both Γf and Γu . Finally, Γo decides which is the output value that

297 will be the input of the next hidden unit.

298 The information of the at−1 previous hidden unit and the information

299 of the xt current input is passed through the σ sigmoid activation function

300 to compute all the gate values and through the tanh activation function to

301 ct new information, which will be used to update. The equations


compute the e

302 defining a LTSM unit are:

15
ct = tanh(Wc [at−1 , xt ] + bc )
e (4)

Γu = σ(Wu [at−1 , xt ] + bu ) (5)

Γf = σ(Wf [at−1 , xt ] + bf ) (6)

Γo = σ(Wo [at−1 , xt ] + bo ) (7)

ct = Γu ∗ e
ct + Γf ∗ ct−1 (8)

at = Γo ∗ tanh(ct ) (9)

303 where Wu , Wf and Wo , and bu , bf and bo are the weights and biases that

304 govern the behavior of the Γu , Γf and Γo gates, respectively, and Wc and bc

305 ct memory cell candidate.


are the weights and bias of the e

306 Figure 5 shows a picture of how a hidden unit works in a LTSM recurrent

307 network. The ∗ and + operators mean an element-wise vectors multiplication

308 and sum.

𝑐#
𝑐#$% * ⨁ - 𝑐#
tanh
*
𝑎#$% Γ* Γ+ 𝑐-𝑡 Γ) * 𝑎#
forget gate update gate tanh output gate

𝑥#

Figure 5: Hidden unit in a LTSM.

16
309 3.2.3. Gated Recurrent Units

310 Recurrent networks with gated recurrent units (GRU) units are long-term

311 memory networks like LTSMs but emerged in 2014 [33, 34] as a simplification

312 of LTSMs due to the high computational cost of the LSTM networks. GRU

313 is one of the most commonly used versions that researchers have converged

314 on and found to be robust and useful for many different problems. The use

315 of gates in RNNs has made it possible to improve capturing of very long-

316 range dependencies, making RNNs much more effective. The LSTM is more

317 powerful and more effective since it has three gates instead of two, but GRU

318 is a simpler model and it is computationally faster as it only has two gates,

319 Γu update gate and Γr relevance gate. The Γu gate will decide whether the

320 ct memory state is or is not updated using the e


ct memory state candidate.

321 The Γr gate determines how relevant is ct−1 to compute the next candidate

322 for ct , that is, e


ct . A GRU unit is defined by the following equations:

Γu = σ(Wu [ct−1 , xt ] + bu (10)

Γr = σ(Wr [ct−1 , xt ] + br ) (11)

ct = tanh(Wc [Γr ∗ ct−1 , xt ] + bc )


e (12)

ct = Γu ∗ e
ct + (1 − Γu ) ∗ ct−1 (13)

at = ct (14)

323 where Wu and Wr , and bu and br are the weights and the bias that govern

324 the behavior of the Γu and Γr gates, respectively, and Wc and bc are the

17
325 ct memory cell candidate.
weights and bias of the e

𝑐#$% * ⨁ - 𝑐#
* -1
Γ(
Γ) * 𝑐+𝑡
relevance
gate update gate tanh

𝑥#

Figure 6: Hidden unit in a GRU.

326 3.2.4. Bidirectional Recurrent Neural Network

327 There are some problems, in the field of natural language processing

328 (NLP) for instance, where in order to predict a value of a data sequence in

329 a given instant of time, information from the sequence both before and after

330 that instant is needed. Bidirectional Recurrent Neural Networks (BRNN)

331 address this issue to solve this kind of problems. The main disadvantage of

332 the BRNNs is that the entire data sequence is needed before the prediction

333 can be made.

334 Standard networks compute the activation values for hidden units using

335 a unidirectional feed forward process. However, in a BRNN, the prediction

336 uses information from the past as well as information from the present and

337 the future as input, using both forward and backward processing.

338 Thus, the prediction at time t, x


bt , is obtained using a g activation func-

339 tion applied to the corresponding weights with both the forward and back-

18
340 ward activation at time t. That is:

bt = g(Wx [aft , abt ] + bx )


x (15)

341 where Wx and bx are the weights and bias and aft and abt are the activation

342 values of the hidden units computed by forward and backward processing,

343 respectively, and g is an activation function.

344 Figure 7 presents the basic architecture of a BRNN. A BRNN can be seen

345 as two RNNs together, where the different hidden units have two values, one

346 computed by forward and another one by backward. In addition, the BRNN

347 units can be standard RNN units or GRU or LTSM units. In fact, a BRNN

348 with LSTM units is commonly used for a lot of NLP problems.

𝑥$#%' 𝑥$#%&

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

𝑥' 𝑥#

Figure 7: Basic architecture of a BRNN.

349 3.2.5. Deep Recurrent Neural Network

350 A Deep Recurrent Neural Network (DRNN) can be considered as a RNN

351 with more than one layer, also called stacked RNN. The hidden units can be

352 standard RNN, GRU or LTSM units, and it can be unidirectional or bidirec-

19
353 tional as described in previous sections. Figure 8 illustrates the architecture

354 of a DRNN with 3 layers.

355 In general, a DRNN works quite well for time series forecasting, but its

356 performance deteriorates when using very long data sequences as input. To

357 address this issue, attention mechanisms can be incorporated into the model,

358 being one of the most powerful ideas in deep learning [35]. An attention

359 model allows a neural network to pay attention to only part of an input data

360 sequence while it’s generating the output. This attention is modeled using

361 weights, which are computed by a single-layer feed forward neural network

362 [36].

𝑥$#%' 𝑥$#%&

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

RNN RNN RNN RNN


cell ⋯ cell ⋯ cell ⋯ cell

𝑥' 𝑥#

Figure 8: Basic architecture of a DRNN.

363 3.3. Convolutional Neural Networks

364 Convolutional Neural Networks (CNN) were presented in [37] by Fukushima

365 and are one of the most common architectures in image processing and com-

366 puter vision [38]. The CNNs have three kinds of layers: convolution, pooling

20
367 and fully connected. The main task of the convolution layers is the learning of

368 the features from data input. For that, filters of a pre-defined size are applied

369 to the data using the convolution operation between matrices. The convo-

370 lution is the sum of all element-wise products. The pooling reduces the size

371 of input, speeding up the computing and preventing overfitting. The most

372 popular pooling methods are average and max pooling, which summarize the

373 values using the mean or maximum value, respectively. Once the features

374 have been extracted by the convolutional layers, the forecasting is carried

375 out using fully connected layers, also called dense layers, as in DFFNN. The

376 input data for these last fully connected layers are the flattened features re-

377 sulting of the convolutional and pooling layers. Figure 9 depicts the overall

378 architecture of a CNN.

...

Input Convolution Pooling FC Output

Figure 9: Architecture of a CNN.

379 Recently, a variant of CNN, called Temporal Convolutional Networks

380 (TCN) [39], has emerged for data sequence competing directly with DRNNs

381 in terms of execution times and memory requirements.

382 TCNs have the same architecture as a DFFNN but the values of acti-

383 vations for each layer are computed using earlier values from the previous

384 layer. Dilated convolution is used in order to select which values of the neu-

21
385 rons from the previous layer will contribute to the values of the neurons in

386 the next layer. Thus, this dilated convolution operation captures both local

387 and temporal information.

The dilated convolution, Fd , is a function defined as follows:

K−1
X
Fd (x) = f (i) · xt−d·i (16)
i=0

388 where d is the dilation factor parameter and f is a filter of size K.

389 Figure 10 shows the architecture of a TCNN when applying a dilated

390 convolution using a filter of size 3 and dilation factors of 1, 2 and 4 for each

391 layer, respectively.

Output

d=4

Hidden

d=2

Hidden

d=1

Input

Figure 10: Architecture of a TCN using a filter of size 3.

392 Moreover, it is necessary to use generic residual modules in addition to

393 convolutional layers when deeper and larger TCN are used in order to achieve

394 further stabilization. These generic residual blocks consist in adding the

395 input of data to the output before applying the activation function. Then,

22
396 the TCN model can be defined as follows:

alt = g(Wal Fd (al−1 l l−1


t ) + ba + at ) (17)

397 where Fd (·) is the dilated convolution of d factor defined in Eq. (16), alt is

398 the value of the neuron of the l-th layer at time t, Wal and bla are the weights

399 and bias corresponding to the l-th layer, and g is the activation function.

400 4. Practical aspects

401 4.1. Implementation

402 The implementation of a multilayer perceptron is relatively simple. How-

403 ever, deep learning models are more complex, and their implementation re-

404 quires a high level of technical expertise and a considerable time investment

405 to implement. For this reason, the profile of the deep learning expert has

406 become one of the most demanded nowadays. In order to make easier imple-

407 mentations and reduce the time needed to design and train a model, some

408 companies have focused their work on developing frameworks that allow for

409 the implementation, training and use of deep learning models.

410 The main idea of the deep learning frameworks is to provide an interface

411 that allows for the implementation of models without having to pay too much

412 attention to the mathematical complexity behind them. There are several

413 frameworks available in the literature. The choice of one or another will

414 depend on several important factors, such as the type of architecture that

415 can be implemented, support for distributed programming environments, or

416 whether it can run on GPU’s. In this sense, Table 1 summarizes the most

23
417 widely used frameworks in the literature, where the term all includes the

418 DFFNN, CNN, TCN, RNN, LTSM, GRU or BRNN architectures, and CPU

419 is a central processing unit.

Table 1: Deep learning Frameworks


Framework Core Lang. Available Interfaces Architecture Distr. CPU | GPU
TensorFlow [40] C++ Python, JavaScript, C++, Java, Go, C#, Julia All 3 3| 3
H2O [41] Java Python, R, Scala, REST DFFNN 3 3| 3
Dl4j [42] Java Python, Scala, Clojure, Kotlin, C, C++ All 3 3| 3
PyTorch [43] Lua Python, C, C++ All 3 3| 3
Caffe [44] C++ Python, MATLAB CNN 7 3| 3
Neon [45] Python Python All 7 3| 3
Chainer [46] Python Python All 3 3| 3
Theano [47] Python Python All 7 3| 3
MXNet [48] Python Python, Scala, Julia, Clojure, Java, C++, R, Perl All 3 3| 3
ONNX [49] Python Python CNN, DFFNN 7 3| 7
PaddlePaddle Python Python CNN 3 3| 3
CNTK [50] C++ Python, C++, C# DFFNN, CNN, RNN 3 3| 3

420 Table 1 shows that the predominant programming language for devel-

421 oping deep learning models is Python. In addition, most of the frameworks

422 support distributed execution and the use of GPU’s. Although the described

423 frameworks facilitate the development of the models, some of them require

424 too many lines of code to obtain a complete implementation. For this reason,

425 high level libraries based on the core of the frameworks have been developed,

426 making programming even easier. Some examples of high-level-libraries can

427 be Keras [51], Sonnet [52], Swift or Gluon [53], among others. The main

428 advantage of using a high-level-library is that the syntax can be reused for

429 another base framework, in addition to facilitating its implementation. How-

430 ever, the lack of flexibility is the main disadvantage.

431 4.2. Hyper-parameter optimization

432 The combination of frameworks and high-level-libraries greatly facilitates

433 the implementation of models. However, there is an important study gap:

24
434 the model optimization. This optimization will determine the quality of

435 the model, and must be performed based on the adjustment of its hyper-

436 parameters. In deep learning there are two types of hyper-parameters: model

437 parameters and optimization parameters. The model parameters must be

438 adjusted in the model definition to obtain optimal performance. The op-

439 timization parameters are adjusted during the training phase of the model

440 using the data set. Some of the most relevant hyper-parameters are described

441 and categorised by network architecture in Table 2.

442

443 The number of hyper-parameters will depend on the network architec-

444 ture to be used. In addition, the value of each one will be influenced by the

445 characteristics of the problem and the data. This makes the task of optimiz-

446 ing a model a challenge for the research community. Moreover, and taken

447 into account the parameters described in table 2, the immense number of

448 possible combinations can be deduced. For this reason various metaheuris-

449 tics and optimisation strategies are used. According to the literature, there

450 are several strategies to optimize a set of hyper-parameters for deep learning

451 models, as shown in Table 3.

452 Thus, the hyper-parameter optimization methods can be classified into

453 four major blocks:

454 1. Trial-error. This optimization method is based on varying each of the

455 hyper-parameters manually. Therefore, this method implies a high

456 time investment, having a relatively low computational cost and a low

457 search space, because it requires the action of a user to modify the

25
Table 2: Relevant hyper-parameters.
Hyper-parameter Architectures Description
Optimizer All Algorithm used to update the weights of each layer after each it-
eration [54].
Learning rate All It determines the size of the step at each iteration of the optimiza-
tion method [55].
Number of epochs All Number of passes made in the whole training set [56].
Batch size All Number of sub samples that the network uses to update the weights
[57].
Hidden layers All It determines the depth of the neural network [58].
Activation function All Introduces non-linearity in the model, which allows the extraction
of more complex knowledge [59].
Momentum All It prevents oscillations in the convergence of the method [60].
Weight initialization All It prevents the explosion or vanishing of the activation in the layers
[61].
Dropout All It eliminates certain connections between neurons in each iteration.

26
It is used to prevent over-fitting [62].
L1/L2 Regularization All It prevents over-fitting, stopping weights that are too high so that
the model does not depend on a single feature [63].
Units RNN, DFFNN It determines the level of knowledge that is extracted by each layer.
It is highly dependent on the size of the data used [58].
Kernel/filter CNN Matrix that moves over the input data. It allows the extraction of
characteristics [64].
Stride CNN The number of pixels that move over the input matrix for each
filter [65].
Padding CNN Number of null samples added to dataset when it is processed by
the kernel [66].
Number of channels CNN Depth of the matrices involved in the convolutions [67].
Pooling CNN It allows to reduce the number of parameters and calculations in
the network [68].
nb_stacks TCN Number of stacks of residual blocks.
Dilations TCN A deep stack of dilated convolutions to capture long-range tempo-
ral patterns.
Table 3: Search strategies
Strategy Deep learning Cost Search space
Trial-error 7 Low Low
Grid 7 High High
Random 3 Medium High
Probabilistic 3 Medium Medium-Driven

458 values manually each time a run is finished. Since in deep learning

459 there are a large number of hyper-parameters and the values they can

460 set are infinite, it is not advisable to use this optimization method.

461 2. Grid. The grid method explores the different possible combinations

462 for a set of established hyperparameters. This method covers a high

463 search space, although it has a high computational cost associated

464 with it, which makes this method unviable to apply in deep learning,

465 let alone in big data environments.

466 3. Random. Random search allows to cover a high search space, because

467 infinite combinations of hyper-parameters can be generated. Within

468 this group we can differentiate between totally random or guided search

469 strategies, such as those based on metaheuristics. Examples of this

470 type of searches are the genetic algorithms [69, 70], particle swarm

471 optimization [71] or neuroevolution of augmenting topologies [72] al-

472 gorithms, among others. The wide search range, added to the medium-

473 cost involved in this search strategy, makes it one of the best meth-

474 ods for optimizing deep learning models. In addition, new hyper-

475 parameters optimization metaheuristics are being published, such as

476 the bioinspired model in the propagation of COVID-19 presented by

27
477 the authors in [73].

478 4. Probabilistic. This optimization method tracks each of the evaluations.

479 These evaluations are used to generate a probabilistic model that as-

480 signs values to the different hyper-parameters. The most common al-

481 gorithms to optimize hyper-parameters using probabilistic methods are

482 those based on Bayesian approaches [74].

483 There are many libraries for the optimization of hyperparameters in an au-

484 tomated way. However, very few are designed specifically for the optimiza-

485 tion of deep learning model hyperparameters, being also compatible with the

486 frameworks and high-level libraries described in Table 1. Table 4 summarizes

487 a set of libraries for the optimization of hyperparameters in deep learning

488 models, classifying them by search strategies, support to distributed comput-

489 ing, programming language and compatible framework from Table 1. Note

490 that it is not known whether HPOLib supports distributed computing or in

491 which frameworks it works.

Table 4: Hyper-parameters optimization libraries


Library Search strategy Distributed language Framework
Elephas Random, Probabilistic Yes Python Keras
Hyperas Random, Probabilistic Yes Python Keras
Hyperopt [75] Random, Probabilistic Yes Python –
Dlopt [76] Random No Python Keras
Talos [77] Grid, Random Yes Python keras
Keras-tuner Random Yes Python Keras
H2O [41] Grid, Random Yes Python, R H2O
BoTorch [78] Probabilistic Yes Python PyTorch
HPOLib [79] Probabilistic – Python –

28
492 4.3. Hardware performance

493 One of the most important decision a researcher must make is to deter-

494 mine the physical resources needed to ensure that deep learning algorithms

495 will find accurate models. Hence, this section overviews different hardware

496 infrastructures typically used for deep learning contexts, given its increasing

497 demand for better and more sophisticated hardware.

498 Although a CPU can be used to execute deep learning algorithms, the in-

499 tensive computational requirements usually make the CPU physical resources

500 insufficient (scalar architecture). For this reason, three different hardware

501 architectures are typically used for mining information with deep learning

502 frameworks: GPU, Tensor Processing Unit (TPU) and Intelligence Process-

503 ing Unit (IPU).

504 A GPU is a co-processor allocated in a CPU which is specifically designed

505 to handle graphics in computing environments. GPUs can have hundreds or

506 even thousands of more cores than a CPU, but running at lower speeds.

507 GPUs achieve high data parallelism with single instructions, multiple data

508 (SIMD) architecture and play an important role in the current artificial in-

509 telligence domain, with a wide variety of applications.

510 The first generation of TPUs was introduced in 2016, at the Google

511 I/O Conference and were specifically designed to run already trained neural

512 networks. TPUs are custom application-specific integrated circuits (ASIC)

513 built specifically for machine learning. Compared to GPUs (frequently used

514 for the same tasks since 2016), TPUs are implicitly designed for a larger

515 volume of reduced precision calculation (for example, from 8 bits of precision)

516 and lack of hardware for rasterization/texture mapping. The term was coined

29
517 for a specific chip designed for Google’s TensorFlow framework. Generally

518 speaking, TPUs have less accuracy compared to the computations performed

519 on a normal CPU or GPU, but it is sufficient to the calculations they have to

520 perform (an individual TPU can process more than 100 millions of pictures

521 per day). Moreover, TPUs are highly optimized for large batches and CNNs

522 and have the highest training throughput [80].

523 The IPU is completely different from today’s CPU and GPU processors.

524 It is a highly flexible, easy to use, parallel processor that has been designed

525 from the ground up to deliver state of the art performance on current ma-

526 chine learning models. But more importantly, the IPU has been designed to

527 allow new and emerging machine intelligence workloads to be realized. The

528 IPU delivers much better arithmetic efficiency on small batch sizes for both

529 training and inference, which results in faster model convergence in training,

530 models that generalise better, the ability to parallelize over many more IPU

531 processors to reduce training time for a given batch size, and also delivers

532 much higher throughput at lower latencies for inference. Another interesting

533 feature is its lower power consumption compared to GPUs or TPUs (up to

534 20% less).

535 Table 5 summarizes the properties of the processing units explored in

536 this section. Note that the performance is measured in flops and the cost in

537 USD. Note that for TPUs cloud services are available for a price starting at

538 4.50 USD per hour (retrieved in March 2020).

30
Table 5: Processing units properties.
Units Architecture Batch size Performance Cost
CPU Scalar Small ∼ 109 ∼ 102
GPU Vector Large ∼ 1012 ∼ 103
TPU ASIC Large ∼ 1012 -
IPU Graph Small ∼ 1015 ∼ 105

539 5. Applications

540 To motivate the relevance of the time series prediction problem, an anal-

541 ysis of the-state-of-the-art has been carried out classifying the deep learning

542 research works by application domain (such as energy and fuels, image and

543 video, finance, environment, industry or health) and the most widespread

544 network architectures used (ENN, LSTM, GRU, BRNN, DFFNN, CNN or

545 TCN). A summary on the works reviewed can be found in Table 6.

546

547 An overview of the items for each application domain is made in the

548 following paragraphs, in order to highlight the goals reached for each method

549 and field:

550 1. Energy and fuels. With the increasing use of renewable energies, accu-

551 rate estimates are needed to improve power system planning and oper-

552 ating. Many techniques have been used to make predictions, including

553 deep learning [196]. Reviewing the literature in the last few years, it

554 can be concluded that the vast majority of deep learning architectures

555 are suitable to this application area. For example, architectures based

556 on LSTM [91], ENN [86, 87], GRU [95], BRNN [96] and TCN [101]

31
Table 6: Summary of the works reviewed and classified into network architecture and application domain.
RNN DFFNN CNN Hybrid/Others
ENN LSTM GRU BRNN CNN TCN
Energy and fu- [81, 82, [88, 89, [94, 95] [96] [97, 98, [100] [101] [102, 103,
els 83, 84, 85, 90, 91, 99] 104, 105,
86, 87] 92, 93] 106, 107]
Image and [108] – – – – [109, [115, 116, [118]
video 110, 111, 117]
112, 113,
114]
Financial – [119, [121, [122] [124] [125, – [128, 129,
120, 121, 122, 123] 126, 121, 130, 131,
122] 127] 132, 121,

32
133, 134]
Environmental [135, 136, [144, [146, [151] [152] [153] [154] [151, 155,
137, 138, 145, 146, 148, 149, 156]
139, 140, 147] 150]
141, 142,
143]
Industry [157, 158] [159, [162, [164, [166, [167] [168, 169] [170, 171,
160, 161] 163] 165] 165] 167]
Health – [172] – [173] [174] [175, – [177, 178,
176, 114] 179, 180,
181, 182]
Misc – – [183] [184] [185] [186, [189, 190, [194, 195]
187, 188] 191, 192,
193]
557 have been used to predict electricity demand consumption. LSTM [90]

558 and CNN [100] networks have also been used to forecast photo-voltaic

559 energy load. A GRU has been used to forecast soot emission in diesel

560 engines in [94]. An ensemble of DFFNN networks was developed by

561 the authors in [99] to forecast time series of general purpose. After

562 that, this strategy has been also used to forecast load demand time

563 series [98]. In [92], the authors proposed an application of LSTM to

564 forecast oil production. Hybrid architectures have been also used in

565 this research field, for example, to forecast the price of carbon [103],

566 the price of energy in electricity markets [102], energy consumption

567 [104] or solar power generation [106].

568 2. Image and video. Image and video analysis is a very broad area of

569 research and works related to any application domains. For example,

570 Hu et al. conducted a wide study of deep learning for image-based

571 cancer detection and diagnosis [197]. In [198], the authors summarized

572 some techniques and studies used to recognize video sequence actions

573 from timed images. The authors presented in [108] an application of

574 an ENN network to forecast and monitor the slopes displacement over

575 photogrammetry performed by unmanned aerial vehicles. In [118],

576 the authors combined GRU, RNN and CNN to classify satellite image

577 time series. Although all these works offer highly competitive results,

578 the use of convolution-based networks predominates in the literature

579 to solve forecasting problems using image or video time series data.

580 On the one hand, CNNs have been used to forecast the combustion

581 instability [109], temporal dependencies in satellite images [112], the

33
582 speed of large-scale traffic [110] or to detect coronary artery stenosis

583 [114], among others. On the other hand, TCN are booming when it

584 comes to analyzing images and videos. For example, Yunqi et al. used

585 a TCN to estimate density maps from videos [115]. The authors in [117]

586 also applied a TCN to summarize generic videos. Another interesting

587 work in which images were used can be found in [116]. In this work,

588 they used a TCN model to dynamically detect stress through facial

589 photographs.

590 3. Financial. Financial analysis has been a challenging issue for decades.

591 Therefore, there are many research works related to this application

592 area, as described in [199]. Additionally, various architectures such as

593 CNN [125, 126, 127], DNN [124], GRU [123] or LSTM [120, 119] have

594 been used. Some authors make a comparison between some of these

595 architectures, analyzing which one offers better results [122]. Although

596 these studies are widespread, the complexity of the problem requires

597 the search for new methodologies and architectures [129, 121, 134, 133,

598 131, 132, 130].

599 4. Environmental. Environmental data analysis is one of the most popular

600 areas for the scientific community. Many of these works are also based

601 on the application of deep learning techniques to forecast time series.

602 The authors in [151] applied CNN and LSTM to forecast wind speed

603 or temperature using meteorological data from Beijing, China. Other

604 authors focused on a single specific variable. For instance, the authors

605 used TCN, GRU, ENN, BRNN and LSTM architectures to forecast

606 information related to wind in [156, 148, 135, 137, 138, 147, 150]. Wa-

34
607 ter quality and demand were also predicted using TCN and ENN in

608 [154, 141]. An application of LSTM-based neural networks for corre-

609 lated time series prediction was also proposed by Wan et al. in [144].

610 Futhermore, carbon dioxide emissions [140], flood [144] or N H3 con-

611 centration for swine house [200] were also predicted using deep learning

612 techniques, in particular ENN networks.

613 5. Industry. In the industry sector, deep learning techniques are also be-

614 ing used to carry out tasks of different kinds [201]. For instance, TCN

615 and BRNN can be used to traffic flow forecasting [168, 164]. LSTM

616 can be used for multiple purposes, such as process planning [161], con-

617 struction equipment recognition [159] or to improve the performance

618 of organizations [160, 162]. The authors in [165] used a DFFNN to

619 forecast bath and metal height features in electrolysis process. ENN

620 and GRU networks have been also used, for example, to forecast the

621 useful life or degradation of the materials [158, 157, 196]. Deep learn-

622 ing techniques are also widely applied to architecture, as it can be seen

623 in the in-depth study conducted by the authors in [202]. It can be

624 concluded that almost all network architectures have been used, given

625 the wide variety of problems existing in this area.

626 6. Health. The use of deep learning architectures in the area of health is

627 common in the last years [203, 197]. However, time series prediction

628 using deep learning models is not very widespread as time series are

629 generally short in this field, along with the high computational cost in-

630 volved in recurrent networks training. The authors of [174] conducted

631 a comprehensive study of time series prediction models in health care

35
632 diagnosis and prognosis with a focus on cardiovascular disease Instead,

633 it is usual to apply convolution-based architectures or implement hy-

634 brid models. For example, the authors used CNN to accelerate the

635 computation for magnetic resonance fingerprinting in [176]. CNN was

636 also used to monitoring the sleep stage in [177], for detecting prema-

637 ture problems as ventricular contractions [175] or to forecast the Sepsis

638 [181]. In [180], the authors used a BP network to forecast the incidence

639 rate of pneumonia. Other network architecture such as LSTM can be

640 used to forecast the status of critical patients according to their vital

641 functions [172]. A recent study conducted by the authors in [182] uses

642 some deep learning architectures to forecast COVID-19 cases.

643 7. Miscellaneous. In recent years, the TCN has been one of the most

644 widely checked general purpose architectures for time series forecasting

645 [191, 193, 190, 183]. However, any of the other network architectures

646 can be applied to time series of miscellaneous application domains not

647 classified in Table 6. For example, CNN and RNN can be used to

648 detect human activity [187, 187] or hybrid models to detect anomalies

649 [195]. Namely, readers interested in cybersecurity can find a detailed

650 description in [204, 185].

651 From the previous analysis of Table 6, two main conclusions can be

652 drawn. First, there exist several methods that have not been applied yet

653 to particular application fields. Second, the existence of these gaps encour-

654 ages the conduction of research in such lines.

36
655 6. Conclusions

656 Deep learning has proven to be one of the most powerful machine learning

657 techniques for solving complex problems dealing with big data. Most of the

658 data mainly generated through smart devices are time series nowadays, being

659 the prediction of them one of the most frequent and present problems in

660 almost all research areas. Thus, these two topics have been jointly analyzed

661 in this survey to provide an overview of deep learning techniques applied to

662 time series forecasting. Firstly, the most used deep learning architectures for

663 time series data in the last years have been described, with special emphasis

664 on important practical aspects that can have a great influence on the reported

665 results. In particular, it has put the focus on the search for hyper-parameters,

666 the frameworks for deployment of the different architectures, and the existing

667 hardware to lighten the hard training of the proposed network architectures.

668 Secondly, a study of the deep neural networks used to predict time series in

669 different application domains has been carried out in this survey, with the

670 aim of providing a good comparative framework to be used in future works

671 and to show which architectures have not been sufficiently tested in some

672 applications.

673 Acknowledgements

674 The authors would like to thank the Spanish Ministry of Science, Innova-

675 tion and Universities for the support under project TIN2017-88209-C2-1-R.

676 Also, this work has been partially supported by the General Directorate

677 of Scientific Research and Technological Development (DGRSDT, Algeria),

37
678 under the PRFU project (ref: C00L07UN060120200003).

679 References

680 [1] Plageras AP, Psannis KE, Stergiou C, et al. Efficient IoT-based sensor

681 big data collection-processing and analysis in smart buildings. Future

682 Generation Computer Systems, 82:349–357, 2018.

683 [2] Patil HP and Atique M. CDNB: CAVIAR-dragonfly optimization with

684 naive bayes for the sentiment and affect analysis in social media. Big

685 Data, 8(2):107–124, 2020.

686 [3] Gama J. Knowledge Discovery from Data Streams. Chapman &

687 Hall/CRC, 2010.

688 [4] Al-Jarrah OY, Yoo PD, Muhaidat S, et al. Efficient machine learning

689 for big data: a review. Big Data Research, 2:87–93, 2015.

690 [5] Dhar V, Sun C, and Batra P. Transforming finance into vision: concur-

691 rent financial time series as convolutional net. Big data, 7(4):276–285,

692 2019.

693 [6] Nguyen G, Dlugolinsky S, Bobák M, et al. Machine learning and deep

694 learning frameworks and libraries for large-scale data mining: a survey.

695 Artificial Intelligence Review, 52:77–124, 2019.

696 [7] Maji P and Mullins R. On the reduction of computational complexity

697 of deep convolutional neural networks. Entropy, 20(4):305, 2018.

38
698 [8] Schmidhuber J. Deep learning in neural networks: an overview. Neural

699 Networks, 61:85–117, 2015.

700 [9] Makridakis S, Wheelwright SC, and Hyndman RJ. Forecasting methods

701 and applications. John Wiley and Sons, 2008.

702 [10] Chatfield C. The analysis of time series: an introduction. Chapman

703 & Hall/CRC, 2003.

704 [11] Box GEP and Jenkins GM. Time series analysis: forecasting and

705 control. John Wiley and Sons, 2008.

706 [12] Martínez-Álvarez F, Troncoso A, Asencio-Cortés G, and Riquelme JC.

707 A survey on data mining techniques applied to electricity-related time

708 series forecasting. Energies, 8(11):13162–13193, 2015.

709 [13] Zhang Q, Yang LT, Chen Z, et al. A survey on deep learning for big

710 data. Information Fusion, 42:146–157, 2018.

711 [14] Fawaz HI, Forestier G, Weber J, Idoumghar L, et al. Deep learning

712 for time series classification: a review. Data Mining and Knowledge

713 Discovery, 33(4):917–963, 2019.

714 [15] Bagnall A, Lines J, Vickers W, et al. The UEA & UCR time series clas-

715 sification repository. [Link]

716 2017.

717 [16] R. Mayer and H. A. Jacobsen. Scalable Deep Learning on Distributed

718 Infrastructures: Challenges, Techniques, and Tools. ACM Computing

719 surveys, 53(1):Article 3, 2020.

39
720 [17] Buuren S. Flexible Imputation of Missing Data. Chapman &

721 Hall/CRC, 2012.

722 [18] Maronna RA, Martin RD, and Yohai VJ. Robust Statistics: Theory

723 and Methods. Wiley, 2006.

724 [19] Fu TC. A review on time series data mining. Engineering Applications

725 of Artificial Intelligence, 24(1):164–181, 2011.

726 [20] Shumway RH and Stoffer DS. Time series analysis and its applications

727 (with R examples). Springer, 2011.

728 [21] Maronna RA, Martin RD, and Yohai VJ. Robust statistics: theory and

729 methods. Wiley, 2007.

730 [22] Hyndman RJ and Athanasopoulos G. Forecasting: principles and prac-

731 tice. OTexts, 2018.

732 [23] Wang X, Kang Y, Hyndman RJ, et al. Distributed ARIMA models for

733 ultra-long time series. arXiv e-prints, arXiv:2007.09577, 2020.

734 [24] Rakthanmanon T, Campana B, Mueen A, Batista G, et al. Addressing

735 big data time series: Mining trillions of time series subsequences under

736 dynamic time warping. ACM Transactions on Knowledge Discovery

737 from Data, 7(3):10, 2013.

738 [25] Torres JF, Galicia A, Troncoso A, et al. A scalable approach based on

739 deep learning for big data time series forecasting. Integrated Computer-

740 Aided Engineering, 25(4):335–348, 2018.

40
741 [26] Galicia A, Talavera-Llames RL, Troncoso A, et al. Multi-step forecast-

742 ing for big data time series based on ensemble learning. Knowledge-

743 Based Systems, 163:830–841, 2019.

744 [27] Talavera-Llames R, Pérez-Chacón R, Troncoso A, et al. Big data time

745 series forecasting based on nearest neighbors distributed computing

746 with spark. Knowledge-Based Systems, 161(1):12–25, 2018.

747 [28] Talavera-Llames R, Pérez-Chacón R, Troncoso A, and F. Martínez-

748 Álvarez. MV-kWNN: A novel multivariate and multi-output weighted

749 nearest neighbors algorithm for big data time series forecasting. Neu-

750 rocomputing, 353:56–73, 2019.

751 [29] Pérez-Chacón R, Asencio-Cortés G, Martínez Álvarez F, et al. Big data

752 time series forecasting based on pattern sequence similarity and its

753 application to the electricity demand. Information Sciences, 540:160–

754 174, 2020.

755 [30] Rumelhart D, Hinton G, and Williams R. Long short-term memory.

756 Nature, 323:533–536, 1986.

757 [31] Elman JL. Finding structure in time. Energy reports, 14(2):179–211,

758 1990.

759 [32] Hochreiter S and Schmidhuber J. Long short-term memory. Neural

760 Computation, 9(8):1735–1780, 1997.

761 [33] Chung J, Gulcehre C, Cho K, et al. Empirical evaluation of gated

41
762 recurrent neural networks on sequence modeling. In Proceedings of the

763 Neural Information Processing Systems, pages 1–12, 2014.

764 [34] Cho K, Merrienboer BV, Bahdanau D, et al. On the properties of neu-

765 ral machine translation: encoder-decoder approaches. In Proceedings

766 of SSST-8, pages 103–111, 2014.

767 [35] Bahdanau D, Cho K, and Bengio Y. Neural machine translation by

768 jointly learning to align and translate. In Proceedings of the Interna-

769 tional Conference on Learning Representations, pages 149–158, 2015.

770 [36] Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption

771 generation with visual attention. In Proceedings of the International

772 Conference on Machine Learning, pages 2048–2057, 2015.

773 [37] Fukushima K. Neocognitron: a self-organizing neural network model

774 for a mechanism of pattern recognition unaffected by shift in position.

775 Biological Cybernetics, 36(4):193–202, 1980.

776 [38] Zhang W, Hasegawa A, Matoba O, et al. Shift-invariant neural network

777 for image processing: Learning and generalization. Applications of

778 Artificial Neural Networks III, 1709(1):257–268, 1992.

779 [39] Alla S and Adari SK. Beginning anomaly detection using Python-based

780 deep learning. Apress, 2019.

781 [40] Abadi M, Agarwal A, Barham P, et al. TensorFlow: large-scale ma-

782 chine learning on heterogeneous systems. [Link] 2015.

42
783 [41] Candel A, LeDell E, Arora A, et al. Deep learning with h2o.

784 [Link] 2015.

785 [42] Eclipse Deeplearning4j development team. DL4J: deep learning for

786 Java. [Link] 2016.

787 [43] Paszke A, Gross S, Massa F, et al. Pytorch: an imperative style,

788 high-performance deep learning library, 2019.

789 [44] Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture

790 for fast feature embedding. arXiv e-prints, arXiv:1408.5093, 2014.

791 [45] Intel Nervana systems. Neon deep learning framework, 2017.

792 [46] Tokui S, Okuta R, Akiba T, et al. Chainer: A deep learning framework

793 for accelerating the research cycle. In Proceedings of International

794 Conference on Knowledge Discovery and Data Mining, pages 2002–

795 2011, 2019.

796 [47] Theano development team. Theano: A Python framework for

797 fast computation of mathematical expressions. arXiv e-prints,

798 arXiv:1605.02688, 2016.

799 [48] Tianqi C, Li M, Li Y, et al. Mxnet: A flexible and efficient machine

800 learning library for heterogeneous distributed systems. arXiv e-prints,

801 arXiv:1512.01274, 2015.

802 [49] Bai J, Lu F, Zhang K, et al. Onnx: Open neural network exchange.

803 [Link] 2019.

43
804 [50] Seide F and Agarwal A. CNTK: Microsoft’s open-source deep-learning

805 toolkit. In Proceedings of the ACM SIGKDD International Conference

806 on Knowledge Discovery and Data Mining, pages 2135–2135, 2016.

807 [51] Chollet F. Keras. [Link] 2015.

808 [52] DeepMind revision. Sonnet, 2019.

809 [53] Guo J, He H, He T, et al. Gluoncv and gluonnlp: deep learning in

810 computer vision and natural language processing. Journal of Machine

811 Learning Research, 21:1–7, 2020.

812 [54] Choi D, Shallue CJ, Nado Z, Lee J, Maddison CJ, and Dahl GE. On

813 empirical comparisons of optimizers for deep learning. arXiv e-prints,

814 arXiv:1910.05446, 2019.

815 [55] You K, Long M, Wang J, and Jordan MI. How does learning rate decay

816 help modern neural networks? In Proceedings of the International

817 Conference on Learning Representations, pages 1–14, 2019.

818 [56] Sinha S, Singh TN, Singh V, and Verma A. Epoch determination for

819 neural network by self-organized map (SOM). Computational Geo-

820 sciences, 14:199–206, 2010.

821 [57] Masters D and Luschi C. Revisiting small batch training for deep

822 neural networks. arXiv e-prints, arXiv:1804.07612, 2018.

823 [58] Shafi I, Ahmad J, Shah SI, et al. Impact of varying neurons and hidden

824 layers in neural network architecture for a time frequency application.

44
825 In Proceedings of the IEEE International Multitopic Conference, pages

826 188–193, 2006.

827 [59] Ding B, Qian H, and Zhou J. Activation functions and their character-

828 istics in deep neural networks. In Proceedings of the Chinese Control

829 and Decision Conference, pages 1836–1841, 2018.

830 [60] Sutskever I, Martens J, Dahl G, et al. On the importance of initializa-

831 tion and momentum in deep learning. In Proceedings of the Interna-

832 tional Conference on Machine Learning, pages 1139–1147, 2013.

833 [61] Kumar SK. On weight initialization in deep neural networks. arXiv

834 e-prints, arXiv:1704.08863, 2017.

835 [62] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A simple way to

836 prevent neural networks from overfitting. Journal of Machine Learning

837 Research, 15(56):1929–1958, 2014.

838 [63] Ng AY. Feature selection, l1 vs. l2 regularization, and rotational invari-

839 ance. In Proceedings of the ACM International Conference on Machine

840 Learning, pages 78–85, 2004.

841 [64] Mairal J, Koniusz P, Harchaoui Z, and Schmid C. Convolutional ker-

842 nel networks. In Proceedings of the Neural Information Processing

843 Systems, pages 1–9, 2014.

844 [65] Zaniolo L and Marques O. On the use of variable stride in convolutional

845 neural networks. Multimedia Tools and Applications, 79(19):13581–

846 13598, 2020.

45
847 [66] Dwarampudi M and Reddy NVS. Effects of padding on LSTMs and

848 CNNs. arXiv e-prints, arXiv:1903.07288, 2019.

849 [67] Zhu H, An Z, Yang C, et al. Rethinking the number of channels for the

850 convolutional neural network. arXiv e-prints, arXiv:1909.01861, 2019.

851 [68] Scherer F, Müller A, and Behnke S. Evaluation of pooling operations

852 in convolutional architectures for object recognition. In Proceedings of

853 Artificial Neural Networks, pages 92–101, 2010.

854 [69] Ma B, Li X, Xia Y, et al. Autonomous deep learning: A genetic DCNN

855 designer for image classification. Neurocomputing, 379:152–161, 2020.

856 [70] Itano F, De-Abreu-De-Sousa MA, and Del-Moral-Hernandez E. Ex-

857 tending MLP ANN hyper-parameters optimization by using genetic

858 algorithm. In Proceedings of the IEEE International Joint Conference

859 on Neural Networks, pages 1–8, 2018.

860 [71] Kennedy K and Eberhart R. Particle swarm optimization. In Proceed-

861 ings of International Conference on Neural Networks, volume 4, pages

862 1942–1948, 1995.

863 [72] Stanley KO and Miikkulainen R. Evolving neural networks through

864 augmenting topologies. Evolutionary Computation, 10(2):99–127,

865 2002.

866 [73] Martínez-Álvarez F, Asencio-Cortés G, Torres JF, et al. Coronavirus

867 Optimization Algorithm: A Bioinspired Metaheuristic Based on the

868 COVID-19 Propagation Model. Big Data, 8(4):308–322, 2020.

46
869 [74] Ranjit MP, Ganapathy G, Sridhar K, et al. Efficient deep learning hy-

870 perparameter tuning using cloud infrastructure: intelligent distributed

871 hyperparameter tuning with bayesian optimization in the cloud. In

872 Proceedings of International Conference on Cloud Computing, pages

873 520–522, 2019.

874 [75] Bergstra J, Yamins D, and Cox DD. Making a science of model search:

875 hyperparameter optimization in hundreds of dimensions for vision ar-

876 chitectures. In Proceedings of the International Conference on Inter-

877 national Conference on Machine Learning, page 115–123, 2013.

878 [76] Camero A, Toutouh J, and Alba E. Dlopt: deep learning optimization

879 library, 2018.

880 [77] Autonomio. Talos. [Link] 2019.

881 [78] Balandat M, Karrer B, Jiang DR, et al. BoTorch: Programmable

882 Bayesian Optimization in PyTorch. arXiv e-prints, arXiv:1910.06403,

883 2019.

884 [79] Eggensperger K, Feurer M, Hutter F, et al. Towards an empirical

885 foundation for assessing bayesian optimization of hyperparameters. In

886 Proceedings of the Neural Information Processing Systems, pages 1–5,

887 2013.

888 [80] Wang YE, Wei GY, and Brooks D. Benchmarking TPU, GPU, and

889 CPU platforms for deep learning. arXiv e-prints, arXiv:1907.10701,

890 2019.

47
891 [81] Yu D, Wang Y, Liu H, et al. System identification of PEM fuel cells us-

892 ing an improved Elman neural network and a new hybrid optimization

893 algorithm. Energy Reports, 5:1365–1374, 2019.

894 [82] Zheng Y, Yao Z, Z H, et al. Power generation forecast of top gas

895 recovery turbine unit based on Elman model. In Proceedings of the

896 IEEE Chinese Control Conference, pages 7498–7501, 2018.

897 [83] Ruiz LGB, Rueda R, Cuéllar MP, et al. Energy consumption fore-

898 casting based on Elman neural networks with evolutive optimization.

899 Expert Systems with Applications, 92:380–389, 2018.

900 [84] Wang J, Lv Z, Liang Y, et al. Fouling resistance prediction based

901 on GA–Elman neural network for circulating cooling water with elec-

902 tromagnetic anti-fouling treatment. Journal of the Energy Institute,

903 92(5):1519–1526, 2019.

904 [85] Li W, Jiao Z, Du L, et al. An indirect RUL prognosis for lithium-ion

905 battery under vibration stress using Elman neural network. Interna-

906 tional Journal of Hydrogen Energy, 44(23):12270–12276, 2019.

907 [86] Yu Y, Wang X, and Bründlinger R. Improved Elman neural network

908 short-term residents load forecasting considering human comfort in-

909 dex. Journal of Electrical Engineering and Technology, 14(6):2315–

910 2322, 2019.

911 [87] Li D, Wang H, Zhang Y, et al. Power grid load state information

912 perception forecasting technology for battery energy storage system

48
913 based on Elman neural network. In Proceedings of information Tech-

914 nology, Networking, Electronic and Automation Control Conference,

915 pages 914–917, 2019.

916 [88] Abdel-Nasser M and Mahmoud K. Accurate photovoltaic power fore-

917 casting models using deep LSTM-RNN. Neural Computing and Appli-

918 cations, 31(7):2727–2740, 2019.

919 [89] Khodabakhsh A, Ari I, Bakır M, et al. Forecasting multivariate time-

920 series data using LSTM and mini-batches. In Proceedings of Data

921 Engineering and Communications Technologies, pages 121–129, 2020.

922 [90] Gao M, Li J, Hong F, et al. Day-ahead power forecasting in a large-

923 scale photovoltaic plant based on weather classification using LSTM.

924 Energy, 187:115838, 2019.

925 [91] Muzaffar S and Afshari A. Short-term load forecasts using LSTM

926 networks. Proceedings of Energy Procedia, 158:2922–2927, 2019.

927 [92] Song X, Liu Y, Xue L, et al. Time-series well performance predic-

928 tion based on long short-term memory (LSTM) neural network model.

929 Journal of Petroleum Science and Engineering, 186:106682, 2020.

930 [93] Wang JQ, Du Y, and Wang J. LSTM based long-term energy con-

931 sumption prediction with periodicity. Energy, 197:117197, 2020.

932 [94] Gokhan A, Yilmaz E, Unel M, et al. Estimating soot emission in diesel

933 engines using gated recurrent unit networks. IFAC-PapersOnLine,

934 52(5):544–549, 2019.

49
935 [95] Wu W, Liao W, Miao J, et al. Using gated recurrent unit network to

936 forecast short-term load considering impact of electricity price. Pro-

937 ceedings of Energy Procedia, 158:3369–3374, 2019.

938 [96] Tang X, Dai Y, Wang T, et al. Short-term power load forecasting based

939 on multi-layer bidirectional recurrent neural network. IET Generation,

940 Transmission and Distribution, 13(17):3847–3854, 2019.

941 [97] Shao Z, Zheng Q, Yang S, et al. Modeling and forecasting the electricity

942 clearing price: a novel BELM based pattern classification framework

943 and a comparative analytic study on multi-layer BELM and LSTM.

944 Energy Economics, 86:104648, 2020.

945 [98] Qiu X, Ren Y, Suganthan PN, et al. Empirical mode decomposition

946 based ensemble deep learning for load demand time series forecasting.

947 Applied soft Computing Journal, 54:246–255, 2017.

948 [99] Qiu X, Zhang L, Ren Y, et al. Ensemble deep learning for regression

949 and time series forecasting. In Proceedings of the IEEE Symposium

950 Series on Computational Intelligence in Ensemble Learning, pages 1–

951 6, 2014.

952 [100] Manohar M, Koley E, Ghosh S, et al. Spatio-temporal information

953 based protection scheme for PV integrated microgrid under solar irra-

954 diance intermittency using deep convolutional neural network. Inter-

955 national Journal of Electrical Power and Energy Systems, 116:105576,

956 2020.

50
957 [101] Mishra K, Basu S, and Maulik U. DaNSe: a dilated causal convo-

958 lutional network based model for load forecasting. Lecture Notes in

959 Computer Science, 11941:234–241, 2019.

960 [102] Qiao W and Yang Z. Forecast the electricity price of U.S. using a

961 wavelet transform-based hybrid model. Energy, 193:116704, 2020.

962 [103] Ji L, Zou Y, He K, et al. Carbon futures price forecasting based with

963 ARIMA-CNN-LSTM model. Procedia Computer Science, 162:33–38,

964 2019.

965 [104] Kim TY and Cho SB. Predicting residential energy consumption using

966 CNN-LSTM neural networks. Energy, 182:72–81, 2019.

967 [105] Shen M, Xu Q, Wang K, et al. Short-term bus load forecasting method

968 based on cnn-gru neural network. Lecture Notes in Electrical Engineer-

969 ing, 585:711–722, 2020.

970 [106] AlKandari M and Ahmad I. Solar power generation forecasting using

971 ensemble approach based on deep learning and statistical methods.

972 Applied Computing and Informatics, 6:1–20, 2020.

973 [107] Kong Z, Tang B, Deng L, et al. Condition monitoring of wind turbines

974 based on spatio-temporal fusion of SCADA data by convolutional neu-

975 ral networks and gated recurrent units. Renewable Energy, 146:760–

976 768, 2020.

977 [108] Wang S, Zhang Z, Ren Y, et al. UAV photogrammetry and AFSA-

51
978 Elman neural network in slopes displacement monitoring and forecast-

979 ing. KSCE Journal of Civil Engineering, 24(1):19–29, 2020.

980 [109] Sarkar S, Lore KG, Sarkar S, et al. Early detection of combustion

981 instability from hi-speed flame images via deep learning and symbolic

982 time series analysis. In Proceedings of the Annual Conference of the

983 Prognostics and Health Management Society, pages 353–362, 2015.

984 [110] Ma X, Dai Z, He Z, et al. Learning traffic as images: a deep convo-

985 lutional neural network for large-scale transportation network speed

986 prediction. Sensors, 17(4):818, 2017.

987 [111] Chen W and Shi K. A deep learning framework for time series classifi-

988 cation using relative position matrix and convolutional neural network.

989 Neurocomputing, 359:384–394, 2019.

990 [112] Ienco D, Interdonato R, Gaetano R, et al. Combining Sentinel-1 and

991 Sentinel-2 satellite image time series for land cover mapping via a

992 multi-source deep learning architecture. Journal of Photogrammetry

993 and Remote Sensing, 158:11–22, 2019.

994 [113] Martínez-Arellano G, Terrazas G, and Ratchev S. Tool wear classifica-

995 tion using time series imaging and deep learning. International Journal

996 of Advanced Manufacturing Technology, 104(9-12):3647–3662, 2019.

997 [114] Wu W, Zhang J, Xie H, et al. Automatic detection of coronary artery

998 stenosis by convolutional neural network with temporal constraint.

999 Computers in Biology and Medicine, 118:103657, 2020.

52
1000 [115] Miao Y, Han J, Gao Y, and Zhang B. ST-CNN: Spatial-Temporal

1001 Convolutional Neural Network for crowd counting in videos. Pattern

1002 Recognition Letters, 125:113–118, 2019.

1003 [116] Feng S. Dynamic facial stress recognition in temporal convolutional

1004 network. In Proceedings of the Communications in Computer and In-

1005 formation Science, pages 698–706, 2019.

1006 [117] Zhang Y, Kampffmeyer M, Liang X, et al. Dilated temporal relational

1007 adversarial network for generic video summarization. Multimedia Tools

1008 and Applications, 78(24):35237–35261, 2019.

1009 [118] Interdonato R, Ienco D, Gaetano R, et al. DuPLO: A DUal view

1010 Point deep Learning architecture for time series classificatiOn. ISPRS

1011 Journal of Photogrammetry and Remote Sensing, 149:91–104, 2019.

1012 [119] Yan H and Ouyang H. Financial time series prediction based on deep

1013 learning. Wireless Personal Communications, 102(2):683–700, 2018.

1014 [120] Sismanoglu G, Onde M, Kocer F, et al. Deep learning based forecasting

1015 in stock market with big data analytics. In Proceedings of the IEEE

1016 Scientific Meeting on Electrical-Electronics and Biomedical Engineer-

1017 ing and Computer Science, pages 10057–10059, 2019.

1018 [121] Jayanth BA, Harish RDS, and Nair BB. Applicability of deep learning

1019 models for stock price forecasting an empirical study on bankex data.

1020 Procedia Computer Science, 143:947–953, 2018.

1021 [122] Jiang M, Liu J, Zhang L, et al. An improved stacking framework for

53
1022 stock index prediction by leveraging tree-based ensemble models and

1023 deep learning algorithms. Physica A: Statistical Mechanics and its

1024 Applications, 541:122272, 2020.

1025 [123] Wu W, Wang Y, Fu J, et al. Preliminary study on interpreting stock

1026 price forecasting based on tree regularization of GRU. In Proceedings of

1027 Communications in Computer and Information Science, volume 1059,

1028 pages 476–487, 2019.

1029 [124] Orimoloye LO, Sung MC, Ma T, et al. Comparing the effectiveness of

1030 deep feedforward neural networks and shallow architectures for predict-

1031 ing stock price indices. Expert Systems with Applications, 139:112828,

1032 2020.

1033 [125] Makarenko AV. Deep learning algorithms for estimating Lyapunov

1034 exponents from observed time series in discrete dynamic systems. In

1035 Proceedings of International Conference Stability and Oscillations of

1036 Nonlinear Control Systems, pages 1–4, 2018.

1037 [126] Dingli A and Fournier KS. Financial time series forecasting - a deep

1038 learning approach. International Journal of Machine Learning and

1039 Computing, 7(5):118–122, 2017.

1040 [127] Kelotra A and Pandey P. Stock Market Prediction Using Optimized

1041 Deep-ConvLSTM Model. Big Data, 8(1):5–24, 2020.

1042 [128] Ni L, Li Y, Wang X, et al. Forecasting of Forex time series data based

1043 on deep learning. Procedia Computer Science, 147:647–652, 2019.

54
1044 [129] Bao W, Yue J, and Rao Y. A deep learning framework for financial

1045 time series using stacked autoencoders and long-short term memory.

1046 Plos One, 12(7):e0180944, 2017.

1047 [130] Munkhdalai L, Li M, Theera-Umpon N, et al. VAR-GRU: a hybrid

1048 model for multivariate financial time series prediction. Lecture Notes

1049 in Artificial Intelligence, 12034:322–332, 2020.

1050 [131] Chen CT, Chiang LK, Huang YC, et al. Forecasting interaction of

1051 exchange rates between fiat currencies and cryptocurrencies based on

1052 deep relation networks. In Proceedings of the IEEE International Con-

1053 ference on Agents, pages 69–72, 2019.

1054 [132] Berradi Z and Lazaar M. Integration of principal component analysis

1055 and recurrent neural network to forecast the stock price of Casablanca

1056 stock exchange. Procedia Computer Science, 148:55–61, 2019.

1057 [133] Wang Q, Xu W, Huang X, et al. Enhancing intraday stock price

1058 manipulation detection by leveraging recurrent neural networks with

1059 ensemble learning. Neurocomputing, 347:46–58, 2019.

1060 [134] Long W, Lu Z, and Cui L. Deep learning-based feature engineering for

1061 stock price movement prediction. Knowledge-Based Systems, 164:163–

1062 173, 2019.

1063 [135] Liu H, Tian HQ, Liang XF, et al. Wind speed forecasting approach

1064 using secondary decomposition algorithm and Elman neural networks.

1065 Applied Energy, 157:183–194, 2015.

55
1066 [136] Yu C, Li Y, Xiang H, et al. Data mining-assisted short-term wind

1067 speed forecasting by wavelet packet decomposition and Elman neural

1068 network. Journal of Wind Engineering and Industrial Aerodynamics,

1069 175:136–143, 2018.

1070 [137] Liu H, Wei MX, and Fei LY. Wind speed forecasting method based on

1071 deep learning strategy using empirical wavelet transform, long short

1072 term memory neural network and Elman neural network. Energy Con-

1073 version and Management, 156:498–514, 2018.

1074 [138] Zhang Y and Pan G. A hybrid prediction model for forecasting

1075 wind energy resources. Environmental Science and Pollution Research,

1076 27(16):19428–19446, 2020.

1077 [139] Zhang L, Xie Y, Chen A, et al. A forecasting model based on enhanced

1078 Elman neural network for air quality prediction. Lecture Notes in

1079 Electrical Engineering, 518:65–74, 2019.

1080 [140] Huang Y and Shen L. Elman neural network optimized by firefly

1081 algorithm for forecasting China’s carbon dioxide emissions. Commu-

1082 nications in Computer and Information Science, 951:36–47, 2018.

1083 [141] Xiao D, Hou S, Li WZ, et al. Hourly campus water demand forecasting

1084 using a hybrid EEMD-Elman neural network model. In Sustainable

1085 development of water resources and hydraulic engineering in China.

1086 Environmental earth sciences, pages 71–80. 2019.

1087 [142] Shen W, Fu X, Wang R, et al. A prediction model of NH3 concentration

1088 for swine house in cold region based on empirical mode decomposition

56
1089 and Elman neural network. Information Processing in Agriculture,

1090 6(2):297–305, 2019.

1091 [143] Wan X, Yang Q, Jiang P, et al. A hybrid model for real-time probabilis-

1092 tic flood forecasting using Elman neural network with heterogeneity of

1093 error distributions. Water Resources Management, 33(11):4027–4050,

1094 2019.

1095 [144] Wan H, Guo S, Yin K, et al. CTS-LSTM: LSTM-based neural net-

1096 works for correlatedtime series prediction. Knowledge-Based Systems,

1097 191:105239, 2019.

1098 [145] Freeman BS, Taylor G, Gharabaghi B, et al. Forecasting air quality

1099 time series using deep learning. Journal of the Air and Waste Man-

1100 agement Association, 68(8):866–886, 2018.

1101 [146] De-Melo GA, Sugimoto DN, Tasinaffo PM, et al. A new approach to

1102 river flow forecasting: LSTM and GRU multivariate models. IEEE

1103 Latin America Transactions, 17(12):1978–1986, 2019.

1104 [147] Chen J, Zeng GQ, Zhou W, et al. Wind speed forecasting using

1105 nonlinear-learning ensemble of deep learning time series prediction and

1106 extremal optimization. Energy Conversion and Management, 165:681–

1107 695, 2018.

1108 [148] Niu Z, Yu Z, Tang W, et al. Wind power forecasting using attention-

1109 based gated recurrent unit network. Energy, 196:117081, 2020.

1110 [149] Li W, Wu H, Zhu N, et al. Prediction of dissolved oxygen in a fishery

57
1111 pond based on gated recurrent unit (GRU). Information Processing in

1112 Agriculture, 2020.

1113 [150] Peng Z, Peng S, Fu L, et al. A novel deep learning ensemble model

1114 with data denoising for short-term wind speed forecasting. Energy

1115 Conversion and Management, 207:112524, 2020.

1116 [151] Jin X, Yu X, Wang X, et al. Prediction for time series with CNN and

1117 LSTM. Lecture Notes in Electrical Engineering, 582:631–641, 2020.

1118 [152] Maqsood H, Mehmood I, Maqsood M, et al. A local and global event

1119 sentiment based efficient stock exchange forecasting using deep learn-

1120 ing. International Journal of Information Management, 50:432–451,

1121 2020.

1122 [153] O’Shea TJ, Roy T, and Clancy TC. Over-the-air deep learning based

1123 radio signal classification. IEEE Journal on Selected Topics in Signal

1124 Processing, 12(1):168–179, 2018.

1125 [154] Zhang Y, Thorburn PJ, and Fitch P. Multi-task temporal convolu-

1126 tional network for predicting water quality sensor data. In Proceedings

1127 of Neural Information Processing Communications in Computer and

1128 Information Science, pages 122–130, 2019.

1129 [155] Sun Y, Zhao Z, Ma X, et al. Short-timescale gravitational microlens-

1130 ing events prediction with ARIMA-LSTM and ARIMA-GRU hybrid

1131 model. Lecture Notes in Computer Science, 11473:224–238, 2019.

1132 [156] Liu H, Mi X, Li Y, et al. Smart wind speed deep learning based multi-

58
1133 step forecasting model using singular spectrum analysis, convolutional

1134 gated recurrent unit network and support vector regression. Renewable

1135 Energy, 143:842–854, 2019.

1136 [157] Li X, Zhang L, Wang Z, et al. Remaining useful life prediction for

1137 lithium-ion batteries based on a hybrid model combining the long

1138 short-term memory and Elman neural networks. Journal of Energy

1139 Storage, 21:510–518, 2019.

1140 [158] Yang L, Wang F, Zhang J, et al. Remaining useful life prediction of

1141 ultrasonic motor based on Elman neural network with improved par-

1142 ticle swarm optimization. Measurement: Journal of the International

1143 Measurement Confederation, 143:27–38, 2019.

1144 [159] Rashid KM and Louis J. Times-series data augmentation and deep

1145 learning for construction equipment activity recognition. Advanced

1146 Engineering Informatics, 42:100944, 2019.

1147 [160] Huang X, Zanni-Merk C, and Crémilleux B. Enhancing deep learning

1148 with semantics: an application to manufacturing time series analysis.

1149 Procedia Computer Science, 159:437–446, 2019.

1150 [161] Mehdiyev N, Lahann J, Emrich A, et al. Time series classification using

1151 deep learning for process planning: a case from the process industry.

1152 Procedia Computer Science, 114:242–249, 2017.

1153 [162] Wang S, Chen J, Wang H, et al. Degradation evaluation of slewing

1154 bearing using HMM and improved GRU. Measurement: Journal of the

1155 International Measurement Confederation, 146:385–395, 2019.

59
1156 [163] Wang J, Yan J, Li C, et al. Deep heterogeneous GRU model for pre-

1157 dictive analytics in smart manufacturing: application to tool wear pre-

1158 diction. Computers in Industry, 111:1–14, 2019.

1159 [164] Bohan H and Yun B. Traffic flow prediction based on BRNN. In

1160 Proceedings of the IEEE International Conference on Electronics In-

1161 formation and Emergency Communication, pages 320–323, 2019.

1162 [165] Pasias A, Vafeiadis T, Ioannidis D, et al. Forecasting bath and metal

1163 height features in electrolysis process. In Proceedings of the Interna-

1164 tional Conference on Distributed Computing in Sensor Systems, pages

1165 312–317, 2019.

1166 [166] Jiang P, Chen C, and Liu X. Time series prediction for evolutions

1167 of complex systems: a deep learning approach. In Proceedings of rhe

1168 IEEE International Conference on Control and Robotics Engineering,

1169 pages 1–6, 2016.

1170 [167] Canizo M, Triguero I, A Conde, et al. Multi-head CNN–RNN for

1171 multi-time series anomaly detection: an industrial case study. Neuro-

1172 computing, 363:246–260, 2019.

1173 [168] Kuang L, Hua C, Wu J, et al. Traffic volume prediction based on

1174 multi-sources GPS trajectory data by temporal convolutional network.

1175 Mobile Networks and Applications, pages 1–13, 2020.

1176 [169] Wu P, Sun J, Chang X, et al. Data-driven reduced order model with

1177 temporal convolutional neural network. Computer Methods in Applied

1178 Mechanics and Engineering, 360:112766, 2020.

60
1179 [170] Varona B, Monteserin A, and Teyseyre A. A deep learning approach

1180 to automatic road surface monitoring and pothole detection. Personal

1181 and Ubiquitous Computing, pages 1–16, 2019.

1182 [171] Cai M, Pipattanasomporn M, and Rahman S. Day-ahead building-level

1183 load forecasts using deep learning vs. traditional time-series techniques.

1184 Applied Energy, 236:1078–1088, 2019.

1185 [172] da Silva DB, Schmidt D, da Costa CA, da Rosa Righi R, and Eskofier

1186 B. Deepsigns: A predictive model based on deep learning for the

1187 early detection of patient health deterioration. Expert Systems with

1188 Applications, 165:113905, 2021.

1189 [173] Yu W, Kim Y, and Mechefske C. Remaining useful life estimation using

1190 a bidirectional recurrent neural network based autoencoder scheme.

1191 Mechanical Systems and Signal Processing, 129:764–780, 2019.

1192 [174] Bui C, Pham N, Vo A, Tran A, Nguyen A, and Le T. Time series

1193 forecasting for healthcare diagnosis and prognostics with the focus on

1194 cardiovascular diseases. In Proceedings of the International Conference

1195 on the Development of Biomedical Engineering in Vietnam, pages 809–

1196 818, 2018.

1197 [175] Liu Y, Huang Y, Wang J, et al. Detecting premature ventricular con-

1198 traction in children with deep learning. Journal of Shanghai Jiaotong

1199 University, 23(1):66–73, 2018.

1200 [176] Hoppe E, Körzdörfer G, Würfl T, et al. Deep learning for magnetic

1201 resonance fingerprinting: A new approach for predicting quantitative

61
1202 parameter values from time series. In Studies in Health Technology and

1203 Informatics, volume 243, pages 202–206. IOS Press, 2017.

1204 [177] Chambon S, Galtier MN, Arnal PJ., et al. A deep learning archi-

1205 tecture for temporal sleep stage classification using multivariate and

1206 multimodal time series. IEEE Transactions on Neural Systems and

1207 Rehabilitation Engineering, 26(4):758–769, 2018.

1208 [178] Lauritsen SM, Kalør ME, Kongsgaard EL, et al. Early detection of sep-

1209 sis utilizing deep learning on electronic health record event sequences.

1210 Artificial Intelligence in Medicine, 104:101820, 2020.

1211 [179] Chen X, He J, Wu X, et al. Sleep staging by bidirectional long short-

1212 term memory convolution neural network. Future Generation Com-

1213 puter Systems, 109:188–196, 2020.

1214 [180] Liang liang M and Fu peng T. Pneumonia Incidence Rate Predictive

1215 Model of Nonlinear Time Series Based on Dynamic Learning Rate BP

1216 Neural Network. In Proceedings of the Fuzzy Information and Engi-

1217 neering, pages 739–749, 2010.

1218 [181] Sarafrazi S, Choudhari RS, Mehta C, Mehta HK, et al. Cracking the

1219 “sepsis” code: Assessing time series nature of ehr data, and using deep

1220 learning for early sepsis prediction. In Proceedings of the Computing

1221 in Cardiology, pages 1–4, 2019.

1222 [182] Zeroual A, Harrou F, Dairi A, and Sun Y. Deep learning methods for

1223 forecasting covid-19 time-series data: A comparative study. Chaos,

1224 Solitons & Fractals, 140:110121, 2020.

62
1225 [183] Zhang X, Shen F, Zhao J, et al. Time series forecasting using GRU

1226 neural network with multi-lag after decomposition. Lecture Notes in

1227 Computer Sciences, 10638:523–532, 2017.

1228 [184] Zhao X, Xia L, Zhang J, et al. Artificial neural network based model-

1229 ing on unidirectional and bidirectional pedestrian flow at straight corri-

1230 dors. Physica A: statistical mechanics and its applications, 547:123825,

1231 2020.

1232 [185] Imamverdiyev Y and Abdullayeva F. Deep learning method for denial

1233 of service attack detection based on restricted boltzmann machine. Big

1234 Data, 6(2):159–169, 2018.

1235 [186] Munir M, Siddiqui SA, Dengel A, et al. DeepAnT: a deep learning

1236 approach for unsupervised anomaly detection in time series. IEEE

1237 Access, 7:1991–2005, 2019.

1238 [187] Zebin T, Scully PJ, and Ozanyan KB. Human activity recognition

1239 with inertial sensors using a deep learning approach. In Proceedings of

1240 IEEE Sensors, pages 1–3, 2017.

1241 [188] Bendong Z, Huanzhang L, Shangfeng C, et al. Convolutional neural

1242 networks for time series classification. Journal of Systems Engineering

1243 and Electronics, 28(1):162–169, 2017.

1244 [189] Jiang W, Wang Y, and Tang Y. A sequence-to-sequence transformer

1245 premised temporal convolutional network for chinese word segmenta-

1246 tion. In Proceedings of Parallel Architectures, Algorithms and Pro-

1247 gramming, pages 541–552, 2020.

63
1248 [190] Shao J, Shen H, Cao Q, et al. Temporal convolutional networks for

1249 popularity prediction of messages on social medias. Lecture Notes in

1250 Computer Science, 11772:135–147, 2019.

1251 [191] Chen Y, Kan Y, Chen Y, et al. Probabilistic forecasting with temporal

1252 convolutional neural network. Neurocomputing, 399:491–501, 2020.

1253 [192] Xi R, Hou M, Fu M, et al. Deep dilated convolution on multi-

1254 modality time series for human activity recognition. In Proceedings of

1255 the IEEE International Joint Conference on Neural Networks, pages

1256 53381–53396, 2018.

1257 [193] Wang R, Peng C, Gao J, et al. A dilated convolution network-based

1258 LSTM model for multi-step prediction of chaotic time series. Compu-

1259 tational and Applied Mathematics, 39(1):1–22, 2020.

1260 [194] Rodrigues F, Markou I, and Pereira FC. Combining time-series and

1261 textual data for taxi demand prediction in event areas: a deep learning

1262 approach. Information Fusion, 49:120–129, 2019.

1263 [195] Kalinin MO, Lavrova DS, and Yarmak AV. Detection of threats in

1264 cyberphysical systems based on deep learning methods using multi-

1265 dimensional time series. Automatic Control and Computer Sciences,

1266 52(8):912–917, 2018.

1267 [196] Wang H, Lei Z, Zhang X, et al. A review of deep learning for renewable

1268 energy forecasting. Energy Conversion and Management, 198:111799,

1269 2019.

64
1270 [197] Hu Z, Tang J, Wang Z, et al. Deep learning for image-based cancer

1271 detection and diagnosis - a survey. Pattern Recognition, 83:134–149,

1272 2018.

1273 [198] Atto AM, Benoit A, and Lambert P. Timed-image based deep learn-

1274 ing for action recognition in video sequences. Pattern Recognition,

1275 104:107353, 2020.

1276 [199] Sezer OB, Gudelek M, and Ozbayoglu AM. Financial time series fore-

1277 casting with deep learning : A systematic literature review: 2005–2019.

1278 Applied Soft Computing, 90:106181, 2020.

1279 [200] Shen Z, Zhang Y, Lu J, et al. A novel time series forecasting model

1280 with deep learning. Neurocomputing, 396:302–313, 2020.

1281 [201] Wang Y, Zhang D, Liu Y, et al. Enhancing transportation systems via

1282 deep learning: a survey. Transportation Research part C: Emerging

1283 Technologies, 99:144–163, 2019.

1284 [202] Kamilaris A and Prenafeta-Boldú FX. Deep learning in agriculture: a

1285 survey. Computers and Electronics in Agriculture, 147:70–90, 2018.

1286 [203] Rui Z, Ruqiang Y, Zhenghua C, et al. Deep learning and its applica-

1287 tions to machine health monitoring. Mechanical Systems and Signal

1288 Processing, 115:213–237, 2019.

1289 [204] Mahdavifar S and Ghorbani AA. Application of deep learning to cy-

1290 bersecurity: a survey. Neurocomputing, 347:149–176, 2019.

65

View publication stats

You might also like