SURVEY Accepted
SURVEY Accepted
net/publication/347364694
CITATIONS READS
356 9,211
5 authors, including:
All content following this page was uploaded by Francisco Martínez-Álvarez on 18 December 2020.
10 Abstract
11 Time series forecasting has become a very intensive field of research, which is
12 even increasing in recent years. Deep neural networks have proved to be pow-
13 erful and are achieving high accuracy in many application fields. For these
14 reasons, they are one of the most widely used methods of machine learning
15 to solve problems dealing with big data nowadays. In this work, the time
16 series forecasting problem is initially formulated along with its mathemati-
17 cal fundamentals. Then, the most common deep learning architectures that
18 are currently being successfully applied to predict time series are described,
19 highlighting their advantages and limitations. Particular attention is given
20 to feed forward networks, recurrent neural networks (including Elman, long-
21 short term memory, gated recurrent units and bidirectional networks) and
22 convolutional neural networks. Practical aspects, such as the setting of val-
23 ues for hyper-parameters and the choice of the most suitable frameworks, for
24 the successful application of deep learning to time series are also provided
25 and discussed. Several fruitful research fields in which the architectures ana-
26 lyzed have obtained a good performance are reviewed. As a result, research
27 gaps have been identified in the literature for several domains of application,
28 thus expecting to inspire new and better forms of knowledge.
29 Keywords: time series forecasting, deep learning, big data
∗
Corresponding author
Email addresses: jftormal@[Link] (J. F. Torres), dhadjout@[Link] (D.
Hadjout), [Link]@[Link] (A. Sebaa), fmaralv@[Link] (F.
Martínez-Álvarez), atrolor@[Link] (A. Troncoso)
1
Equally contributing authors.
32 increase during the last decade due to the massive deployment of smart sen-
33 sors [1] or the social media platforms [2], which generate data on a continuous
34 basis [3]. However, this situation poses new challenges, such as storing these
37 on efficiently collecting, organizing and analyzing big data with the aim
40 moves, which leads to more efficient operations and higher profits [5].
41 From all the learning paradigms that are currently being used in big
43 scale of data increases [6]. Most of the layer computations in deep learning
44 can be done in parallel by, for instance, powerful graphic processing units
45 (GPUs). That way, scalable distributed models are easier to be built and
46 they provide better accuracy at a much higher speed. Higher depth allows for
48 costs [7].
51 in the literature [8]. Pattern recognition and classification were the first
52 and most relevant uses of deep learning, achieving great success in speech
2
54 regression problems is becoming quite popular nowadays mainly due to the
56 with data indexed over time. Such is the case of time series and, more
60 or chemical phenomena without variables that evolve over time. For this
61 reason, the proposal of time series forecasting approaches is fruitful and can
63 Statistical approaches have been used from the 70s on, especially those
65 chine learning and its powerful regression methods [12], many models were
67 ods in most research works. However, methods based on deep learning are
68 currently achieving superior results and much effort is being put into devel-
70 For all the above mentioned, the primary motivation behind this survey
74 successful and, as a result, research gaps have been identified in the literature
77 published during the last years, the majority of them provided a general
3
79 Zhang et al. [13] reviewed emerging researches of deep learning models,
80 including their mathematical formulation, for big data feature learning. An-
81 other remarkable work can be found in [14], in which the authors introduced
82 the time series classification problem and provided an open source framework
84 Mayer and Jacobsen published a survey about scalable deep learning on dis-
85 tributed infrastructures, in which the focus was put on techniques and tools,
86 along with a smart discussion about the existing challenges in this field [16].
88 lem and mathematical formulation for time series can be found in Section
93 to forecast time series. Section 5 overviews the most relevant papers, sorted
94 by fields, in which deep learning has been applied to forecast time series. Fi-
95 nally, the lessons learned and the conclusions drawn are discussed in Section
96 6.
97 2. Problem definition
98 This section provides the time series definition (Section 2.1), along with
99 description of the main time series components (Section 2.2). The mathe-
100 matical formulation for the time series forecasting problem is introduced in
101 Section 2.3. Final remarks about the length of the time series can be found
4
103 2.1. Time series definition
105 and observed over time. While the time is a variable measured on a continu-
106 ous basis, the values in a time series are sampled at constant intervals (fixed
108 This definition holds true for many applications, but not every time series
109 can be modeled in this way, due to some of the following reasons:
110 1. Missing data in time series is a very common problem due to the re-
111 liability of data collection. To deal with these values, there are a lot
112 a strategies but those based on imputing the missing information and
113 on omitting the entire record, are the most widely used [17].
114 2. Outlying data is also an issue that appears very frequently in time
116 remove these values or, simply, to incorporate them into the model
117 [18].
118 3. When data are collected at irregular time periods, they can be either
119 called unevenly spaced time series or, if big enough, data streams [3].
120 Some of these issues can be handled natively by the used model, but if the
121 data is collected irregularly, this should be accounted for in the model. In
122 this survey, the time series pre-processing is out of scope, but please refer to
125 Time series are usually characterized by three components: trend, sea-
126 sonality and irregular components, also known as residuals [20]. Such com-
5
127 ponents are described below:
128 1. Trend. It is the general movement that the time series exhibits during
130 ities. In some texts, this component is also known as long-term vari-
131 ation. Although there are different kinds of trends in time series, the
134 regular intervals and may provide useful information when time periods
136 along with the time, magnitude and direction. Seasonality can be
138 festivities.
139 3. Residuals. Once the trend and cyclic oscillations have been calculated
140 and removed, some residual values remain. These values can be, some-
141 times, high enough to mask the trend and the seasonality. In this case,
142 the term outlier is used to refer these residuals, and robust statistics
143 are usually applied to cope with them [21]. These fluctuations can be
144 of diverse origin, which makes the prediction almost impossible. How-
145 ever, if by any chance, this origin can be detected or modeled, they
148 series present a meaningful irregular component and are not stationary (mean
149 and variance are not constant over time), turning this component into the
150 most challenging one to model. For this reason, to make accurate predictions
6
151 for them is extremely difficult, and many forecasting classical methods try
152 to decompose the target time series into these three components and make
156 this component where data mining-based techniques has been shown to be
159 identifies the time, whereas the y–axis the values recorded at punctual time
160 stamps (xt ). This representation allows the visual detection of the most
162 sons and cycles or the existence of anomalous data or outliers. Figure 1
163 depicts an time series, xt , using an additive model with linear seasonality
164 with constant frequency and amplitude over time, represented by the func-
165 tion sin(x); linear trend where changes over time are consistently made by
166 the same amount, represented by the function 0.0213x; and residuals, repre-
169 Time series models can be either univariate (one time-dependent vari-
171 models may dramatically differ between an univariate and a multivariate sys-
172 tem, the majority of the deep learning models can handle indistinctly with
174 On the one hand, let y = y(t − L), . . . , y(t − 1), y(t), y(t + 1), . . . , y(t + h)
7
Seasonality Trend
1.00
1.0
0.75
0.50 0.8
0.25
0.6
0.00
x
x
0.25 0.4
0.50
0.2
0.75
1.00 0.0
0 10 20 30 40 50 0 10 20 30 40 50
t t
Residuals Time series
0.5 2.5
2.0
0.4
1.5
0.3
1.0
x
x
0.2 0.5
0.0
0.1
0.5
0.0
0 10 20 30 40 50 0 10 20 30 40 50
t t
175 be a given univariate time series with L values in the historical data, where
176 each y(t − i), for i = 0, . . . , L, represents the recorded value of the variable
178 y(t + 1), denoted by ŷ(t + 1), with the aim of minimizing the error, which is
179 typically represented as a function of y(t + 1) − ŷ(t + 1). This prediction can
180 be made also when the horizon of prediction, h, is greater than one, that is,
181 when the objective is to predict the h next values after y(t), that is, y(t + i),
182 with i = 1, . . . , h. In this situation, the best prediction is reached when the
8
in the matrix form:
y1
y1 (t − L) . . . y1 (t − 1) y1 (t) y1 (t + 1) . . . y1 (t + h)
y2
y2 (t − L) . . . y2 (t − 1) y2 (t) y2 (t + 1) . . . y2 (t + h)
.. =
.
... ... ... ... ... ... ...
yn
yn (t − L) . . . yn (t − 1) yn (t) yn (t + 1) . . . yn (t + h)
yT
(1)
184 where yi (t − m) identifies the set of time series, with i = {1, 2, . . . , n},
185 being m = {0, 1, . . . , L} the historical data and current sample and m =
186 {−1, −2, . . . , −h} the future h values. Usually, there is one target time series
187 (the one to be predicted) and the remaining ones, are denoted as independent
190 Another key issue is the length of the time series. Depending on the
191 number of samples, long or short time series can be defined. It is well-
192 known that the Box-Jenkins’ models do not work well for long time series
193 mainly due to the time consuming process of parameters optimization and
194 to the inclusion of information which is no longer useful to model the current
196 How to deal with these issues is highly related to the purpose of the
197 model. Flexible non-parametric models could be used, but this still assumes
198 that the model structure will work over the whole period of the data, which
199 is not always true. A better approach consists in allowing the model to vary
200 over time. This can be done by either adjusting a parametric model with
9
201 time-varying parameters, or by adjusting a non-parametric model with a
202 time-based kernel. But if the goal is only to forecast few observations, it is
203 simpler to fit a model with the most recent samples and transforming the
206 been recently published [23], it remains challenging to deal with such time
208 learning algorithms adapted to deal with ultra-long time series, or big data
209 time series, have been published in recent years [24]. These models make
212 Deep learning models can deal with time series in a scalable way and
213 provide accurate forecasts [25]. Ensemble learning can also be useful to
214 forecast big data time series [26] or even methods based on well-established
215 methods such as nearest neighbours [27, 28] or pattern sequence similarity
216 [29].
218 This section provides a theoretical tour of deep learning for time series
219 prediction in big data environments. First, a description of the most used
220 architectures in the literature to predict time series is made. Then, a state-of-
221 the-art analysis is carried out, where the deep learning works and frameworks
10
223 3.1. Deep Feed Forward Neural Network
224 Deep Feed Forward Neural Networks (DFFNN), also called multi-layer
225 perceptron, arose due to the inability of single-layer neural networks to learn
227 layer, an output layer and different hidden layers as shown in Figure 2. In
228 addition, each hidden layer has a certain number of neurons to be deter-
229 mined.
The relationships between the neurons of two consecutive layers are mod-
elled by weights, which are calculated during the training phase of the net-
Once the weights are computed, the values of the output neurons of the
network are obtained using a feed forward process defined by the following
11
equation:
230 where al are the activation values in the l-th layer, that is, a vector composed
231 of the values of the neurons of the l-th layer, Wal and bla are the weights
232 and bias corresponding to the l-th layer, and g is the activation function.
233 Therefore, the al values are computed using the activation values of the
234 l − 1 layer, al−1 , as input. In time series forecasting, the rectified linear
235 unit function (ReLU) is commonly used as activation function for all layers,
236 except for the output layer to obtain the predicted values which generally
238 For all network architectures, the values of some hyper-parameters have
240 layers and the number of neurons, define the network architecture, and
241 other hyper-parameters, such as the learning rate, the momentum, num-
242 ber of iterations or minibatch size, among others, have a great influence on
243 the convergence of the gradient descend methods. The optimal choice of
248 Recurrent Neural Networks (RNN) are specifically designed to deal with
12
252 that the data have a temporal dependency between them. Traditional feed
253 forward neural networks cannot take into account these dependencies, and
254 RNNs arise precisely to address this problem [30]. Therefore, the input data
255 in the architecture of a RNN are both past and current data. There are
256 different types of architectures, depending on the number of data inputs and
257 outputs in the network, such as one to one (one input and one output), one
258 to many (one input and many outputs), many to one (many inputs and one
259 output), and many to many (many inputs and outputs). The most com-
260 mon RNNs are many to one for classification problems or many to many for
261 machine translation or time series forecasting for instance. In addition, for
262 the case of a time series, the length of the input data sequence is usually
263 different from the size of the output data sequence that usually is the num-
265 forecasting of time series is shown in Figure 3. xi and x̂i are the actual and
266 predicted values of the time series at time i, and h is the number of samples
𝑥$#%' 𝑥$#%&
𝑥' 𝑥#
Many to many
268 The most widely used RNNs for time series forecasting are briefly de-
13
269 scribed below.
271 The Elman network (ENN) was the first RNN and incorporated the t
272 state of a hidden unit in order to make predictions in data sequences [31]. The
273 ENN consists of a classical one-layer feed-forward network but the hidden
274 layer is connected to a new layer, called context layer, using fixed weights
275 equal to one, as shown in Figure 4. The main function of the neurons of this
276 context layer is to save a copy of the values of activation of the neurons of
278 where at are the values of the neurons in the t state in the hidden layer, xt is
279 the current input, at−1 is the information saved in the context hidden units,
280 Wa , Ua and ba are the weights and the bias, and g the activation function.
Output
Hidden layer
Context layer
Input data
14
281 3.2.2. Long-Short Term Memory
282 Standard basic RNNs suffer the vanishing gradient problem, which con-
283 sists in the gradient decreasing as the number of layers increases. Indeed, for
284 deep RNNs with a high number of layers, the gradient practically becomes
285 null, preventing the learning of the network. For this reason, these networks
286 have a short-term memory and do not obtain good results when dealing with
287 long sequences that require memorizing all the information contained in the
289 emerge in order to solve the vanishing gradient problem [32]. For this pur-
290 pose, LTSM use three gates to keep longstanding relevant information and
291 discard irrelevant information. These gates are Γf forget gate, Γu update
292 gate and Γo output gate. Γf decides what information should be thrown
293 away or saved. A value close to 0 means that the past information is for-
294 gotten while a value close to 1 means that it remains. Γu decides what new
296 using both Γf and Γu . Finally, Γo decides which is the output value that
298 The information of the at−1 previous hidden unit and the information
299 of the xt current input is passed through the σ sigmoid activation function
300 to compute all the gate values and through the tanh activation function to
15
ct = tanh(Wc [at−1 , xt ] + bc )
e (4)
ct = Γu ∗ e
ct + Γf ∗ ct−1 (8)
at = Γo ∗ tanh(ct ) (9)
303 where Wu , Wf and Wo , and bu , bf and bo are the weights and biases that
304 govern the behavior of the Γu , Γf and Γo gates, respectively, and Wc and bc
306 Figure 5 shows a picture of how a hidden unit works in a LTSM recurrent
𝑐#
𝑐#$% * ⨁ - 𝑐#
tanh
*
𝑎#$% Γ* Γ+ 𝑐-𝑡 Γ) * 𝑎#
forget gate update gate tanh output gate
𝑥#
16
309 3.2.3. Gated Recurrent Units
310 Recurrent networks with gated recurrent units (GRU) units are long-term
311 memory networks like LTSMs but emerged in 2014 [33, 34] as a simplification
312 of LTSMs due to the high computational cost of the LSTM networks. GRU
313 is one of the most commonly used versions that researchers have converged
314 on and found to be robust and useful for many different problems. The use
315 of gates in RNNs has made it possible to improve capturing of very long-
316 range dependencies, making RNNs much more effective. The LSTM is more
317 powerful and more effective since it has three gates instead of two, but GRU
318 is a simpler model and it is computationally faster as it only has two gates,
319 Γu update gate and Γr relevance gate. The Γu gate will decide whether the
321 The Γr gate determines how relevant is ct−1 to compute the next candidate
ct = Γu ∗ e
ct + (1 − Γu ) ∗ ct−1 (13)
at = ct (14)
323 where Wu and Wr , and bu and br are the weights and the bias that govern
324 the behavior of the Γu and Γr gates, respectively, and Wc and bc are the
17
325 ct memory cell candidate.
weights and bias of the e
𝑐#$% * ⨁ - 𝑐#
* -1
Γ(
Γ) * 𝑐+𝑡
relevance
gate update gate tanh
𝑥#
327 There are some problems, in the field of natural language processing
328 (NLP) for instance, where in order to predict a value of a data sequence in
329 a given instant of time, information from the sequence both before and after
331 address this issue to solve this kind of problems. The main disadvantage of
332 the BRNNs is that the entire data sequence is needed before the prediction
334 Standard networks compute the activation values for hidden units using
336 uses information from the past as well as information from the present and
337 the future as input, using both forward and backward processing.
339 tion applied to the corresponding weights with both the forward and back-
18
340 ward activation at time t. That is:
341 where Wx and bx are the weights and bias and aft and abt are the activation
342 values of the hidden units computed by forward and backward processing,
344 Figure 7 presents the basic architecture of a BRNN. A BRNN can be seen
345 as two RNNs together, where the different hidden units have two values, one
346 computed by forward and another one by backward. In addition, the BRNN
347 units can be standard RNN units or GRU or LTSM units. In fact, a BRNN
348 with LSTM units is commonly used for a lot of NLP problems.
𝑥$#%' 𝑥$#%&
𝑥' 𝑥#
351 with more than one layer, also called stacked RNN. The hidden units can be
352 standard RNN, GRU or LTSM units, and it can be unidirectional or bidirec-
19
353 tional as described in previous sections. Figure 8 illustrates the architecture
355 In general, a DRNN works quite well for time series forecasting, but its
356 performance deteriorates when using very long data sequences as input. To
357 address this issue, attention mechanisms can be incorporated into the model,
358 being one of the most powerful ideas in deep learning [35]. An attention
359 model allows a neural network to pay attention to only part of an input data
360 sequence while it’s generating the output. This attention is modeled using
361 weights, which are computed by a single-layer feed forward neural network
362 [36].
𝑥$#%' 𝑥$#%&
𝑥' 𝑥#
365 and are one of the most common architectures in image processing and com-
366 puter vision [38]. The CNNs have three kinds of layers: convolution, pooling
20
367 and fully connected. The main task of the convolution layers is the learning of
368 the features from data input. For that, filters of a pre-defined size are applied
369 to the data using the convolution operation between matrices. The convo-
370 lution is the sum of all element-wise products. The pooling reduces the size
371 of input, speeding up the computing and preventing overfitting. The most
372 popular pooling methods are average and max pooling, which summarize the
373 values using the mean or maximum value, respectively. Once the features
374 have been extracted by the convolutional layers, the forecasting is carried
375 out using fully connected layers, also called dense layers, as in DFFNN. The
376 input data for these last fully connected layers are the flattened features re-
377 sulting of the convolutional and pooling layers. Figure 9 depicts the overall
...
380 (TCN) [39], has emerged for data sequence competing directly with DRNNs
382 TCNs have the same architecture as a DFFNN but the values of acti-
383 vations for each layer are computed using earlier values from the previous
384 layer. Dilated convolution is used in order to select which values of the neu-
21
385 rons from the previous layer will contribute to the values of the neurons in
386 the next layer. Thus, this dilated convolution operation captures both local
K−1
X
Fd (x) = f (i) · xt−d·i (16)
i=0
390 convolution using a filter of size 3 and dilation factors of 1, 2 and 4 for each
Output
d=4
Hidden
d=2
Hidden
d=1
Input
393 convolutional layers when deeper and larger TCN are used in order to achieve
394 further stabilization. These generic residual blocks consist in adding the
395 input of data to the output before applying the activation function. Then,
22
396 the TCN model can be defined as follows:
397 where Fd (·) is the dilated convolution of d factor defined in Eq. (16), alt is
398 the value of the neuron of the l-th layer at time t, Wal and bla are the weights
399 and bias corresponding to the l-th layer, and g is the activation function.
403 ever, deep learning models are more complex, and their implementation re-
404 quires a high level of technical expertise and a considerable time investment
405 to implement. For this reason, the profile of the deep learning expert has
406 become one of the most demanded nowadays. In order to make easier imple-
407 mentations and reduce the time needed to design and train a model, some
408 companies have focused their work on developing frameworks that allow for
410 The main idea of the deep learning frameworks is to provide an interface
411 that allows for the implementation of models without having to pay too much
412 attention to the mathematical complexity behind them. There are several
413 frameworks available in the literature. The choice of one or another will
414 depend on several important factors, such as the type of architecture that
416 whether it can run on GPU’s. In this sense, Table 1 summarizes the most
23
417 widely used frameworks in the literature, where the term all includes the
418 DFFNN, CNN, TCN, RNN, LTSM, GRU or BRNN architectures, and CPU
420 Table 1 shows that the predominant programming language for devel-
421 oping deep learning models is Python. In addition, most of the frameworks
422 support distributed execution and the use of GPU’s. Although the described
423 frameworks facilitate the development of the models, some of them require
424 too many lines of code to obtain a complete implementation. For this reason,
425 high level libraries based on the core of the frameworks have been developed,
427 be Keras [51], Sonnet [52], Swift or Gluon [53], among others. The main
428 advantage of using a high-level-library is that the syntax can be reused for
24
434 the model optimization. This optimization will determine the quality of
435 the model, and must be performed based on the adjustment of its hyper-
436 parameters. In deep learning there are two types of hyper-parameters: model
438 adjusted in the model definition to obtain optimal performance. The op-
439 timization parameters are adjusted during the training phase of the model
440 using the data set. Some of the most relevant hyper-parameters are described
442
444 ture to be used. In addition, the value of each one will be influenced by the
445 characteristics of the problem and the data. This makes the task of optimiz-
446 ing a model a challenge for the research community. Moreover, and taken
447 into account the parameters described in table 2, the immense number of
448 possible combinations can be deduced. For this reason various metaheuris-
449 tics and optimisation strategies are used. According to the literature, there
450 are several strategies to optimize a set of hyper-parameters for deep learning
456 time investment, having a relatively low computational cost and a low
457 search space, because it requires the action of a user to modify the
25
Table 2: Relevant hyper-parameters.
Hyper-parameter Architectures Description
Optimizer All Algorithm used to update the weights of each layer after each it-
eration [54].
Learning rate All It determines the size of the step at each iteration of the optimiza-
tion method [55].
Number of epochs All Number of passes made in the whole training set [56].
Batch size All Number of sub samples that the network uses to update the weights
[57].
Hidden layers All It determines the depth of the neural network [58].
Activation function All Introduces non-linearity in the model, which allows the extraction
of more complex knowledge [59].
Momentum All It prevents oscillations in the convergence of the method [60].
Weight initialization All It prevents the explosion or vanishing of the activation in the layers
[61].
Dropout All It eliminates certain connections between neurons in each iteration.
26
It is used to prevent over-fitting [62].
L1/L2 Regularization All It prevents over-fitting, stopping weights that are too high so that
the model does not depend on a single feature [63].
Units RNN, DFFNN It determines the level of knowledge that is extracted by each layer.
It is highly dependent on the size of the data used [58].
Kernel/filter CNN Matrix that moves over the input data. It allows the extraction of
characteristics [64].
Stride CNN The number of pixels that move over the input matrix for each
filter [65].
Padding CNN Number of null samples added to dataset when it is processed by
the kernel [66].
Number of channels CNN Depth of the matrices involved in the convolutions [67].
Pooling CNN It allows to reduce the number of parameters and calculations in
the network [68].
nb_stacks TCN Number of stacks of residual blocks.
Dilations TCN A deep stack of dilated convolutions to capture long-range tempo-
ral patterns.
Table 3: Search strategies
Strategy Deep learning Cost Search space
Trial-error 7 Low Low
Grid 7 High High
Random 3 Medium High
Probabilistic 3 Medium Medium-Driven
458 values manually each time a run is finished. Since in deep learning
459 there are a large number of hyper-parameters and the values they can
460 set are infinite, it is not advisable to use this optimization method.
461 2. Grid. The grid method explores the different possible combinations
464 with it, which makes this method unviable to apply in deep learning,
466 3. Random. Random search allows to cover a high search space, because
468 this group we can differentiate between totally random or guided search
470 type of searches are the genetic algorithms [69, 70], particle swarm
472 gorithms, among others. The wide search range, added to the medium-
473 cost involved in this search strategy, makes it one of the best meth-
474 ods for optimizing deep learning models. In addition, new hyper-
27
477 the authors in [73].
479 These evaluations are used to generate a probabilistic model that as-
480 signs values to the different hyper-parameters. The most common al-
483 There are many libraries for the optimization of hyperparameters in an au-
484 tomated way. However, very few are designed specifically for the optimiza-
485 tion of deep learning model hyperparameters, being also compatible with the
489 ing, programming language and compatible framework from Table 1. Note
28
492 4.3. Hardware performance
493 One of the most important decision a researcher must make is to deter-
494 mine the physical resources needed to ensure that deep learning algorithms
495 will find accurate models. Hence, this section overviews different hardware
496 infrastructures typically used for deep learning contexts, given its increasing
498 Although a CPU can be used to execute deep learning algorithms, the in-
499 tensive computational requirements usually make the CPU physical resources
500 insufficient (scalar architecture). For this reason, three different hardware
501 architectures are typically used for mining information with deep learning
502 frameworks: GPU, Tensor Processing Unit (TPU) and Intelligence Process-
506 even thousands of more cores than a CPU, but running at lower speeds.
507 GPUs achieve high data parallelism with single instructions, multiple data
508 (SIMD) architecture and play an important role in the current artificial in-
510 The first generation of TPUs was introduced in 2016, at the Google
511 I/O Conference and were specifically designed to run already trained neural
513 built specifically for machine learning. Compared to GPUs (frequently used
514 for the same tasks since 2016), TPUs are implicitly designed for a larger
515 volume of reduced precision calculation (for example, from 8 bits of precision)
516 and lack of hardware for rasterization/texture mapping. The term was coined
29
517 for a specific chip designed for Google’s TensorFlow framework. Generally
518 speaking, TPUs have less accuracy compared to the computations performed
519 on a normal CPU or GPU, but it is sufficient to the calculations they have to
520 perform (an individual TPU can process more than 100 millions of pictures
521 per day). Moreover, TPUs are highly optimized for large batches and CNNs
523 The IPU is completely different from today’s CPU and GPU processors.
524 It is a highly flexible, easy to use, parallel processor that has been designed
525 from the ground up to deliver state of the art performance on current ma-
526 chine learning models. But more importantly, the IPU has been designed to
527 allow new and emerging machine intelligence workloads to be realized. The
528 IPU delivers much better arithmetic efficiency on small batch sizes for both
529 training and inference, which results in faster model convergence in training,
530 models that generalise better, the ability to parallelize over many more IPU
531 processors to reduce training time for a given batch size, and also delivers
532 much higher throughput at lower latencies for inference. Another interesting
533 feature is its lower power consumption compared to GPUs or TPUs (up to
536 this section. Note that the performance is measured in flops and the cost in
537 USD. Note that for TPUs cloud services are available for a price starting at
30
Table 5: Processing units properties.
Units Architecture Batch size Performance Cost
CPU Scalar Small ∼ 109 ∼ 102
GPU Vector Large ∼ 1012 ∼ 103
TPU ASIC Large ∼ 1012 -
IPU Graph Small ∼ 1015 ∼ 105
539 5. Applications
540 To motivate the relevance of the time series prediction problem, an anal-
541 ysis of the-state-of-the-art has been carried out classifying the deep learning
542 research works by application domain (such as energy and fuels, image and
543 video, finance, environment, industry or health) and the most widespread
544 network architectures used (ENN, LSTM, GRU, BRNN, DFFNN, CNN or
546
547 An overview of the items for each application domain is made in the
548 following paragraphs, in order to highlight the goals reached for each method
550 1. Energy and fuels. With the increasing use of renewable energies, accu-
551 rate estimates are needed to improve power system planning and oper-
552 ating. Many techniques have been used to make predictions, including
553 deep learning [196]. Reviewing the literature in the last few years, it
554 can be concluded that the vast majority of deep learning architectures
555 are suitable to this application area. For example, architectures based
556 on LSTM [91], ENN [86, 87], GRU [95], BRNN [96] and TCN [101]
31
Table 6: Summary of the works reviewed and classified into network architecture and application domain.
RNN DFFNN CNN Hybrid/Others
ENN LSTM GRU BRNN CNN TCN
Energy and fu- [81, 82, [88, 89, [94, 95] [96] [97, 98, [100] [101] [102, 103,
els 83, 84, 85, 90, 91, 99] 104, 105,
86, 87] 92, 93] 106, 107]
Image and [108] – – – – [109, [115, 116, [118]
video 110, 111, 117]
112, 113,
114]
Financial – [119, [121, [122] [124] [125, – [128, 129,
120, 121, 122, 123] 126, 121, 130, 131,
122] 127] 132, 121,
32
133, 134]
Environmental [135, 136, [144, [146, [151] [152] [153] [154] [151, 155,
137, 138, 145, 146, 148, 149, 156]
139, 140, 147] 150]
141, 142,
143]
Industry [157, 158] [159, [162, [164, [166, [167] [168, 169] [170, 171,
160, 161] 163] 165] 165] 167]
Health – [172] – [173] [174] [175, – [177, 178,
176, 114] 179, 180,
181, 182]
Misc – – [183] [184] [185] [186, [189, 190, [194, 195]
187, 188] 191, 192,
193]
557 have been used to predict electricity demand consumption. LSTM [90]
558 and CNN [100] networks have also been used to forecast photo-voltaic
559 energy load. A GRU has been used to forecast soot emission in diesel
561 the authors in [99] to forecast time series of general purpose. After
562 that, this strategy has been also used to forecast load demand time
564 forecast oil production. Hybrid architectures have been also used in
565 this research field, for example, to forecast the price of carbon [103],
568 2. Image and video. Image and video analysis is a very broad area of
569 research and works related to any application domains. For example,
571 cancer detection and diagnosis [197]. In [198], the authors summarized
572 some techniques and studies used to recognize video sequence actions
574 an ENN network to forecast and monitor the slopes displacement over
576 the authors combined GRU, RNN and CNN to classify satellite image
577 time series. Although all these works offer highly competitive results,
579 to solve forecasting problems using image or video time series data.
580 On the one hand, CNNs have been used to forecast the combustion
33
582 speed of large-scale traffic [110] or to detect coronary artery stenosis
583 [114], among others. On the other hand, TCN are booming when it
584 comes to analyzing images and videos. For example, Yunqi et al. used
585 a TCN to estimate density maps from videos [115]. The authors in [117]
587 work in which images were used can be found in [116]. In this work,
588 they used a TCN model to dynamically detect stress through facial
589 photographs.
590 3. Financial. Financial analysis has been a challenging issue for decades.
591 Therefore, there are many research works related to this application
593 CNN [125, 126, 127], DNN [124], GRU [123] or LSTM [120, 119] have
594 been used. Some authors make a comparison between some of these
595 architectures, analyzing which one offers better results [122]. Although
596 these studies are widespread, the complexity of the problem requires
597 the search for new methodologies and architectures [129, 121, 134, 133,
600 areas for the scientific community. Many of these works are also based
602 The authors in [151] applied CNN and LSTM to forecast wind speed
604 authors focused on a single specific variable. For instance, the authors
605 used TCN, GRU, ENN, BRNN and LSTM architectures to forecast
606 information related to wind in [156, 148, 135, 137, 138, 147, 150]. Wa-
34
607 ter quality and demand were also predicted using TCN and ENN in
609 lated time series prediction was also proposed by Wan et al. in [144].
611 centration for swine house [200] were also predicted using deep learning
613 5. Industry. In the industry sector, deep learning techniques are also be-
614 ing used to carry out tasks of different kinds [201]. For instance, TCN
615 and BRNN can be used to traffic flow forecasting [168, 164]. LSTM
616 can be used for multiple purposes, such as process planning [161], con-
619 forecast bath and metal height features in electrolysis process. ENN
620 and GRU networks have been also used, for example, to forecast the
621 useful life or degradation of the materials [158, 157, 196]. Deep learn-
622 ing techniques are also widely applied to architecture, as it can be seen
624 concluded that almost all network architectures have been used, given
626 6. Health. The use of deep learning architectures in the area of health is
627 common in the last years [203, 197]. However, time series prediction
628 using deep learning models is not very widespread as time series are
629 generally short in this field, along with the high computational cost in-
35
632 diagnosis and prognosis with a focus on cardiovascular disease Instead,
634 brid models. For example, the authors used CNN to accelerate the
636 also used to monitoring the sleep stage in [177], for detecting prema-
638 [181]. In [180], the authors used a BP network to forecast the incidence
640 used to forecast the status of critical patients according to their vital
641 functions [172]. A recent study conducted by the authors in [182] uses
643 7. Miscellaneous. In recent years, the TCN has been one of the most
644 widely checked general purpose architectures for time series forecasting
645 [191, 193, 190, 183]. However, any of the other network architectures
647 classified in Table 6. For example, CNN and RNN can be used to
648 detect human activity [187, 187] or hybrid models to detect anomalies
651 From the previous analysis of Table 6, two main conclusions can be
652 drawn. First, there exist several methods that have not been applied yet
653 to particular application fields. Second, the existence of these gaps encour-
36
655 6. Conclusions
656 Deep learning has proven to be one of the most powerful machine learning
657 techniques for solving complex problems dealing with big data. Most of the
658 data mainly generated through smart devices are time series nowadays, being
659 the prediction of them one of the most frequent and present problems in
660 almost all research areas. Thus, these two topics have been jointly analyzed
662 time series forecasting. Firstly, the most used deep learning architectures for
663 time series data in the last years have been described, with special emphasis
664 on important practical aspects that can have a great influence on the reported
665 results. In particular, it has put the focus on the search for hyper-parameters,
666 the frameworks for deployment of the different architectures, and the existing
667 hardware to lighten the hard training of the proposed network architectures.
668 Secondly, a study of the deep neural networks used to predict time series in
669 different application domains has been carried out in this survey, with the
671 and to show which architectures have not been sufficiently tested in some
672 applications.
673 Acknowledgements
674 The authors would like to thank the Spanish Ministry of Science, Innova-
675 tion and Universities for the support under project TIN2017-88209-C2-1-R.
676 Also, this work has been partially supported by the General Directorate
37
678 under the PRFU project (ref: C00L07UN060120200003).
679 References
680 [1] Plageras AP, Psannis KE, Stergiou C, et al. Efficient IoT-based sensor
684 naive bayes for the sentiment and affect analysis in social media. Big
686 [3] Gama J. Knowledge Discovery from Data Streams. Chapman &
688 [4] Al-Jarrah OY, Yoo PD, Muhaidat S, et al. Efficient machine learning
689 for big data: a review. Big Data Research, 2:87–93, 2015.
690 [5] Dhar V, Sun C, and Batra P. Transforming finance into vision: concur-
691 rent financial time series as convolutional net. Big data, 7(4):276–285,
692 2019.
693 [6] Nguyen G, Dlugolinsky S, Bobák M, et al. Machine learning and deep
694 learning frameworks and libraries for large-scale data mining: a survey.
38
698 [8] Schmidhuber J. Deep learning in neural networks: an overview. Neural
700 [9] Makridakis S, Wheelwright SC, and Hyndman RJ. Forecasting methods
704 [11] Box GEP and Jenkins GM. Time series analysis: forecasting and
709 [13] Zhang Q, Yang LT, Chen Z, et al. A survey on deep learning for big
711 [14] Fawaz HI, Forestier G, Weber J, Idoumghar L, et al. Deep learning
712 for time series classification: a review. Data Mining and Knowledge
714 [15] Bagnall A, Lines J, Vickers W, et al. The UEA & UCR time series clas-
716 2017.
39
720 [17] Buuren S. Flexible Imputation of Missing Data. Chapman &
722 [18] Maronna RA, Martin RD, and Yohai VJ. Robust Statistics: Theory
724 [19] Fu TC. A review on time series data mining. Engineering Applications
726 [20] Shumway RH and Stoffer DS. Time series analysis and its applications
728 [21] Maronna RA, Martin RD, and Yohai VJ. Robust statistics: theory and
732 [23] Wang X, Kang Y, Hyndman RJ, et al. Distributed ARIMA models for
735 big data time series: Mining trillions of time series subsequences under
738 [25] Torres JF, Galicia A, Troncoso A, et al. A scalable approach based on
739 deep learning for big data time series forecasting. Integrated Computer-
40
741 [26] Galicia A, Talavera-Llames RL, Troncoso A, et al. Multi-step forecast-
742 ing for big data time series based on ensemble learning. Knowledge-
749 nearest neighbors algorithm for big data time series forecasting. Neu-
752 time series forecasting based on pattern sequence similarity and its
757 [31] Elman JL. Finding structure in time. Energy reports, 14(2):179–211,
758 1990.
41
762 recurrent neural networks on sequence modeling. In Proceedings of the
764 [34] Cho K, Merrienboer BV, Bahdanau D, et al. On the properties of neu-
770 [36] Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption
779 [39] Alla S and Adari SK. Beginning anomaly detection using Python-based
42
783 [41] Candel A, LeDell E, Arora A, et al. Deep learning with h2o.
785 [42] Eclipse Deeplearning4j development team. DL4J: deep learning for
791 [45] Intel Nervana systems. Neon deep learning framework, 2017.
792 [46] Tokui S, Okuta R, Akiba T, et al. Chainer: A deep learning framework
802 [49] Bai J, Lu F, Zhang K, et al. Onnx: Open neural network exchange.
43
804 [50] Seide F and Agarwal A. CNTK: Microsoft’s open-source deep-learning
812 [54] Choi D, Shallue CJ, Nado Z, Lee J, Maddison CJ, and Dahl GE. On
815 [55] You K, Long M, Wang J, and Jordan MI. How does learning rate decay
818 [56] Sinha S, Singh TN, Singh V, and Verma A. Epoch determination for
821 [57] Masters D and Luschi C. Revisiting small batch training for deep
823 [58] Shafi I, Ahmad J, Shah SI, et al. Impact of varying neurons and hidden
44
825 In Proceedings of the IEEE International Multitopic Conference, pages
827 [59] Ding B, Qian H, and Zhou J. Activation functions and their character-
833 [61] Kumar SK. On weight initialization in deep neural networks. arXiv
838 [63] Ng AY. Feature selection, l1 vs. l2 regularization, and rotational invari-
844 [65] Zaniolo L and Marques O. On the use of variable stride in convolutional
45
847 [66] Dwarampudi M and Reddy NVS. Effects of padding on LSTMs and
849 [67] Zhu H, An Z, Yang C, et al. Rethinking the number of channels for the
865 2002.
46
869 [74] Ranjit MP, Ganapathy G, Sridhar K, et al. Efficient deep learning hy-
874 [75] Bergstra J, Yamins D, and Cox DD. Making a science of model search:
878 [76] Camero A, Toutouh J, and Alba E. Dlopt: deep learning optimization
883 2019.
887 2013.
888 [80] Wang YE, Wei GY, and Brooks D. Benchmarking TPU, GPU, and
890 2019.
47
891 [81] Yu D, Wang Y, Liu H, et al. System identification of PEM fuel cells us-
892 ing an improved Elman neural network and a new hybrid optimization
894 [82] Zheng Y, Yao Z, Z H, et al. Power generation forecast of top gas
897 [83] Ruiz LGB, Rueda R, Cuéllar MP, et al. Energy consumption fore-
901 on GA–Elman neural network for circulating cooling water with elec-
905 battery under vibration stress using Elman neural network. Interna-
911 [87] Li D, Wang H, Zhang Y, et al. Power grid load state information
48
913 based on Elman neural network. In Proceedings of information Tech-
917 casting models using deep LSTM-RNN. Neural Computing and Appli-
925 [91] Muzaffar S and Afshari A. Short-term load forecasts using LSTM
927 [92] Song X, Liu Y, Xue L, et al. Time-series well performance predic-
928 tion based on long short-term memory (LSTM) neural network model.
930 [93] Wang JQ, Du Y, and Wang J. LSTM based long-term energy con-
932 [94] Gokhan A, Yilmaz E, Unel M, et al. Estimating soot emission in diesel
49
935 [95] Wu W, Liao W, Miao J, et al. Using gated recurrent unit network to
938 [96] Tang X, Dai Y, Wang T, et al. Short-term power load forecasting based
941 [97] Shao Z, Zheng Q, Yang S, et al. Modeling and forecasting the electricity
945 [98] Qiu X, Ren Y, Suganthan PN, et al. Empirical mode decomposition
946 based ensemble deep learning for load demand time series forecasting.
948 [99] Qiu X, Zhang L, Ren Y, et al. Ensemble deep learning for regression
951 6, 2014.
953 based protection scheme for PV integrated microgrid under solar irra-
956 2020.
50
957 [101] Mishra K, Basu S, and Maulik U. DaNSe: a dilated causal convo-
958 lutional network based model for load forecasting. Lecture Notes in
960 [102] Qiao W and Yang Z. Forecast the electricity price of U.S. using a
962 [103] Ji L, Zou Y, He K, et al. Carbon futures price forecasting based with
964 2019.
965 [104] Kim TY and Cho SB. Predicting residential energy consumption using
967 [105] Shen M, Xu Q, Wang K, et al. Short-term bus load forecasting method
970 [106] AlKandari M and Ahmad I. Solar power generation forecasting using
973 [107] Kong Z, Tang B, Deng L, et al. Condition monitoring of wind turbines
975 ral networks and gated recurrent units. Renewable Energy, 146:760–
977 [108] Wang S, Zhang Z, Ren Y, et al. UAV photogrammetry and AFSA-
51
978 Elman neural network in slopes displacement monitoring and forecast-
980 [109] Sarkar S, Lore KG, Sarkar S, et al. Early detection of combustion
981 instability from hi-speed flame images via deep learning and symbolic
987 [111] Chen W and Shi K. A deep learning framework for time series classifi-
988 cation using relative position matrix and convolutional neural network.
991 Sentinel-2 satellite image time series for land cover mapping via a
995 tion using time series imaging and deep learning. International Journal
52
1000 [115] Miao Y, Han J, Gao Y, and Zhang B. ST-CNN: Spatial-Temporal
1010 Point deep Learning architecture for time series classificatiOn. ISPRS
1012 [119] Yan H and Ouyang H. Financial time series prediction based on deep
1014 [120] Sismanoglu G, Onde M, Kocer F, et al. Deep learning based forecasting
1015 in stock market with big data analytics. In Proceedings of the IEEE
1018 [121] Jayanth BA, Harish RDS, and Nair BB. Applicability of deep learning
1019 models for stock price forecasting an empirical study on bankex data.
1021 [122] Jiang M, Liu J, Zhang L, et al. An improved stacking framework for
53
1022 stock index prediction by leveraging tree-based ensemble models and
1029 [124] Orimoloye LO, Sung MC, Ma T, et al. Comparing the effectiveness of
1030 deep feedforward neural networks and shallow architectures for predict-
1031 ing stock price indices. Expert Systems with Applications, 139:112828,
1032 2020.
1033 [125] Makarenko AV. Deep learning algorithms for estimating Lyapunov
1037 [126] Dingli A and Fournier KS. Financial time series forecasting - a deep
1040 [127] Kelotra A and Pandey P. Stock Market Prediction Using Optimized
1042 [128] Ni L, Li Y, Wang X, et al. Forecasting of Forex time series data based
54
1044 [129] Bao W, Yue J, and Rao Y. A deep learning framework for financial
1045 time series using stacked autoencoders and long-short term memory.
1048 model for multivariate financial time series prediction. Lecture Notes
1050 [131] Chen CT, Chiang LK, Huang YC, et al. Forecasting interaction of
1055 and recurrent neural network to forecast the stock price of Casablanca
1060 [134] Long W, Lu Z, and Cui L. Deep learning-based feature engineering for
1063 [135] Liu H, Tian HQ, Liang XF, et al. Wind speed forecasting approach
55
1066 [136] Yu C, Li Y, Xiang H, et al. Data mining-assisted short-term wind
1070 [137] Liu H, Wei MX, and Fei LY. Wind speed forecasting method based on
1071 deep learning strategy using empirical wavelet transform, long short
1072 term memory neural network and Elman neural network. Energy Con-
1074 [138] Zhang Y and Pan G. A hybrid prediction model for forecasting
1077 [139] Zhang L, Xie Y, Chen A, et al. A forecasting model based on enhanced
1078 Elman neural network for air quality prediction. Lecture Notes in
1080 [140] Huang Y and Shen L. Elman neural network optimized by firefly
1083 [141] Xiao D, Hou S, Li WZ, et al. Hourly campus water demand forecasting
1088 for swine house in cold region based on empirical mode decomposition
56
1089 and Elman neural network. Information Processing in Agriculture,
1091 [143] Wan X, Yang Q, Jiang P, et al. A hybrid model for real-time probabilis-
1092 tic flood forecasting using Elman neural network with heterogeneity of
1094 2019.
1095 [144] Wan H, Guo S, Yin K, et al. CTS-LSTM: LSTM-based neural net-
1098 [145] Freeman BS, Taylor G, Gharabaghi B, et al. Forecasting air quality
1099 time series using deep learning. Journal of the Air and Waste Man-
1101 [146] De-Melo GA, Sugimoto DN, Tasinaffo PM, et al. A new approach to
1102 river flow forecasting: LSTM and GRU multivariate models. IEEE
1104 [147] Chen J, Zeng GQ, Zhou W, et al. Wind speed forecasting using
1108 [148] Niu Z, Yu Z, Tang W, et al. Wind power forecasting using attention-
57
1111 pond based on gated recurrent unit (GRU). Information Processing in
1113 [150] Peng Z, Peng S, Fu L, et al. A novel deep learning ensemble model
1114 with data denoising for short-term wind speed forecasting. Energy
1116 [151] Jin X, Yu X, Wang X, et al. Prediction for time series with CNN and
1118 [152] Maqsood H, Mehmood I, Maqsood M, et al. A local and global event
1119 sentiment based efficient stock exchange forecasting using deep learn-
1121 2020.
1122 [153] O’Shea TJ, Roy T, and Clancy TC. Over-the-air deep learning based
1125 [154] Zhang Y, Thorburn PJ, and Fitch P. Multi-task temporal convolu-
1126 tional network for predicting water quality sensor data. In Proceedings
1132 [156] Liu H, Mi X, Li Y, et al. Smart wind speed deep learning based multi-
58
1133 step forecasting model using singular spectrum analysis, convolutional
1134 gated recurrent unit network and support vector regression. Renewable
1136 [157] Li X, Zhang L, Wang Z, et al. Remaining useful life prediction for
1140 [158] Yang L, Wang F, Zhang J, et al. Remaining useful life prediction of
1141 ultrasonic motor based on Elman neural network with improved par-
1144 [159] Rashid KM and Louis J. Times-series data augmentation and deep
1150 [161] Mehdiyev N, Lahann J, Emrich A, et al. Time series classification using
1151 deep learning for process planning: a case from the process industry.
1154 bearing using HMM and improved GRU. Measurement: Journal of the
59
1156 [163] Wang J, Yan J, Li C, et al. Deep heterogeneous GRU model for pre-
1159 [164] Bohan H and Yun B. Traffic flow prediction based on BRNN. In
1162 [165] Pasias A, Vafeiadis T, Ioannidis D, et al. Forecasting bath and metal
1166 [166] Jiang P, Chen C, and Liu X. Time series prediction for evolutions
1176 [169] Wu P, Sun J, Chang X, et al. Data-driven reduced order model with
60
1179 [170] Varona B, Monteserin A, and Teyseyre A. A deep learning approach
1183 load forecasts using deep learning vs. traditional time-series techniques.
1185 [172] da Silva DB, Schmidt D, da Costa CA, da Rosa Righi R, and Eskofier
1189 [173] Yu W, Kim Y, and Mechefske C. Remaining useful life estimation using
1193 forecasting for healthcare diagnosis and prognostics with the focus on
1197 [175] Liu Y, Huang Y, Wang J, et al. Detecting premature ventricular con-
1200 [176] Hoppe E, Körzdörfer G, Würfl T, et al. Deep learning for magnetic
61
1202 parameter values from time series. In Studies in Health Technology and
1204 [177] Chambon S, Galtier MN, Arnal PJ., et al. A deep learning archi-
1205 tecture for temporal sleep stage classification using multivariate and
1208 [178] Lauritsen SM, Kalør ME, Kongsgaard EL, et al. Early detection of sep-
1209 sis utilizing deep learning on electronic health record event sequences.
1214 [180] Liang liang M and Fu peng T. Pneumonia Incidence Rate Predictive
1218 [181] Sarafrazi S, Choudhari RS, Mehta C, Mehta HK, et al. Cracking the
1219 “sepsis” code: Assessing time series nature of ehr data, and using deep
1222 [182] Zeroual A, Harrou F, Dairi A, and Sun Y. Deep learning methods for
62
1225 [183] Zhang X, Shen F, Zhao J, et al. Time series forecasting using GRU
1228 [184] Zhao X, Xia L, Zhang J, et al. Artificial neural network based model-
1231 2020.
1232 [185] Imamverdiyev Y and Abdullayeva F. Deep learning method for denial
1235 [186] Munir M, Siddiqui SA, Dengel A, et al. DeepAnT: a deep learning
1238 [187] Zebin T, Scully PJ, and Ozanyan KB. Human activity recognition
63
1248 [190] Shao J, Shen H, Cao Q, et al. Temporal convolutional networks for
1251 [191] Chen Y, Kan Y, Chen Y, et al. Probabilistic forecasting with temporal
1258 LSTM model for multi-step prediction of chaotic time series. Compu-
1260 [194] Rodrigues F, Markou I, and Pereira FC. Combining time-series and
1261 textual data for taxi demand prediction in event areas: a deep learning
1263 [195] Kalinin MO, Lavrova DS, and Yarmak AV. Detection of threats in
1267 [196] Wang H, Lei Z, Zhang X, et al. A review of deep learning for renewable
1269 2019.
64
1270 [197] Hu Z, Tang J, Wang Z, et al. Deep learning for image-based cancer
1272 2018.
1273 [198] Atto AM, Benoit A, and Lambert P. Timed-image based deep learn-
1276 [199] Sezer OB, Gudelek M, and Ozbayoglu AM. Financial time series fore-
1279 [200] Shen Z, Zhang Y, Lu J, et al. A novel time series forecasting model
1281 [201] Wang Y, Zhang D, Liu Y, et al. Enhancing transportation systems via
1286 [203] Rui Z, Ruqiang Y, Zhenghua C, et al. Deep learning and its applica-
1289 [204] Mahdavifar S and Ghorbani AA. Application of deep learning to cy-
65