0% found this document useful (0 votes)
58 views10 pages

TimeMachine: Advanced Long-Term Time Series Forecasting

Uploaded by

pt4452mxd5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views10 pages

TimeMachine: Advanced Long-Term Time Series Forecasting

Uploaded by

pt4452mxd5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

TimeMachine: A Time Series is Worth 4 Mambas for

Long-term Forecasting
Md Atik Ahameda, * and Qiang Chenga,b, **

Department of Computer Sciencea , Institute for Biomedical Informaticsb


University of Kentucky

Abstract. Long-term time-series forecasting remains challenging Recently, state-space models (SSMs) [10, 11, 12, 13, 25] have
due to the difficulty in capturing long-term dependencies, achieving emerged as powerful engines for sequence-based inference and have
linear scalability, and maintaining computational efficiency. We in- attracted growing research interest. These models are capable of
troduce TimeMachine, an innovative model that leverages Mamba, inferring over very long sequences and exhibit distinctive proper-
a state-space model, to capture long-term dependencies in multivari- ties, including the ability to capture long-range correlations with lin-
arXiv:2403.09898v2 [cs.LG] 22 Aug 2024

ate time series data while maintaining linear scalability and small ear complexity and context-aware selectivity with hidden attention
memory footprints. TimeMachine exploits the unique properties of mechanisms [11, 2]. SSMs have demonstrated great potential in var-
time series data to produce salient contextual cues at multi-scales ious domains, including genomics [11], tabular learning [1], graph
and leverage an innovative integrated quadruple-Mamba architecture data [3], and images [22], yet they remain unexplored for LTSF.
to unify the handling of channel-mixing and channel-independence The under-utilization of SSMs in LTSF can be attributed to two
situations, thus enabling effective selection of contents for predic- main reasons. First, highly content- and context-selective SSMs have
tion against global and local contexts at different scales. Experimen- only been recently developed [11]. Second, and more importantly, ef-
tally, TimeMachine achieves superior performance in prediction ac- fectively representing the context in time series data remains a chal-
curacy, scalability, and memory efficiency, as extensively validated lenge. Many Transformer-based models, such as Autoformer [29]
using benchmark datasets. and Informer [37], regard each observation as a token in a sequence,
Code availability: https://github.com/Atik-Ahamed/TimeMachine while more recent models like PatchTST [23] and iTransformer [21]
leverage patches of the time series as tokens. However, our empirical
experiments on real-world MTS data suggest that directly utilizing
1 Introduction SSMs for LTSF by using either observations or patches as tokens
could hardly achieve performance comparable to Transformer-based
Long-term time-series forecasting (LTSF) is essential in various models. Considering the particular characteristics of MTS data, it is
tasks across diverse fields, such as weather forecasting, anomaly de- essential to extract more salient contextual cues tailored to SSMs.
tection, and resource planning in energy, agriculture, industry, and MTS data typically have many channels with each variate corre-
defense. Although numerous approaches have been developed for sponding to a channel. Many models, such as Informer [37], FED-
LTSF, they typically can achieve only one or two desired properties former [38], and Autoformer [29], handle MTS data to extract use-
such as capturing long-term dependencies in multivariate time series ful representations in a channel-mixing way, where the MTS input
(MTS), linear scalability in the amount of model parameters with is treated as a two-dimensional matrix whose size is the number of
respect to data, and computational efficiency or applicability in edge channels multiplied by the length of history. Nonetheless, recently a
computing. It is still challenging to achieve these desirable properties few works such as PatchTST [23] and TiDE [6] have shown that a
simultaneously. channel-independence way for handling MTS may achieve state-of-
Capturing long-term dependencies, which are generally abundant the-art (SOTA) accuracy, where each channel is input to the model as
in MTS data, is pivotal to LTSF. While linear models such as DLin- a one-dimensional vector independent of the other channels. We be-
ear [34] and TiDE [6] achieve competitive performance with linear lieve that these two ways of handling LTSF need to be adopted as per
complexity and scalability, with accuracy on par with Transformer- the characteristics of the MTS data, rather than using a one-size-fits-
based models, they usually rely on MLPs and linear projections that all approach. When there are strong between-channel correlations,
may not well capture long-range correlations [4]. Transformer-based channel mixing usually can help capture such dependencies; other-
models such as iTransformer [21], PatchTST [23], and Crossformer wise, channel independence is a more sensible choice. Therefore, it is
[36] have a strong ability to capture long-range dependencies and necessary to design a unified architecture applicable to both channel-
superior performance in LTSF accuracy, thanks to the self-attention mixing and channel-independence scenarios.
mechanisms in Transformers [28]. However, they typically suffer Moreover, time series data exhibit a unique property – Tempo-
from the quadratic complexity [6], limiting their scalability and ap- ral relations are largely preserved after downsampling into two sub-
plicability, e.g., in edge computing. sequences. Few methods such as Scinet [19] have explored this prop-
erty in designing their models; however, it is under-utilized in other
∗ Email: [email protected]
∗∗
approaches. Due to the high redundancy of MTS values at consecu-
Email: [email protected], Corresponding Author.
tive time points, directly using time points as tokens may have redun- to achieve complexity and scalability. CNN-based methods, such as
dant values obscure context-based selection and, more importantly, TimesNet [30] and Scinet [19], utilize convolutional filters to ex-
overlook long-range dependencies. Rather than relying on individual tract valuable temporal features and model complex temporal dy-
time points, using patches may provide contextual clues within each namics. These approaches exhibit highly competitive performance,
time window of a patch length. However, a pre-defined small patch often comparable to or even occasionally outperforming more so-
length only provides contexts at a fixed temporal or frequency reso- phisticated Transformer-based models.
lution, whereas long-range contexts may span different patches. To Transformer-based Supervised Learning Methods, such as
best capture long-range dependencies, it is sensible to supply multi- iTransformer [21], PatchTST [23], Crossformer [36], FEDformer
scale contexts and, at each scale, automatically produce global-level [38], stationary [20], Flowformer [31], and Autoformer [29], have
tokens as contexts similar to iTransformer [21] that tokenizes the gained popularity for LTSF due to their superior accuracy. These
whole look-back window. Further, while models like Transformer methods convert time series to token sequences and leverage the
and the selective SSMs [11] have the ability to select sub-token con- self-attention mechanism to discover dependencies between arbitrary
tents, such ability is limited in the channel-independence case, for time steps, making them particularly effective for modeling complex
which local contexts need to be enhanced when leveraging SSMs for temporal relationships. They may also exploit Transformers’ ability
LTSF. to process data in parallel, enabling long-term dependency discovery
In this paper, we introduce a novel approach that effectively cap- sometimes with even linear scalability. Despite their distinctive ad-
tures long-range dependencies in time series data by providing rich vantages, these methods typically have quadratic time and memory
multi-scale contexts and particularly enhancing local contexts in the complexity due to point-wise correlations in self-attention mecha-
channel-independence situation. Our model, built upon a selective nisms.
scan SSM called Mamba [11], serves as a core inference engine with Self-Supervised Representation Learning Models: Self-
a strong ability to capture long-range dependencies in MTS data supervised learning has been leveraged to learn useful representa-
while maintaining linear scalability and small memory footprints. tions of MTS for downstream tasks, using non-Transformer-based
The proposed model exploits the unique property of time series data models for time series [32, 9, 26, 33], and Transformer-based models
in a bottom-up manner by producing contextual cues at two scales such as time series Transformer (TST) and TS-TCC [35, 7, 27].
through consecutive resolution reduction or downsampling using lin- Currently, Transformer-based self-supervised models have not yet
ear mapping. The first level operates at a high resolution, while the achieved performance on par with supervised learning approaches
second level works at a low resolution. At each level, we employ two [27]. This paper focuses on LTSF in a supervised learning setting.
Mamba modules to glean contextual cues from global perspectives
for the channel-mixing case and from both global and local perspec-
tives for the channel-independence case.
3 Proposed Method
In summary, our major contributions are threefold: In this section, we describe each component of our proposed
architecture and how we use our model to solve the LTSF problem.
• We develop an innovative model called TimeMachine that is the Assume a collection of MTS samples is given, denoted by dataset
first to leverage purely SSM modules to capture long-term depen- D, which comprises an input sequence x = [x1 , . . . , xL ], with each
dencies in multivariate time series data for context-aware predic- xt ∈ RM representing a vector of M channels at time point t. For
tion, with linear scalability and small memory footprints superior matrices, we use bold font; for scalars and vectors, we use regular
or comparable to linear models. (non-bold) letters. The sequence length L is also known as the
• Our model constitutes an innovative architecture that unifies the look-back window. The goal is to predict T future values, denoted
handling of channel-mixing and channel-independence situations by [xL+1 , . . . , xL+T ]. The architecture of our proposed model,
with four SSM modules, exploiting potential between-channel referred to as TimeMachine, is depicted in Figure 1. The pillars of
correlations. Moreover, our model can effectively select contents this architecture consist of four Mambas, which are employed in
for prediction against global and local contextual information, at an integrated way to tap contextual cues from MTS. This design
different scales in the MTS data. choice enables us to harness Mamba’s robust capabilities of inferring
• Experimentally, TimeMachine achieves superior performance in sequential data for LTSF.
prediction accuracy, scalability, and memory efficiency. We exten-
sively validate the model using standard benchmark datasets and Normalization: Before feeding the data to our model, we normalize
perform rigorous ablation studies to demonstrate its effectiveness. (0) (0)
the original MTS x into x0 = [x1 , · · · , xL ] ∈ RM ×L , via
(0)
x = N ormalize(x). Here, N ormalize(·) represents a nor-
2 Related Works malization operation with two different options. The first is to use
the reversible instance normalization (RevIN) [16], which is also
Numerous methods for LTSF have been proposed, which can be adopted in PatchTST [23]. The second option is to employ regular
(0)
grouped into three main categories: non-Transformer-based super- Z-score normalization: xi,j = (xi,j − mean(xi,j ))/σj , where
vised approaches, Transformer-based supervised learning models, σj is the standard deviation for channel j, with j = 1, · · · , M .
and self-supervised representation learning models. Empirically we find that RevIN is often more helpful compared to
Z-score. Apart from normalizing the data in the forward pass of our
Non-Transformer-based Supervised Approaches include classical approach, in experiments, we also follow the standardization process
methods like ARIMA, VARMAX, GARCH [5], and RNN [15], as of the data when compared with baseline methods.
well as deep learning-based methods that achieve SOTA performance
using multi-layer perceptrons (MLPs) and convolutional neural net- Channel Mixing vs. Channel Independence: Our model can handle
works (CNNs). MLP-based models, such as DLinear [34], TiDE both channel independence and channel mixing cases. In channel in-
[6], and RLinear [18], leverage the simplicity of linear structures dependence, each channel is processed independently by our model,
Element-wise addition Concatenation Transposable

Dropout SiLU activation Nonlinearity

Figure 1: Schematic diagram of our proposed model, TimeMachine. Our method incorporates a configuration of four Mambas, with two
specialized Mambas capable of processing the transposed tensor data in each branch. On the left, an example of the MTS is depicted, while
the right side shows a detailed view of a Mamba’s structure. Mambas are capable of accepting an input of shape BM ni while providing the
output of the same shape, where i ∈ {1, 2} in our method.
while in channel mixing, the MTS sequence is processed with multi- rely on them, as demonstrated with the ablation study (see Section 5).
ple channels combined throughout our architecture. Regardless of the
case, our model accepts input of the shape BM L, where B is batch Integrated Quadruple Mambas: With the two processed embedded
size, and produces the desired output of the shape BM T , eliminating representations from E1 , E2 , we can now learn more comprehensive
the need for additional manual pre-processing. representations by leveraging Mamba, a type of SSM with selective
Channel independence has been proven effective in reducing over- scan ability. At each embedding level, we employ a pair of Mam-
fitting by PatchTST [23]. We found this strategy helpful for datasets bas to capture long-term dependencies within the look-back samples
with a smaller number of channels. However, we observe for datasets and provide sufficient local contexts. Denote the input to one of the
with a number of channels comparable to the look-back, channel four Mamba blocks by u, which is either DO(x(1) ) obtained after
mixing is more effective in capturing the correlations among chan- E1 and the subsequent dropout layer for the two outer Mambas, or
nels and reaching the desired minimum loss during training. DO(x(2) ) obtained after E2 and the subsequent dropout layer for
Our architecture is robust and versatile, capable of benefiting the two inner Mambas (Figure 1). The input tensors may be reshaped
from potentially strong inter-channel correlations in channel-mixing per channel mixing or channel independence cases as described.
case and exploiting independence in channel-independence case. Inside a Mamba block, two fully-connected layers in two branches
When dealing with channel independence, we reshape the input calculate linear projections. The output of the linear mapping in the
from BM L to (B × M )1L after the normalization step. The first branch passes through a 1D causal convolution and SiLU acti-
reshaped input is then processed throughout the network and later vation S(·) [8], then a structured SSM. The continuous-time SSM
merged to provide an output shape of BM T . In contrast, for channel maps an input function or sequence u(t) to output v(t) through a
mixing, no reshaping is necessary. The channels are kept together latent state h(t):
and processed throughout the network.
dh(t)/dt = A h(t) + B u(t), v(t) = C h(t), (2)
Embedded Representations: Before processing the input sequence where h(t) is N -dimensional, with N also known as a state expan-
with Mambas, we provide two-stage embedded representations of the sion factor, u(t) is D-dimensional, with D being the dimension fac-
input sequence with length L by E1 and E2 : tor for an input token, v(t) is an output of dimension D, and A, B,
and C are coefficient matrices of proper sizes. This dynamic system
x(1) = E1 (x(0) ), x(2) = E2 (DO(x(1) )), (1) induces a discrete SSM governing state evolution and outputs given
where DO stands for the dropout operation, and the embedding the input token sequence through time sampling at {k∆}. Here, ∆
operations E1 : RM ×L → RM ×n1 and E2 : RM ×n1 → RM ×n2 is a time interval for discretizing the dynamic system. In particular,
are achieved through MLPs. Thus, for the channel mixing case, Mamba makes ∆ a function of the input, and hence so do the model
the batch-formed tensors will have the following changes in size: coefficients (A, B, C) and hidden state, thereby adapting the model
BM n1 ← E1 (BM L), and BM n2 ← E2 (BM n1 ). This enables dynamics to input and enhancing context selectivity. Consequently,
us to deal with the fixed-length tokens of n1 and n2 regardless of the this discrete SSM is
variable input sequence length L, and both n1 and n2 are configured hk = Ā hk−1 + B̄ uk , v k = C hk , (3)
to take values from the set {512, 256, 128, 64, 32} satisfying
where hk , uk , and vk are respectively samples of h(t), u(t), and v(t)
n1 > n2 . Since MLPs are fully connected, we introduce dropouts to
at time k∆,
reduce overfitting. Although we have the linear mappings (MLPs)
before Mambas, the performance of our model does not heavily Ā = exp(∆A), B̄ = (∆A)−1 (exp(∆A) − I)∆B. (4)
Table 1: Overview of the characteristics of used benchmarking
For SSMs, diagonal A is often used. Mamba makes B, C, and ∆ lin-
datasets. Time points illustrate the total length of each dataset.
ear time-varying functions dependent on the input. In particular, for a
token u, B, C ← LinearN (u), and ∆ ← sof tplus(parameter + Dataset (D) Channels (M) Time Points Frequency
LinearD (Linear1 (u))), where Linearp (u) is a linear projection Weather 21 52696 10 Minutes
to a p-dimensional space, and sof tplus activation function. Further- Traffic 862 17544 Hourly
more, Mamba also has an option to expand the model dimension Electricity 321 26304 Hourly
factor D by a controllable dimension expansion factor E. Such co- ETTh1 7 17420 Hourly
efficient matrices enable context and input selectivity properties [11] ETTh2 7 17420 Hourly
to selectively propagate or forget information along the input token ETTm1 7 69680 15 Minutes
sequence based on the current token. Subsequently, the SSM output ETTm2 7 69680 15 Minutes
is multiplicatively modulated with the output from the second branch
before another fully connected projection. The second branch simply bedded representation obtained through E1 and E2 .
consists of a linear mapping followed by a SiLU. In addition to the token transformation, we also employ residual
Processed embedded representation with tensor size BM n1 is connections. One residual connection is added before P1 , and an-
transformed by outer Mambas, while that with BM n2 is trans- other is added after P1 . The effectiveness of these residual connec-
formed by inner Mambas, as depicted in Figure 1. For the channel- tions is verified by experimental results (see Supplementary Table 1).
mixing case, the whole univariate sequence of each channel is Residual connections are demonstrated by arrows and element-wise
used as a token with dimension factor n2 for the inner Mam- addition in our method (Figure 1).
bas. The outputs from the left-side and right-side inner Mambas, To retain the information of both outer and inner pairs of Mam-
(2) (3)
vL,k , vR,k ∈ Rn2 , are element-wise added with xk to obtain xk bas we concatenate their representations before processing via P2 .
for the k-th token, k = 1, · · · , M . That is, by denoting vL = In summary, we concatenate the outputs of the
M ×n2 L four Mambas with
[vL,1 , · · · , vL,M ]TL∈ RL and similarly vR ∈ RM ×n2 , we a skip connection to have x(6) = x(5) ∥(x(4) x(1) ), where ∥ de-
have x(3) = vL x(2) , with
L
vR being element-wise ad- notes concatenation. Finally, the output y is obtained by applying P2
dition. Then, x(3) is linearly mapped to x(4) with P1 : x(3) → to x(6) , i.e., y = P2 (x(6) ).
x(4) ∈ RM ×n1 . Similarly, the outputs from the outer Mambas,
∗ ∗
vL,k , vR,k ∈ Rn1 are element-wise added to obtain x(5) ∈ RM ×n1 .
For the channel independence case, the input is reshaped,
4 Result Analysis
BM L 7→ (B × M )1L, and the embedded representations become In this segment, we present the main results of our experiments on
(B×M )1n1 and (B×M )1n2 . Here, the batch size becomes B×M , widely recognized benchmark datasets for long-term MTS forecast-
and we regard each sequence of length L independent from each ing. We also conduct extensive ablation studies to demonstrate the
other. One Mamba in each pair of outer Mambas or inner Mambas effectiveness of each component of our method.
considers the input dimension as 1 and the token length as n1 or n2 ,
while the other Mamba learns with input dimension n2 or n1 and
4.1 Datasets
token length 1. This design enables learning both global context and
local context simultaneously. The outer and inner pairs of Mambas We evaluate our model on seven benchmark datasets extensively
will extract salient features and context cues at fine and coarse scales used for LTSF: Weather, Traffic, Electricity, and four ETT datasets
with high- and low-resolution, respectively. (ETTh1, ETTh2, ETTm1, ETTm2). Table 1 illustrates the relevant
Channel mixing is performed when the datasets contain a signifi- statistics of these datasets, highlighting that the Traffic and Electric-
cantly large number of channels, in particular, when the look-back L ity datasets notably large, with 862 and 321 channels, respectively,
is comparable to the channel number M , taking the whole sequence and tens of thousands of temporal points in each sequence. More de-
as a token to better provide context cues. All four Mambas are used tails on these datasets can be found in Wu et al. [29], Zhou et al.
to capture the global context of the sequences at different scales [37]. Focusing on long-term forecasting, we exclude the ILI dataset,
and learn from the channel correlations. This helps stabilize the which has a shorter temporal horizon, similar to Das et al. [6]. We
training and reduce overfitting with large M . To switch between demonstrate the superiority of our model in two parts: quantitative
the channel-independence and channel-mixing cases, the input (main results) and qualitative results. For a fair comparison, we used
sequence is simply transposed, with one Mamba in the outer Mamba the code from PatchTST [23] 1 and iTransformer [21] 2 including
pair and one in the inner pair processing the transposed input, as normalization and evaluation protocol used by them, and we took
demonstrated in Figure 1. These integrated Mamba blocks empower the results for the baseline methods from iTransformer [21].
our model for content-dependent feature extraction and reasoning
with long-range dependencies and feature interactions.
4.2 Experimental Environment
Output Projection: After receiving the output tokens from the All experiments were conducted using the PyTorch framework [24]
Mambas, our goal is to project these tokens to generate predictions with NVIDIA 4XV100 GPUs (32GB each). The model was opti-
with the desired sequence length. To accomplish this task, we utilize mized using the ADAM algorithm [17] with L2 loss. The batch size
two MLPs, P1 and P2 , which output n1 and T time points, respec- varied depending on the dataset, but the training was consistently set
tively, with each point having M channels. Specifically, projector to 100 epochs. We measure the prediction errors using mean square
P1 performs a mapping RM ×n2 → RM ×n1 , as discussed above error (MSE) and mean absolute error (MAE) metrics, where smaller
for obtaining x(4) . Subsequently, projector P2 performs a mapping values indicate better prediction accuracy.
RM ×2n1 → RM ×T , transforming the concatenated output from the
Mambas into the final predictions. The use of a two-stage output pro- 1 https://github.com/yuqinie98/PatchTST
jection via P1 and P2 symmetrically aligns with the two-stage em- 2 https://github.com/thuml/iTransformer
Baseline Models: We compared our model, TimeMachine, with
3 0.3
iTransformer
11 SOTA models, including iTransformer [21], PatchTST [23], 2
Ground Truth
TimeMachine
0.4
iTransformer
Ground Truth
TimeMachine
0.5
DLinear [34], RLinear [18], Autoformer [29], Crossformer [36], 0.6

Observation

Observation
1
0.7
TiDE [6], Scinet [19], TimesNet [30], FEDformer [38], and Sta- 0 0.8
0.9
tionary [20]. Although another variant of SSMs, namely S4 [12], 1 1.0
1.1
0 20 40 60 80 100
exists, we did not include it in our comparison because TiDE [6] Time points
0 20 40
Time points
60 80 100

has already demonstrated superior performance over S4. Similarly, (a) Electricity (b) Traffic
as Flowformer [31] is not as competitive as iTransformer and other Figure 3: Qualitative comparison between TimeMachine and the
Transformer-based models, following [21], we did not include it. second-best method (Table 2) on a test set example with L=96,
T=720, and a randomly selected channel (best viewed zoomed-in).
4.3 Quantitative Results and MAE metrics, we also compare the memory footprints and scal-
ability of our method against other baselines in Figure 4. We mea-
We demonstrate TimeMachine’s performance in supervised long- sured the GPU memory utilization of our method and compared
term forecasting tasks in Table 2. Following the protocol used in it against other baselines, with their results taken from the iTrans-
iTransformer [21], we set all baselines fixed with L = 96 and former [21] paper. To ensure a fair comparison, we also included
T = {96, 192, 336, 720}, including our method. For all results Flowformer [31] and vanilla Transformer [28], and set the experi-
achieved by our model, we utilized the training-related values men- mental settings for our method similar to those of iTransformer.
tioned in Subsection 4.2. In addition to the training hyperparameters, The results clearly show very small memory footprints compared
we set default values for all Mambas: Dimension factor D = 256, to SOTA baselines. Specifically for Traffic, our method consumes a
local convolutional width = 2, and state expand factor N = 1. We very similar amount of memory to the DLinear [34] method. More-
provide an experimental justification for these parameters in Sec- over, our method is capable of handling longer look-back windows
tion 5. Table 2 clearly shows that our method demonstrates supe- with a relatively linear increase in the number of learnable param-
rior performance compared to all the strong baselines in almost all eters, as demonstrated in Supplementary Figure 4 for two datasets.
datasets. Moreover, iTransformer [21] has significantly better perfor- This is due to the robustness of our method, where E1 is only depen-
mance than other baselines on the Traffic and Electricity datasets, dent on the input sequence length L, and the rest of the networks are
which contain a large number of channels. Our method also demon- relatively independent of L, leading to a highly scalable model.
strates comparable or superior performance on these two datasets,
10
outperforming the existing strong baselines by a large margin. This
demonstrates the effectiveness of our method in handling LTSF tasks 8
Memory Footprint

with varying number of channels and datasets. 6


In addition to Table 2, we conducted experiments with TimeMa-
4
chine using different look-back windows L = {192, 336, 720}. Ta-
ble 3 and Supplementary Table 2, demonstrate TimeMachine’s per- 2
formance under these settings. An examination of these tables reveals 0
that the implementation of extended look-back windows markedly
er

er
ST

er

ar
e

hin
TiD
orm

rm

rm
orm

ine
hT

ac
sfo

sfo
tc

DL
ssf

wf

eM
Pa

an

n
enhances the performance of our method across the majority of the
iFlo
Cro

Tra

Tim
iTr

Methods
datasets examined. This also demonstrates TimeMachine’s capability
(a) Traffic
for handling longer look-back windows while maintaining consistent
performance. 2.00
1.75
Electricity 1.50
Tra
Memory Footprint

0.17 ffic 1.25


0.43 1.00
0.75
h1
ETT

0.42 0.50
0.25
Weather

0.24 0.00
er

r
ST

ine

er

ar
e

e
TiD
orm

rm

rm
orm

ine
hT

h
ac
sfo

sfo
tc

DL
ssf

wf
eM
Pa
n

an
iFlo
Cro

Tra

Tim

iTr

0.34 Methods
2
ETTh

0.27
(b) Weather
0.38 Figure 4: Memory footprint (in GB) for Traffic (with 862 channels)
m2
ETT

ETTm1 and Weather (with 21 channels) following iTransformer [21].

4.4 Qualitative Results


TimeMachine PatchTST
iTransformer RLinear
Figure 3 and supplementary Figure 2 demonstrate TimeMachine’s ef-
Figure 2: Average MSE comparison of TimeMachine and SOTA fectiveness in visual comparison. It is evident that TimeMachine can
baselines with L = 96. The circle center represents the maximum follow the actual trend in the predicted future time horizon for the test
possible error. Closer to the boundary indicates better performance. set. In the case of the Electricity dataset, there is a clear difference
Following iTransformer [21], Figure 2 demonstrates the normal- between the performance of TimeMachine and iTransformer. For the
ized percentage gain of TimeMachine with respect to three other Traffic dataset, although both iTransformer and Timemachine’s per-
SOTA methods, indicating a clear improvement over the strong base- formance align with the ground truth, in the range approximately
lines. In addition to the general performance comparison using MSE between 75-90, TimeMachine’s performance is more closely aligned
Table 2: Results in MSE and MAE (the lower the better) for the long-term forecasting task. We compare extensively with baselines under
different prediction lengths, T = {96, 192, 336, 720} following the setting of iTransformer [21]. The length of the input sequence (L) is set
to 96 for all baselines. The best results are in bold and the second best are underlined.
Methods→ TimeMachine iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear SCINet FEDformer Stationary Autoformer
D T MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.164 0.208 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 0.173 0.223 0.266 0.336
Weather

192 0.211 0.250 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 0.245 0.285 0.307 0.367
336 0.256 0.290 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 0.321 0.338 0.359 0.395
720 0.342 0.343 0.358 0.349 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 0.414 0.410 0.419 0.428
96 0.397 0.268 0.395 0.268 0.649 0.389 0.544 0.359 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.788 0.499 0.587 0.366 0.612 0.338 0.613 0.388
Traffic

192 0.417 0.274 0.417 0.276 0.601 0.366 0.540 0.354 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.789 0.505 0.604 0.373 0.613 0.340 0.616 0.382
336 0.433 0.281 0.433 0.283 0.609 0.369 0.551 0.358 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.797 0.508 0.621 0.383 0.618 0.328 0.622 0.337
720 0.467 0.300 0.467 0.302 0.647 0.387 0.586 0.375 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.841 0.523 0.626 0.382 0.653 0.355 0.660 0.408
Electricity

96 0.142 0.236 0.148 0.240 0.201 0.281 0.195 0.285 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 0.169 0.273 0.201 0.317
192 0.158 0.250 0.162 0.253 0.201 0.283 0.199 0.289 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 0.182 0.286 0.222 0.334
336 0.172 0.268 0.178 0.269 0.215 0.298 0.215 0.305 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 0.200 0.304 0.231 0.338
720 0.207 0.298 0.225 0.317 0.257 0.331 0.256 0.337 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 0.222 0.321 0.254 0.361
96 0.364 0.387 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 0.513 0.491 0.449 0.459
ETTh1

192 0.415 0.416 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 0.534 0.504 0.500 0.482
336 0.429 0.421 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 0.588 0.535 0.521 0.496
720 0.458 0.453 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 0.643 0.616 0.514 0.512
96 0.275 0.334 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 0.476 0.458 0.346 0.388
ETTh2

192 0.349 0.381 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 0.512 0.493 0.456 0.452
336 0.340 0.381 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 0.552 0.551 0.482 0.486
720 0.411 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 0.562 0.560 0.515 0.511
96 0.317 0.355 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 0.386 0.398 0.505 0.475
ETTm1

192 0.357 0.378 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 0.459 0.444 0.553 0.496
336 0.379 0.399 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 0.495 0.464 0.621 0.537
720 0.445 0.436 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 0.585 0.516 0.671 0.561
96 0.175 0.256 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 0.192 0.274 0.255 0.339
ETTm2

192 0.239 0.299 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 0.280 0.339 0.281 0.340
336 0.287 0.332 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 0.334 0.361 0.339 0.372
720 0.371 0.385 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 0.417 0.413 0.433 0.432

Table 3: Results for the long-term forecasting task with varying L =


spectively, and P1 performing an expansion by converting n2 →
{192, 336, 720} and T = {96, 192, 336, 720}
n1 . Since several strong baseline methods, e.g., DLinear, lever-
Prediction (T )→ 96 192 336 720 age mainly MLPs, we aim at understanding the effect of MLPs
D L MSE MAE MSE MAE MSE MAE MSE MAE
on performance. To this end, we explored 10 different combina-
tions from {512, 256, 128, 64, 32} and demonstrated the perfor-
192 0.362 0.252 0.386 0.262 0.402 0.270 0.431 0.288
Traffic

mance with MSE for two datasets (ETTh1, ETTh2) in Figure 5.


336 0.355 0.249 0.378 0.259 0.391 0.266 0.418 0.283
720 0.348 0.249 0.364 0.255 0.376 0.263 0.410 0.281 These figures show that our method is not heavily dependent on the
MLPs. Rather, we can see more improvement with very small MLPs
192 0.135 0.230 0.167 0.258 0.176 0.269 0.213 0.302
Elec.

for T = 720 with the ETTh1 dataset and mostly stable performance
336 0.133 0.225 0.160 0.255 0.172 0.268 0.211 0.303
720 0.133 0.225 0.160 0.257 0.167 0.269 0.204 0.300 on the ETTh2 dataset.
ETTm2

192 0.170 0.252 0.230 0.294 0.273 0.325 0.351 0.376


336 0.165 0.254 0.223 0.291 0.264 0.323 0.345 0.375 5.2 Sensitivity of Dropouts
720 0.163 0.253 0.222 0.295 0.265 0.325 0.336 0.376
In our model (Figure 1), we include two dropouts after processing the
with the ground truth compared to iTransformer. For better visualiza- signals from E1 and E2 . These dropouts are necessary, especially
tion, we demonstrated a window of 100 predicted time points. for datasets with a small number of channels, e.g., ETTs. Supple-
mentary Figure 1 shows the effect of dropouts on both ETTh1 and
ETTh2 datasets. As expected, too low or too high dropout rates are
5 Hyperparameter Sensitivity Analysis and not helpful. To maintain balance, we set the dropout rates to 0.7 for
Ablation Study both datasets while tuning other variations for the rest.
In this section, we conducted experiments on various hyper-
parameters, including training and method-specific parameters. For
5.3 Ablation of Residual Connections
each parameter, we provided experimental justification based on the
achieved results. While conducting an ablation experiment on a pa- Studies have shown the effectiveness of residual connection, includ-
rameter, other parameters were kept fixed at their default values, en- ing models using SSMs [1] and CNNs [14]. In this section, we justify
suring a clear justification for that specific parameter. the two residual connections in our architecture: one from E2 to the
output of the two inner Mambas, and the other from E1 to the out-
put of P1 . Both of them use element-wise additions and help stabi-
5.1 Effect of MLPs’ Parameters (n1 , n2)
lize training and reduce overfitting, especially for the smaller datasets
As demonstrated in Figure 1, we have two stages of compres- with channel independence. Supplementary Table 1 provides exper-
sion with two MLPs E1 , E2 of output dimensions n1 and n2 , re- imental justification, where the Res. column indicates the presence
0.50
T=96 T=96 T=192 T=336 T=720
0.48 T=192 0.48
T=336
0.46 T=720 0.46
0.44 0.44
MSE

MSE
0.42 0.42
0.40 0.40
0.38 0.38
0.36 0.36
8 16 32 64 128 256
6)

8)

8)

2)
State Expansion Factor (N)
64

32

64

32

64

32
25

12

12

,3
2,

2,

6,

6,

8,

8,

(64
2,

2,

6,
(51

(51

(25

(25

(12

(12
(51

(51

(25
Combination (a) ETTh1
(a) ETTh1
0.450
T=96 T=192 T=336 T=720
0.425
0.42 T=96
T=192 0.400
0.40 T=336 0.375
0.38 T=720
0.350

MSE
0.36
0.325
MSE

0.34
0.300
0.32
0.275
0.30
0.250 8 16 32 64 128 256
0.28 State Expansion Factor (N)
6)

8)

8)

2)

(b) ETTh2
64

32

64

32

64

32
25

12

12

,3
2,

2,

6,

6,

8,

8,

(64
2,

2,

6,
(51

(51

(25

(25

(12

(12
(51

(51

(25

Combination Figure 6: MSE versus the state expansion factor (N ) with the input
(b) ETTh2 sequence length L = 96.
Figure 5: MSE comparison with combinations of n1 and n2 for input 5.6 Ablation on Mamba Dimension Expansion Factor
sequence length L = 96 for the ETTh1 and ETTh2 datasets.
We also experimented with the dimension expansion factor (E) of the
Table 4: Ablation results on the local convolution width with L = 96.
Mambas, which is used to expand the input dimension, with results
Prediction (T )→ 96 192 336 720 shown in Supplementary Figure 3. Increasing the block expansion
D d_conv MSE MAE MSE MAE MSE MAE MSE MAE
factor does not lead to consistent improvements in performance. In-
stead, higher expansion factors come with a heavy cost in memory
4 0.365 0.389 0.419 0.418 0.439 0.424 0.465 0.457 and training time. Therefore, we set this value to 1 by default in all
ETTh1
2 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453
Mambas and report the results in Tables 2 and 3.
4 0.275 0.333 0.347 0.383 0.350 0.382 0.411 0.433 In addition to these sensitivity analyses, we also demonstrated per-
ETTh2
2 0.275 0.334 0.349 0.381 0.340 0.381 0.411 0.433
formance comparison between 1 and 2 levels in Supplementary Ta-
ble 3. Considering a balance between performance and memory foot-
(✓) or absence (✗) of residual connections. We observe clear im- print, we used two levels.
provement on both datasets when residual connections are used. This
motivated us to include residual connections in our architecture, and
all results presented in Tables 2 and 3 incorporate these connections. 6 Strengths and Limitations
TimeMachine outperforms numerous baselines, including
transformer-based methods, across benchmark datasets and ad-
5.4 Effects of Mambas’ Local Convolutional Width ditionally demonstrates memory efficiency and stable performance
across varying look-back and prediction lengths. Unlike transformer-
In addition to experimenting with the different components of our based methods that have quadratic complexity, our method has linear
architecture (Figure 1), we also investigated the effectiveness of complexity. While TimeMachine achieves top-ranked performance
Mamba parameters. For example, we tested two variations of local in most cases, it ranks second on the Weather dataset with small T ,
convolutional kernel widths (2 and 4) for the Mambas and found that highlighting an area for future improvement. Moreover, as shown
a kernel width of 2 yields more promising results compared to 4. in Figure 3, there is potential for enhancing alignment with ground
Therefore, we set the default kernel width to 2 for all datasets and truth.
Mambas.
7 Conclusion
This paper introduces TimeMachine, a novel model that captures
5.5 Ablation on State Expansion Factor of Mambas
long-term dependencies in multivariate time series data while main-
taining linear scalability and small memory footprints. By lever-
The SSM state expansion factor (N ) is another crucial parameter of aging an integrated quadruple-Mamba architecture to predict with
Mamba. We ablate this parameter from a very small value of 8 up to rich global and local contextual cues at multiple scales, TimeMa-
the highest possible value of 256. Figure 6 demonstrates the effec- chine unifies channel-mixing and channel-independence situations,
tiveness of this expansion factor while keeping all other parameters enabling accurate long-term forecasting. Extensive experiments
fixed. With a higher state expansion factor, there is a certain chance demonstrate the model’s superior performance in accuracy, scal-
of performance improvement for varying prediction lengths. There- ability, and memory efficiency compared to state-of-the-art meth-
fore, we set N = 256 as the default value for all datasets, and the ods. Future work will explore TimeMachine’s application in a self-
results in Tables 2 and 3 contain the TimeMachine’s performance supervised learning setting.
with this default value.
Acknowledgements [21] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long. itrans-
former: Inverted transformers are effective for time series forecasting.
In International Conference on Learning Representations, 2024. URL
This research is supported in part by the NSF under Grants https://openreview.net/forum?id=JePfAI8fah.
2327113 and 2433190 and the NIH under Grants R21AG070909, [22] J. Ma, F. Li, and B. Wang. U-mamba: Enhancing long-range
P30AG072946, and R01HD101508-01. We would like to thank the dependency for biomedical image segmentation. arXiv preprint
arXiv:2401.04722, 2024.
University of Kentucky Center for Computational Sciences and In- [23] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time se-
formation Technology Services Research Computing for their sup- ries is worth 64 words: Long-term forecasting with transformers. In
port and use of the Lipscomb Compute Cluster and associated re- International Conference on Learning Representations, 2023. URL
search computing resources. https://openreview.net/forum?id=Jbdc0vTOcol.
[24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imper-
ative style, high-performance deep learning library. Advances in neural
References information processing systems, 32, 2019.
[25] Y. Schiff, C.-H. Kao, A. Gokaslan, T. Dao, A. Gu, and V. Kuleshov. Ca-
[1] M. A. Ahamed and Q. Cheng. Mambatab: A simple yet effective ap- duceus: Bi-directional equivariant long-range dna sequence modeling.
proach for handling tabular data. arXiv preprint arXiv:2401.08867, arXiv preprint arXiv:2403.03234, 2024.
2024. [26] S. Tonekaboni, D. Eytan, and A. Goldenberg. Unsupervised representa-
[2] A. Ali, I. Zimerman, and L. Wolf. The hidden attention of mamba mod- tion learning for time series with temporal neighborhood coding. arXiv
els. arXiv preprint arXiv:2403.01590, 2024. preprint arXiv:2106.00750, 2021.
[3] A. Behrouz and F. Hashemi. Graph mamba: Towards learning on graphs [27] P. Trirat, Y. Shin, J. Kang, Y. Nam, J. Na, M. Bae, J. Kim, B. Kim,
with state space models. arXiv preprint arXiv:2402.08678, 2024. and J.-G. Lee. Universal time-series representation learning: A survey.
[4] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von arXiv preprint arXiv:2401.03717, 2024.
Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
the opportunities and risks of foundation models. arXiv preprint Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Ad-
arXiv:2108.07258, 2021. vances in Neural Information Processing Systems, 30, 2017.
[5] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series [29] H. Wu, J. Xu, J. Wang, and M. Long. Autoformer: Decomposition
analysis: forecasting and control. John Wiley & Sons, 2015. transformers with auto-correlation for long-term series forecasting. Ad-
[6] A. Das, W. Kong, A. Leach, S. K. Mathur, R. Sen, and R. Yu. Long- vances in Neural Information Processing Systems, 34:22419–22430,
term forecasting with TiDE: Time-series dense encoder. Transactions 2021.
on Machine Learning Research, 2023. ISSN 2835-8856. URL https: [30] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long. Timesnet: Tem-
//openreview.net/forum?id=pCbC3aQB5W. poral 2d-variation modeling for general time series analysis. In Inter-
[7] E. Eldele, M. Ragab, Z. Chen, M. Wu, C. K. Kwoh, X. Li, and C. Guan. national Conference on Learning Representations, 2022.
Time-series representation learning via temporal and contextual con- [31] H. Wu, J. Wu, J. Xu, J. Wang, and M. Long. Flowformer: Linearizing
trasting. arXiv preprint arXiv:2106.14112, 2021. transformers with conservation flows. In International Conference on
[8] S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units Machine Learning, pages 24226–24242. PMLR, 2022.
for neural network function approximation in reinforcement learning. [32] L. Yang and S. Hong. Unsupervised time-series representation learning
Neural Networks, 107:3–11, 2018. ISSN 0893-6080. doi: https:// with iterative bilinear temporal-spectral fusion. In International Con-
doi.org/10.1016/j.neunet.2017.12.012. URL https://www.sciencedirect. ference on Machine Learning, pages 25038–25054. PMLR, 2022.
com/science/article/pii/S0893608017302976. Special issue on deep re- [33] Z. Yue, Y. Wang, J. Duan, T. Yang, C. Huang, Y. Tong, and B. Xu.
inforcement learning. Ts2vec: Towards universal representation of time series. In Proceed-
[9] J.-Y. Franceschi, A. Dieuleveut, and M. Jaggi. Unsupervised scalable ings of the AAAI Conference on Artificial Intelligence, volume 36, pages
representation learning for multivariate time series. Advances in Neural 8980–8987, 2022.
Information Processing Systems, 32, 2019. [34] A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective
[10] D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Re. Hun- for time series forecasting? In Proceedings of the AAAI Conference on
gry hungry hippos: Towards language modeling with state space mod- Artificial Intelligence, volume 37, pages 11121–11128, 2023.
els. In International Conference on Learning Representations, 2022. [35] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, and C. Eickhoff.
[11] A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selec- A transformer-based framework for multivariate time series representa-
tive state spaces. arXiv preprint arXiv:2312.00752, 2023. tion learning. In Proceedings of the 27th ACM SIGKDD Conference on
[12] A. Gu, K. Goel, and C. Re. Efficiently modeling long sequences with Knowledge Discovery & Data Mining, pages 2114–2124, 2021.
structured state spaces. In International Conference on Learning Rep- [36] Y. Zhang and J. Yan. Crossformer: Transformer utilizing cross-
resentations, 2021. dimension dependency for multivariate time series forecasting. In In-
[13] A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. Ré. ternational Conference on Learning Representations, 2022.
Combining recurrent, convolutional, and continuous-time models with [37] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang.
linear state space layers. Advances in Neural Information Processing Informer: Beyond efficient transformer for long sequence time-series
Systems, 34:572–585, 2021. forecasting. In Proceedings of the AAAI conference on artificial intelli-
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for im- gence, volume 35, pages 11106–11115, 2021.
age recognition. In Proceedings of the IEEE Conference on Computer [38] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. Fedformer:
Vision and Pattern Recognition, pages 770–778, 2016. Frequency enhanced decomposed transformer for long-term series fore-
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural casting. In International Conference on Machine Learning, pages
Computation, 9(8):1735–1780, 1997. 27268–27286. PMLR, 2022.
[16] T. Kim, J. Kim, Y. Tae, C. Park, J.-H. Choi, and J. Choo. Reversible in-
stance normalization for accurate time-series forecasting against distri-
bution shift. In International Conference on Learning Representations,
2022. URL https://openreview.net/forum?id=cGDAkQo1C0p.
[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[18] Z. Li, S. Qi, Y. Li, and Z. Xu. Revisiting long-term time se-
ries forecasting: An investigation on linear mapping. arXiv preprint
arXiv:2305.10721, 2023.
[19] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu. Scinet:
Time series modeling and forecasting with sample convolution and in-
teraction. Advances in Neural Information Processing Systems, 35:
5816–5828, 2022.
[20] Y. Liu, H. Wu, J. Wang, and M. Long. Non-stationary transformers:
Exploring the stationarity in time series forecasting. Advances in Neural
Information Processing Systems, 35:9881–9893, 2022.
Supplementary Materials
Table 1: Ablation experiment on the residual connections with input
1.00
sequence length L = 96 and T = {96, 192, 336, 720}.
0.75
0.50

Observation
Prediction (T )→ 96 192 336 720
0.25
D Res. MSE MAE MSE MAE MSE MAE MSE MAE 0.00
0.25 PatchTST
✗ 0.366 0.395 0.423 0.425 0.430 0.427 0.474 0.462 0.50 Ground Truth
ETTh1 TimeMachine
✓ 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453 0.75
0 20 40 60 80 100
Time points
✗ 0.281 0.337 0.347 0.386 0.352 0.383 0.415 0.435
ETTh2 (a) ETTm1
✓ 0.275 0.334 0.349 0.381 0.340 0.381 0.411 0.433

0.1
0.2
0.48
0.3

Observation
0.46 0.4
0.44 0.5
0.6
MSE

0.42 PatchTST
0.7 Ground Truth
0.40 T=96 TimeMachine
T=192 0.8
0.38 0 20 40 60 80 100
T=336 Time points
T=720
0.36 (b) ETTm2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Dropout
Figure 2: Qualitative comparison between TimeMachine and second-
(a) ETTh1 best-performing methods from Table 2. Visualization is provided
for the test set with L = 96 and T = 720 with a randomly selected
0.42 T=96
T=192 channel and a window frame of 100 time points.
0.40 T=336
0.38 T=720
0.36
MSE

0.34
0.32 0.50
T=96 T=192 T=336 T=720
0.30 0.48
0.28
0.46
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Dropout 0.44
MSE

(b) ETTh2 0.42

Figure 1: Performance (MSE) comparison concerning a diverse range 0.40


0.38
of dropouts with input sequence length L = 96.
0.36
Table 2: Results for the long-term forecasting task with varying input 1 2 3 4 5 6 7 8 9 10
Expand factor
sequence length L = {192, 336, 720} and T = {96, 192, 336, 720} (a) ETTh1

Prediction (T )→ 96 192 336 720 0.450


T=96 T=192 T=336 T=720
0.425
D L MSE MAE MSE MAE MSE MAE MSE MAE
0.400
ETTh1 Weather

192 0.155 0.204 0.198 0.243 0.241 0.281 0.327 0.336 0.375
0.350
MSE

336 0.151 0.201 0.192 0.240 0.236 0.278 0.318 0.334


720 0.151 0.203 0.195 0.246 0.239 0.285 0.321 0.340 0.325
0.300
192 0.365 0.386 0.415 0.413 0.406 0.417 0.447 0.459 0.275
336 0.360 0.387 0.398 0.410 0.386 0.411 0.443 0.457 0.250 1 2 3 4 5 6 7 8 9 10
720 0.363 0.395 0.402 0.418 0.396 0.420 0.468 0.476 Expand factor

(b) ETTh2
ETTh2

192 0.274 0.334 0.340 0.379 0.327 0.378 0.402 0.432


336 0.267 0.334 0.324 0.375 0.316 0.375 0.392 0.429 Figure 3: Comparative analysis for the expanding factor.
720 0.260 0.332 0.314 0.372 0.316 0.377 0.394 0.433
ETTm1

192 0.286 0.337 0.331 0.365 0.354 0.384 0.421 0.421


336 0.286 0.337 0.328 0.364 0.355 0.381 0.408 0.413 Table 3: Ablation experiment on the level of TimeMachine with input
720 0.289 0.344 0.334 0.369 0.357 0.382 0.416 0.413
sequence length L = 96 and T = {96, 192, 336, 720}.

Prediction (T )→ 96 192 336 720


D Level MSE MAE MSE MAE MSE MAE MSE MAE
1 0.367 0.393 0.420 0.418 0.437 0.424 0.460 0.455
ETTh1
2 0.364 0.387 0.415 0.416 0.429 0.421 0.458 0.453
1 0.143 0.237 0.164 0.256 0.175 0.271 0.212 0.303
Electricity
2 0.142 0.236 0.158 0.250 0.172 0.268 0.207 0.298
400000
Learnable Parameters

300000

200000

100000

0 96 192 336 720 1440


Input Sequence Length
(a) ETTh2

400000
Learnable Parameters

300000

200000

100000

0 96 192 336 720 1440


Input Sequence Length
(b) Weather
Figure 4: Scalability in terms of learnable parameters with respect to
look-back window.

You might also like