Skip to content

Latest commit

 

History

History
650 lines (526 loc) · 23.1 KB

File metadata and controls

650 lines (526 loc) · 23.1 KB

A Time Series Apologia

2023-05-07

John Mount Win Vector LLC

Introduction

There is a fear that one can not be seen to be using standard regression methods for time series problems.

The motivating worry is that in time series problems the error terms are likely correlated, so one can no longer appeal to the Gauss Markov Theorem to be assured of good out of sample performance (in the style of Bounding Excess Generalization Error for Linear Regression Models). Another motivation is the absolute beauty of the germinal time series reference: Box, Jenkins, Reinsel, Time Series Analysis, Wiley 2008 (all editions are good, this is just the edition I have at hand). So there is a temptation to “leave it all to the package” and not be seen to be tinkering.

We are going to demonstrate some standard time series methods. We will call out an important, but under-taught, use of MA terms in ARIMA. We will also demonstrate one can, in some cases, safely treat time series problems as standard regression problems. The out of sample performance of regression can be quite good, even for time series problems.

This note is, unfortunately, going to be long. However I think it is very much worth the effort. I would like to encourage data scientists to feel safe trying a variety of methods on time series forecasting problems. To that end please think of this note as an attempt at a vindication in the sense of Borges’ “The Library of Babel”:

At that period there was much talk of The Vindications- books of apologia and prophecies that would vindicate for all the actions of every person in the universe and that held wondrous arcana for men’s futures.

An Example

Let’s work an example of time series forecasting in R.

First we attach our packages.

library(ggplot2)
library(Hmisc)
library(wrapr)
library(forecast)

Now let’s generate some synthetic data to define an example problem.

w <- 2 * pi / 20
d <- data.frame(
  x = seq(150)
)

d$y_ideal <- sin(w * d$x)

head(d)
##   x   y_ideal
## 1 1 0.3090170
## 2 2 0.5877853
## 3 3 0.8090170
## 4 4 0.9510565
## 5 5 1.0000000
## 6 6 0.9510565

y_ideal obeys the recurrence y_ideal[i] = (2 * cos(w)) * y_ideal[i - 1] - y_ideal[i - 2]. It is the first time series we will try to forecast.

theoretical_coefs <- c(2 * cos(w), -1)

theoretical_coefs
## [1]  1.902113 -1.000000

We can confirm the recurrence as follows.

max(abs(
  d$y_ideal - 
    (theoretical_coefs[1] * Lag(d$y_ideal, 1) 
     + theoretical_coefs[2] * Lag(d$y_ideal, 2))),
  na.rm = TRUE
)
## [1] 6.943996e-15

The above is easy to derive from the usual trigonometric additive angle formulas, and pretty much how trigonometric tables used to be constructed. The recurrence exactly matches the form in the help(arima) documentation:

X[t] = a[1] * X[t - 1] + a[2] * X[t - 2] + e[t].

This reference is using X as the observable value (which for us is y_ideal or y_observed, depending on which example we are working).

Note: we don’t always want to call the e[] term an error. Time Series Analysis more properly refers to such as “random shocks”, not errors. In section 1.2.1 “Stationary and Nonstationary Stochastic Models for Forecasting and Control” page 8 Time Series Analysis states:

The stochastic models we employ are based on the idea [ref: 317] that an obsevable time series zt in which successive values are highly dependent can frequently be regarded as generated from a series of independent “shocks” at.

This reference is using z for the observable process and a for the unobservable shocks (refs, p. 2 and p. 8).

An analogy for the above is: if you consider the contents of a tip jar over time to be a time series, you most certainly do not consider the tips (difference from the average expecte tip) to be errors! They are in fact the drivers of the jar contents. Many current references (such as help(arima)) unfortunately use the term “errors” for these shocks or drivers.

The example we have chosen is a nice process to study as it doesn’t have the “unit root” pathology, and it doesn’t require differencing or integration in the analysis (a case where we don’t need the “I” in ARIMA). We will move to trying to infer or fit the original system by time series modeling.

As a Time Series

The example is called a “c(2, 0, 0) ARIMA model” (see help(arima)). A c(p, d, q) ARIMA process has p autoregressive terms (the “AR”, expressing a linear relation among the ideal unobserved values), d integration steps (integration needed to add trends to a system), and p moving average terms (the “MA”, expressing finite memory for unobserved shocks). It can be daunting to pick the c(p, d, q) parameters when modeling. However when we have a recurrence specifying the ideal system we can read off the implied parameters, and in this case we have a c(2, 0, 0) system: we see two lags of the variable in question, and no itegrations or older shock/impulse terms.

A note on time series terminology. A lot of the standard naming of terms is historic, and does not reflect current understanding. For example our Time Series Analysis reference in section 1.2.1 “Stationary and Nonstationary Stochastic Models for Forecasting and Control” page 10 asserts:

The name “moving average” is somewhat misleading becaues the weights […] do not need to total to unity nor need they be positive. However, this nomenclature is in common use, and therefore we employ it.

Back to our problem. We will reserve some of our data for out of sample evaluation.

d$is_train <- seq(nrow(d)) <= 110

We fit the ARIMA model.

arima_include_mean <- FALSE
arima_method <- "CSS"
m_ideal <- arima(
  d$y_ideal[d$is_train], 
  order = c(2, 0, 0), 
  include.mean = arima_include_mean, 
  method = arima_method)

m_ideal
## 
## Call:
## arima(x = d$y_ideal[d$is_train], order = c(2, 0, 0), include.mean = arima_include_mean, 
##     method = arima_method)
## 
## Coefficients:
##          ar1  ar2
##       1.9021   -1
## s.e.  0.0000    0
## 
## sigma^2 estimated as 3.03e-30:  log likelihood = 3585.47,  aic = NA

Some of the diagnostics look a little concerning, but the fitter did identify the system and reproduce the theoretical coefficients.

We can plot the forecast into the future (past the training region).

plot_ts_result <- function(predictions, comparison_column, title) {
  p_frame <- 
    rbind(
      data.frame(
        x = d$x[d$is_train == FALSE],
        y = predictions,
        what = "forecast"
      ),
      data.frame(
        x = d$x,
        y = d[[comparison_column]],
        what = "ground truth"
      )
    )
  q_frame <-data.frame(
        x = d$x[d$is_train == FALSE],
        y_truth = d[[comparison_column]][d$is_train == FALSE],
        y_pred = predictions
      )
  rmse = sqrt(mean((q_frame$y_truth - q_frame$y_pred)^2))
  ggplot(
    data = p_frame,
    mapping = aes(x = x, y = y, color = what, linetype = what, shape = what)) +
    geom_line(alpha = 0.8) +
    geom_point(alpha = 0.8) + 
    scale_color_brewer(palette = "Dark2") +
    ggtitle(
      title,
      subtitle = paste0("out of sample RMSE: ", sprintf("%g", rmse), " (smaller is better)"))
}
n_ahead = sum(d$is_train == FALSE)
plot_ts_result(
  as.numeric(predict(m_ideal, n.ahead = n_ahead)$pred), 
  "y_ideal", 
  "recovered ideal system")

With Shocks or Obsevation Noise

Now consider the more common set-up where we don’t observe y_ideal, but instead y_observed.

set.seed(2023)
# add observation noise
d$y_observed <- d$y_ideal + 0.1 * rnorm(nrow(d))

We must take care. The y_observed is not a c(2, 0, 0) ARIMA process, even though y_ideal is! We can confirm the mis-fit as follows.

Doing it Wrong

m_observed_wrong <- arima(
  d$y_observed[d$is_train], 
  order = c(2, 0, 0),  # wrong for this example
  include.mean = arima_include_mean, 
  method = arima_method)

m_observed_wrong
## 
## Call:
## arima(x = d$y_observed[d$is_train], order = c(2, 0, 0), include.mean = arima_include_mean, 
##     method = arima_method)
## 
## Coefficients:
##          ar1      ar2
##       1.5321  -0.6364
## s.e.  0.0732   0.0731
## 
## sigma^2 estimated as 0.03716:  log likelihood = 25.01,  aic = NA
plot_ts_result(
  as.numeric(predict(m_observed_wrong, n.ahead = n_ahead)$pred),
  "y_observed", 
  "fit observed system (wrong)")

The above is a pathological failure of the mis-specified fitting system. The forecast procedure is vulnerable both to mis-identification of system and to mis-estimation of the state going into the forecast region.

This sort of failure can go unnoticed during modeling.

Doing it Right

The correct form is translate the earlier recurrence to an identical one over the shock-free terms Xideal := Xobs - e:

(Xobs[t] - e[t]) = a[1] * (Xobs[t - 1] - e[t - 1]) + a[2] * (Xobs[t - 2] - e[t - 2]).

This can be re-arranged to the usual ARIMA formula:

Xobs[t] = a[1] * Xobs[t - 1] + a[2] * Xobs[t - 2] + e[t] + b[1] * e[t - 1] + b[2] * e[t - 2].

The above is a strong argument that for c(p, d, q) ARIMA processes one should pick q to be at least p. The idea is: with q >= p we can appeal to a substitution of the form Xideal := Xobs - e. If there is an autoregressive recurrence (the “AR” in ARIMA, or how the state is thought to evolve in our system to be fit) in the unobserved ideal values, then adding q moving average terms (the “MA” in ARIMA) makes the ARIMA system strong enough to estimate it from the mere observations. Without the q shock terms (the e[]s) there is in fact no reason to suppose the approximate observations will obey a recurrence similar to the unobserved ideal observations.

Let’s give that a try.

m_observed <- arima(
  d$y_observed[d$is_train], 
  order = c(2, 0, 2), 
  include.mean = arima_include_mean, 
  method = arima_method)

m_observed
## 
## Call:
## arima(x = d$y_observed[d$is_train], order = c(2, 0, 2), include.mean = arima_include_mean, 
##     method = arima_method)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2
##       1.9048  -1.0022  -1.6756  0.7642
## s.e.  0.0038   0.0039   0.0804  0.0810
## 
## sigma^2 estimated as 0.01235:  log likelihood = 85.59,  aic = NA
plot_ts_result(
  as.numeric(predict(m_observed, n.ahead = n_ahead)$pred),
  "y_observed", 
  "fit observed system (correct)")

“Magic” Tools With Mundane Outcomes

One can also try one of the many “auto ARIMA” systems to find c(p, d, q).

m_auto <- auto.arima(
  d$y_observed[d$is_train], 
  stepwise = FALSE,
  approximation = FALSE)

m_auto
## Series: d$y_observed[d$is_train] 
## ARIMA(5,0,0) with zero mean 
## 
## Coefficients:
##          ar1     ar2      ar3      ar4      ar5
##       0.7894  0.4619  -0.1404  -0.2015  -0.1967
## s.e.  0.0942  0.1200   0.1278   0.1210   0.0952
## 
## sigma^2 = 0.01829:  log likelihood = 63.86
## AIC=-115.71   AICc=-114.89   BIC=-99.51

The software fails to pick the correct c(p, d, q) and fully identify the system. However the excess p seems to almost work.

plot_ts_result(
  as.numeric(predict(m_auto, n.ahead = n_ahead)$pred),
  "y_observed",
  "auto.arima observed system")

The unfortunate decay in the predictor is likely a small under-estimate of the norm of the step forward in time operator. This is fine, until one applies the operator again and again to make predictions. When we fit ARIMA models we are actually trying to identify the one step forward operator, not a k-step forward operator.

I have in the past found auto ARIMA type methods to be fairly brittle and degenerate to predicting a near-constant with rapidly growing error bars for problems as simple as modeling the tide when “under supervised.” For some fun with forecasting tides, please see Lord Kelvin, Data Scientist.

ARIMA forecasting

ARIMA models make k-step out predictions by iterating 1-step forward predictions and feeding the intermediate predictions in as if they were actual observations. This may seem a bit crude, but it is explicitly so described in Time Series Analysis section 5.1.2 “Three Basic Forms for the Forecast” page 143.

  1. The zt-j (j = 0, 1, 2, …) which have already occured at origin t, are left unchanged.

  2. The zt+j (j = 1, 2, …) which have not yet occured at origin t, are replaced by their forecasts t(j) at origin t.

  3. The at-j (j = 0, 1, 2, …) which have occured, are available from zt-j - ẑt-j-1(1).

  4. The at+j (j = 1, 2, …) which have not yet occured, are replaced by zeros.

In context this is saying: assume any past difference between predictions and actuals are the (normally unobserved) driving shocks, and assume no future shocks (which is why we model shocks as being expected-value zero!).

With all said and done, the above k-step prediction scheme is just a composition of linear functions, so therefore itself a linear function of the original inputs.

Treating The Problems as a Mere Regression

We can exploit the closed under composition property to support the practice of directly fitting a k-step out predictor using traditional regression (for each possible k!). To do this we land a number of lagged copies of our variable as our explanatory variables. This brings us to the usual form of having all the explanatory variables in each row of data, and fitting a model as a function of rows.

The fitting strategy we are using throughout is: use theoretical arguments to ensure the data generating system is in the family of models being optimized over. Then the optimizer returns either this system, or one that has even smaller loss on the training data. Some of the excess performance is over-fit, so will not translate to out of sample performance. However, if the concept space isn’t too complicated in comparison with the amount of training data available: we then expect good out of sample performance. We can establish this through theoretical means (such as PAC learning or bounding in terms of Rademacher complexity), or we can just be parsimonious in more model specification and then empirically confirm out of sample performance.

Let’s set that up. First we organize our explanatory variables.

lags <- 0:5
d_train <- d[d$is_train, c("x", "y_observed"), drop = FALSE]
names(lags) <- paste0("x_", lags)
for (nm in names(lags)) {
  d_train[nm] = Lag(d_train$y_observed, lags[nm])
}

head(d_train)
##   x y_observed       x_0       x_1       x_2       x_3       x_4       x_5
## 1 1  0.3006386 0.3006386        NA        NA        NA        NA        NA
## 2 2  0.4894909 0.4894909 0.3006386        NA        NA        NA        NA
## 3 3  0.6215103 0.6215103 0.4894909 0.3006386        NA        NA        NA
## 4 4  0.9324421 0.9324421 0.6215103 0.4894909 0.3006386        NA        NA
## 5 5  0.9366514 0.9366514 0.9324421 0.6215103 0.4894909 0.3006386        NA
## 6 6  1.0601363 1.0601363 0.9366514 0.9324421 0.6215103 0.4894909 0.3006386

Now we define our brute force predict k-steps out function. To forecast in this way we need a whole family of such functions. We predict 1-step out using a function fit to forecast 1-step out, we predict 2-steps out using a function fit to forecast 2-steps out, and so on. We train many such k step out functions by fitting k steps into the future, at prediction time we use the many functions to predict many steps out from our last known state.

With this formulation we don’t need any of the lemmas that justify composing predictions, as we don’t compose predictions. That is: we don’t need to know if a good 3-step out prediction can be formed from feeding a 2-step out prediction back into a 1-step out model. We don’t need to know this as we are no longer doing that. We instead are saying, if there is a good k-step out prediction that is a linear function of its inputs we can find an approximation of that directly by fitting. This lightens the theoretical load, but increases the engineering load, as we have to perform and store multiple model fits.

fit_fn <- function(steps_forward) {
  stopifnot(steps_forward >= 1)
  d_train_y <- d_train
  # land the dependent variable
  d_train_y["y_f"] <- Lag(d_train$y_observed, -steps_forward)
  # limit to rows where we have explanatory values and dependent values
  d_train_y <- na.omit(d_train_y)
  # fit the dependent variable as a function of explanatory variables
  lm(mk_formula("y_f", names(lags)), data = d_train_y)
}

It is easy to code a data leak that makes results look too good both in the training and application phases of this formulation. We avoid this by strictly using only d_train for formation of both explanatory and outcome variables, and letting na.omit() deal with how this cuts our training data in an irregular pattern. More advice is to: return the predictions in an explicit data.frame (not as a scalar or a vector), and add an explicit time key to both inputs and predictions to make joining external regressors or joining results safe and reliable.

We store these fits in a map keye by how far out we are trying to predict.

vector_of_naheads_to_predict = seq(n_ahead)
model_map <- lapply(
  vector_of_naheads_to_predict,   # different nahead values
  fit_fn)

Our prediction functions uses this map of models to forecast.

pred_fn <- function(steps_forward, newdata) {
  stopifnot(steps_forward >= 1)
  stopifnot(nrow(newdata) == 1)
  model_f = model_map[[steps_forward]]
  # apply the model to the last "before the test period" row to get 
  # the k-steps_forward prediction
  as.numeric(predict(model_f, newdata = newdata))
}

With our prediction functions defined, we are ready to plot the results.

plot_ts_result(
  vapply(
    vector_of_naheads_to_predict,  # different nahead values
    pred_fn, 
    numeric(1), 
    newdata = d_train[nrow(d_train), , drop = FALSE]),  # last known state
  "y_observed", 
  "brute force k-lagged linear function family")

Notice we get good out of sample performance. If we are judging on quality of out of sample predictions (and not on system identification or inference) then direct fitting k-step out predictors looks like a justifiable strategy. There are other consideration in fitting, but if minimizing prediction error magnitude is your goal, then regression methods tend to dominate (as that is all they care about). If all one cares about is minimizing error, then (like it or not) error-minimizing strategies tend to dominate.

The above I call the Apps principle; as Dr. Philip Apps says:

You have to get up pretty early in the morning to beat linear regression.

The linear regression worked very well, even though it can only imitate AR terms (it doesn’t have MA terms). This is even with an example where the generative process is c(2, 0, 2), so has a non-trivial number of MA terms. Reliable prediction does not always require a fully faithful reconstruction of the underlying system. The Time Series Analysis reference makes a similar observation after working an example of a c(1, 1, 1) system (section 4.2.3 “Inverted Form of the Model” p. 113):

We mention in passing that, for statistical models representing practically occuring time series, the converging π weights usually die out rather quickly. Thus, although zt may be theoretically dependent on the remote past, the representation [… expression with only a few AR terms …] will usually show that the zt is dependent to an important extent only on recent past values zt-j of the time series.

The above was in the context that a c(p, 0, q) system (i.e. a MA or moving average process) can be written as a c(infinity, 0, 0) system, which in turn is usually very well approximated by a c(k, 0, 0) system for a small k. The theory supports systems that are not well-approximated with c(k, 0, 0), but they are not likely to be estimable from a reasonable number of observations anyway (so setting k very large doesn’t solve all problems).

This is an example of what I call the Welch principal from the following.

When Dr. Will Welch was asked why a simple model could correctly predict the strength of a biological interaction he said “maybe the molecule didn’t go to grad school.”

The direct regression approach makes the mechanics of adding external regressors and saving models for future use a bit easier. If one can form a row with correct lags or explanatory variables, one can call predict.

Conclusion

Time series tools have their own set of issues and affordances.

However, some of the terminology can be hard to understand and may not in fact align with your modeling situation. In particular watch out for thinking of the e[] as mere errors, they can also be important driving impulses or shocks. Similarly avoid thinking of the moving average terms as only computing moving averages; they can also translate relations thought to be true on the unobserved ideal values to the observed values (the “set q to p trick” we showed above). As with all mathematical objects, what something is called doesn’t actually influence what it does.

In my opinion teaching of ARIMA has drifted from the definitive treatment in Time Series Analysis. Current teaching can degenerate to “use this package to make forecasts, and direct modeling is heterodox.” Time Series Analysis spent a lot of time on semantics, consequences, and system identification as their own worthwhile goals- it did not recommend black box packages skipping forward to prediction or forecasting.

If one does only want predictions (even out of sample), one can in fact use direct prediction methods. Approaches that bring in standard regression tools (such as linear regression or even deep learning) can in fact be justified. One is allowed to not use ARIMA.

Thank you to Dr. Nina Zumel for her comments and improvements.

Time series packages of interest include sktime (a Python package), and NNS.