Skip to content

Latest commit

 

History

History
332 lines (280 loc) · 11.7 KB

File metadata and controls

332 lines (280 loc) · 11.7 KB

ts_example

R packages re-doing of our nested model time series example.

# http://fable.tidyverts.org
library(fable)
library(tsibble)
library(tsibbledata)
library(lubridate)
library(dplyr)
library(ggplot2)
library(forecast)
library(jsonlite)

The problem

We want to fit a time series with both durable and transient external regressors.

First let’s show the generating parameters we are attempting to recover.

model_specification <- fromJSON("generating_params.json")

model_specification
## $beta_auto_intercept
## [1] 2.127637
## 
## $effect_shift
## [1] -10.4
## 
## $beta_auto
## [1]  1.972723 -1.000000
## 
## $beta_durable
## [1] 9.2
## 
## $beta_transient
## [1] 19.1
## 
## $generating_lags
## [1] 1 2
## 
## $error_scale
## [1] 3.2
## 
## $n_step
## [1] 1000

What we hope is to find b_x_dur ~ b_z and b_x_imp ~ b_x.

The generating and modeling lags specify translate into ARIMA terms as p = 2, i = 0. We take our own advice from A Time Series Apologia and pick q = p. So in ARIMAX terms we try fitting a pdq(2, 0, 2) system.

We don’t have any way to specify the nature of external regressors, with act as transient effects in the “regression with ARIMA residuals” formulation favored by the fable and forecast packages. This will lead to a degradation in fit quality and an inability to properly estimate b_z (as we can’t specify the for of the effect we believe it has in the data).

Fitting with external regressors using the fable package

General Fable formulation:

(1 - φ1 B - … - φp Bp)(1 - B)d yt = c + (1 - θ1 B - … - θq Bq) εt

where B is the shift operator such that B ut = ut-1, μ is the mean of (1 - B)d yt and c = μ (1 - φ1 - … - φp). We solve for εt being small. We very much prefer calling the εt “the external shocks”, and not calling them “errors.”

In our pdq(2, 0, 2) case this specializes to:

(1 - φ1 B - φ2 B2) yt = c + (1 - θ1 B - θ2 B2) εt

And knowing this is “regression with ARIMA residuals”, when we add the external regressors this should be:

(1 - φ1 B - φ2 B2) (yt - β0 - βx xt - βz zt) = c + (1 - θ1 B - θ2 B2) εt

Removing the B shift notation and assuming mean(yt - β0 - βx xt - βz zt) is zero gives us.


(yt - β0 - βx xt - βz zt)
- φ1 (yt-1 - β0 - βx xt-1 - βz zt-1)
- φ2 (yt-2 - β0 - βx xt-2 - βz zt-2)
= εt - θ1 εt-1 - θ2 εt-2

This shows us the grace of the B shift notation. This equation block is repeated for all t (with out of range indices denoting zero values). For a T time step system we get a system of T equalities in T + 7 equations (minimizing norm ε). This also shows us that while the system is linear in the data, it is non-linear in the fit coefficients.

All of the above is a long way of saying: “given the pdq(2, 0, 2) specification and y, x, z, the modeling system then fits for φ, β, θ, ε (with ε small norm).” I.e., we don’t have to worry over the above transformations- that is the job of the solver.

Note: I agree with the Prophet authors that the user has to be involved in specifying pdq(2, 0, 2). Many auto-ARIMA systems seem to silently fail in presence of external regressors. Here is an example of “hey let’s use auto-arima” failing for tides and even its own package help example: https://github.com/WinVector/Examples/blob/main/Tides/TideR_ARIMA.md. All mean mean to say is: we can give ourselves permission to use non-ARIMA methods (more on this in “A Time Series Apologia”) in addition to trying ARIMA methods. The additional ARIMA driven concerns and checks (such as looking for unit roots, worrying about co-linear regressors) are more to do with the solution formulation (recursive equations) than the nature of forecast problems.

Normally, we don’t have to care so much about model structure- as we are protected from that by calling predict() or forecast(). However, in this case we are very concerned if model structure will or will not allow us to express what we think the external regressors actually do (i.e. model structure).

This is what we meant about the chosen package specifying the modeling recurrence equations (i.e. taking that choice out of our hands). We can specify modeling the total, but not unobserved sub-populations of the system.

Frankly ARIMAX/SARIMAX can be a false path for business modelers. Time series research didn’t actually stop at or actually consolidate on this terminology. Instead, transfer function methods and other more further developed systems are studied. Roughly: the scientific community is well served by ARIMAX. The research community moved on from ARIMAX. And, the business community wishes ARIMAX was in fact the dominant method, as it is the dominant software offered.

An odd point in this direction is: ARIMA prediction of tides. In fact tides are formed by external regressors: the gravitational attraction of the sun and moon. However just giving the ARIMA the exact periodicities of these regressors is enough for it to model the entire system without those inputs.

# https://otexts.com/fpp3/regarima.html
d_train <- read.csv('d_train.csv', stringsAsFactors = FALSE)
d_test <- read.csv('d_test.csv', stringsAsFactors = FALSE)
fable_model <- (
  d_train %>%
    tsibble(index=time_tick) %>%
    model(
      ARIMA(y ~ 
              1  # turn off must specify constant warning
            + x_durable_0   # external regressor (can also use xreg(x_durable_0))
            + x_transient_0   # external regressor (can also use xreg(x_transient_0))
            + pdq(2, 0, 2)   # AR=MA=2, I=0
            + PDQ(0, 0, 0)   # turn off seasonality (help(ARIMA))
            )
    )
)
coef(fable_model)
## # A tibble: 7 × 6
##   .model                            term  estimate std.error statistic   p.value
##   <chr>                             <chr>    <dbl>     <dbl>     <dbl>     <dbl>
## 1 "ARIMA(y ~ 1 + x_durable_0 + x_t… ar1      1.93    0.00985   196.    0        
## 2 "ARIMA(y ~ 1 + x_durable_0 + x_t… ar2     -0.965   0.00972   -99.2   0        
## 3 "ARIMA(y ~ 1 + x_durable_0 + x_t… ma1     -0.928   0.0323    -28.7   3.09e-132
## 4 "ARIMA(y ~ 1 + x_durable_0 + x_t… ma2      0.271   0.0310      8.72  1.15e- 17
## 5 "ARIMA(y ~ 1 + x_durable_0 + x_t… x_du…   -0.487   0.842      -0.579 5.63e-  1
## 6 "ARIMA(y ~ 1 + x_durable_0 + x_t… x_tr…   17.0     0.346      49.2   3.70e-267
## 7 "ARIMA(y ~ 1 + x_durable_0 + x_t… inte…   78.3     2.72       28.8   7.12e-133

Notice we recovered good estimates of the autoregressive terms b_auto (ar1, ar2), transient external effect coefficient b_x (x_transient_0). We did not get a good estimate of the durable external effect coefficient x_durable_0 (b_z), so we did not infer how changes in this variable affect results.

We would be able to forecast, as the auto-regressive terms dominate. We would not be able to plan, as we don’t have a good estimate of x_durable_0 (b_z).

model_specification
## $beta_auto_intercept
## [1] 2.127637
## 
## $effect_shift
## [1] -10.4
## 
## $beta_auto
## [1]  1.972723 -1.000000
## 
## $beta_durable
## [1] 9.2
## 
## $beta_transient
## [1] 19.1
## 
## $generating_lags
## [1] 1 2
## 
## $error_scale
## [1] 3.2
## 
## $n_step
## [1] 1000
preds <-  (
  fable_model %>%
    forecast(new_data=tsibble(d_test, index=time_tick)) 
)
d_test['fable ARIMAX prediction'] = preds['.mean']

# Rsquared
sse <- sum((d_test[['y']] - d_test[['fable ARIMAX prediction']])**2)
sey <- sum((d_test[['y']] - mean(d_test[['y']]))**2)
rsq <- 1 - sse/sey
# RMSE
rmse <- sqrt(mean((d_test[['y']] - d_test[['fable ARIMAX prediction']])**2))

(
  ggplot(
    data=d_test,
    mapping=aes(x=time_tick)
  )
  + geom_step(mapping=aes(y=`fable ARIMAX prediction`), direction='mid', color='blue')
  + geom_point(mapping=aes(y=y, shape=ext_regressors, color=ext_regressors), size=2)
  + ggtitle(paste0("fable package on held out data, rmse: ", 
    sprintf('%.2f', rmse), ', rsq: ', sprintf('%.2f', rsq)))
) 

Fitting with an external regressor using the forecast package

Forecast formulation:

(1 - φ1 B - … - φp Bp) (yt - μ) = c + (1 - θ1 B - … - θq Bq) εt

where yt = (1 - B)d yt, μ = mean(yt).

# https://otexts.com/fpp3/regarima.html
d_train <- read.csv('d_train.csv', stringsAsFactors = FALSE)
d_test <- read.csv('d_test.csv', stringsAsFactors = FALSE)
forecast_model <- Arima(
  ts(d_train[['y']], start=d_train[['time_tick']][1]), 
  order=c(2, 0, 2), 
  xreg=ts(d_train[, c('x_durable_0', 'x_transient_0')], start=d_train[['time_tick']][1])
  )

forecast_model
## Series: ts(d_train[["y"]], start = d_train[["time_tick"]][1]) 
## Regression with ARIMA(2,0,2) errors 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  intercept  x_durable_0  x_transient_0
##       1.9335  -0.9648  -0.9282  0.2705    78.3360      -0.4873        17.0466
## s.e.  0.0099   0.0097   0.0323  0.0310     2.7168       0.8420         0.3463
## 
## sigma^2 = 60.54:  log likelihood = -3400.48
## AIC=6816.97   AICc=6817.11   BIC=6856.07

Notice the recovered durable effect coefficient is way too low.

preds <- forecast(
  forecast_model, 
  xreg=ts(d_test[, c('x_durable_0', 'x_transient_0')], 
  start=d_test[['time_tick']][1]))
d_test['forecast ARIMAX'] <- as.numeric(preds$mean)

# Rsquared
sse <- sum((d_test[['y']] - d_test[['forecast ARIMAX']])**2)
sey <- sum((d_test[['y']] - mean(d_test[['y']]))**2)
rsq <- 1 - sse/sey

# RMSE
rmse <- sqrt(mean((d_test[['y']] - d_test[['forecast ARIMAX']])**2))

(
  ggplot(
    data=d_test,
    mapping=aes(x=time_tick)
  )
  + geom_step(mapping=aes(y=`forecast ARIMAX`), direction='mid', color='blue')
  + geom_point(mapping=aes(y=y, shape=ext_regressors, color=ext_regressors), size=2)
  + ggtitle(paste0(
    "forecast package on held out data, rmse: ", 
    sprintf('%.2f', rmse), ', rsq: ', sprintf('%.2f', rsq)))
)