SRM Notes
SRM Notes
dex
This is known as the training MSE and fˆ is typically found by minimizing training MSE. However, the
ideal fˆ minimizes test MSE. The test MSE can be estimated by cross-validation. Usually the test MSE
follows a U-shape where as the flexibility of the learning method increases, the test MSE decreases
until it starts to increase again.
The expected test MSE can be decomposed into the following form:
The ideal learning method achieves low variance and low bias. The final term var(ϵ) is said to be an
irreducible error and this term can never be lowered and serves as a lower bound to the expected test
MSE. Bias is introduced by approximating a complicated relationship by a simple model. Generally,
as the flexibility of the model increases, the bias will decrease. Variance refers to the amount by which
fˆ changes if it is given a different training data set. In general, more flexible statistical methods have
higher variance.
In the classification setting the training MSE is replaced with the training error rate defined as
n
1X
1{yi ̸= ŷi }
n i=1
Similarly to the regression setting, the ideal classifier minimizes test error rate. The expected test
error rate is minimized by the Bayes classifier which assigns a test observation with predictor x0 to the
class j for which P (Y = j | X = x0 ) is largest. If there are only two classes, then the Bayes classifier
corresponds to P (Y = 1 | X = x0 ) > 0.5. The set for which the conditional probability is exactly 0.5
is called the Bayes decision boundary. The Bayes classifier produces the lowest possible test error rate
called the Bayes error rate. The Bayes error rate is given by
1 − E max P (Y = j | X)
j
1
Since for real data we do not know the conditional distribution, the actual Bayes classifier cannot
be used but may be treated as the gold standard to compare other classification methods. We may
instead try to classify observations with an estimated conditional distribution and one such method is
the K-nearest neighbors (KNN) classifier.
1 X
P̂ (Y = j | X = x0 ) = 1{yi = j}
K
i∈N0
where N0 refers to the K points in the training data closest to x0 . KNN classifies the test observation
x0 to the class with the largest estimated conditional probability. The choice of K corresponds to the
smoothness of the classifier, with higher K corresponding to lower bias and higher variance and lower
K to higher bias but lower variance.
Y = β0 + β1 X + ϵ
x̄2 σ2
2 2 1
se(β̂0 ) = σ + Pn 2
, se(β̂1 )2 = Pn 2
n i=1 (xi − x̄) i=1 (xi − x̄)
where σ 2 = var(ϵ). These formulas are derived assuming the errors to be uncorrelated with common
variance σ 2 . Note that se(β̂1 ) is smaller when the xi are more spread out, which intuitively follows
from there being more leverage to estimate the slope in this case. In general, σ 2 is not known but can
be estimated from the data. This estimate of σ is known as the residual standard error given by the
formula r
RSS
RSE =
n−2
Standard errors can also be used to compute Wald confidence intervals. We can also perform hypothesis
tests on the coefficients. To test the null hypothesis that there is no relationship between X and Y ,
we can test
H0 : β1 = 0, H1 = β1 ̸= 0
We compute a t-statistic given by
β̂1
t=
SE(β̂1 )
2
this follows a tn−2 distribution.
To quantify the extent to which the model fits the data, we can look at two related quantities: the
RSE and the R2 statistic. The RSE provides an absolute measure of lack of fit of the model to the
data. The R2 states the proportion of variance explained.
TSS − RSS RSS
R2 = =1−
TSS TSS
Pn
where TSS = i=1 (yi − ȳ)2 is the total sum of squares. The statistic R2 always lies between 0 and
1. It is also related to the correlation. If r is the sample correlation between X and Y , in the case of
simple linear regression, R2 = r2 .
where Xj represents the jth predictor and βj quantifies the association between that variable and
the response. We interpret βj as the average effect on Y of a unit increase in Xj holding all other
predictors fixed.
Given estimators β̂0 , β̂1 , · · · , β̂p , predictions take the form
p
X
ŷ = β̂0 + β̂k xk
k=1
Given a full rank design matrix X, the OLS estimator is given by the equation
β̂OLS = (X T X)−1 X T Y
To answer if there is a relationship between the response and predictors we can take the hypothesis
test with null hypothesis
H0 : β1 = β2 = · · · = βp = 0
with the alternative hypothesis being that at least one βk for k > 0 is not zero. For this purpose, we
compute the F -statistic
(TSS − RSS)/p
F =
RSS/(n − p − 1)
Pn
where TSS = i=1 (yi − ȳ)2 . If the model assumptions are correct, then
E[RSS/(n − p − 1)] = σ 2
3
Hence if there is no relationship between the response and predictors, we expect the F -statistic to be
close to 1. If the alternative hypothesis holds, then E[(TSS − RSS)/p] > σ 2 so we expect F to be
greater than 1.
Sometimes we would like to instead test that a particular subset of q of the coefficients are zero. That
is,
H0 : βp−q+1 = βp−q+2 = · · · = βp = 0
Let RSS0 denote the residual sum of squares that use all variables except the last q. Then we take the
F -statistic
(RSS0 − RSS)/q
F =
RSS/(n − p − 1)
This F -statistic reports the partial effect of adding these q variables. In all of these cases, for a nested
model with p1 < p2 , we compare with the F distribution with (p2 − p1 , n − p2 − 1) degrees of freedom.
Variable selection is the task of determining which predictors are associated with the response in order
to fit a single model involving only those predictors. Various statistics can be used to jduge the quality
of a model including Mallow’s Cp , Akaike information criterion (AIC), Bayesian information criterion
(BIC), and adjusted R2 . Since there are a total of 2p models that contain subsets of p variables, it
is infeasible to try every subset. More efficient approaches must be taken. There are three classical
approaches:
1. Forward selection. Begin with the null model then fit p simple linear regressions and add to the
null model the variable that results in the lowest RSS. We then add to that model the variable
that results in the lowest RSS for the new two-variable model and continue until some stopping
rule is satisfied.
2. Backward selection. We start with all variables in the model and remove the variable with the
largest p-value. The new (p − 1)-variable model is fit and the variable with the largest p-value is
removed and we continue this procedure until a stopping rule is reached.
3. Mixed selection. This is a combination of forward and backward selection. Start with no variables,
then add the variables that provide the best fit one-by-one. If at any point the p-value for one
of the variables in the model rises above a threshold, we remove that variable from the model
and we continue to perform forward and backward steps until all variables in the model have a
sufficiently low p-value.
Backward selection cannot be used if p > n and forward selection can always be used. Forward
selection is a greedy approach and can include variables that later become redundant. Mixed selection
can remedy this.
In multiple linear regression, R2 = cor(Y, Ŷ )2 . An R2 value close to 1 indicates that the model explains
a large portion of the variance in the response variable. The R2 statistic will always increase when
more variables are added to the model, even if those variables are only weakly associated with the
response since adding another variable always results in a decrease in the residual sum of squares in
the training data.
For multiple linear regression the RSE is defined as
s
RSS
RSE =
n−p−1
Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the
increase in p.
4
The inaccuracy in the coefficient estimates is related to the reducible error and one can compute
a confidence interval. Even if the true values for β0 , β1 , · · · , βp are known, the response cannot be
predicted perfectly. The error due to ϵ is called the irreducible error. Prediction errors are always
wider than confidence intervals since they incorporate both the error in the estimate of the population
and the randomness of the individual point.
5
can be used to identify outliers. It can be difficult to decide how large a residual needs to be before
it is considered an outlier. We can instead plot the studentized residuals, computed by dividing the
residual ei by its estimated standard error. Observations whose studentized residuals are greater than
3 in absolute value are possible outliers.
Observations with high leverage have an unusual value for xi . High leverage observations have a
substantial impact on the estimated regression fit. In order to quantify an observation’s leverage, we
compute the leverage statistic. For simple linear regression,
1 (xi − x̄)2
hi = + Pn 2
n i′ =1 (xi − x̄)
′
For multiple linear regression, the leverage score of the ith observation is the ith diagonal entry of the
hat-matrix H = X(X T X)−1 X T . That is,
2
where RX j |X−j
is the R2 from a regression of Xj onto all other predictors.
6
The Kth class is known as the baseline and satisfies
1
P (Y = K | X = x) = PK−1 Pp
1+ l=1 exp( j=0 βlj xj )
An alternative coding for multinomial logistic regression is the softmax coding. Rather than selecting
a baseline class, we treat all K classes symmetrically and assume for k ∈ {1, 2, · · · , K},
Pp
exp( j=0 βkj xj )
P (Y = k | X = x) = PK Pp
l=1 exp( j=0 βlj xj )
πk fk (x)
P (Y = k | X = x) = PK
l=1 πl fl (x)
πk exp(− 2σ1 2 (x − µk )2 )
pk (x) = PK 1 2
l=1 πl exp(− 2σ 2 (x − µl ) )
Thus, given this form we should assign observations to the class for which
µk µ2k
δk (x) = x − + log(πk )
σ2 2σ 2
is largest. The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging
in estimates for πk , µk and σ 2 .
1 X
µ̂k = xi
nk
i|yi =k
K
1 X X
σ̂ 2 = (xi − µ̂k )2
n−K
k=1 i|yi =k
7
where n is the total number of training observations, nk is the number of training observations in the
kth class. In the absence of any additional information, take
nk
π̂k =
n
The LDA classifier assigns observation X = x to the class for which
µ̂k µ̂2
δ̂k (x) = x 2
− k2 + log(π̂k )
σ̂ 2σ̂
is largest. The name comes from the discriminant functions δ̂k (x) being linear functions of x.
We can extend the LDA classifier to multiple predictors by assuming X is drawn from a multivariate
normal distribution with a class specific mean vector and common covariance matrix. We find that
1
δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk
2
Similar estimates for µ1 , · · · , µK , π1 , · · · , πK and Σ can be given. As before, δ̂k (x) is a linear function
of x. For binary classification, a confusion matrix is a convenient way of displaying which types of
errors are being made. A confusion matrix has entries true positive (TP), false negative (FN), false
positive (FP), true negative (TN). Sensitivity, recall, or hit rate refers to the true positive rate (TPR).
Specificity, selectivity refers to the true negative rate (TNR). The Bayes classifier assigns observations
for the class for which the posterior probability pK (X) is greatest. For the two-class case, the threshold
is 50%. However, if we are concerned about incorrect predictions for a particular class, we may change
the threshold. The ROC curve displays the error rate for all possible thresholds. It plots the true
positive rate (TPR) against the false positive rate (FPR). The overall performance of a classifier is
summarized by the area under the curve (AUC) statistic. An ideal ROC curve is close to the top left
corner.
This assumption means the association between the predictors need not be estimated. The model
assumes Qp
πk j=1 fkj (xj )
P (Y = k | X = x) = PK Qp
l=1 πl j=1 flj (xj )
8
for k ∈ {1, 2, · · · , K}. To estimate the one dimensional density functions fkj , we can assume Xj | Y =
2
k ∼ N (µjk , σjk ). This amounts to QDA with the additional assumption that class-specific covariance
matrices are diagonal. Another option is to use a nonparametric estimate for fkj such as a kernel
density estimator. If Xj is a factor, then we can simply count the proportion of training observations
for the jth predictor corresponding to each class.
It should be noted that the naive Bayes estimator is a generalized additive model and that LDA is a
special case of Naive Bayes. Neither QDA nor naive Bayes is a special case of the other. Naive Bayes
can be flexible but is limited to an additive fit whereas QDA contains quadratic terms.
We can compare the previous methods discussed with KNN. Since KNN is completely nonparametric,
it should dominate LDA and logistic regression when the decision boundary is highly nonlinear and
when n is large while p is small. KNN requires a lot of observations relative to the number of predictors.
If n is modest or p is not very small, QDA may be preferred to KNN since it is nonlinear while still
having a parametric form. KNN does not tell which predictors are important unlike logistic regression.
The k-fold CV approach involves randomly dividing the set of observations into k groups, or folds,
of approximately equal size. The first fold is used as a validation set and the method is fit on the
remaining k − 1 folds. The mean squared error is computed on the observations in the held-out fold.
This procedure is repeated by treating other folds as validation sets. The k-fold CV estimate averages
these values:
k
1X
CV(k) = MSEi
k i=1
In practice, one usually takes k = 5 or k = 10. The obvious advantage over LOOCV is the computation
speed. It also gives more accurate estimates of the test error rate than does LOOCV due to a bias-
variance trade-off. The outputs of the fitted models are correlated with each other, with the correlation
9
increasing with increased number of folds, and since the mean of highly correlated quantities has higher
variance, the test error estimate of LOOCV tends to have higher variance than that of k-fold CV.
On classification problems, the LOOCV error rate is
n
1X
CV(n) = Erri
n i=1
1.4.2 Bootstrap
The bootstrap emulates the process of obtaining new sample sets by repeatedly sampling observations
from the original data set. This sampling is done with replacement. The bootstrap standard error of
α̂ is given by v
u B B
!2
u 1 X 1 X ′
SEB (α̂) = t α̂∗r − α̂∗r
B − 1 r=1 B ′
r =1
• Pick the best among these kp models, and call it Mk . Best is defined as the one with the
10
1. Let Mp denote the full model.
2. For k = {1, 2, · · · , p}:
(a) Consider all k models that contain all but one predictors in Mk , for a total of k−1 predictors.
(b) Choose the best among these k models and call it Mk−1 . Best is defined as smallest RSS
or deviance.
3. Select a single best model among M0 , · · · , Mp using cross-validated prediction error, Cp , AIC,
BIC, or adjusted R2 .
Forward and backward selection are not guaranteed to yield the best model. A hybrid approach where
both backward and forward steps are available are possible.
Generally, training set MSE is an underestimate of the test MSE. But there are techniques for adjusting
the training error for the model size. For a fitted least squares model containing d predictors, Mallow’s
Cp estimate of test MSE is given by
1
Cp = (RSS + 2dσ̂ 2 )
n
Typically σ̂ 2 is estimated using the full model containing all predictors. If σ̂ 2 is an unbiased estimator
of σ 2 , then Cp is an unbiased estimate of the test MSE.
The Aikake Information Criterion (AIC) is defined for a large class of model fit by maximum likelihood.
AIC is given by
1
AIC = (RSS + 2dσ̂ 2 )
n
The Bayesian Information Criterion (BIC) takes the form
1
BIC = (RSS + log(n)dσ̂ 2 )
n
11
The tuning parameter λ serves to control the relative impact of the penalty. When λ = 0, there is
no penalty and when λ = ∞, the regression coefficients equal zero. Ride regression will generate a
different set of coefficients for each value of λ. Note that we do not shrink the intercept.
The standard least squares coefficients are scale equivariant. Multiplying the predictors by a constant c
leads to a scaling of the coefficient estimates by c−1 . However, the ridge regression coefficient estimates
are not scale equivariant so it is best to apply ridge regression after standardizing the predictors first.
Ridge regression is advantageous due to the bias-variance trade-off. As λ increases, the flexibility of
the ridge regression fit decreases, leading to decreased variance but increased bias.
Ridge regression has the disadvantage of shrinking all coefficients towards zero but not setting any of
them exactly to zero, leading to problems with model interpretation. The lasso is an alternative to
ridge regression that sets some coefficients to zero. It has the form
p
X
β̂λR = argmin RSS(β) + λ |βj |
β j=1
The ℓ1 penalty forces some coefficient estimates to be exactly zero when the tuning parameter λ is
sufficiently large. In this sense, the lasso performs variable selection. We say that the lasso yields
sparse models.
Ridge regression and lasso are described as solutions to a convex optimization problem, which have
the Lagrangian dual problems
argmin RSS(β)
β|∥β∥1 ≤s
argmin RSS(β)
β|∥β∥2 ≤s
For every λ, there is a corresponding s that gives the same coefficient estimates. Best subset selection
may be thought as the solution to
argmin RSS(β)
β|∥β∥0 ≤s
The ℓ1 norm is a convex approximation to the ℓ0 norm and hence the lasso may be thought of as a
convex relaxation of the best subset selection problem.
Ridge regression and lasso requires a method to select the tuning parameter. We choose a grid of λ
values and compute the cross-validation error for each value of λ then select the tuning parameter
for which the cross-validation error is smallest. Then the model is refit with the selected value of the
tuning parameter.
with least squares. If the constants are chosen wisely then the dimension reduction should outperform
least squares.
12
Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of
features. PCA is performed by maximizing the Rayleigh quotient. The first vector satisfies
wT X T Xw
ϕ·1 = argmax{ }
w wT w
Pk−1
Given the first k − 1 components, let X̂k = X − s=1 Xϕ·s ϕT·s and
wT X̂kT X̂k w
ϕ·k = argmax{ }
w wT x
Note that even though PCR provides a simple way to perform regression using M < p predictors, it
is not a feature selection method. Generally, one should standardize the predictors before generating
the principal components.
Partial least squares (PLS) is a supervised alternative to PCR. Even though the low rank approximation
of X in PCR best accounts for X, it is not necessarily optimal for regression. The PLS1 algorithm is
as follows.
1. Columnwise center X and y and set X0 = X.
2. Repeat for i = 1, · · · , k:
T
Xi−1 y
(a) Set wi = T y∥
∥Xi−1
Xi−1 wi
(b) Set ti = ∥Xi−1 wi ∥
T
(c) Set vi = Xi−1 ti
(d) Set Xi = Xi−1 − ti viT
3. Collect the vectors into matrices Wk , Tk , Vk .
(k)
β̂PLSE = Wk (VkT Wk )−1 TkT y
For a given value of k, the PLSE has a better predictability than the corresponding PCR estimator.
As with PCR, the number of dimensions is a tuning parameter typically chosen by cross-validation.
13
1.6 ISLR Chapter 7
1.6.1 Basis functions
Polynomial regression allows for nonlinearity by allowing for polynomials in the predictors. Using
polynomial functions of the features imposes a global structure on the nonlinear function of X. One
can instead use step functions. We can break the range of X into bins and fit a different constant in
each bin. This converts a continuous variable into an ordered categorical variable. With indicators on
the bins, this may be fit in the same way as linear regression.
Polynomial and piecewise constant regression models are special cases of a basis function approach.
Instead of fitting a linear model in X, we can fit
K
X
y i = β0 + βj bj (xi ) + ϵi
j=1
One flexible class of basis functions are piecewise polynomials. The points where the coefficients
change are called knots. We can also impose constraints on the piecewise polynomials by imposing
continuity, continuity of the first derivative, and continuity of the second derivative. With third degree
polynomials and the constraints, we get cubic splines. Cubic splines with K knots have K + 4 degrees
of freedom. One way to represent a cubic spline is with a truncated power basis. A truncated power
basis function is defined as
h(x, ξ) = (x − ξ)3+
where ξ is the knot. Splines can have high variance at the outer range of the predictors. A natural
spline is a regression spline with additional boundary constraints: the function is required to be linear
at the boundary.
It is common to place the knots of a spline at uniform quantiles of the data. One can again use
cross-validation to select for the degrees of freedom to be used.
We can also take a different approach to produce a spline. Note that if we wanted to find a function
g(x) that fits the observed data well, we can simply choose a g that interpolates all of the yi . We can
control this by requiring g to be smooth. We can find the function g that minimizes
n
X Z
2
(yi − g(xi )) + λ g ′′ (t)2 dt
i=1
where λ is a nonnegative tuning parameter. The function g minimizing this objective is called a
smoothing spline. A smoothing spline can be shown to be a piecewise cubic polynomial with knots at
the unique values of x1 , · · · , xn , and have continuous first and second derivatives at each knot that is
linear outside the extreme knots. The smoothing parameter λ controls the effective degrees of freedom.
Define
ĝλ = Sλ y
where ĝλ is the solution to the minimization problem for the smoothing spline. The effective degrees
of freedom is defined as
dfλ = tr(Sλ )
The LOOCV can be computed efficiently for smoothing splines.
n 2
X yi − ĝλ (xi )
RSScv (λ) =
i=1
1 − {Sλ }ii
14
1.6.2 Local Regression
Local regression computes the fit at a target point using only the nearby training observations. Local
regression is sometimes referred to as a memory-based procedure because we need all the training data
each time we wish to compute a prediction. The most important choice in a local regression is the
span s, the proportion of points used to compute the local regression at x0 . The span controls the
flexibility of the fit.
1. Gather the fraction s = k/n of training points whose xi are closest to x0 .
2. Assign a weight Ki0 = K(xi , x0 ) to each point in the neighbourhood so that the closer points
have higher weight, where all but these k nearest neighbors have weight zeor.
3. Fit a weighted least squares regression on the yi by finding a minimum of
n
X
Ki0 (yi − β0 − β1 xi )2
i=1
GAMs allow us to fit a nonlinear fj to each Xj so that we can automatically model non-linear relation-
ships that standard linear regression will miss. The nonlinear fits can potentially make more accurate
predictions for the response. Because the model is additive, we can examine the effect of each Xj on
Y while holding all other variables fixed. The smoothness of fj can be summarized by the degrees of
freedom. The main limitation of GAMs is that the model must be additive. For this purpose, we can
manually add interaction terms to GAMs or fit interaction terms using two-dimensional smoothers
such as two-dimensional splines.
To reduce the computational burden, a greedy approach called recursive binary splitting is used. This
approach starts at the root (top) of the tree and splits the predictor space with two branches. It is
15
greedy because the best split is made at the particular step rather than looking ahead to make a better
split in the future. Define the half-planes
We can repeat this process to make further splits in the data until a stopping criterion such as contin-
uing until no region contains more than five observations.
This process will likely overfit the data, leading to poor test set performance so in practice one grows
a large tree T0 and prunes it to obtain a subtree. Cost complexity pruning, also known as weakest link
pruning, considers trees indexed by a nonnegative tuning parameter α. Let |T | denote the number of
terminal nodes of tree T . For each α, there is a subtree T ⊂ T0 that minimizes
|T |
X X
(yi − ŷRm )2 + α|T |
m=1 xi ∈Rm
The tuning parameter α controls the subtree’s complexity. It turns out that branches get pruned from
the tree in a nested and predictable fashion. The tuning parameter α is selected via a validation set
or cross-validation.
A classification tree is similar to a regression tree but is used for classification. For a classification tree,
we predict that each observation belongs to the most commonly occurring class of training observations
in the region to which it belongs. We are not only interested in the class prediction to a terminal node
region but also the class proportions among the training observations that fall in the region. A different
objective function must be used for classification. The Gini index is defined by
K
X
G= p̂mk (1 − p̂mk )
k=1
A small value indicates that a node predominantly contains observations from a single class. An
alternative to Gini index is entropy, given by
K
X
D=− p̂mk log p̂mk
k=1
The entropy will take on a value near zero if p̂mk are all near zero or near one. Either the Gini index
or the entropy are used to evaluate the quality of a particular split for pruning but the classification
error rate is preferable if prediction accuracy of the final pruned tree is the goal. Sometimes splits will
yield the same predicted value. The reason the split is made at all is because it increases node purity.
Advantages for decision trees:
• Trees are easily explained and easily interpreted by a non-expert.
• Trees can be displayed graphically.
• Trees more closely resembles human decision-making.
• Trees can handle qualitative predictors without needing to create dummy variables.
16
Disadvantages for decision trees:
• Trees do not have the same level of predictive accuracy as other regression and classification
approaches.
• Trees are non-robust. Small changes in the data can lead to large changes in the final estimated
tree.
An ensemble method combines many simple building block models to obtain a single more powerful
model. The building block models are known as weak learners.
Decision trees suffer from high variance. Bootstrap aggregation, or bagging, is a general-procedure for
reducing the variance of a statistical learning method, which can be used for decision trees.
B
1 X ˆ∗b
fˆbag (x) = f (x)
B
b=1
Construct B regression trees using B bootstrapped training sets, and average the resulting predictions.
These trees are grown deep and are not pruned. Hence each individual tree has high variance but low
bias and averaging these B trees reduces the variance. For classification, we take the majority rule:
the overall prediction is the most commonly occurring class among the B predictions. Note here that
the number of trees B is not a tuning parameter and a very large value of B will not lead to overfitting.
On average, each bagged tree makes use of around two-thirds of the observations. The remaining
one third of the observations not used to fit a given bagged tree are referred to as the out-of-bag
observations. We can predict the response for the ith observation using each of the trees in which
that observation was OOB. This yields in about B/3 predictions for the ith observations. Average
or take the majority vote to get a single OOB prediction for the ith observation. An overall OOB
MSE or classification error can be computed this way and the resulting OOB error is a valid estimate
of the test error for the bagged model. For large enough B, the OOB error is nearly equivalent to
leave-one-out cross-validation error.
Bagging results in improved accuracy but it is hard to interpret the resulting model. One can obtain
an overall summary of the importance of each predictor by averaging the amount of SSE decreased
due to splits over a given predictor averaged over the B trees. A large value indicates an important
predictor. For classification, averaging the amount of Gini index decreased by splits over a predictor
averaged over the B trees.
Random forests improves on bagging by decorrelating the trees. As in bagging, one builds decision trees
on bootstrapped training samples but also takes a random sample of m predictors as split candidates
from the full set of p predictors. A fresh sample of m predictors is taken at each split and we typically
√
choose m ≈ p. If m = p, this is simply bagging. A small value of m is helpful for a large number of
correlated predictors.
Here is a boosting algorithm:
1. Set fˆ(x) = 0 and ri = yi for all i in the training set.
2. For b = 1, 2, · · · , B:
(a) Fit a tree fˆb with d splits to the training data (X, r).
(b) Update fˆ by adding a shrunken version of the new tree:
17
(c) Update the residuals:
ri ← ri − λfˆb (xi )
The tuning parameters are B, λ, d. Unlike bagging and random forests, a high B can actually lead to
overfitting for boosting. Cross-validation is used to select B. The shrinkage parameter λ controls the
rate at which boosting learns. Typically small positive numbers such as 0.01 and 0.001. Very small λ
can require a large B. The number d of splits controls the complexity of the boosted ensemble. d is
the interaction depth, controlling the interaction order of the boosted model since d splits can involve
at most d variables. The choice d = 1 is referred to as a ”stump” consisting of a single split which is
equivalent to fitting an additive model. Smaller d have the benefit of being more interpretable.
Bayesian additive regression trees (BART) is another ensemble method that uses decision trees as
building blocks. Each tree for BART is constructed in a random manner as in bagging and random
forests, and each tree tries to capture signal not yet accounted for as in boosting. Let K denote the
number of regression trees, B the number of iterations for which BART is run, fˆkb (x) represents the
prediction at P
x for the kth regression tree used in the bth iteration. At the end of each iterations, we
K
take fˆb (x) = k=1 fˆkb (x) for all b.
In the first iteration of BART, all trees are initialized to have a single root node, with fˆk1 (x) =
1
Pn ˆ1 PK ˆ1 1
Pn
nK i=1 yi . Thus, f (x) = k=1 fk (x) = n i=1 yi . In subsequent iterations, BART updates each
of the K trees one at a time. In the bth iteration, to update the kth tree, subtract from each response
value the predictions from all but the kth tree to obtain a partial residual
X X
ri = yi − fˆkb′ (xi ) − fˆkb−1
′ (xi )
k′ <k k′ >k
for the ith observation. Rather than fitting a fresh tree to this partial residual, BART randomly
chooses a perturbation to the tree from the previous iteration, favoring ones that improve the fit to
the partial residual.
1. We may change the structure of the tree by adding or pruning branches.
2. WE may change the prediction in each terminal node of the tree.
The output of BART is a collection of prediction models,
K
X
fˆb (x) = fˆkb (x)
k=1
We typically throw away the first few of these prediction models, called the burn-in period. Let L
denote the number of burn-in iterations. Then, we take the average after the burn-in iterations,
B
1 X
fˆ(x) = fˆb (x)
B−L
b=L+1
However, other quantities such as percentiles of fˆL+1 (x), · · · , fˆB (x) can be taken. The BART method
can be viewed as a Bayesian approach to fitting an ensemble of trees: randomly perturbing a tree to
fit the residuals is drawing a new tree from a posterior distribution. Here is the complete algorithm
for BART:
18
Pn
1. Let fˆ11 (x) = fˆ21 (x) = · · · = fˆK
1 1
(x) = nK i=1 yi .
K n
2. Compute fˆ1 (x) = k=1 fˆk1 (x) = n1 i=1 yi
P P
(b) Fit a new tree, fˆkb (x) to ri by randomly perturbing the kth tree from the previous iteration,
fˆkb−1 (x). Perturbations that improve the fit are preferred.
PK
(c) Compute fˆb (x) = k=1 fˆkb (x)
4. Compute the mean after L burn-in samples,
B
1 X
fˆ(x) = fˆb (x)
B−L
b=L+1
Summary:
• In bagging, trees are grown from bootstrapped samples. The trees can be similar to each other
and fail to thoroughly explore the model space.
• In random forests, the trees are grown from bootstrapped samples and each split is performed
with a random subset of the features, decorrelating the trees relative to bagging.
• In boosting, there is no bootstrapping and the trees are grown successively with a slow learning
rate.
• In BART, there is no bootstrapping and the trees are grown successively but the trees are
perturbed for a more thorough exploration of the model space.
Thus, ϕ1· is an eigenvector of X T X corresponding to the highest eigenvalue. The kth principal
component can be found in this way:
k−1
X
bk = X −
X Xϕi· ϕTi·
i=1
19
bk ϕ∥2
ϕk· = argmax ∥X
∥ϕ|=1
Indeed, ϕi· correspond to the ith eigenvectors of X T X and ϕTi· X T Xϕi· are eigenvalues. As normalized
eigenvectors, the vectors {ϕi· } form an orthonormal set. ϕi· is called the ith principal component
loading vector. In matrix form, we define
Z = XΦ
X is the original data set, Φ contains the loading vectors, and Z contains the principal component
scores.
Assuming that X is centered, the total variance of X is defined as
p n
1 XX 2
x
n j=1 i=1 ij
In total, there are min{n − 1, p} principal components and their PVEs sums to one.
The first M principal component loading and score vectors should be interpreted as the best M -
dimensional approximation to the data in terms of residual sum of squares.
Var(X)
d = Var(Z)
d + MSE(X − ZΦT )
Moreover, the PVE is the R2 of the approximation of X given by the first M principal components.
The results for PCA depend on whether the variables are individually scaled. If the features are on
different units, then it may be desirable to first scale each variable to have standard deviation one
before PCA. However, if the variables are measured in the same units, then one should not scale the
variables.
To decide on the number of principal components to use, we can use a scree plot. A scree plot depicts
the proportion of variance explained by each principal component. Cumulative proportion of variance
plots are also used. Typically, the scree plot has a point where the proportion of variance explained
by each subsequent principal component drops off referred to as the elbow of a scree plot. Visual
inspection of the scree plot allows us to decide the number of principal components to use. It should
be noted that this visual analysis is ad hoc but that there is no well-accepted objective way to decide
the number of principal components.
1.8.2 Clustering
K-means clustering partitions the data into K distinct, non-overlapping clusters. The main idea of
the K-means clustering is that a good clustering is one with low within-cluster variation. We would
like to solve
K p
X 1 X X
argmin (xij − xi′ j )2
C1 ,C2 ,··· ,CK |Ck | ′
k=1 i,i ∈Ck
j=1
20
1. Randomly assign a number, from 1 to K to each of the observations.
2. Iterate until the cluster assignments do not change:
(a) Compute the cluster centroid for each K clusters.
(b) Assign each observation to the cluster whose centroid is closest.
This algorithm is guaranteed to decrease the value of the objective at each step. When the result no
longer changes, a local optimum has been reached. The K-means algorithm depends on the initial
cluster assignment of the observations so it is important to run the algorithm multiple times from
different initial configurations then select the best solution.
One disadvantage of K-means clustering is the requirement to decide beforehand the number of clusters
K. Hierarchical clustering is an alternative approach which does not require such a choice. Hierarchical
clustering results in a tree-based representation of observations, called a dendrogram.
A bottom-up or agglomerative clustering refers to a dendrogram that is built starting from the leaves
and combining up to the trunk. At the bottom of the dendrogram, each leaf represents one observations
and as one moves up the tree, leaves fuse into branches and the branches fuse with leaves and other
branches. The earlier the fusions occur, the more similar the group of observations are to each other.
Observations that fuse later can be quite different. More precisely, the height of the fusion indicates
how different two observations are. Note that we cannot draw conclusions about the similarity of two
observations based on their proximity along the horizontal axis.
To identify clusters on the basis of a dendrogram, we make a horizontal cut across the dendrogram.
The distinct sets of observations beneath the cut are the clusters. The height of the cut controls
the number of clusters obtained and plays the same role as K in K-means clustering. This type of
clustering is hierarchical in that clusters at a lower height are necessarily nested within the clusters at
any greater height. Sometimes this assumption of hierarchical structure might be unrealistic.
Hierarchical clustering relies on a dissimilarity measure between each pair of observations, typically
Euclidean distance. Observations with the smallest dissimilarity are fused first. The concept of dis-
similarity between a pair of observations needs to be extended to pair of groups of observations to
fuse clusters. Linkage defines the dissimilarity between two groups of observations. The most common
types of linkage are complete, average, single, and centroid.
The hierarchical clustering algorithm is as follows:
1. Begin with n observations and treat each observation as its own cluster.
2. For i = n, n − 1, · · · , 2
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair
of clusters that are least dissimilar. Fuse these two clusters. The dissimilarity between these
two clusters indicates the height in the dendrogram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters.
Linkage Description
Complete Largest pairwise dissimilarity.
Single Smallest pairwise dissimilarity.
Average Average pairwise dissimilarity.
Centroid Dissimilarity between centroids of two clusters.
The most popular linkage are average, complete, and single linkage. A drawback of centroid linkage
is that an inversion can occur where two clusters are fused at a height below the individual clusters.
Average and complete linkage generally create more balanced dendrograms.
21
Other dissimilarity measures exist: for example, a correlation-based distance considers two observations
to be similar if their features are highly correlated. In addition to selecting the dissimilarity measure,
one should consider whether to scale the variables before the dissimilarity measure is computed.
Both K-means and hierarchical clustering will assign every observation to a cluster but sometimes there
may be outliers that do not belong in any cluster. Outliers may greatly distort the resulting clusters.
Mixture models are an approach that accommodates for outliers, resulting in a soft version of K-means
clustering. Clustering methods are generally not robust to perturbations of the data. To get a sense
of the robustness of the clusters, one can cluster subsets of the data. It is also important to remember
that the results of clustering depend on the decisions such as whether the data is standardized, type of
linkage, the height to cut the dendrogram, or the choice of K. Clustering results should not be taken
as absolute truth but rather the starting point of a hypothesis and further study on an independent
data set.
Tweedie distributions refer to a family of distributions where the variance is related to the mean
through a power function
var(Y ) = a[E(Y )]p
Various exponential family distributions are tweedie distributions:
• Normal distribution, p = 0
• Poisson distribution, p = 1
• Compound Poisson-Gamma distribution, 1 < p < 2
• Gamma distribution, p = 2
• Inverse Gaussian distribution, p = 3
In a GLM, the inverse of the link function transforms the linear formula for the mean to the original
data set scale. The canonical link function is related to the function b(θ) mapping θ to the natural
parameter.
Distribution Canonical Link Function
Normal Identity
Binomial Logit
Poisson Natural Logarithm
Gamma Inverse
22
R
By differentiating f (y | θ)dy under the integral sign and rearranging, we get
c′ (θ)
E[Y ] = −
b′ (θ)
∂µi ∂µi
= xij
∂βj ∂ηi
Therefore the score is
N
X yi − µi ∂µi
Uj = xij
i=1
var(Yi ) ∂ηi
23
where W is a diagonal matrix with
2
1 ∂µi
Wii =
var(Yi ) ∂ηi
2.0.2 Inference
A goodness of fit statistic is a summary statistic to describe how well the model fits the data. It
may be based on a maximum value of the log-likelihood or the minimum value of the sum of squares
criterion or a composite statistic based on the residuals.
1. Specify a model M0 corresponding to H0 . Specify a more general model M1 .
2. Fit M0 and calculate the goodness of fit statistic G0 . Fit M1 and calculate the goodness of fit
statistic G1 .
3. Calculate the improvement in fit, G1 − G0 or G1 /G0 .
4. Use the sampling distribution of G1 − G0 to test the null hypothesis G1 = G0 .
5. If the null hypothesis is not rejected, then M0 is the preferred model. If it is rejected, then M1
is regarded as the better model.
The asymptotic sampling distribution for score statistics is
d
U T I −1 U → χ2p
for a p-dimensional vector of parameters. We can take a Taylor expansion of the score statistic to get
(b − β) = I −1 (b)U
24
maximum number of parameters that can be estimated, called a saturated model, also called a maximal
or full model. Define the deviance or the log-likelihood ratio statistic as
The first term has distribution χ2n , the second term has distribution χ2m , and the third term, v =
2[l(βmax |y) − l(β|y)] is a positive constant that will be near zero if the model fits the data almost as
well as the saturated model. Thus, the sampling distribution of the deviance is approximately
D ∼ χ2 (m − p, v)
is an unbiased estimator of σ 2 and can be used to estimate I and make inference about b. Note that
the MLE and least squares estimators are the same.
The deviance is given by the formula
1
D= (y − Xb)T (y − Xb)
σ2
The residual sum of squares is sometimes called the scaled deviance since
25
D0 − D1 D1 b1 X1T y − bT0 X0T y y T y − bT1 X1T y
F = ÷ = ÷
p−q N −p p−q N −p
F ∼ F (p − q, N − p) under H0 . Otherwise F has a non-central distribution. Let S0 = y T y − bT0 X0T y
and S1 = y T y − bT1 X1T y, then
S0 − S1 S1
F = ÷
p−q N −p
Usually inferences on a parameter for an explanatory variable depends on other explanatory variables
but an exception is made when the components are orthogonal. Suppose that
X = [X1 , · · · , Xm ]
where XjT Xk = 0, the zero matrix for j ̸= k. Then X is said to be orthogonal and X T X is a block
diagonal matrix and hence
m
X
bT X T y = bTk XkT y
k=1
Consequently, the hypotheses
H01 : β1 = 0, · · · , H0m : βm = 0
can be tested independently. Except for some well-designed experiments, the design matrix is not
orthogonal. Tests based on all other terms being included before Xj βj is added is called a Type III
test. Tests that depend on the sequential order of fitting terms are called Type I.
Residuals are defined as
êi = yi − xTi b = yi − µ̂i
The covariance matrix is
E[êêT ] = σ 2 [I − X(X T X)−1 X T ] = σ 2 [I − H]
where H is the projection or hat matrix. The standardized residuals are
êi
ri = √
σ̂ 1 − Hii
where σ̂ 2 is an estimate of σ 2 .
An outlier is an observation that is not well fitted by the model. An influential observation is one
which has a large effect on inferences based on the model. Hii the ith element on the diagonal of the
hat matrix is called the leverage of the ith observation. As a rule of thumb, if hii is greater than 2p
N,
it may be a concern. Measures combining standardized residuals and leverage include
r
Hii
DFFITSi = ri
1 − Hii
1 Hii
Di = ri2
p 1 − Hii
Di is called Cook’s distance. Cook’s distance may also be obtained by fitting a model with and without
each observation and seeing the difference this makes to estimates b.
1
Di = (b − b(i) )T X T X(b − b(i) )
p
The basic model of multiple linear regression includes an intercept term which is represented by a
column of 1’s in the design matrix.
Define S, Ŝ, Ŝ0 to be
S = eT e = (Y − Xβ)T (Y − Xβ)
26
Ŝ = (y − Xb)T (y − Xb) = y T y − bT X T y
Ŝ0 = y T y − N ȳ 2
Then the coefficient of determination R2 is defined as
Ŝ0 − Ŝ bT X T y − N ȳ 2
R2 = =
Ŝ0 y T y − N ȳ 2
27
The reduced model is
E(Yjk ) = µ + γxjk
The term general linear model refers to Normal linear models with any combination of categorical and
continuous explanatory variables.
D ∼ χ2N −p
where p is the number of parameters estimated and N the number of covariate patterns.
For small studies or situations where there are few observations for each covariate pattern, asymptotic
results are poor approximations. Software has been developed for exact methods.
Instead of MLE, we can estimate the parameters by minimizing the weighted sum of squares
N
X (yi − ni πi )2
Sw =
n π (1 − πi )
i=1 i i
l(π̃|y) − l(π̂|y)
pseudoR2 =
l(π̃|y)
P
where π̃ = P nyii is from the minimal model. The pseudo-R2 represents the proportional improvement
in the log-likelihood due to the terms in the model of interest.
AIC = −2l(π̂|y) + 2p
28
The Pearson or chi-squared residual is
yk − nk π̂k
Xk = p
nk π̂k (1 − π̂k )
dk
rDk = √
1 − hk
Pearson and deviance residuals can be used for checking the adequacy of a model. They should
be plotted against each continuous explanatory variable to check the assumption of linearity and
against other explanatory variables not included in the model. They should be plotted in the order
of measurements to check for serial correlation. Normal probability plots can also be used for large
enough n since the standardized residuals should be approximately standard normal. If the data is
binary or if nk is small, then the plots may be uninformative and aggregated goodness of fit statistics
as well as other diagnostics should be used.
For binary data, there may be overdispersion which may be due to inadequate specification of the
model or more complex structure. One approach is to add an extra parameter ϕ so that var(Yi ) =
ni πi (1 − πi )ϕ, called the quasibinomial distribution.
Consider a random variable Y with J categories. Let π1 , π2 , · · · , πJ denote their respective probabili-
ties. If there are n independent observations of Y resulting in y1 , y2 , · · · , yJ outcomes in each category,
the multinomial distribution is
n!
f (y | n) = π y1 π y2 · · · πJyJ
y1 !y2 ! · · · yJ ! 1 2
It can be shown that E(Yj ) = nπj , var(Yj ) = nπj (1 − πj ) and cov(Yj , Yk ) = −nπj πk . Nominal logistic
regression models are used when there is no natural order among the response categories. One category
is arbitrarily chosen as the reference category. Then,
πj
logit(πj ) = log = xTj βj
π1
1
π̂1 = PJ
1+ j=2 exp(xTj bj )
exp(xTj bj )
π̂j = PJ
1 + j=2 exp(xTj bj )
29
The Pearson chi-squared residuals are
o i − ei
ri = √
ei
Chi-squared statistic is
N
X
X2 = ri2
i=1
C = 2[l(b) − l(bmin )]
l(bmin ) − l(b)
PseudoR2 =
l(bmin )
AIC = −2l(π̂ | y) + 2p
The odds ratio is
πjp π1p
ORj = ÷
πja π1a
where πjp and πja denote probabilities of response category according to whether exposure is present
or absent, respectively.
In some situations, there may be a continuous variable z that is difficult to measure and may be
assessed by cut points for the latent variable so that small values are classified as none, larger values
as moderate, and high values as severe. The cutpoints C1 , C2 , · · · , CJ define J ordinal categories with
associated probabilities.
The cumulative odds for the jth category are
P (z ≤ Cj ) π1 + π2 · · · + πj
=
P (z > Cj ) πj+1 + · · · + πJ
It is based on the assumption that the effects of the covariates x1 , · · · , xp−1 are the same for all
categories on the logarithmic scale. If some of the categories are amalgamated, this does not change
the parameter estimates β1 , · · · , βp−1 but the terms β0j will be affected. This is the collapsibility
property. The proportional odds model is also not affected by a reversal of labelling of the categories.
The proportional odds model is the usual form of ordinal logistic regression.
30
The rate ratio, RR for presence vs absence is
E(Yi | present)
RR = = eβj
E(Yi | absent)
so that D = i d2i .
P
LF = Q1 − 1.5(Q3 − Q1 ), UF = Q3 + 1.5(Q3 − Q1 )
Points that lie outside the fences are called potential outliers. Points on the line are called adjacent
points.
Q-Q plots plot sample quantiles against quantiles from a theoretical CDF. If the plot is mostly on the
identity line y = x, then we see the two distributions as being similar. Otherwise, we suspect that the
two distributions are not the same.
3 Time Series
3.1 Basics and Box-Jenkins
Consider a time series model that is stationary in the mean and variance. The model is second-order
stationary if the correlation between variables only depends on the number of time steps separating
them. The number of time steps between the variables is called the lag. A correlation of a variable
31
with itself at different times is called autocorrelation or serial correlation. If a time series model is
second-order stationary, we define the autocovariance function (avcf), γk as
If rk falls outside the lines, there is evidence against the null hypothesis ρk = 0, though we expect 5%
of the estimates rk to fall outside the lines even under the null hypothesis.
Suppose there are time series models for variables x and y that are stationary in the mean and variance.
Their combined model is second-order stationary if correlations depend only on the lag, and we may
define the cross-covariance function (ccvf) as
xt = xt−1 + wt
32
where {wt } is a white noise series. Back substitution gives
t
X
xt = wi
i=1
Bxt = xt−1
∇xt = xt − xt−1
Differencing can be a useful filtering procedure for non-stationary time series. The first-order differences
of a random walk is a white noise series so the correlogram of the series of differences can be used to
assess whether a given series is a random walk.
A random walk model with a drift parameter δ is
xt = xt−1 + δ + wt
where {wt } is white noise, αi are model parameters with αp ̸= 0 for an order p process. It can also be
expressed as
θp (B)xt = (1 − α1 B − α2 B 2 − · · · − αp B p )xt = wt
The equation θp (B) = 0 is called the characteristic equation. The roots of the characteristic equation
must all exceed unity in absolute value for the process to be stationary. To derive the second-order
properties for AR(1), note that
(1 − αB)xt = wt =⇒ xt = (1 − αB)−1 wt
∞
X
xt = αi wt−i
i=0
P∞
Hence E(xt ) = i=0 αi E(wt−i ) = 0 and
αk σ 2
γk = cov(xt , xt+k ) =
1 − α2
The autocorrelation function follows as
ρk = α k
From this equation we see that autocorrelations are nonzero for all lags even though the underlying
model xt only depends on the previous value xt−1 . The partial autocorrelation at lag k is the correlation
that results after removing the effect of any correlations due to the terms at shorter lags.
An AR(p) model can be fitted in R using the ar function. The method is generally fit with MLE and
the order p of the process is chosen with AIC.
A time series model {xt } is linear if it can be expressed as
xt = α0 + α1 u1,t + · · · + αm um,t + zt
33
where ui,t is the value of the ith predictor at time t, zt is the error at time t, and α0 , α1 , · · · , αm are
model parameters, estimated by least squares.
Linear models for time series are non-stationary when they include functions of time. Differencing
can often transform a non-stationary series with a deterministic trend to a stationary series. If the
underlying trend is a polynomial of order m, then the mth order differencing is required to remove the
trend.
If {xt } is a stationary time series with Ext = µ, var(xt ) = σ 2 , cor(xt , xt+k ) = ρk , then the variance of
the sample mean is " #
n−1
σ2 X k
var(x̄) = 1+2 (1 − )ρk
n n
k=1
Thus if ρk > 0, then the variance of the sample mean is greater than the independent case and if
ρk < 0, the variance of the sample mean is smaller than the independent case. Since in time series
regression, the residual series will be autocorrelated, we should use generalized least squares (GLS)
instead for better estimates of the standard errors of the regression parameters.
Suppose a time series contains s seasons. A seasonal indicator model for a time series {xt } containing
s seasons and a trend mt is given by
xt = mt + st + zt
where st = βi when t is in the ith season and {zt } is the residual error series, which may be autocor-
related. This may be also written as
xt = mt + β1+(t−1) mod s + zt
By treating the seasonal term st as a factor, the parameters for the model can be estimated by GLS.
If seasonal effects vary smoothly over the seasons, it may be more parameter efficient to use a smooth
function instead of separate indices. The harmonic seasonal model is
⌊s/2⌋
X
xt = mt + [si sin(2πit/s) + ci cos(2πit/s)] + zt
i=1
where mt is the trend with a constant term, si and ci are unknown parameters.
The logarithm can be used to transform a multiplicative model into an additive model. If a time series
is given by
xt = m′t s′t zt′
then applying the logarithm gives
The process of using a transformation, fitting a model, then applying an inverse transformation in-
troduces a bias in the forecasts. For the logarithm, an empirical correction factor may be applied for
forecasting.
n
X ezi
x̂′t = elog xt
\
t=1
n
A time series model {xt } is said to be strictly stationary if the joint distribution of {xt1 , · · · , xtn } is
the same as the joint distribution of {xt1 +m , · · · , xtn +m } for all t1 , t2 , · · · , tn and m.
Regression can allow us to decompose a non-stationary series to a trend, seasonal components, and
residual series. The residual series can be treated as a realization of a stationary error series.
34
A moving average (MA) process of order q is a linear combination of current white noise and q most
recent past white noise terms.
xt = wt + β1 wt−1 + · · · + βq wt−q
2
where {wt } is white noise with variance σw .
xt = (1 + β1 B + β2 B 2 + · · · + βq B q )wt = ϕq (B)wt
2
where ϕq is a polynomial of order q. The mean is clearly zero and the variance is σw (1 + β12 + · · · + βq2 ).
The autocorrelation function is Pq−k
i=0 βi βi+k
ρk = P q 2 , k ∈ [q]
i=0 βi
and ρ0 = 1, ρk = 0 for k > q. The MA process is invertible if it can be expressed as
provided |β| < 1. An MA(q) process is invertible when all roots of ϕq (B) all exceed unity in absolute
value.
When AR and MA terms are added together, it is an ARMA process. An ARMA(p, q) is
35
3.2 Smoothing
The basic running average estimate is defined by
k
1X
ŝt = yt−i+1
k i=1
where k is the running average length. The choice of k depends on the smoothing desired. The larger
the k, the smoother is ŝt . To forecast the series,
yt − yt−k
ŝt = ŝt−1 +
k
If there are no trends in the data, then the second term is zero. For a series that can be expressed as
yt = β0 + β1 t + ϵt ,
Because observations are not available in the infinite past, we take the truncated version
t
X
ŝt = (1 − w) wi yt−i
i=0
This is the exponential smoothed estimate of the series. Like running average estimates, the smoothed
estimates provide greater weights to more recent observations. Exponential smoothing has the recursive
equation
ŝt = ŝt−1 + (1 − w)(yt − ŝt−1 ) = (1 − w)yt + wŝt−1
To forecast, we should use ŷT +l = ŝT . To decide w, we minimize the sum of squared one-step prediciton
errors
XT
SS(w) = (yt − ŝt−1 )2
t=1
36
(1) (1)
1. Take ŝt = (1 − w)yt + wŝt−1 .
(2) (1) (2)
2. Take ŝt = (1 − w)ŝt + wŝt−1 .
(1) (2)
The estimate of the trend is b1,T = 1−ww (ŝT − ŝT ). The forecasts are given by ŷT +l = b0,T + b1,T l,
(1) (2)
where the intercept is b0,T = 2ŝT − ŝT .
Similar to running average smoothing, exponential smoothed estimates are WLS estimates with weights
given by wt = wT −t . Exponential smoothing estimates are also called discounted least squares esti-
mates.
Holt introduced the following generalization of the double exponential smoothing method for seasonal
data. Let w1 and w2 be smoothing parameters and calculate recursively the estimates:
yt = β0 + β1 t + St + ϵt
Pg
where St = St−g , i=1 Si = 0. We employ three smoothing parameters. The parameter estimates for
the model are determined recursively by
37
is not the usual t-distribution. This test is called the Dickey-Fuller test. A criticism of the Dickey-Fuller
test is that the disturbance term is assumed to be serially uncorrelated. The augmented Dickey-Fuller
test statistic is the t-statistic associated with the yt−1 variable using OLS on the equation
p
X
yt − yt−1 = β0 + (ϕ − 1)yt−1 + β1 t + ϕj (yt−j − yt−j−1 ) + ϵt
j=1
In this equation, the disturbance term is augmented by the autoregressive terms in the differences
yt−j −yt−j−1 . The idea is that these terms capture serial correlation in the disturbance term. Consensus
is not reached on choosing the number of lags p. Analysts provide results of the test statistic for a
number of choices of lags and hope that conclusions are similar.
We first present the autoregressive changing heteroscedasticity model of order q, ARCH(q). Assume
ϵt given Ωt−1 is normally distributed with mean zero and variance σt2 . Further assume that the
conditional variance is determined recursively by
> 0 is the long-run volatility parameter and γ1 , γ2 , · · · , γq are coefficients such that γj ≥ 0
The term w P
q
and γ(1) = j=1 γj < 1. If p = 1, a large change to ϵ2t−1 can induce a large conditional variance σt2 .
Higher orders of q capture longer-term effects. Despite having a changing conditional variance, the
unconditional variance remains constant over time.
Next, we discuss the generalized ARCH model of order p, GARCH(p, q). As with the ARCH model,
we assume the distribution of ϵt given Ωt−1 is normally distributed with mean zero and variance σt2 .
It is determined recursively by
σt2 − δ1 σt−1
2 2
− · · · − δp σt−p = w + γ1 ϵ2t−1 + · · · + γq ϵ2t−q
38