50% found this document useful (2 votes)
241 views38 pages

SRM Notes

1. This document introduces concepts from statistical learning including simple linear regression, multiple linear regression, and classification. 2. Simple linear regression finds estimates of coefficients by minimizing residual sum of squares. Multiple linear regression extends this to multiple predictors. 3. For classification, K-nearest neighbors assigns classes based on closest training examples in feature space, with a bias-variance tradeoff as K varies.

Uploaded by

YU XUAN LEE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
241 views38 pages

SRM Notes

1. This document introduces concepts from statistical learning including simple linear regression, multiple linear regression, and classification. 2. Simple linear regression finds estimates of coefficients by minimizing residual sum of squares. Multiple linear regression extends this to multiple predictors. 3. For classification, K-nearest neighbors assigns classes based on closest training examples in feature space, with a bias-variance tradeoff as K varies.

Uploaded by

YU XUAN LEE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SRM Notes

dex

November 13, 2022

1 Introduction to Statistical Learning


1.1 ISLR Chapter 2
There is a prediction accuracy and model interpretability trade-off. Inflexible models are often more
interpretable than flexible models. The most commonly used measure for quality of fit is the mean
squared error
n
1X
MSE = (yi − fˆ(xi ))2
n i=1

This is known as the training MSE and fˆ is typically found by minimizing training MSE. However, the
ideal fˆ minimizes test MSE. The test MSE can be estimated by cross-validation. Usually the test MSE
follows a U-shape where as the flexibility of the learning method increases, the test MSE decreases
until it starts to increase again.
The expected test MSE can be decomposed into the following form:

E(y0 − fˆ(x0 ))2 = var(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + var(ϵ)

The ideal learning method achieves low variance and low bias. The final term var(ϵ) is said to be an
irreducible error and this term can never be lowered and serves as a lower bound to the expected test
MSE. Bias is introduced by approximating a complicated relationship by a simple model. Generally,
as the flexibility of the model increases, the bias will decrease. Variance refers to the amount by which
fˆ changes if it is given a different training data set. In general, more flexible statistical methods have
higher variance.
In the classification setting the training MSE is replaced with the training error rate defined as
n
1X
1{yi ̸= ŷi }
n i=1

Similarly to the regression setting, the ideal classifier minimizes test error rate. The expected test
error rate is minimized by the Bayes classifier which assigns a test observation with predictor x0 to the
class j for which P (Y = j | X = x0 ) is largest. If there are only two classes, then the Bayes classifier
corresponds to P (Y = 1 | X = x0 ) > 0.5. The set for which the conditional probability is exactly 0.5
is called the Bayes decision boundary. The Bayes classifier produces the lowest possible test error rate
called the Bayes error rate. The Bayes error rate is given by
 
1 − E max P (Y = j | X)
j

1
Since for real data we do not know the conditional distribution, the actual Bayes classifier cannot
be used but may be treated as the gold standard to compare other classification methods. We may
instead try to classify observations with an estimated conditional distribution and one such method is
the K-nearest neighbors (KNN) classifier.
1 X
P̂ (Y = j | X = x0 ) = 1{yi = j}
K
i∈N0

where N0 refers to the K points in the training data closest to x0 . KNN classifies the test observation
x0 to the class with the largest estimated conditional probability. The choice of K corresponds to the
smoothness of the classifier, with higher K corresponding to lower bias and higher variance and lower
K to higher bias but lower variance.

1.2 ISLR Chapter 3


1.2.1 Simple linear regression
Simple linear regression assumes the form

Y = β0 + β1 X + ϵ

Estimates of βi ’s are found by minimizing the residual sum of squares (RSS)


n
X n
X
RSS = e2i = (yi − ŷi )2
i=1 i=1

The minimizers are


Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1 2
i=1 (xi − x̄)
β̂0 = ȳ − β̂1 x̄
1
Pn 1
Pn
where ȳ = n i=1 , x̄ = n i=1 xi .

x̄2 σ2
 
2 2 1
se(β̂0 ) = σ + Pn 2
, se(β̂1 )2 = Pn 2
n i=1 (xi − x̄) i=1 (xi − x̄)

where σ 2 = var(ϵ). These formulas are derived assuming the errors to be uncorrelated with common
variance σ 2 . Note that se(β̂1 ) is smaller when the xi are more spread out, which intuitively follows
from there being more leverage to estimate the slope in this case. In general, σ 2 is not known but can
be estimated from the data. This estimate of σ is known as the residual standard error given by the
formula r
RSS
RSE =
n−2
Standard errors can also be used to compute Wald confidence intervals. We can also perform hypothesis
tests on the coefficients. To test the null hypothesis that there is no relationship between X and Y ,
we can test
H0 : β1 = 0, H1 = β1 ̸= 0
We compute a t-statistic given by
β̂1
t=
SE(β̂1 )

2
this follows a tn−2 distribution.
To quantify the extent to which the model fits the data, we can look at two related quantities: the
RSE and the R2 statistic. The RSE provides an absolute measure of lack of fit of the model to the
data. The R2 states the proportion of variance explained.
TSS − RSS RSS
R2 = =1−
TSS TSS
Pn
where TSS = i=1 (yi − ȳ)2 is the total sum of squares. The statistic R2 always lies between 0 and
1. It is also related to the correlation. If r is the sample correlation between X and Y , in the case of
simple linear regression, R2 = r2 .

1.2.2 Multiple Linear Regression


The multiple linear regression model takes the form
p
!
X
Y = β0 + βk Xk +ϵ
k=1

where Xj represents the jth predictor and βj quantifies the association between that variable and
the response. We interpret βj as the average effect on Y of a unit increase in Xj holding all other
predictors fixed.
Given estimators β̂0 , β̂1 , · · · , β̂p , predictions take the form
p
X
ŷ = β̂0 + β̂k xk
k=1

Multiple linear regression minimizes the sum of squared residuals


n
X n
X p
X
RSS = (yi − ŷi )2 = (yi − β̂0 − β̂k xik )2
i=1 i=1 k=1

Given a full rank design matrix X, the OLS estimator is given by the equation

β̂OLS = (X T X)−1 X T Y

To answer if there is a relationship between the response and predictors we can take the hypothesis
test with null hypothesis
H0 : β1 = β2 = · · · = βp = 0
with the alternative hypothesis being that at least one βk for k > 0 is not zero. For this purpose, we
compute the F -statistic
(TSS − RSS)/p
F =
RSS/(n − p − 1)
Pn
where TSS = i=1 (yi − ȳ)2 . If the model assumptions are correct, then

E[RSS/(n − p − 1)] = σ 2

and if H0 holds, then


E[(TSS − RSS)/p] = σ 2

3
Hence if there is no relationship between the response and predictors, we expect the F -statistic to be
close to 1. If the alternative hypothesis holds, then E[(TSS − RSS)/p] > σ 2 so we expect F to be
greater than 1.
Sometimes we would like to instead test that a particular subset of q of the coefficients are zero. That
is,
H0 : βp−q+1 = βp−q+2 = · · · = βp = 0
Let RSS0 denote the residual sum of squares that use all variables except the last q. Then we take the
F -statistic
(RSS0 − RSS)/q
F =
RSS/(n − p − 1)
This F -statistic reports the partial effect of adding these q variables. In all of these cases, for a nested
model with p1 < p2 , we compare with the F distribution with (p2 − p1 , n − p2 − 1) degrees of freedom.
Variable selection is the task of determining which predictors are associated with the response in order
to fit a single model involving only those predictors. Various statistics can be used to jduge the quality
of a model including Mallow’s Cp , Akaike information criterion (AIC), Bayesian information criterion
(BIC), and adjusted R2 . Since there are a total of 2p models that contain subsets of p variables, it
is infeasible to try every subset. More efficient approaches must be taken. There are three classical
approaches:
1. Forward selection. Begin with the null model then fit p simple linear regressions and add to the
null model the variable that results in the lowest RSS. We then add to that model the variable
that results in the lowest RSS for the new two-variable model and continue until some stopping
rule is satisfied.
2. Backward selection. We start with all variables in the model and remove the variable with the
largest p-value. The new (p − 1)-variable model is fit and the variable with the largest p-value is
removed and we continue this procedure until a stopping rule is reached.
3. Mixed selection. This is a combination of forward and backward selection. Start with no variables,
then add the variables that provide the best fit one-by-one. If at any point the p-value for one
of the variables in the model rises above a threshold, we remove that variable from the model
and we continue to perform forward and backward steps until all variables in the model have a
sufficiently low p-value.
Backward selection cannot be used if p > n and forward selection can always be used. Forward
selection is a greedy approach and can include variables that later become redundant. Mixed selection
can remedy this.
In multiple linear regression, R2 = cor(Y, Ŷ )2 . An R2 value close to 1 indicates that the model explains
a large portion of the variance in the response variable. The R2 statistic will always increase when
more variables are added to the model, even if those variables are only weakly associated with the
response since adding another variable always results in a decrease in the residual sum of squares in
the training data.
For multiple linear regression the RSE is defined as
s
RSS
RSE =
n−p−1

Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the
increase in p.

4
The inaccuracy in the coefficient estimates is related to the reducible error and one can compute
a confidence interval. Even if the true values for β0 , β1 , · · · , βp are known, the response cannot be
predicted perfectly. The error due to ϵ is called the irreducible error. Prediction errors are always
wider than confidence intervals since they incorporate both the error in the estimate of the population
and the randomness of the individual point.

1.2.3 Other considerations


Sometimes qualitative predictors known as factors are used. If a factor has two levels, we can create
a dummy variable or an indicator. When a qualitative predictor has more than two levels, we can
create additional dummy variables. There will always be one fewer dummy variable than the number
of levels. The level with no dummy variable is known as the baseline.
The standard linear regression model has several highly restrictive assumptions that are often violated.
It assumes that the relationship between the predictors and response are additive and linear. One way
of relaxing the additive assumption is by introducing an interaction term. Take the following model
for example:
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 + ϵ
The predictor X1 X2 is said to be an interaction term constructed by computing the product of X1
and X2 . Note that the interpretation of parameters also changes with the inclusion of the interaction
term. The hierarchical principle states that if we include an interaction in a model, we should also
include the main effects, even if the p-values associated with their coefficients are not significant.
We can also extend the linear model by accommodating nonlinear relationships. For example, we
can take polynomial regression. The following model is nonlinear in the predictors but linear in the
coefficients:
mpg = β0 + β1 × horsepower + β2 × horsepower2 + ϵ
A list of common problems associated with linear regression:
1. Non-linearity of the response-predictor relationships.
2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers.
5. High-leverage points.
6. Collinearity
If the true relationship is far from linear, then all conclusions drawn from the fit is suspect. Residual
plots are a useful graphical tool to identify non-linearity. If the residual plot indicates there are
non-linear associations, one approach is to use non-linear transformations of the predictors.
If the errors are correlated, the estimated standard errors tend to underestimate the true standard
errors. Correlations frequently occur in time series data where individual data points are not indepen-
dent. If error terms are positively correlated, there may be tracking in the residuals, that is, adjacent
residuals have similar values.
We refer to a non-constant variance in the errors as heteroscedasticity. A nonlinear transformation
may result in a reduction of heteroscedasticity. If we have a good idea of the variance of each response,
then it is possible to fit with weighted least squares instead to accommodate for heteroscedasticity.
An outlier is a point for which yi is far from the value predicted by the model. Removing an outlier
may have little effect on the fit of the model but still have a high influence on the RSE. Residual plots

5
can be used to identify outliers. It can be difficult to decide how large a residual needs to be before
it is considered an outlier. We can instead plot the studentized residuals, computed by dividing the
residual ei by its estimated standard error. Observations whose studentized residuals are greater than
3 in absolute value are possible outliers.
Observations with high leverage have an unusual value for xi . High leverage observations have a
substantial impact on the estimated regression fit. In order to quantify an observation’s leverage, we
compute the leverage statistic. For simple linear regression,
1 (xi − x̄)2
hi = + Pn 2
n i′ =1 (xi − x̄)

For multiple linear regression, the leverage score of the ith observation is the ith diagonal entry of the
hat-matrix H = X(X T X)−1 X T . That is,

hii = xi (X T X)−1 xTi


2p
Leverage is between 0 and 1 and the sum of leverages is equal to the number of parameters. If hii > n,
it can be considered an outlier.
Collinearity refers to the situation in which two or more predictor variables are closely related to one
another. It can be difficult to separate the individual effects of collinear variables on the response.
Collinearity reduces the accuracy of the estimates of the regression coefficients, which causes the
standard error of β̂j to grow. This decreases the t-statistic, which in return reduces the power. One
simple way to detect collinearity is in the correlation matrix of the predictors. An element that is
large in absolute value indicates a pair of highly correlated variables. However, the correlation matrix
does not detect multicollinearity, that is collinearity between three or more variables. One can instead
look at the variance inflation factor (VIF). The VIF is the ratio of the variance of β̂j when fitting the
full model divided by the variance of β̂j fit on its own. A VIF value that exceeds 5 or 10 indicates a
problematic amount of multicollinearity.
1
VIF(β̂j ) = 2
1 − RX j |X−j

2
where RX j |X−j
is the R2 from a regression of Xj onto all other predictors.

1.3 ISLR Chapter 4


1.3.1 Logistic Regression
Logistic regression models the probability that Y belongs to a particular category. Let p(X) = P (Y =
1 | X). We take the model
  p
p(X) X
logit(p(X)) = log = β0 + βk Xk
1 − p(X)
k=1

To estimate the regression coefficients βk ’s, we take the MLE.


If only a few predictors are used when other predictors are relevant, then there may be confounding.
We can extend the two-class logistic regression approach to K > 2 classes, known as multinomial
logistic regression. We take the following model:
  X p
P (Y = k | X = x)
log = βkj xj
P (Y = K | X = x) j=0

6
The Kth class is known as the baseline and satisfies
1
P (Y = K | X = x) = PK−1 Pp
1+ l=1 exp( j=0 βlj xj )

and for k ∈ {1, 2, · · · , K − 1},


Pp
exp( j=0 βkj xj )
P (Y = k | X = x) = PK−1 Pp
1 + l=1 exp( j=0 βlj xj )

An alternative coding for multinomial logistic regression is the softmax coding. Rather than selecting
a baseline class, we treat all K classes symmetrically and assume for k ∈ {1, 2, · · · , K},
Pp
exp( j=0 βkj xj )
P (Y = k | X = x) = PK Pp
l=1 exp( j=0 βlj xj )

1.3.2 Linear Discriminant Analysis


We now consider alternative approaches for classification. Let fk (X) = p(X|Y = k) be a conditional
density. Let πk represent the prior probability that a randomly chosen observation comes from the
kth class. Bayes theorem states that

πk fk (x)
P (Y = k | X = x) = PK
l=1 πl fl (x)

We use the abbreviation pk (x) = P (Y = k | X = x) to be the posterior probability that an observation


X = x belongs to the kth class. Generally estimating πk is easy by just taking the fraction of the
training sample that are in the kth class. Estimating fk (x) is more challenging.
Assume p = 1. Further assume fk (x) is normal.
 
1 1 2
fk (x) = √ exp − 2 (x − µk )
2πσk 2σk

Further assume that σ12 = σ22 = · · · = σK


2
. Then,

πk exp(− 2σ1 2 (x − µk )2 )
pk (x) = PK 1 2
l=1 πl exp(− 2σ 2 (x − µl ) )

Thus, given this form we should assign observations to the class for which

µk µ2k
δk (x) = x − + log(πk )
σ2 2σ 2
is largest. The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging
in estimates for πk , µk and σ 2 .
1 X
µ̂k = xi
nk
i|yi =k

K
1 X X
σ̂ 2 = (xi − µ̂k )2
n−K
k=1 i|yi =k

7
where n is the total number of training observations, nk is the number of training observations in the
kth class. In the absence of any additional information, take
nk
π̂k =
n
The LDA classifier assigns observation X = x to the class for which
µ̂k µ̂2
δ̂k (x) = x 2
− k2 + log(π̂k )
σ̂ 2σ̂
is largest. The name comes from the discriminant functions δ̂k (x) being linear functions of x.
We can extend the LDA classifier to multiple predictors by assuming X is drawn from a multivariate
normal distribution with a class specific mean vector and common covariance matrix. We find that
1
δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk
2
Similar estimates for µ1 , · · · , µK , π1 , · · · , πK and Σ can be given. As before, δ̂k (x) is a linear function
of x. For binary classification, a confusion matrix is a convenient way of displaying which types of
errors are being made. A confusion matrix has entries true positive (TP), false negative (FN), false
positive (FP), true negative (TN). Sensitivity, recall, or hit rate refers to the true positive rate (TPR).
Specificity, selectivity refers to the true negative rate (TNR). The Bayes classifier assigns observations
for the class for which the posterior probability pK (X) is greatest. For the two-class case, the threshold
is 50%. However, if we are concerned about incorrect predictions for a particular class, we may change
the threshold. The ROC curve displays the error rate for all possible thresholds. It plots the true
positive rate (TPR) against the false positive rate (FPR). The overall performance of a classifier is
summarized by the area under the curve (AUC) statistic. An ideal ROC curve is close to the top left
corner.

1.3.3 Quadratic Discriminant Analysis


Quadratic discriminant analysis also assumes that the observations for each class are drawn from a
normal distribution and plugs in estimators for the parameters into Bayes’ theorem for prediction.
Unlike LDA, QDA assumes that each class has its own covariance matrix. Then
1 1
δk (x) = − (x − µk )T Σ−1
k (x − µk ) − log |Σk | + log πk
2 2
1 1 T −1 1
= − xT Σ−1 T −1
k x + x Σk µk − µk Σk µk − log |Σk | + log πk
2 2 2
QDA is more flexible than LDA and has lower bias but has greater variance. QDA is recommended if
the training set isvery large or if the assumption of a common covariance matrix is untenable.

1.3.4 Naive Bayes


The Naive Bayes classifier assumes that within the kth class, the p predictors are independent. That
is,
Yp
fk (x) = fkj (xj )
j=1

This assumption means the association between the predictors need not be estimated. The model
assumes Qp
πk j=1 fkj (xj )
P (Y = k | X = x) = PK Qp
l=1 πl j=1 flj (xj )

8
for k ∈ {1, 2, · · · , K}. To estimate the one dimensional density functions fkj , we can assume Xj | Y =
2
k ∼ N (µjk , σjk ). This amounts to QDA with the additional assumption that class-specific covariance
matrices are diagonal. Another option is to use a nonparametric estimate for fkj such as a kernel
density estimator. If Xj is a factor, then we can simply count the proportion of training observations
for the jth predictor corresponding to each class.
It should be noted that the naive Bayes estimator is a generalized additive model and that LDA is a
special case of Naive Bayes. Neither QDA nor naive Bayes is a special case of the other. Naive Bayes
can be flexible but is limited to an additive fit whereas QDA contains quadratic terms.
We can compare the previous methods discussed with KNN. Since KNN is completely nonparametric,
it should dominate LDA and logistic regression when the decision boundary is highly nonlinear and
when n is large while p is small. KNN requires a lot of observations relative to the number of predictors.
If n is modest or p is not very small, QDA may be preferred to KNN since it is nonlinear while still
having a parametric form. KNN does not tell which predictors are important unlike logistic regression.

1.4 ISLR Chapter 5


Resampling methods involve repeatedly drawing samples from a training set and refitting a model of
interest on each sample in order to obtain additional information about the fitted model. Resampling
approaches are typically computationally expensive due to the need to fit the same statistical method
multiple times. The most commonly used resampling methods are cross-validation and bootstrap.
Model assessment is the process of evaluating a model’s performance and model selection is the process
of selecting a proper level of flexibility for a model.

1.4.1 Cross Validation


In the absence of a designated test set, one can estimate the test error rate by holding out a subset of the
training observations from the fitting process. The validation set error rate provides an estimate of the
test error rate. This approach is conceptually simple and easy to implement but the validation estimate
of the test error rate can be highly variable, and the validation set error rate tends to overestimate the
test error rate for the model fit on the entire data set.
Leave-one-out-cross-validation (LOOCV) leaves a single observation out for the validation set and uses
the remaining n − 1 training observations to fit the statistical learning method. A prediction is made
for the excluded observation. This procedure is repeated for all other observations. The LOOCV
estimate of the test MSE is the average of these n test error estimates.
n
1X
CV(n) = MSEi
n i=1

The k-fold CV approach involves randomly dividing the set of observations into k groups, or folds,
of approximately equal size. The first fold is used as a validation set and the method is fit on the
remaining k − 1 folds. The mean squared error is computed on the observations in the held-out fold.
This procedure is repeated by treating other folds as validation sets. The k-fold CV estimate averages
these values:
k
1X
CV(k) = MSEi
k i=1
In practice, one usually takes k = 5 or k = 10. The obvious advantage over LOOCV is the computation
speed. It also gives more accurate estimates of the test error rate than does LOOCV due to a bias-
variance trade-off. The outputs of the fitted models are correlated with each other, with the correlation

9
increasing with increased number of folds, and since the mean of highly correlated quantities has higher
variance, the test error estimate of LOOCV tends to have higher variance than that of k-fold CV.
On classification problems, the LOOCV error rate is
n
1X
CV(n) = Erri
n i=1

where Erri = 1{yi ̸= ŷi }. Similar for cross-validation.

1.4.2 Bootstrap
The bootstrap emulates the process of obtaining new sample sets by repeatedly sampling observations
from the original data set. This sampling is done with replacement. The bootstrap standard error of
α̂ is given by v
u B B
!2
u 1 X 1 X ′
SEB (α̂) = t α̂∗r − α̂∗r
B − 1 r=1 B ′
r =1

1.5 ISLR Chapter 6


1.5.1 Subset Selection
Best subset selection fits a separate least squares regression for every possible combination of the p
predictors with the goal of identifying the one that is best. The algorithm is described in the following
way:
1. Let M0 denote the null model, which contains no predictors.
2. For k ∈ {1, 2, · · · , p}:
• Fit all kp models that contain exactly k predictors.


• Pick the best among these kp models, and call it Mk . Best is defined as the one with the


smallest RSS or largest R2 .


3. Select a single best model among M0 , · · · , Mp using cross-validated prediction error, Cp , AIC,
BIC, or adjusted R2 .
In logistic regression, instead of ordering by RSS, we instead use deviance which is negative two times
the maximized log-likelihood. The smaller the deviance, the better the fit. Best subset selection is
infeasible for high values of p.
Forward stepwise selection is a computationally efficient alternative. Forward stepwise selection algo-
rithm is as follows:
1. Let M0 denote the null model.
2. For k ∈ {0, 1, · · · , p − 1},
(a) Consider all p − k models that augment the predictors in Mk with one additional predictor.
(b) Choose the best among these p − k models, and call it Mk+1 . Best is defined as smallest
RSS or deviance.
3. Select a single best model from among M0 , · · · , Mp using cross-validated prediction error, Cp ,
AIC, BIC, or adjusted R2 .
Backward stepwise selection is another alternative to best subset selection. The algorithm is as follows:

10
1. Let Mp denote the full model.
2. For k = {1, 2, · · · , p}:
(a) Consider all k models that contain all but one predictors in Mk , for a total of k−1 predictors.
(b) Choose the best among these k models and call it Mk−1 . Best is defined as smallest RSS
or deviance.
3. Select a single best model among M0 , · · · , Mp using cross-validated prediction error, Cp , AIC,
BIC, or adjusted R2 .
Forward and backward selection are not guaranteed to yield the best model. A hybrid approach where
both backward and forward steps are available are possible.
Generally, training set MSE is an underestimate of the test MSE. But there are techniques for adjusting
the training error for the model size. For a fitted least squares model containing d predictors, Mallow’s
Cp estimate of test MSE is given by
1
Cp = (RSS + 2dσ̂ 2 )
n
Typically σ̂ 2 is estimated using the full model containing all predictors. If σ̂ 2 is an unbiased estimator
of σ 2 , then Cp is an unbiased estimate of the test MSE.
The Aikake Information Criterion (AIC) is defined for a large class of model fit by maximum likelihood.
AIC is given by
1
AIC = (RSS + 2dσ̂ 2 )
n
The Bayesian Information Criterion (BIC) takes the form
1
BIC = (RSS + log(n)dσ̂ 2 )
n

The adjusted R2 statistic is calculated as


RSS/(n − d − 1)
Adjusted R2 = 1 −
TSS/(n − 1)
Alternatively, we can directly estimate the test error using the validation set and cross-validation
methods. The advantage is that it makes fewer assumptions about the underlying model. For cross-
validation, we use the one-standard-error rule. Where we first calculate the standard error of the
estimated test MSE for each model size, and then select the smallest model for which the estimated
test error is within one standard error of the lowest point on the curve. The idea is that if a set of
models appear to be more or less good, then we should choose the simplest model.

1.5.2 Shrinkage Methods


The OLS minimizes the RSS  2
n
X p
X
RSS = yi − β0 − βj xij 
i=1 j=1

Ridge regression minimizes the RSS plus a shrinkage penalty.


p
X
β̂λR = argmin RSS(β) + λ βj2
β j=1

11
The tuning parameter λ serves to control the relative impact of the penalty. When λ = 0, there is
no penalty and when λ = ∞, the regression coefficients equal zero. Ride regression will generate a
different set of coefficients for each value of λ. Note that we do not shrink the intercept.
The standard least squares coefficients are scale equivariant. Multiplying the predictors by a constant c
leads to a scaling of the coefficient estimates by c−1 . However, the ridge regression coefficient estimates
are not scale equivariant so it is best to apply ridge regression after standardizing the predictors first.
Ridge regression is advantageous due to the bias-variance trade-off. As λ increases, the flexibility of
the ridge regression fit decreases, leading to decreased variance but increased bias.
Ridge regression has the disadvantage of shrinking all coefficients towards zero but not setting any of
them exactly to zero, leading to problems with model interpretation. The lasso is an alternative to
ridge regression that sets some coefficients to zero. It has the form
p
X
β̂λR = argmin RSS(β) + λ |βj |
β j=1

The ℓ1 penalty forces some coefficient estimates to be exactly zero when the tuning parameter λ is
sufficiently large. In this sense, the lasso performs variable selection. We say that the lasso yields
sparse models.
Ridge regression and lasso are described as solutions to a convex optimization problem, which have
the Lagrangian dual problems
argmin RSS(β)
β|∥β∥1 ≤s

argmin RSS(β)
β|∥β∥2 ≤s

For every λ, there is a corresponding s that gives the same coefficient estimates. Best subset selection
may be thought as the solution to
argmin RSS(β)
β|∥β∥0 ≤s

The ℓ1 norm is a convex approximation to the ℓ0 norm and hence the lasso may be thought of as a
convex relaxation of the best subset selection problem.
Ridge regression and lasso requires a method to select the tuning parameter. We choose a grid of λ
values and compute the cross-validation error for each value of λ then select the tuning parameter
for which the cross-validation error is smallest. Then the model is refit with the selected value of the
tuning parameter.

1.5.3 Dimension Reduction


Let Z1 , Z2 , · · · , ZM represent M < p linear combinations of the original p predictors.
p
X
Zm = ϕjm Xj
j=1

for constants ϕjm , j ∈ [p], m ∈ [M ]. Fit the linear regression model


M
X
yi = θ0 + θm zim + ϵi
m=1

with least squares. If the constants are chosen wisely then the dimension reduction should outperform
least squares.

12
Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of
features. PCA is performed by maximizing the Rayleigh quotient. The first vector satisfies
wT X T Xw
ϕ·1 = argmax{ }
w wT w
Pk−1
Given the first k − 1 components, let X̂k = X − s=1 Xϕ·s ϕT·s and
wT X̂kT X̂k w
ϕ·k = argmax{ }
w wT x
Note that even though PCR provides a simple way to perform regression using M < p predictors, it
is not a feature selection method. Generally, one should standardize the predictors before generating
the principal components.
Partial least squares (PLS) is a supervised alternative to PCR. Even though the low rank approximation
of X in PCR best accounts for X, it is not necessarily optimal for regression. The PLS1 algorithm is
as follows.
1. Columnwise center X and y and set X0 = X.
2. Repeat for i = 1, · · · , k:
T
Xi−1 y
(a) Set wi = T y∥
∥Xi−1

Xi−1 wi
(b) Set ti = ∥Xi−1 wi ∥
T
(c) Set vi = Xi−1 ti
(d) Set Xi = Xi−1 − ti viT
3. Collect the vectors into matrices Wk , Tk , Vk .
(k)
β̂PLSE = Wk (VkT Wk )−1 TkT y

For a given value of k, the PLSE has a better predictability than the corresponding PCR estimator.
As with PCR, the number of dimensions is a tuning parameter typically chosen by cross-validation.

1.5.4 High dimensions


The low-dimensional setting is the case in which n ≫ p. Data sets where p > n are referred to as
high-dimensional. Regularization plays a key role in high-dimensional problems and an appropriate
tuning parameter selection is crucial for good predictive performance. The curse of dimensionality
states that the test error tends to increase with dimension unless the additional features are truly
associated with the response.
In the high-dimensional setting, the multicollinearity problem is extreme. Every variable in the model
can be written as a linear combination of all other variables in the model. Thus we can never know
exactly which variables are truly predictive of the outcome, and we can never identify the best coef-
ficients for use in the regression. In the high-dimensional setting, if forward stepwise selection selects
some predictors, it would be incorrect to conclude that these predictors predict more effectively than
the other predictors not included. There are likely many sets of predictors that predict just as well as
the selected model.
In the high-dimensional setting, it is easy to obtain a useless model with zero residuals. Thus one
should not use traditional measures of model fit on training data as evidence of a good model fit in
the high-dimensional setting. One must report results on an independent test set or cross-validation
errors.

13
1.6 ISLR Chapter 7
1.6.1 Basis functions
Polynomial regression allows for nonlinearity by allowing for polynomials in the predictors. Using
polynomial functions of the features imposes a global structure on the nonlinear function of X. One
can instead use step functions. We can break the range of X into bins and fit a different constant in
each bin. This converts a continuous variable into an ordered categorical variable. With indicators on
the bins, this may be fit in the same way as linear regression.
Polynomial and piecewise constant regression models are special cases of a basis function approach.
Instead of fitting a linear model in X, we can fit
K
X
y i = β0 + βj bj (xi ) + ϵi
j=1

One flexible class of basis functions are piecewise polynomials. The points where the coefficients
change are called knots. We can also impose constraints on the piecewise polynomials by imposing
continuity, continuity of the first derivative, and continuity of the second derivative. With third degree
polynomials and the constraints, we get cubic splines. Cubic splines with K knots have K + 4 degrees
of freedom. One way to represent a cubic spline is with a truncated power basis. A truncated power
basis function is defined as
h(x, ξ) = (x − ξ)3+
where ξ is the knot. Splines can have high variance at the outer range of the predictors. A natural
spline is a regression spline with additional boundary constraints: the function is required to be linear
at the boundary.
It is common to place the knots of a spline at uniform quantiles of the data. One can again use
cross-validation to select for the degrees of freedom to be used.
We can also take a different approach to produce a spline. Note that if we wanted to find a function
g(x) that fits the observed data well, we can simply choose a g that interpolates all of the yi . We can
control this by requiring g to be smooth. We can find the function g that minimizes
n
X Z
2
(yi − g(xi )) + λ g ′′ (t)2 dt
i=1

where λ is a nonnegative tuning parameter. The function g minimizing this objective is called a
smoothing spline. A smoothing spline can be shown to be a piecewise cubic polynomial with knots at
the unique values of x1 , · · · , xn , and have continuous first and second derivatives at each knot that is
linear outside the extreme knots. The smoothing parameter λ controls the effective degrees of freedom.
Define
ĝλ = Sλ y
where ĝλ is the solution to the minimization problem for the smoothing spline. The effective degrees
of freedom is defined as
dfλ = tr(Sλ )
The LOOCV can be computed efficiently for smoothing splines.
n  2
X yi − ĝλ (xi )
RSScv (λ) =
i=1
1 − {Sλ }ii

14
1.6.2 Local Regression
Local regression computes the fit at a target point using only the nearby training observations. Local
regression is sometimes referred to as a memory-based procedure because we need all the training data
each time we wish to compute a prediction. The most important choice in a local regression is the
span s, the proportion of points used to compute the local regression at x0 . The span controls the
flexibility of the fit.
1. Gather the fraction s = k/n of training points whose xi are closest to x0 .
2. Assign a weight Ki0 = K(xi , x0 ) to each point in the neighbourhood so that the closer points
have higher weight, where all but these k nearest neighbors have weight zeor.
3. Fit a weighted least squares regression on the yi by finding a minimum of
n
X
Ki0 (yi − β0 − β1 xi )2
i=1

4. The fitted value at x0 is given by fˆ(x0 ) = β̂0 + β̂1 x0 .


A varying coefficient model such as a local regression is a useful way of adapting a model to the most
recently gathered data. However, local regression performs poorly if p is larger than 3 or 4 because
there will generally be very few training observations close to x0 .

1.6.3 Generalized Additive Models


Generalized additive models (GAM) provide a general framework for extending a standard linear model
by allowing nonlinear functions of each of the variables while maintaining additivity. Write the model
as
Xp
yi = β0 + fj (xij ) + ϵi
j=1

GAMs allow us to fit a nonlinear fj to each Xj so that we can automatically model non-linear relation-
ships that standard linear regression will miss. The nonlinear fits can potentially make more accurate
predictions for the response. Because the model is additive, we can examine the effect of each Xj on
Y while holding all other variables fixed. The smoothness of fj can be summarized by the degrees of
freedom. The main limitation of GAMs is that the model must be additive. For this purpose, we can
manually add interaction terms to GAMs or fit interaction terms using two-dimensional smoothers
such as two-dimensional splines.

1.7 ISLR Chapter 8


This chapter discussed tree-based methods. To start, a regression tree involves dividing the predictor
space into J distinct and non-overlapping regions R1 , R2 , · · · , RJ . For every observation in region Rj
we make the same prediction, the mean of the response variables for the training observations in Rj .
While the regions can technically take any shape, it is often hyperrectangles or boxes. The ideal boxes
R1 , R2 , · · · , RJ minimize the SSE
XJ X
(yi − ŷnj )2
j=1 i∈Rj

To reduce the computational burden, a greedy approach called recursive binary splitting is used. This
approach starts at the root (top) of the tree and splits the predictor space with two branches. It is

15
greedy because the best split is made at the particular step rather than looking ahead to make a better
split in the future. Define the half-planes

R1 (j, s) = {X | Xj < s}, R2 (j, s) | Xj ≥ s}

We seek to find the j and s that minimizes


X X
(yi − ŷR1 )2 + (yi − ŷR2 )2
i|xi ∈R1 (j,s) i|xi ∈R2 (j,s)

We can repeat this process to make further splits in the data until a stopping criterion such as contin-
uing until no region contains more than five observations.
This process will likely overfit the data, leading to poor test set performance so in practice one grows
a large tree T0 and prunes it to obtain a subtree. Cost complexity pruning, also known as weakest link
pruning, considers trees indexed by a nonnegative tuning parameter α. Let |T | denote the number of
terminal nodes of tree T . For each α, there is a subtree T ⊂ T0 that minimizes
|T |
X X
(yi − ŷRm )2 + α|T |
m=1 xi ∈Rm

The tuning parameter α controls the subtree’s complexity. It turns out that branches get pruned from
the tree in a nested and predictable fashion. The tuning parameter α is selected via a validation set
or cross-validation.
A classification tree is similar to a regression tree but is used for classification. For a classification tree,
we predict that each observation belongs to the most commonly occurring class of training observations
in the region to which it belongs. We are not only interested in the class prediction to a terminal node
region but also the class proportions among the training observations that fall in the region. A different
objective function must be used for classification. The Gini index is defined by
K
X
G= p̂mk (1 − p̂mk )
k=1

A small value indicates that a node predominantly contains observations from a single class. An
alternative to Gini index is entropy, given by
K
X
D=− p̂mk log p̂mk
k=1

The entropy will take on a value near zero if p̂mk are all near zero or near one. Either the Gini index
or the entropy are used to evaluate the quality of a particular split for pruning but the classification
error rate is preferable if prediction accuracy of the final pruned tree is the goal. Sometimes splits will
yield the same predicted value. The reason the split is made at all is because it increases node purity.
Advantages for decision trees:
• Trees are easily explained and easily interpreted by a non-expert.
• Trees can be displayed graphically.
• Trees more closely resembles human decision-making.
• Trees can handle qualitative predictors without needing to create dummy variables.

16
Disadvantages for decision trees:
• Trees do not have the same level of predictive accuracy as other regression and classification
approaches.
• Trees are non-robust. Small changes in the data can lead to large changes in the final estimated
tree.
An ensemble method combines many simple building block models to obtain a single more powerful
model. The building block models are known as weak learners.
Decision trees suffer from high variance. Bootstrap aggregation, or bagging, is a general-procedure for
reducing the variance of a statistical learning method, which can be used for decision trees.
B
1 X ˆ∗b
fˆbag (x) = f (x)
B
b=1

Construct B regression trees using B bootstrapped training sets, and average the resulting predictions.
These trees are grown deep and are not pruned. Hence each individual tree has high variance but low
bias and averaging these B trees reduces the variance. For classification, we take the majority rule:
the overall prediction is the most commonly occurring class among the B predictions. Note here that
the number of trees B is not a tuning parameter and a very large value of B will not lead to overfitting.
On average, each bagged tree makes use of around two-thirds of the observations. The remaining
one third of the observations not used to fit a given bagged tree are referred to as the out-of-bag
observations. We can predict the response for the ith observation using each of the trees in which
that observation was OOB. This yields in about B/3 predictions for the ith observations. Average
or take the majority vote to get a single OOB prediction for the ith observation. An overall OOB
MSE or classification error can be computed this way and the resulting OOB error is a valid estimate
of the test error for the bagged model. For large enough B, the OOB error is nearly equivalent to
leave-one-out cross-validation error.
Bagging results in improved accuracy but it is hard to interpret the resulting model. One can obtain
an overall summary of the importance of each predictor by averaging the amount of SSE decreased
due to splits over a given predictor averaged over the B trees. A large value indicates an important
predictor. For classification, averaging the amount of Gini index decreased by splits over a predictor
averaged over the B trees.
Random forests improves on bagging by decorrelating the trees. As in bagging, one builds decision trees
on bootstrapped training samples but also takes a random sample of m predictors as split candidates
from the full set of p predictors. A fresh sample of m predictors is taken at each split and we typically

choose m ≈ p. If m = p, this is simply bagging. A small value of m is helpful for a large number of
correlated predictors.
Here is a boosting algorithm:
1. Set fˆ(x) = 0 and ri = yi for all i in the training set.
2. For b = 1, 2, · · · , B:
(a) Fit a tree fˆb with d splits to the training data (X, r).
(b) Update fˆ by adding a shrunken version of the new tree:

fˆ(x) ← fˆ(x) + λfˆb (x)

17
(c) Update the residuals:
ri ← ri − λfˆb (xi )

3. Output the boosted model:


B
X
fˆ(x) = λfˆb (x)
b=1

The tuning parameters are B, λ, d. Unlike bagging and random forests, a high B can actually lead to
overfitting for boosting. Cross-validation is used to select B. The shrinkage parameter λ controls the
rate at which boosting learns. Typically small positive numbers such as 0.01 and 0.001. Very small λ
can require a large B. The number d of splits controls the complexity of the boosted ensemble. d is
the interaction depth, controlling the interaction order of the boosted model since d splits can involve
at most d variables. The choice d = 1 is referred to as a ”stump” consisting of a single split which is
equivalent to fitting an additive model. Smaller d have the benefit of being more interpretable.
Bayesian additive regression trees (BART) is another ensemble method that uses decision trees as
building blocks. Each tree for BART is constructed in a random manner as in bagging and random
forests, and each tree tries to capture signal not yet accounted for as in boosting. Let K denote the
number of regression trees, B the number of iterations for which BART is run, fˆkb (x) represents the
prediction at P
x for the kth regression tree used in the bth iteration. At the end of each iterations, we
K
take fˆb (x) = k=1 fˆkb (x) for all b.
In the first iteration of BART, all trees are initialized to have a single root node, with fˆk1 (x) =
1
Pn ˆ1 PK ˆ1 1
Pn
nK i=1 yi . Thus, f (x) = k=1 fk (x) = n i=1 yi . In subsequent iterations, BART updates each
of the K trees one at a time. In the bth iteration, to update the kth tree, subtract from each response
value the predictions from all but the kth tree to obtain a partial residual
X X
ri = yi − fˆkb′ (xi ) − fˆkb−1
′ (xi )
k′ <k k′ >k

for the ith observation. Rather than fitting a fresh tree to this partial residual, BART randomly
chooses a perturbation to the tree from the previous iteration, favoring ones that improve the fit to
the partial residual.
1. We may change the structure of the tree by adding or pruning branches.
2. WE may change the prediction in each terminal node of the tree.
The output of BART is a collection of prediction models,
K
X
fˆb (x) = fˆkb (x)
k=1

We typically throw away the first few of these prediction models, called the burn-in period. Let L
denote the number of burn-in iterations. Then, we take the average after the burn-in iterations,
B
1 X
fˆ(x) = fˆb (x)
B−L
b=L+1

However, other quantities such as percentiles of fˆL+1 (x), · · · , fˆB (x) can be taken. The BART method
can be viewed as a Bayesian approach to fitting an ensemble of trees: randomly perturbing a tree to
fit the residuals is drawing a new tree from a posterior distribution. Here is the complete algorithm
for BART:

18
Pn
1. Let fˆ11 (x) = fˆ21 (x) = · · · = fˆK
1 1
(x) = nK i=1 yi .
K n
2. Compute fˆ1 (x) = k=1 fˆk1 (x) = n1 i=1 yi
P P

3. For b ∈ {2, 3, · · · , B}:


(a) For k ∈ [K], i ∈ [n], compute
X X
ri = yi − fˆkb′ (xi ) − fˆkb−1
′ (xi )
k′ <k k′ >k

(b) Fit a new tree, fˆkb (x) to ri by randomly perturbing the kth tree from the previous iteration,
fˆkb−1 (x). Perturbations that improve the fit are preferred.
PK
(c) Compute fˆb (x) = k=1 fˆkb (x)
4. Compute the mean after L burn-in samples,
B
1 X
fˆ(x) = fˆb (x)
B−L
b=L+1

Summary:
• In bagging, trees are grown from bootstrapped samples. The trees can be similar to each other
and fail to thoroughly explore the model space.
• In random forests, the trees are grown from bootstrapped samples and each split is performed
with a random subset of the features, decorrelating the trees relative to bagging.
• In boosting, there is no bootstrapping and the trees are grown successively with a slow learning
rate.
• In BART, there is no bootstrapping and the trees are grown successively but the trees are
perturbed for a more thorough exploration of the model space.

1.8 ISLR Chapter 12


Unsupervised learning is a set of statistical tools for the setting where only the features are avail-
able. The goal is to discover interesting things about the data and informative ways to visualize it.
Unsupervised learning is often done as part of an exploratory data analysis.

1.8.1 Principal Component Analysis


Principal component analysis (PCA) finds a low-dimensional representation of the data set that con-
tains as much of the variation as possible. We first center the data by subtracting the empirical mean.
Then, for a design matrix X, the first principal component is given by

ϕ1· = argmax ∥Xϕ∥2 = argmax ϕT X T Xϕ


∥ϕ∥=1 ∥ϕ∥=1

Thus, ϕ1· is an eigenvector of X T X corresponding to the highest eigenvalue. The kth principal
component can be found in this way:
k−1
X
bk = X −
X Xϕi· ϕTi·
i=1

19
bk ϕ∥2
ϕk· = argmax ∥X
∥ϕ|=1

Indeed, ϕi· correspond to the ith eigenvectors of X T X and ϕTi· X T Xϕi· are eigenvalues. As normalized
eigenvectors, the vectors {ϕi· } form an orthonormal set. ϕi· is called the ith principal component
loading vector. In matrix form, we define
Z = XΦ
X is the original data set, Φ contains the loading vectors, and Z contains the principal component
scores.
Assuming that X is centered, the total variance of X is defined as
p n
1 XX 2
x
n j=1 i=1 ij

The variance explained by the mth principal component is


 2
n n p
1X 2 1 X X
z = ϕjm xij 
n i=1 im n i=1 j=1

So the proportion of variance explained (PVE) of the mth principal component is


Pn Pn Pp 2
2
i=1 zim i=1 ( j=1 ϕjm xij )
Pp Pn 2 = P p P n 2
j=1 i=1 xij j=1 i=1 xij

In total, there are min{n − 1, p} principal components and their PVEs sums to one.
The first M principal component loading and score vectors should be interpreted as the best M -
dimensional approximation to the data in terms of residual sum of squares.
Var(X)
d = Var(Z)
d + MSE(X − ZΦT )
Moreover, the PVE is the R2 of the approximation of X given by the first M principal components.
The results for PCA depend on whether the variables are individually scaled. If the features are on
different units, then it may be desirable to first scale each variable to have standard deviation one
before PCA. However, if the variables are measured in the same units, then one should not scale the
variables.
To decide on the number of principal components to use, we can use a scree plot. A scree plot depicts
the proportion of variance explained by each principal component. Cumulative proportion of variance
plots are also used. Typically, the scree plot has a point where the proportion of variance explained
by each subsequent principal component drops off referred to as the elbow of a scree plot. Visual
inspection of the scree plot allows us to decide the number of principal components to use. It should
be noted that this visual analysis is ad hoc but that there is no well-accepted objective way to decide
the number of principal components.

1.8.2 Clustering
K-means clustering partitions the data into K distinct, non-overlapping clusters. The main idea of
the K-means clustering is that a good clustering is one with low within-cluster variation. We would
like to solve
K p
X 1 X X
argmin (xij − xi′ j )2
C1 ,C2 ,··· ,CK |Ck | ′
k=1 i,i ∈Ck
j=1

The K-means algorithm is as follows:

20
1. Randomly assign a number, from 1 to K to each of the observations.
2. Iterate until the cluster assignments do not change:
(a) Compute the cluster centroid for each K clusters.
(b) Assign each observation to the cluster whose centroid is closest.
This algorithm is guaranteed to decrease the value of the objective at each step. When the result no
longer changes, a local optimum has been reached. The K-means algorithm depends on the initial
cluster assignment of the observations so it is important to run the algorithm multiple times from
different initial configurations then select the best solution.
One disadvantage of K-means clustering is the requirement to decide beforehand the number of clusters
K. Hierarchical clustering is an alternative approach which does not require such a choice. Hierarchical
clustering results in a tree-based representation of observations, called a dendrogram.
A bottom-up or agglomerative clustering refers to a dendrogram that is built starting from the leaves
and combining up to the trunk. At the bottom of the dendrogram, each leaf represents one observations
and as one moves up the tree, leaves fuse into branches and the branches fuse with leaves and other
branches. The earlier the fusions occur, the more similar the group of observations are to each other.
Observations that fuse later can be quite different. More precisely, the height of the fusion indicates
how different two observations are. Note that we cannot draw conclusions about the similarity of two
observations based on their proximity along the horizontal axis.
To identify clusters on the basis of a dendrogram, we make a horizontal cut across the dendrogram.
The distinct sets of observations beneath the cut are the clusters. The height of the cut controls
the number of clusters obtained and plays the same role as K in K-means clustering. This type of
clustering is hierarchical in that clusters at a lower height are necessarily nested within the clusters at
any greater height. Sometimes this assumption of hierarchical structure might be unrealistic.
Hierarchical clustering relies on a dissimilarity measure between each pair of observations, typically
Euclidean distance. Observations with the smallest dissimilarity are fused first. The concept of dis-
similarity between a pair of observations needs to be extended to pair of groups of observations to
fuse clusters. Linkage defines the dissimilarity between two groups of observations. The most common
types of linkage are complete, average, single, and centroid.
The hierarchical clustering algorithm is as follows:
1. Begin with n observations and treat each observation as its own cluster.
2. For i = n, n − 1, · · · , 2
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair
of clusters that are least dissimilar. Fuse these two clusters. The dissimilarity between these
two clusters indicates the height in the dendrogram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among the i − 1 remaining clusters.
Linkage Description
Complete Largest pairwise dissimilarity.
Single Smallest pairwise dissimilarity.
Average Average pairwise dissimilarity.
Centroid Dissimilarity between centroids of two clusters.
The most popular linkage are average, complete, and single linkage. A drawback of centroid linkage
is that an inversion can occur where two clusters are fused at a height below the individual clusters.
Average and complete linkage generally create more balanced dendrograms.

21
Other dissimilarity measures exist: for example, a correlation-based distance considers two observations
to be similar if their features are highly correlated. In addition to selecting the dissimilarity measure,
one should consider whether to scale the variables before the dissimilarity measure is computed.
Both K-means and hierarchical clustering will assign every observation to a cluster but sometimes there
may be outliers that do not belong in any cluster. Outliers may greatly distort the resulting clusters.
Mixture models are an approach that accommodates for outliers, resulting in a soft version of K-means
clustering. Clustering methods are generally not robust to perturbations of the data. To get a sense
of the robustness of the clusters, one can cluster subsets of the data. It is also important to remember
that the results of clustering depend on the decisions such as whether the data is standardized, type of
linkage, the height to cut the dendrogram, or the choice of K. Clustering results should not be taken
as absolute truth but rather the starting point of a hypothesis and further study on an independent
data set.

1.9 Exponential Families


Exponential families have densities of the form
f (y | θ) = exp[yb(θ) + c(θ) + d(y)]
The quantity b(θ) is called the natural parameter of the distribution. Parameters other than θ are
called nuisance parameters. Here is a table of exponential families
Distribution θ b(θ) c(θ) d(y) 
p
Binomial p log 1−p n log(1 − p) log ny
β
log r+y−1

Negative Binomial β log 1+β −r log(1 + β) r−1
Poisson λ log λ −λ − log y!
Exponential θ − θ1 −α log θ 0
Gamma θ − θ1 −α log θ (α − 1) log y − log Γ(α)
µ µ2 y2 1 2
Normal µ σ2 − 2σ 2 − 2σ 2 − 2 log 2πσ

Inverse Gaussian µ − 2µθ 2 θ


µ
θ 1
− 2y + 2 log 2πy2θ

Tweedie distributions refer to a family of distributions where the variance is related to the mean
through a power function
var(Y ) = a[E(Y )]p
Various exponential family distributions are tweedie distributions:
• Normal distribution, p = 0
• Poisson distribution, p = 1
• Compound Poisson-Gamma distribution, 1 < p < 2
• Gamma distribution, p = 2
• Inverse Gaussian distribution, p = 3
In a GLM, the inverse of the link function transforms the linear formula for the mean to the original
data set scale. The canonical link function is related to the function b(θ) mapping θ to the natural
parameter.
Distribution Canonical Link Function
Normal Identity
Binomial Logit
Poisson Natural Logarithm
Gamma Inverse

22
R
By differentiating f (y | θ)dy under the integral sign and rearranging, we get

c′ (θ)
E[Y ] = −
b′ (θ)

b′′ (θ)c′ (θ) − c′′ (θ)b′ (θ)


var[Y ] =
b′ (θ)3
The score statistic takes the form
U = b′ (θ)Y + c′ (θ)
The score has mean zero and variance equal to the Fisher information.

b′′ (θ)c′ (θ)


I(θ) = − c′′ (θ)
b′ (θ)

2 Generalized Linear Model


2.0.1 Estimation
GLMs are estimated through maximum likelihood estimation. Consider independent random variables
Y1 , · · · , YN in a GLM with E(Yi ) = µi , g(µi ) = xTi β. We wish to estimate the parameters β. For each
yi , the log-likelihood is
li = yi b(θi ) + c(θi ) + d(yi )
Hence the score is
N
∂l X ∂li ∂θi ∂µi
= Uj =
∂βj i=1
∂θi ∂µi ∂βj
Consider each multiplicand separately. First,
∂li
= yi b′ (θi ) + c′ (θi ) = b′ (θi )(yi − µi )
∂θi

By differentiating µi = − cb′ (θ
(θi )
i)
with respect to θi ,

∂µi c′′ (θi ) c′ (θi )b′′ (θi )


=− ′ + = b′ (θi )var(Yi )
∂θi b (θi ) [b′ (θi )2

∂µi ∂µi
= xij
∂βj ∂ηi
Therefore the score is
N  
X yi − µi ∂µi
Uj = xij
i=1
var(Yi ) ∂ηi

The information matrix is I = cov(U ), which can be simplified to


N  2
X xij xik ∂µi
Ijk =
i=1
var(Yi ) ∂ηi

which can be rewritten as


I = XT W X

23
where W is a diagonal matrix with
 2
1 ∂µi
Wii =
var(Yi ) ∂ηi

Fisher scoring generalizes to


b(m) = b(m−1) + [I (m−1) ]−1 U (m−1)
where b(m) is the mth estimate of β and I (m−1) , U (m−1) are I and U evaluated at b(m) . The iterative
equation can be expressed as
X T W Xb(m) = X T W z
where z has entries
p
X (m−1) ∂ηi
zi = xik bk + (yi − µi )
∂µi
k=1
∂ηi
with µi and ∂µi evaluated at b(m−1) . Note that this is the same form as the normal equations for
a weighted least squares. Hence, the maximum likelihood estimators in a GLM are obtained by an
iterative weighted least squares procedure.

2.0.2 Inference
A goodness of fit statistic is a summary statistic to describe how well the model fits the data. It
may be based on a maximum value of the log-likelihood or the minimum value of the sum of squares
criterion or a composite statistic based on the residuals.
1. Specify a model M0 corresponding to H0 . Specify a more general model M1 .
2. Fit M0 and calculate the goodness of fit statistic G0 . Fit M1 and calculate the goodness of fit
statistic G1 .
3. Calculate the improvement in fit, G1 − G0 or G1 /G0 .
4. Use the sampling distribution of G1 − G0 to test the null hypothesis G1 = G0 .
5. If the null hypothesis is not rejected, then M0 is the preferred model. If it is rejected, then M1
is regarded as the better model.
The asymptotic sampling distribution for score statistics is
d
U T I −1 U → χ2p

for a p-dimensional vector of parameters. We can take a Taylor expansion of the score statistic to get

U (β) = U (b) − I(b)(β − b)

Note that if b is the MLE, by definition U (b) = 0 so

(b − β) = I −1 (b)U

since I = E(U U T ) and I is symmetric,


d
(b − β)T I(b)(b − β) → χ2p

This is called the Wald statistic.


One way of assessing the adequacy of a model is to compare it with a more general model with the

24
maximum number of parameters that can be estimated, called a saturated model, also called a maximal
or full model. Define the deviance or the log-likelihood ratio statistic as

D = 2[l(bmax |y) − l(b|y)]

The deviance can be decomposed as

D = 2[l(bmax |y) − l(βmax |y)] − 2[l(b|y) − l(β|y)] + 2[l(βmax | y) − l(β|y)]

The first term has distribution χ2n , the second term has distribution χ2m , and the third term, v =
2[l(βmax |y) − l(β|y)] is a positive constant that will be near zero if the model fits the data almost as
well as the saturated model. Thus, the sampling distribution of the deviance is approximately

D ∼ χ2 (m − p, v)

where v is the non-centrality parameter.


Hypothesis tests on a p-dimensional parameter β can be done with a Wald statistic (β̂ − β)T I(β̂ − β) ∼
χ2p or the score statistic U T I −1 U ∼ χ2p . Alternatively, one can compare the goodness of fit. Consider
nested models where M0 assumes a q-dimensional parametric model and M1 is a more general p-
dimensional parametric model containing M0 . Consider the null hypothesis H0 : βq+1 = βq+2 = · · · =
βp = 0 with H1 : ∃βj ̸= 0, j > q. We can test H0 against H1 with the difference of the deviance
statistics
∆D = D0 − D1 = 2[l(b1 |y) − l(b0 |y)]
Provided certain independence conditions hold, ∆D ∼ χ2p−q .

2.0.3 Normal Linear Model


The MLE is
b = (X T X)−1 X T y
provided X T X is not singular. The estimator is unbiased with a covariance matrix of σ 2 (X T X)−1 =
I −1 . σ 2 is a nuisance parameter though
1
σ̂ 2 = (y − Xb)T (y − Xb)
N −p

is an unbiased estimator of σ 2 and can be used to estimate I and make inference about b. Note that
the MLE and least squares estimators are the same.
The deviance is given by the formula
1
D= (y − Xb)T (y − Xb)
σ2
The residual sum of squares is sometimes called the scaled deviance since

RSS = (y − Xb)T (y − Xb) = σ 2 D

It can be expanded to the form


1 T
D= (y y − bT X T y)
σ2
To test H0 against H1 for nested normal linear models,
1 T T
∆D = D0 − D1 = (b X y − bT0 X0T y)
σ2 1 1

25
D0 − D1 D1 b1 X1T y − bT0 X0T y y T y − bT1 X1T y
F = ÷ = ÷
p−q N −p p−q N −p
F ∼ F (p − q, N − p) under H0 . Otherwise F has a non-central distribution. Let S0 = y T y − bT0 X0T y
and S1 = y T y − bT1 X1T y, then
S0 − S1 S1
F = ÷
p−q N −p
Usually inferences on a parameter for an explanatory variable depends on other explanatory variables
but an exception is made when the components are orthogonal. Suppose that
X = [X1 , · · · , Xm ]
where XjT Xk = 0, the zero matrix for j ̸= k. Then X is said to be orthogonal and X T X is a block
diagonal matrix and hence
m
X
bT X T y = bTk XkT y
k=1
Consequently, the hypotheses
H01 : β1 = 0, · · · , H0m : βm = 0
can be tested independently. Except for some well-designed experiments, the design matrix is not
orthogonal. Tests based on all other terms being included before Xj βj is added is called a Type III
test. Tests that depend on the sequential order of fitting terms are called Type I.
Residuals are defined as
êi = yi − xTi b = yi − µ̂i
The covariance matrix is
E[êêT ] = σ 2 [I − X(X T X)−1 X T ] = σ 2 [I − H]
where H is the projection or hat matrix. The standardized residuals are
êi
ri = √
σ̂ 1 − Hii
where σ̂ 2 is an estimate of σ 2 .
An outlier is an observation that is not well fitted by the model. An influential observation is one
which has a large effect on inferences based on the model. Hii the ith element on the diagonal of the
hat matrix is called the leverage of the ith observation. As a rule of thumb, if hii is greater than 2p
N,
it may be a concern. Measures combining standardized residuals and leverage include
r
Hii
DFFITSi = ri
1 − Hii
 
1 Hii
Di = ri2
p 1 − Hii
Di is called Cook’s distance. Cook’s distance may also be obtained by fitting a model with and without
each observation and seeing the difference this makes to estimates b.
1
Di = (b − b(i) )T X T X(b − b(i) )
p
The basic model of multiple linear regression includes an intercept term which is represented by a
column of 1’s in the design matrix.
Define S, Ŝ, Ŝ0 to be
S = eT e = (Y − Xβ)T (Y − Xβ)

26
Ŝ = (y − Xb)T (y − Xb) = y T y − bT X T y
Ŝ0 = y T y − N ȳ 2
Then the coefficient of determination R2 is defined as

Ŝ0 − Ŝ bT X T y − N ȳ 2
R2 = =
Ŝ0 y T y − N ȳ 2

The square root of R2 is called the multiple correlation coefficient.


If some explanatory variables are highly correlated with one another, this is called collinearity or
multicollinearity. Multicollinearity implies that the condition number of the design matrix X is large
or that the estimating equation (X T X)b = X T y is ill-conditioned. Collinearity may be detected by
the variance inflation factor (VIF)
1
VIFj = 2
1 − R(j)
2
where R(j) is the coefficient of determination obtained from regressing the jth explanatory variable
against all other explanatory variables. If it is uncorrelated then VIF = 1. VIF increases as the
correlation increases. One should be concerned if VIF > 5.

2.0.4 Analysis of Variance


Analysis of variance is used for comparing means of groups of continuous observations where the
groups are defined by the levels of factors. If experimental units are randomly allocated to groups
corresponding to J levels of a factor, this is a completely randomized experiment. Responses at the
same level have the same expected value and so are called replicates.
There are three different specifications of a model to test the hypothesis that the response means differ
among the factor levels.
1. E(Yjk ) = µj , j ∈ [K].
2. E(Yjk ) = µ + αj , j ∈ [J] with sum-to-zero constraint.
3. E(Yjk ) = µ + αj with α1 = 0. Corner point parameterization.
If there are two factors A and B that are crossed, then there are 4 models to consider.
1. Saturated model. E(Yjkl ) = µ + αj βk + (αβ)jk
2. Additive model. E(Yjkl ) = µ + αj + βk
3. Only A. E(Yjkl ) = µ + αj
4. Only B. E(Yjkl ) = µ + βk
Because these models are overspecified, we must add constraints. We can impose sum-to-zero con-
straints or corner point constraints.
If the data is balanced, that is, there are equal number of observations in each subgroup, then it is
possible to specify the design matrix in such a way that it is orthogonal.
Analysis of covariance is used for models in which some of the explanatory variables are dummy vari-
ables representing factors and other are continuous measurements called covariates. We are interested
in comparing means of subgroups by factor levels after adjustment for covariate effects. The saturated
model is
E(Yjk ) = µj + γxjk

27
The reduced model is
E(Yjk ) = µ + γxjk
The term general linear model refers to Normal linear models with any combination of categorical and
continuous explanatory variables.

2.0.5 Logistic Regression


Suppose that Y1 , Y2 , · · · , YN are independent random variables such that Yi ∼ Bin(ni , πi ). Suppose
that  
πi
logitπi = log = xTi β
1 − πi
The estimation process is the same whether the data are grouped as frequencies for each covariate
pattern or each observation is coded 0 or 1 and its covariate pattern is listed separately. The deviance
is
N     
X yi n i − yi
D=2 yi log + (ni − yi ) log
i=1
ŷi ni − ŷi
This has the form X o
D=2 o log
e
where o denotes observed successes and failures and e denotes fitted values. Since D does not involve
nuisance parameters, hypotheses can be directly tested with the approximation

D ∼ χ2N −p

where p is the number of parameters estimated and N the number of covariate patterns.
For small studies or situations where there are few observations for each covariate pattern, asymptotic
results are poor approximations. Software has been developed for exact methods.
Instead of MLE, we can estimate the parameters by minimizing the weighted sum of squares
N
X (yi − ni πi )2
Sw =
n π (1 − πi )
i=1 i i

This is equivalent to minimizing the Pearson chi-squared statistic


X (o − e)2
X2 =
e
When evaluated at the estimated expected frequencies, X 2 is asymptotically equivalent by the delta
method. By analogy with R2 , we have the

l(π̃|y) − l(π̂|y)
pseudoR2 =
l(π̃|y)
P
where π̃ = P nyii is from the minimal model. The pseudo-R2 represents the proportional improvement
in the log-likelihood due to the terms in the model of interest.

AIC = −2l(π̂|y) + 2p

BIC = −2l(π̂|y) + 2p log n

28
The Pearson or chi-squared residual is
yk − nk π̂k
Xk = p
nk π̂k (1 − π̂k )

The standardized Pearson residuals are


Xk
rPk = √
1 − hk
where hk is the leverage, obtained from the hat matrix. Deviance residuals are defined as
     1/2
yk nk − yk
dk = sgn(yk − nk π̂k ) 2 yk log + (nk − yk ) log
nk π̂k nk − nk π̂k
Pn
Note that k=1 d2k = D, the deviance. Standardized deviance residuals are defined by

dk
rDk = √
1 − hk
Pearson and deviance residuals can be used for checking the adequacy of a model. They should
be plotted against each continuous explanatory variable to check the assumption of linearity and
against other explanatory variables not included in the model. They should be plotted in the order
of measurements to check for serial correlation. Normal probability plots can also be used for large
enough n since the standardized residuals should be approximately standard normal. If the data is
binary or if nk is small, then the plots may be uninformative and aggregated goodness of fit statistics
as well as other diagnostics should be used.
For binary data, there may be overdispersion which may be due to inadequate specification of the
model or more complex structure. One approach is to add an extra parameter ϕ so that var(Yi ) =
ni πi (1 − πi )ϕ, called the quasibinomial distribution.
Consider a random variable Y with J categories. Let π1 , π2 , · · · , πJ denote their respective probabili-
ties. If there are n independent observations of Y resulting in y1 , y2 , · · · , yJ outcomes in each category,
the multinomial distribution is
n!
f (y | n) = π y1 π y2 · · · πJyJ
y1 !y2 ! · · · yJ ! 1 2

It can be shown that E(Yj ) = nπj , var(Yj ) = nπj (1 − πj ) and cov(Yj , Yk ) = −nπj πk . Nominal logistic
regression models are used when there is no natural order among the response categories. One category
is arbitrarily chosen as the reference category. Then,
 
πj
logit(πj ) = log = xTj βj
π1

The (J − 1) logit equations are used simultaneously to estimate the parameters βj .

π̂j = π̂1 exp(xTj bj )

1
π̂1 = PJ
1+ j=2 exp(xTj bj )
exp(xTj bj )
π̂j = PJ
1 + j=2 exp(xTj bj )

29
The Pearson chi-squared residuals are
o i − ei
ri = √
ei
Chi-squared statistic is
N
X
X2 = ri2
i=1

Deviance is D = 2[l(bmax ) − l(b)] and likelihood ratio chi-squared statistic is

C = 2[l(b) − l(bmin )]

l(bmin ) − l(b)
PseudoR2 =
l(bmin )
AIC = −2l(π̂ | y) + 2p
The odds ratio is
πjp π1p
ORj = ÷
πja π1a
where πjp and πja denote probabilities of response category according to whether exposure is present
or absent, respectively.
In some situations, there may be a continuous variable z that is difficult to measure and may be
assessed by cut points for the latent variable so that small values are classified as none, larger values
as moderate, and high values as severe. The cutpoints C1 , C2 , · · · , CJ define J ordinal categories with
associated probabilities.
The cumulative odds for the jth category are

P (z ≤ Cj ) π1 + π2 · · · + πj
=
P (z > Cj ) πj+1 + · · · + πJ

The cumulative logit model is


π1 + · · · + πj
log = xTj βj
πj+1 + · · · + πJ
The proportional odds model is
π1 + · · · + πj
log = β0j + β1 x1 + · · · + βp−1 xp−1
πj+1 + · · · + πJ

It is based on the assumption that the effects of the covariates x1 , · · · , xp−1 are the same for all
categories on the logarithmic scale. If some of the categories are amalgamated, this does not change
the parameter estimates β1 , · · · , βp−1 but the terms β0j will be affected. This is the collapsibility
property. The proportional odds model is also not affected by a reversal of labelling of the categories.
The proportional odds model is the usual form of ordinal logistic regression.

2.0.6 Poisson Regression


Assume Yi ∼ Poisson(µi ) with
log µi = log ni + xTi β
This differs from the usual specification of the linear component due to the log ni term, called the
offset. It is a known constant.

30
The rate ratio, RR for presence vs absence is

E(Yi | present)
RR = = eβj
E(Yi | absent)

The Pearson residuals are


o i − ei
ri = √
ei
The chi-squared goodness of fit statistic is related by
X X (oi − ei )2
X2 = ri2 =
i i
ei

The deviance for a Poisson model is


X
D=2 [oi log(oi /ei ) − (oi − ei )]
i

For most models σi oi = σi ei so this simplifies to


X
D=2 [oi log(oi /ei )]
i

The deviance residuals are the components of D,


p
di = sign(oi − ei ) 2[oi log(oi /ei ) − (oi − ei )

so that D = i d2i .
P

2.0.7 Q-Q and Box Plots


A five-number summary of the data consists of the minimum, the first quartile, the median, the third
quartile, and the maximum.
The boxplot encloses the middle 50% of the data, with a line segment to indicate the median, and
extending lines to the minimum and maximum. Box and whisker plots also have the middle box but
extend to a lower fence and upper fence where

LF = Q1 − 1.5(Q3 − Q1 ), UF = Q3 + 1.5(Q3 − Q1 )

Points that lie outside the fences are called potential outliers. Points on the line are called adjacent
points.
Q-Q plots plot sample quantiles against quantiles from a theoretical CDF. If the plot is mostly on the
identity line y = x, then we see the two distributions as being similar. Otherwise, we suspect that the
two distributions are not the same.

3 Time Series
3.1 Basics and Box-Jenkins
Consider a time series model that is stationary in the mean and variance. The model is second-order
stationary if the correlation between variables only depends on the number of time steps separating
them. The number of time steps between the variables is called the lag. A correlation of a variable

31
with itself at different times is called autocorrelation or serial correlation. If a time series model is
second-order stationary, we define the autocovariance function (avcf), γk as

γk = E[(xt − µ)(xt+k − µ)]

The autocorrelation function (acf), ρk is


γk
ρk =
σ2
The sample acvf, ck is calculated as
n−k
1X
ck = (xt − x̄)(xt+k − x̄)
n t=1

The sample acf is defined as


ck
rk =
c0
The correlogram is a plot of rk against k. If ρk = 0, the sampling distribution of rk is approximately
normal with a mean of −n−1 and variance of n−1 . The dotted lines on a correlogram are drawn at
1 1.96
− ± √
n n

If rk falls outside the lines, there is evidence against the null hypothesis ρk = 0, though we expect 5%
of the estimates rk to fall outside the lines even under the null hypothesis.
Suppose there are time series models for variables x and y that are stationary in the mean and variance.
Their combined model is second-order stationary if correlations depend only on the lag, and we may
define the cross-covariance function (ccvf) as

γk (x, y) = E[(xt+k − µx )(yt − µy )]

This is not a symmetric relationship, and x is lagging y by k.

γk (x, y) = γ−k (y, x)

The lag k cross-correlation function ccf is defined by


γk (x, y)
ρk (x, y) =
σx σy

The sample ccvf is defined as


n−k
1X
ck (x, y) = (xt+k − x̄)(yt − ȳ)
n t=1

The sample acf is defined as


ck (x, y)
rk (x, y) = p
c0 (x, x)c0 (y, y)
A time series {wt | t ∈ [n]} is discrete white noise (DWN) if variables w1 , w2 , · · · , wn are i.i.d. with a
mean of zero. If in addition, the variables are normal, the series is called Gaussian white noise.
Let {xt } be a time series. It is a random walk if

xt = xt−1 + wt

32
where {wt } is a white noise series. Back substitution gives
t
X
xt = wi
i=1

A random walk satisfies µx = 0, γk (t) = cov(xt , xt+k ) = tσ 2 .


The backward shift operator is defined by

Bxt = xt−1

Also called the lag operator. The difference operator ∇ is defined by

∇xt = xt − xt−1

Differencing can be a useful filtering procedure for non-stationary time series. The first-order differences
of a random walk is a white noise series so the correlogram of the series of differences can be used to
assess whether a given series is a random walk.
A random walk model with a drift parameter δ is

xt = xt−1 + δ + wt

The series {xt } is an autoregressive process of order p, abbreviated AR(p) if

xt = α1 xt−1 + α2 xt−2 + · · · + αp xt−p + wt

where {wt } is white noise, αi are model parameters with αp ̸= 0 for an order p process. It can also be
expressed as
θp (B)xt = (1 − α1 B − α2 B 2 − · · · − αp B p )xt = wt
The equation θp (B) = 0 is called the characteristic equation. The roots of the characteristic equation
must all exceed unity in absolute value for the process to be stationary. To derive the second-order
properties for AR(1), note that

(1 − αB)xt = wt =⇒ xt = (1 − αB)−1 wt

X
xt = αi wt−i
i=0
P∞
Hence E(xt ) = i=0 αi E(wt−i ) = 0 and

αk σ 2
γk = cov(xt , xt+k ) =
1 − α2
The autocorrelation function follows as
ρk = α k
From this equation we see that autocorrelations are nonzero for all lags even though the underlying
model xt only depends on the previous value xt−1 . The partial autocorrelation at lag k is the correlation
that results after removing the effect of any correlations due to the terms at shorter lags.
An AR(p) model can be fitted in R using the ar function. The method is generally fit with MLE and
the order p of the process is chosen with AIC.
A time series model {xt } is linear if it can be expressed as

xt = α0 + α1 u1,t + · · · + αm um,t + zt

33
where ui,t is the value of the ith predictor at time t, zt is the error at time t, and α0 , α1 , · · · , αm are
model parameters, estimated by least squares.
Linear models for time series are non-stationary when they include functions of time. Differencing
can often transform a non-stationary series with a deterministic trend to a stationary series. If the
underlying trend is a polynomial of order m, then the mth order differencing is required to remove the
trend.
If {xt } is a stationary time series with Ext = µ, var(xt ) = σ 2 , cor(xt , xt+k ) = ρk , then the variance of
the sample mean is " #
n−1
σ2 X k
var(x̄) = 1+2 (1 − )ρk
n n
k=1

Thus if ρk > 0, then the variance of the sample mean is greater than the independent case and if
ρk < 0, the variance of the sample mean is smaller than the independent case. Since in time series
regression, the residual series will be autocorrelated, we should use generalized least squares (GLS)
instead for better estimates of the standard errors of the regression parameters.
Suppose a time series contains s seasons. A seasonal indicator model for a time series {xt } containing
s seasons and a trend mt is given by
xt = mt + st + zt
where st = βi when t is in the ith season and {zt } is the residual error series, which may be autocor-
related. This may be also written as

xt = mt + β1+(t−1) mod s + zt

By treating the seasonal term st as a factor, the parameters for the model can be estimated by GLS.
If seasonal effects vary smoothly over the seasons, it may be more parameter efficient to use a smooth
function instead of separate indices. The harmonic seasonal model is
⌊s/2⌋
X
xt = mt + [si sin(2πit/s) + ci cos(2πit/s)] + zt
i=1

where mt is the trend with a constant term, si and ci are unknown parameters.
The logarithm can be used to transform a multiplicative model into an additive model. If a time series
is given by
xt = m′t s′t zt′
then applying the logarithm gives

yt = log xt = log m′t + log s′t + log zt′ = mt + st + zt

The process of using a transformation, fitting a model, then applying an inverse transformation in-
troduces a bias in the forecasts. For the logarithm, an empirical correction factor may be applied for
forecasting.
n
X ezi
x̂′t = elog xt
\

t=1
n
A time series model {xt } is said to be strictly stationary if the joint distribution of {xt1 , · · · , xtn } is
the same as the joint distribution of {xt1 +m , · · · , xtn +m } for all t1 , t2 , · · · , tn and m.
Regression can allow us to decompose a non-stationary series to a trend, seasonal components, and
residual series. The residual series can be treated as a realization of a stationary error series.

34
A moving average (MA) process of order q is a linear combination of current white noise and q most
recent past white noise terms.

xt = wt + β1 wt−1 + · · · + βq wt−q
2
where {wt } is white noise with variance σw .

xt = (1 + β1 B + β2 B 2 + · · · + βq B q )wt = ϕq (B)wt
2
where ϕq is a polynomial of order q. The mean is clearly zero and the variance is σw (1 + β12 + · · · + βq2 ).
The autocorrelation function is Pq−k
i=0 βi βi+k
ρk = P q 2 , k ∈ [q]
i=0 βi
and ρ0 = 1, ρk = 0 for k > q. The MA process is invertible if it can be expressed as

wt = (1 − βB)−1 xt = xt + βxt−1 + β 2 xt−2 + · · ·

provided |β| < 1. An MA(q) process is invertible when all roots of ϕq (B) all exceed unity in absolute
value.
When AR and MA terms are added together, it is an ARMA process. An ARMA(p, q) is

xt = α1 xt−1 + α2 xt−2 + · · · + αp xt−p + wt + β1 wt−1 + β2 wt−2 + · · · + βq wt−q

where {wt } is white noise.


θp (B)xt = ϕq (B)wt
Some properties of an ARMA(p, q) process:
1. The process is stationary when all the roots of θ exceed unity in absolute value.
2. The process is invertible when all roots of ϕ exceed unity in absolute value.
3. The AR(p) model is the special case ARMA(p, 0)
4. The MA(q) model is the special case ARMA(0, q).
5. When fitting data, an ARMA model will often be more parameter efficient than a single MA or
AR model.
6. When θ and ϕ share a common factor, a stationary model can be simplified.
A series {xt } is integrated of order d, denoted I(d), if ∇d xt = wt . Since ∇ = 1 − B, {xt } is integrated
of order d iff
(1 − B)d xt = wt
A time series {xt } is an ARIMA(p, d, q) process if the dth differences of {xt } series are an ARMA(p, q)
process.
θp (B)(1 − B)d xt = ϕq (B)wt
A seasonal ARIMA model uses differencing at a lag equal to the number of seasons s to remove
additional seasonal effects. The seasonal ARIMA(p, d, q)(P, D, Q)s model is expressed as

ΘP (B s )θp (B)(1 − B s )d (1 − B)d xt = ΦQ (B s )ϕq (B)wt

35
3.2 Smoothing
The basic running average estimate is defined by
k
1X
ŝt = yt−i+1
k i=1

where k is the running average length. The choice of k depends on the smoothing desired. The larger
the k, the smoother is ŝt . To forecast the series,
yt − yt−k
ŝt = ŝt−1 +
k
If there are no trends in the data, then the second term is zero. For a series that can be expressed as

yt = β0 + β1 t + ϵt ,

we can use a double smoothing procedure:


(1) Pk
1. Create a smoothed series ŝt = k1 i=1 yt−i+1 .
(2) Pk (1)
2. Create a doubly smoothed series ŝt = k1 i=1 ŝt−i+1 .
This procedure smooths out the effect of a linear trend in time. The estimate of the trend is
(1) (2)
ŝT − ŝT
b1,T = 2
k−1
The resulting forecasts are
ŷT +l = ŝT + b1,T l
Running averages can be expressed as weighted least squares estimates. By taking a weight of one for
t = T − k + 1, · · · , T , the weighted least squares estimator returns the running average estimate. This
model is called a locally constant mean model.
Exponential smoothing takes the form

X
ŝt = (1 − w) wi yt−i
i=0

Because observations are not available in the infinite past, we take the truncated version
t
X
ŝt = (1 − w) wi yt−i
i=0

This is the exponential smoothed estimate of the series. Like running average estimates, the smoothed
estimates provide greater weights to more recent observations. Exponential smoothing has the recursive
equation
ŝt = ŝt−1 + (1 − w)(yt − ŝt−1 ) = (1 − w)yt + wŝt−1
To forecast, we should use ŷT +l = ŝT . To decide w, we minimize the sum of squared one-step prediciton
errors
XT
SS(w) = (yt − ŝt−1 )2
t=1

For a linear trend in time, we can use a double exponential smoothing:

36
(1) (1)
1. Take ŝt = (1 − w)yt + wŝt−1 .
(2) (1) (2)
2. Take ŝt = (1 − w)ŝt + wŝt−1 .
(1) (2)
The estimate of the trend is b1,T = 1−ww (ŝT − ŝT ). The forecasts are given by ŷT +l = b0,T + b1,T l,
(1) (2)
where the intercept is b0,T = 2ŝT − ŝT .
Similar to running average smoothing, exponential smoothed estimates are WLS estimates with weights
given by wt = wT −t . Exponential smoothing estimates are also called discounted least squares esti-
mates.
Holt introduced the following generalization of the double exponential smoothing method for seasonal
data. Let w1 and w2 be smoothing parameters and calculate recursively the estimates:

b0,t = (1 − w1 )yt + w1 (b0,t−1 + b1,t−1 )

b1,t = (1 − w2 )(b0,t − b0,t−1 ) + w2 b1,t−1


These estimates forecast the linear trend model, yt = β0 + β1 t + ϵt . This is a generalization since
the same weight need not be used for both β0 and β1 . Winters extended the Holt procedure to
accommodate seasonal trends. The Holt-Winter seasonal additive model is

yt = β0 + β1 t + St + ϵt
Pg
where St = St−g , i=1 Si = 0. We employ three smoothing parameters. The parameter estimates for
the model are determined recursively by

b0,t = (1 − w1 )(yt − Ŝt−g ) + w1 (b0,t−1 + b1,t−1 )

b1,t = (1 − w2 )(b0,t − b0,t−1 ) + w2 b1,t−1


Ŝt = (1 − w3 )(yt − b0,t ) + w3 Ŝt−g
Forecasts are determined using
ŷT +l = b0,T + b1,T l + Ŝt (l)
where ŜT (l) = ŜT +l1 , l ≡ l1 (mod g). To compute the recursive estimates, we must decide on the
initial starting values and a choice of smoothing parameters. To determine the initial starting values,
it is recommended to fit a regression equation to the first portion of the data. The regression equation
includes a linear trend in time and g −1 binary variables for seasonal variation. Only g +1 observations
are required to determine the initial estimates b0,0 , b1,0 , y1−g , y2−g , · · · , y0 . On the choice of smoothing
parameters, analysts rely on rule of thumbs. Cryer and Miller recommend w1 = w2 = 0.9, w3 = 0.6.

3.3 Unit Root Test


Consider the model
yt = µ0 + ϕ(yt−1 − µ0 ) + µ1 (ϕ + (1 − ϕ)t) + ϵt
When ϕ = 1, this is a random walk model with yt = µ1 + yt−1 + ϵt . When ϕ < 1 and µ1 = 0, this is an
AR(1) model. When ϕ = 0, this is a linear trend in time model. It is customary to use least squares
on the model
yt − yt−1 = β0 + (ϕ − 1)yt−1 + β1 t + ϵt
where β0 = µ0 (1 − ϕ) + ϕµ1 and β1 = µ1 (1 − ϕ). From this regression, let tDF denote the t-statistic
associated with the yt−1 variable. We wish to use the t-statistic to test the null hypothesis H0 : ϕ = 1
versus H1 : ϕ < 1. Because {yt−1 } is a random walk under the null-hypothesis, the distribution of tDF

37
is not the usual t-distribution. This test is called the Dickey-Fuller test. A criticism of the Dickey-Fuller
test is that the disturbance term is assumed to be serially uncorrelated. The augmented Dickey-Fuller
test statistic is the t-statistic associated with the yt−1 variable using OLS on the equation
p
X
yt − yt−1 = β0 + (ϕ − 1)yt−1 + β1 t + ϕj (yt−j − yt−j−1 ) + ϵt
j=1

In this equation, the disturbance term is augmented by the autoregressive terms in the differences
yt−j −yt−j−1 . The idea is that these terms capture serial correlation in the disturbance term. Consensus
is not reached on choosing the number of lags p. Analysts provide results of the test statistic for a
number of choices of lags and hope that conclusions are similar.

3.4 ARCH/GARCH models


Let Ωt denote the information set, the collection of knowledge about the process up to and including
time t. We allow the variance to depend on time t by conditioning on the past,

σt2 = Var(ϵt | Ωt−1 )

We first present the autoregressive changing heteroscedasticity model of order q, ARCH(q). Assume
ϵt given Ωt−1 is normally distributed with mean zero and variance σt2 . Further assume that the
conditional variance is determined recursively by

σt2 = w + γ1 ϵ2t−1 + · · · + γq ϵ2t−q = w + γ(B)ϵ2t

> 0 is the long-run volatility parameter and γ1 , γ2 , · · · , γq are coefficients such that γj ≥ 0
The term w P
q
and γ(1) = j=1 γj < 1. If p = 1, a large change to ϵ2t−1 can induce a large conditional variance σt2 .
Higher orders of q capture longer-term effects. Despite having a changing conditional variance, the
unconditional variance remains constant over time.
Next, we discuss the generalized ARCH model of order p, GARCH(p, q). As with the ARCH model,
we assume the distribution of ϵt given Ωt−1 is normally distributed with mean zero and variance σt2 .
It is determined recursively by

σt2 − δ1 σt−1
2 2
− · · · − δp σt−p = w + γ1 ϵ2t−1 + · · · + γq ϵ2t−q

or σt2 = w + γ(B)ϵ2t + δ(B)σt2 . In addition to ARCH(q) requirements, we also need δj ≥ 0 and


γ(1) + δ(1) < 1. The GARCH(p, q) model is also weakly stationary, with mean zero and variance
w
Varϵt = 1−γ(1)−δ(1) .

38

You might also like