0% found this document useful (0 votes)
32 views152 pages

Lecture Notes in Linear Regression

The lecture notes for MA317-6-AU by Dr. Danilo Petti provide an overview of key concepts in linear regression, including simple and multiple linear regression, logistic regression, and maximum likelihood estimation. The notes emphasize the importance of understanding statistical models for data analysis and include references for further study. Assessment consists of a lab test (30%) and a final exam (70%).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views152 pages

Lecture Notes in Linear Regression

The lecture notes for MA317-6-AU by Dr. Danilo Petti provide an overview of key concepts in linear regression, including simple and multiple linear regression, logistic regression, and maximum likelihood estimation. The notes emphasize the importance of understanding statistical models for data analysis and include references for further study. Assessment consists of a lab test (30%) and a final exam (70%).
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture notes for

MA317-6-AU

Dr. Danilo Petti


Department of Mathematical Sciences
University of Essex

2023-2024
Contents

3
4 C ONTENTS

Declaration
These lecture notes do not represent original content; their content is extracted from the references
cited in the appropriate section. Furthermore, this material is intended to provide a comprehensive
overview of the topics covered in the course and are the result of research and analysis. While
they may not cover every detail of the course material, they are designed to give students a solid
understanding of the key concepts and principles. It should be noted that the lecture notes are
not a substitute for attending lectures and actively participating in class discussions, but rather a
complement to them. The notes serve as a tool for students to review and reinforce their under-
standing of the material presented in class, and the exercises assigned on a weekly basis are part
of the lecture material and should be attempted by students to fully grasp the concepts covered.
Students are encouraged to consult the references provided and conduct further research to deepen
their knowledge on the topics covered in the course. The instructor and teaching assistants are
available to provide guidance and support throughout the course. Finally, it is important to adhere
to academic integrity principles and properly cite any sources used in research and assignments.

Module assessment
lab test 30% and final exam 70%

References
• Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to linear regression
analysis (6th ed.). Wiley.

– Note for students: Sections 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13.

• Verbeek, M. (2017). A guide to modern econometrics (5th ed.). Wiley.

– Note for students: Sections 1, 2, 3, 6, Appendix A & B.

• Faraway, J. J. (2015). Linear Models with R (2nd ed.). CRC Press.

– Note for students: Covers the material discussed in the labs.

• Faraway, J. J. (2016). Extending the linear model with R: generalized linear, mixed effects
and nonparametric regression models (2nd ed.). CRC Press.

• Data used in the Labs are mainly extracted from: Faraway J (2022). _faraway: Functions
and Datasets for Books by Julian Faraway_. R package version 1.0.8, <[Link]
[Link]/package=faraway>.
C ONTENTS 5

Syllabus
Simple linear regression

• Explain what is meant by response and explanatory variables.

• Explain what is meant by a variable, a factor taking categorical values and an interaction
term.

• Use simple linear regression to describe the stages of conducting a data analysis to solve
real-world problems in a scientific manner and describe tools suitable for each stage.

• Use simple linear regression to describe sources of data and explain the characteristics of
different sources, including extremely large data sets.

• Explain the meaning and value of reproducible research and describe the elements required
to ensure a data analysis is reproducible.

• Maximum likelihood estimation.

• Link between maximum likelihood and least Squares. OLS for linear regression.

• Use a fitted linear relationship for prediction. Define the linear predictor.

• Confidence intervals for parameters and prediction intervals for future observations.

• ANOVA

Multiple linear regression

• Multiple regression. Subdividing the regression sum of squares. Lack of fit and pure error.

• Regression diagnostics, violation of the assumptions, leverage points, outliers and collinear-
ity and methods to deal with them.

• Apply statistical tests to determine the acceptability of a fitted model: Testing one linear
restriction, joint test of significance of regression, general case.

• Principal Component Regression

• Principal Component Analysis

• Model selection via stepwise methods. Cp, AIC, BIC

• Ridge regression

• Lasso Regression
6 C ONTENTS

Logistic regression

• Define the deviance and scaled deviance and state how the parameters of a logistic model
may be estimated.

• Define the Pearson and deviance residuals and describe how they may be used.

• Use R to implement the methods above on a data set and interpret the output.
Chapter 1

The Method of Maximum Likelihood

Suppose that random variables X1 , . . . , Xn have a joint density or frequency function defined as
f (x1 , . . . , xn |θ). Given observed sample (x1 , . . . , xn ), the likelihood function of θ is defined as

L(θ|x) = f (x1 , . . . , xn |θ), θ ∈ Θ, (1.1)

where Θ denotes the parametric space. We can note that the function just presented is not the
joint density, this is because it is no longer a function of the random variables X, this because we
observed a sample and we are evaluating (??) according to (x1 , . . . , xn ). The maximum likelihood
estimate (MLE) of θ is that value of θ that maximizes the likelihood function, the value that makes
the observed data most likely to be observed.
If the Xi are assumed to be i.i.d., then the joint density can be written as the product of the
marginal densities, and the likelihood function is
n
Y
L(θ|x) = fXi (xi |θ).
i=1

Rather than maximizing the likelihood, it is usually easier (and equivalent) to maximize its
natural logarithm. The log-likelihood function is
n
X
`(θ) = log[fXi (xi |θ)].
i=1

Let us find the MLE for the Gaussian distribution

Example 1. (Gaussian Distribution)


If X1 , . . . , Xn are i.i.d. N (µ, σ 2 ), their joint density is the product of their marginal densities:
n
(xi − µ)2
 
2
Y 1
f (x1 , . . . , xn |µ, σ ) = √ exp − ,
i=1 2πσ 2 2σ 2

7
8 T HE M ETHOD OF M AXIMUM L IKELIHOOD

the log-likelihood is
n
n 1 X
`(µ, σ 2 ) = −n log σ − log 2π − 2 (xi − µ)2 ,
2 2σ i=1

the partials derivatives respect to µ and σ are

n
∂`(µ, σ 2 ) 1 X
= 2 (xi − µ),
∂µ σ i=1
n
∂`(µ, σ 2 ) n X
= − + σ −3 (xi − µ)2 ,
∂σ σ i=1

setting the first and second partial equal to zero and solving for µ and σ we obtain the MLEs

µ̂ = X̄,
v
u n
u1 X
σ̂ = t (xi − X̄)2 ,
n i=1

1
Pn
where X̄ = n i=1 xi

Example 2. If X follows a Poisson distribution with paramter λ then

λx e−λ
P(X = x) = ,
x!

If X1 , . . . , Xn are i.i.d. and Poisson, their joint distribution function is the product of the marginal
distributions. The log-likelihood is

n
X
`(λ) = (xi log λ − λ − log x!),
i=1
n
X n
X
= log λ xi − nλ − log xi !.
i=1 i=1

Setting the first derivative equal to zero and solving for λ, we end up with
n
∂`(λ) −1
X
= λ̂ xi − n = 0,
∂λ λ=λ̂ i=1
9

the MLE is
λ̂ = X̄.
A very useful quantity in the context of the maximum likelihood estimation is the Fisher in-
formation Marix with the jk th (1 ≤, k ≤ d) entry is defined as

∂2
 
ijk (θ) = E `(θ) .
∂θj ∂θk
This can be thought of as a measure of how hard it is to estimate θ when it is the true parameter
value. The Cramer-Rao Lower bound states that if θ̂ is an unbiased estimator of θ, the under
regularity conditions

V ar(θ̂) − i−1 (θ),

is a positive definite quantity. In other words we are saying the MLE achieves the lowest
possible variance.
The main asymptotic properties of maximum likelihood estimators are

• Under regularity conditions, the MLE (θ̂) is a strongly consistent estimator of θ


 
P lim θ̂ = θ = 1,
n→∞

it follows that the MLE is asymptotically unbiased, since

E(θ̂ − θ) → 0 as n → ∞.

• The maximum likelihood estimator is asymptotically efficient. That is

lim Var(θ̂) = I(θ)−1


n→∞

• The MLE is asymptotically normally distributed and efficient



n(θ̂ − θ) ∼ N (0, I(θ)−1 ),

where I(θ) is the fisher information matrix

with regularity conditions we mean


• The likelihood function is continuous in θ;

• The density/mass function is such that for all x where f (x; θ) > 0


log f (x, θ),
∂θ
10 T HE M ETHOD OF M AXIMUM L IKELIHOOD

exists and is finite;

• The order of differentiation with respect to θ and integration over the sample space may be
reversed. In practice, we have this regularity condition satisfied where the support of f (x, θ)
does not depend on θ.

Theorem 1. Let θ̂ the MLE for the paratemeter θ and ψ(·) a one-to-one function, then ψ(θ̂) is the
MLE for ψ(θ).
11

Exercises

1. Let x1 , x2 , . . . , xn be a simple random sample drawn from a random variable X with density
function:

x −x
f (x; λ) ∝ 2 2e
λ

3
λ

for x > 0 and λ > 0.

The candidate should:

(a) Define the maximum likelihood function and the log-likelihood function;

(b) Calculate the maximum likelihood estimate λ̂ for λ;

2. Suppose we have a single observation X = x from an exponential distribution with param-


eter λ and pdf

f (x|λ) = λe−λx , x > 0.

Find the maximum likelihood estimate (MLE) λ̂ of λ. Now reparametrize using θ = λ1 and
find the MLE θ̂ of θ. Comment on the relationship between λ̂ and θ̂. Show that θ̂ is unbiased
for θ and find its variance. Is λ̂ unbiased for λ?

Useful integrals: for n ≥ 1


Z ∞
y n−1 e−y dy = Γ(n) = (n − 1)! (1)
0

and Z ∞
y −1 e−y dy = ∞. (2)
0

3. With θ as the parameter so that

1
f (x|θ) = e−x/θ , x > 0,
θ

suppose we now have n independent observations x1 , x2 , . . . , xn . Find the MLE of θ, show


it is unbiased and find its variance.

4. The truncated exponential distribution has pdf

1
f (x|θ, α) = e−(x−α)/θ , x > α.
θ
12 T HE M ETHOD OF M AXIMUM L IKELIHOOD

Sketch the pdf and check that it satisfies the two conditions for being a pdf. For known α
and n independent observations x1 , x2 , . . . , xn , what is the MLE of θ? Now suppose both α
and θ need to be estimated. Find their MLEs.
Hint: You need to be careful with α; think before you differentiate.
Chapter 2

Simple Linear Regression

The most operative part of statistics is about the model building. The main goal of an analyst is to
describe,understand,make prediction, simulate and control a phenomenon. For these purposes, it
is crucial to understand the logical and formal structure of a regression model. In which the final
goal is to define an explicit relationship between what is meant to be explained Y and its cause X.
A statistical model is by definition a simplified representation of reality. Which derives from
sample observations and logical deductions. The construction of a statistical model is character-
ized by three main phases: specification, estimation and validation phase.
i) Model specification, in this phase our aim is to to specify a functional form between vari-
ables of interest, in the following way

Y = f (X1 , . . . , Xp ),

where Y is the dependent variable (outcome variable, response variable) and X1 , . . . , Xp are the
independent variables (predictor variables, explanatory variables). We are going to use X’s to
explain Y through f (·).
The relationship Y = f (X1 , . . . , Xp ) can usually be derived from two main factors i) prior
knowledge of the phenomenon ii) experimental results. In the specification phase, the variables of
interest and their role must be identified, because this helps to formulate the functional link f (·)
correctly.
Any statistical model that we are going to encounter in this module represents an approxima-
tion of the reality. As there is no scientific field in which is legitimate to hypothesize a deterministic
relationship among random variables, we need to take another component in our specification. A
component that helps taking into account the uncertainty. Therefore

Y = f (X1 , . . . , Xp ) + ,

where  is a random variable (error term) that represents our real or supposed ignorance about

13
14 S IMPLE L INEAR R EGRESSION

the relationship among Y and X1 , . . . , Xp . In real word applications, the specification of the
functional form f (·) comes immediately from the nature of the problem or the underlying theory
(Examples: marco and micro-economics, medicine, biology, chemistry).

• If Y is the weight and X is the height, a reasonable functional form is: Y = βX +  where
β ∈ R;

• If Y is the weight of a tile and X1 and X2 represent the length and width, knowing that the
weight of a tile is proportional to its area a functional form is: Y = βX1 X2 +  ;

• The micro-economic theory teaches us that the output Y depends from the capital K and
labour force L through the Cobb–Douglas production function: Y = β0 K β1 Lβ2 +  where
β0 , β1 , β2 ∈ R.

In each of the proposed specifications, one or more parameters β need(s) to be estimated in


order to use these models. Even if these parameters assume specific meanings (e.g., constants of
proportionality, basis weight, elasticity of demand, rate of change), this is not necessary for the
purpose of the specification phase.
A statistical model can be classified using different criteria, according to the data employed,
the estimation method, the nature of the random variables, from the functional form used and so
on and so forth.

• Simple: when only one variable Y is connected to an only one variable X; multiple when
the variables used to explain Y are more than one;

• Linear: when Y is expressend using a linear combination of variables (X1 , . . . , Xp ) and


parameters β1 , . . . , βp , linearizable: when originally non-linear models, can be linearized
through a transformation (e.g., apply the log(·) to the Cobb–Douglas production function in
the previous example).

For sake of completeness, it is fair to say that there are other classifications that are out of the
scope of this course (e.g., spatial, panel models, times series models, non linear models).
Assuming that X is the daily precipitation in a reservoir and Y is the water level of the river,
it is likely to assume that there is a relationship of the type X → Y , and not vice versa. If X
is the speed of a vehicle and Y the braking distance it makes sense to assume only a one-way
relationship X → Y .
If Y is a household’s consumption and X is total income, the economic theory proposes various
formulations for specifying a functional form Y = f (X) + . However, for the purposes of
a tax investigation, studying a relationship Y → X and a model X = f (Y ) +  would make
sense. Evaluating income through a consumption function would allow verifying the credibility
of individual income tax return.
15

ii) Estimation Phase, The estimation phase can be approached as follows. Assume we plan
an experiment, determine a random sample of size n, and on each of the statistical units both the
phenomenon to be explained and the causes are observed. Thus we have the following data

(yi , xi1 , xi2 , . . . , xip ), for i = 1, . . . , n,

therefore the model specified is the following

Yi = f (xi1 , xi2 , . . . , xip ; β) + i , for i = 1, . . . , n.

However, this model specification has a problem. The variable i is not observable, and in the
observed data the variable i is realized through the number ei , which can be deduced from the
following relation

yi = f (xi1 , xi2 , . . . , xip ; β) + ei , for i = 1, . . . , n.

Therefore, ei = yi −f (xi1 , xi2 , . . . , xip ; β) are the sample estimate of i , and yi are the observed
realizations of the r.v. Y , i = 1, . . . , n.

iii) Validation Phase consists of a series of inferential decisions, formalized through hypoth-
esis tests, with the ultimate aim of criticizing and discussing the results obtained. If the validation
does not lead to the rejection of the estimated model then this model can be used. Otherwise it has
to be reformulated and the information that led to that functional form revised.

A simple linear regression model can be expressed using the following form

Yi = β0 + β1 xi + i , i = 1, . . . , n, (2.1)

Where β0 is the intercept and β1 is the slope. A model is defined to be linear as long as it is
linear in the parameters. If β0 = 0 the straight line passes though the origin. If β1 > 0 (β1 < 0),
our straight line has positive (negative) slope and as X grows then Y grows (decreases). Finally,
if β1 = 0 we have a parallel line to the x-axis. Note that the form of the model implies that the
only random variable is the response variable Y . The explanatory variable X is usually assumed
to be fixed, i.e. measured without error.

For each statistical unit (yi , xi ), our regression model is based on a deterministic component,
a function of the variable X (β0 + β1 xi ) plus a stochastic component (given by the error term i ).
Assuming that E(i ) = 0, i = 1, . . . , n and that all the xi are deterministic, the expected value and
the variance of Y can be proved to be
16 S IMPLE L INEAR R EGRESSION

E(Yi ) = E(β0 + β1 xi + i ) = β0 + β1 xi + E(i ) = β0 + β1 xi ,


V (Yi ) = V (β0 + β1 xi + i ) = V (i ),

when the X 0 s are not fixed by the design of the study it is more appropriate to write
E(Yi |Xi = xi ) in order to make clear that the x-values are themselves random variables but
what we are interested in not the distribution of the x- values, only the way Y depends on them.
Therefore

E(Yi |Xi = xi ) = E(β0 + β1 xi + i |Xi = xi ) = β0 + β1 xi + E(i |Xi = xi ) = β0 + β1 xi ,


V (Yi |Xi = xi ) = V (β0 + β1 xi + i |Xi = xi ) = V (i |Xi = xi ).

We are ready to discuss some inferential results, assume to have a sample of size n such that
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) are realizations of X and Y . For the ith statistical unit, we can write
E(Yi ) = β0 + β1 xi , i = 1, . . . , n. Having observed the value of yi for the r.v. Yi , the difference
ei = yi − (β0 + β1 xi ) represents the ith realization of the random variable (r.v.) i provided that we
know the values for β0 and β1 . So if the model were a perfect representation of reality for Y, we
would observe the exact value for (β0 + β1 xi ), instead we observed yi . The difference between the
observed value for yi for the r.v. Yi and the expected value β0 + β1 xi computed using the linear
model when Xi = xi , is defined as the realization ei (residual(s)) of the r.v. i (error(s)).

ei = yi − (β0 + β1 xi ) for i = 1, . . . , n.

For a correct understanding, we must fix the following elements

• Model Specification
Y = β0 + β1 X + ;

• Population
Yi = β0 + β1 xi + i , i = 1, . . . ;

• Observed Sample
yi = β0 + β1 xi + ei , i = 1, . . . , n.

In practice, we wish to minimize the difference between the actual value of y (yi ) and the
predicted value of y that we will denote with ŷi . This difference is called the residual, ei , that is,

ei = yi − ŷi .
2.1 E STIMATION AND I NFERENCE 17

Figure ?? shows a hypothetical situation based on six data points. Marked on this plot is a line
of best fit, ŷi along with the residuals.
A very popular method of choosing β0 and β1 is called the method of least squares. As the
name suggests β0 and β1 are chosen to minimize the sum of squared residuals (or residual sum of
squares, RSS ),

2.1 Estimation and Inference


We can now introduce some assumptions. These conditions are not strictly needed to justify the
use of the ordinarily least square (OLS) estimator. They just constitute a simple case in which the
small sample properties of the estimator of β0 and β1 are easily derived.

• (A0) Yi = β0 + β1 xi + i , for i = 1, . . . , n;

• (A1) E(i ) = 0, for i = 1, . . . , n;

• (A2) V (i ) = σ 2 < +∞, for i = 1, . . . , n;

• (A3) Cov(i , j ) = 0, for i 6= j, . . . , n;

• (A4) X is deterministic, and it is observed for at least two distinct values.

The condition (A0) imposes that the model we are specifying has the same parameters for each
observation, i.e. there are no structural breaks in the data, we are also assuming that the model
is linear. Condition (A1) states the the expected value of the error terms is zero, on average the
regression line should be correct. Assumption (A2) states that the error terms have equal variance,
which is referred to as homoskedasticity. Assumption (A3) imposes zero correlation between
different error terms. (A4) requires knowledge of the variable X, so we are assuming that the
18 S IMPLE L INEAR R EGRESSION

Figure 2.1: 3D plot of the Residual Sum of Squares as a function of the parameters β0 and β1 , with the red dot
representing the minimum of the function, which corresponds to the values that minimize the RSS.

observed sample is the result of a controlled experiment where X is a deterministic variable.


However, we have already discussed how the model can be interpreted as a relationship between
the conditional mean value of Y and the observed value of X

We are seeking the values of β0 and β1 that minimize the residual sum of squares (RSS)

n
X n
X
2
RSS(β0 , β1 ) = (ei ) = (yi − (β0 + β1 xi ))2 , (2.2)
i=1 i=1

this is a quadratic expression of the variables (β0 and β1 ). Figure ?? depicts precisely a 3D
representation of the objective function expressed in Equation (??).

By differentiating respect to β0 and β1 , we obtain the normal equations


2(−1) Pn (y − (β̂ + β̂ x )) = 0
i=1 i 0 1 i
P n
i=1 (yi − (β̂0 + β̂1 xi ))xi = 0
2(−1)
2.1 E STIMATION AND I NFERENCE 19

then 
Pn (y − (β̂ + β̂ x )) = 0
i=1 i 0 1 i
 n (y − (β̂ + β̂ x ))x = 0
P
i=1 i 0 1 i i

from which we can obtain the following solutions



β̂ = ni=1
P
(x −x̄)(yi −ȳ) Sxy
1 Pn i 2 =
i=1 (xi −x̄) Sx2
β̂ = ȳ − β̂ x̄
0 1

where

n
X n
X n
X n
X
−1 −1 −1 −1
x̄ = n xi , ȳ = n yi , Sx2 =n (xi − x̄), Sxy = n (xi − x̄)(yi − ȳ),
i=1 i=1 i=1 i=1

from a computational point of view, the quantities needed to compute the estimates are
n
X n
X n
X n
X n
X
xi yi , x2i , yi2 , xi y i ,
i=1 i=1 i=1 i=1 i=1

it is useful to recall some alternative formulations

n
X n
X n
X n
X
(xi − x̄)(yi − ȳ) = (xi − x̄)yi − ȳ (xi − x̄) = (xi − x̄)yi ,
i=1 i=1 i=1 i=1

and the expression of the slope can be written as


Pn n
i=1 (xi − x̄)yi X
β̂1 = Pn 2
= δi y i ,
i=1 (xi − x̄) i=1

the estimation of the coefficients δi is such that


n n n  2
(xi − x̄) X X
2
X (xi − x̄) 1
δi = Pn 2
, i = 1, . . . , n; δi = 0; (δi ) = Pn 2
= Pn 2
i=1 (xi − x̄) i=1 i=1 i=1 i=1 (xi − x̄) i=1 (xi − x̄)

Furthermore, we can rewrite the slope as

2 (yi −ȳ)
Pn n
i=1 (xi − x̄) (x−x̄)
X
β̂1 = Pn 2
= w i bi ,
i=1 (xi − x̄) i=1

where
n
(yi − ȳ) (xi − x̄)2 X
bi = ; wi = n
P 2
, i = 1, . . . , n; wi ≥ 0 wi = 1,
(xi − x̄) i=1 (xi − x̄) i=1
20 S IMPLE L INEAR R EGRESSION

These formulations show how the OLS estimate of the slope is a weighted average of all the
slopes bi , determined by the straight lines joining the single observed data points (xi , yi ) with the
point (x̄, ȳ). We note that the weights wi are an increasing function of (xi − x̄). This means that
data points far from the sample mean will have a greater impact in determining the slope. Another
way to conceptualize this is by regarding the dispersion of xi around the mean as equivalent to
information. The greater the dispersion among the xi values, the more information we possess,
and consequently, we will have a richer source of information for estimating our parameters.
Conversely, when the data cluster closely around the mean, we have less information available,
leading to less accurate parameter estimates.
Finally, it is important to discuss what happens when dealing with standardized data points, in
this case

ŷ − ȳ xi − x̄ Sy
= β∗ → (ŷ − ȳ) = β ∗ (xi − x̄), i = 1, . . . , n,
Sy Sx Sx
From which we have

Sy Sx Sxy Sx
β̂1 = β ∗ → β ∗ = β̂1 = 2 = rxy ,
Sx Sy Sx Sy

Where rxy is the correlation coefficient. Then, using the standardized variables, the intercept
reduces the correlation coefficient. So by estimating β ∗ we are not adding more information than
computing the correlation coefficient.
Using the results just discussed, we can prove some important properties about the regression
line.

• The regression line is unique, because the minimum of the convex function RSS(β0 , β1 ) is
unique;

• The regression line always passes through the point (x̄, ȳ). From the relation β̂0 = ȳ − β̄1 x̄,
we can deduce that ȳ = β̂0 + β̂1 x̄;

• The sample mean of yi is equal to the ŷ¯i . In fact


Pn Pn
i=1 (yi − β̂0 − β̂1 xi ) = 0 → i=1 (yi −
ŷi ) → ni=1 ŷi = ni=1 yi .
P P

Note that the presence of the intercept in the model formulation implies that the sum of the
residuals is always zero. Usually, if not for theoretical reasons or specialized knowledge, it is a
mistake to omit the intercept from the estimation.
2.1 E STIMATION AND I NFERENCE 21

Example 3. Let’s assume to have the following data

x y
9,705 3,816
7,267 3,130
8,459 2,955
12,476 4,809
10,296 4,269
8,424 3,291
7,910 2,274
8,879 3,308
11,160 4,340
5,295 1,948
8,421 3,715
12,232 5,340
5,422 2,212
9,900 2,512
12,441 5,277
Table 2.1: Data, Example

we can compute

X X X X X
xi = 138, 287; yi = 53, 196; x2i = 1, 347, 881, 959; xi yi = 521, 674, 875; yi2 = 205, 383, 190

and
x̄ = 9, 219.133; ȳ = 3, 549.400; Sxy = 2, 083, 590.5467; Sx2 = 4, 866, 377.8489

Therefore,

2, 083, 590.5467
β̂1 = = 0.4282; β̂0 = (3.546, 400) − 0.4282 × (9, 219.133) = −401.2328
4, 866, 377.8489

The linear regression will be


ŷ = −401.2328 + 0.4282x

2.1.1 Gauss Markov Theorem


The estimates (β̂0 , β̂1 ) obtained with the OLS method using observed sample {(xi , yi ), i = 1, . . . , n}
vary as the random sample varies {(xi , Yi ), i = 1, . . . , n}, thus generating a pair of random vari-
ables (B0 , B1 ), i.e. the corresponding estimators of the parameters. In the regression model the
random component is given by i , and therefore Y is also a random variable. The estimators are
then
22 S IMPLE L INEAR R EGRESSION


B = ni=1
P
(x −x̄)(Yi −Ȳ )
1 Pn i 2
i=1 (xi −x̄)
B = Ȳ − B x̄
0 1

In the linear regression model we denote (β0 , β1 ) the population parameters, with (β̂0 , β̂1 ) the
estimates obtained from OLS (= numbers), with (B0 , B1 ) are the estimators (=r.v.) generated from
the random sampling process. We are now ready to discuss one of the most important results in
Inferential Statistics

Theorem 2. (Gauss-Markov Theorem). Under the classical assumptions of the simple regression
model, OLS estimators (B0 , B1 ) for the parameters (β0 , β1 ) are linear, unbiased, and the most
efficient in the class of linear and unbiased estimators (BLUE) (A proof based on the multivariate
version of this theorem can be found in the following section.)

To employ the least squares estimators for making meaningful inferences, we need to discuss
their statistical properties. We have derived the expression of B1 and B0 , it can be showed that the
least squares estimators are unbiased

"P #
n Pn
(x − x̄)(Yi − Ȳ ) i=1 (xi − x̄)E(Yi )
E(B1 ) = E Pn i
i=1
2
= P n 2
,
i=1 (xi − x̄) i=1 (xi − x̄)
Pn
i=1P(xi − x̄)(β0 + β1 xi )
= n 2
,
i=1 (xi − x̄)
Pn Pn
i=1 (xi − x̄) (xi − x̄)xi
= β0 Pn 2
+ β1 Pi=1 n 2
,
i=1 (xi − x̄) i=1 (xi − x̄)
Pn
(xi − x̄)2
= 0 + β1 Pni=1 2
= β1 .
i=1 (xi − x̄)

To find V (B1 ) we need to recall that Y1 , Y2 , Y3 , . . . , Yn are independent random variables,


therefore

"P # n
" #
n
(xi − x̄)Yi 1 X
V (B1 ) = V Pi=1
n 2
= 2 V (xi − x̄)Yi ,
i=1 (xi − x̄) Sxx i=1
n
1 X
= 2
(xi − x̄)2 V (Yi ),
Sxx i=1

Pn
where Sxx = i=1 (xi − x̄)2 , recalling that V (Yi ) = σ 2 , i = 1, . . . , n

V (B1 ) = σ 2 /Sxx ,

to study the properties of B0 we should first discuss some results


2.1 E STIMATION AND I NFERENCE 23

X X
E(Ȳ ) = E(n−1 (Yi )) = n−1 E(β0 + β1 xi + i ) = β0 + β1 x̄,

it can be proved that the covariance between Ȳ and B1 is zero

Cov(Ȳ , B1 ) = 0,

we have that

E(B0 ) = E(Ȳ ) − E(B1 )x̄ = β0 + β1 x̄ − β1 x̄ = β0 ,

as we know the expressions for V (Ȳ ), V (βˆ1 ) and Cov(Ȳ , βˆ1 ), we can derive the variance of
B0

σ2 n(x̄)2
 
V (B0 ) = 1 + Pn 2
,
n i=1 (xi − x̄)

finally, we can prove that B0 and B1 are correlated

−x̄σ 2
Cov(B0 , B1 ) = .
Sxx
Unfortunately, the value of σ 2 is unknown, so we have to use our sample information to find
a reasonable estimator. Because we are using Ŷ to estimate E(Yi ) it seems natural to base an
estimate of σ 2 upon RSS = ni=1 (Yi − Ŷi )2 . Indeed it can be showed that
P

 n
X  
2 1 2 1
S = (Yi − Ŷi ) = RSS,
n−2 i=1
n−2

is an unbiased estimator for σ 2 . We note how the variability of the estimates decreases as
the quantity ni=1 (xi − x̄)2 increases, and therefore as the variance of the sample increases, the
P

variability must be seen as a source of information and, specifically, only by having an adequate
variability for the observations will it be possible to check if this variability has an effect on Y .
p p
It can be also proved that B0 → β0 and B1 → β1 meaning that both B0 and B1 are consistent
estimators. If we assume that i are independent, then S 2 is a consistent estimator for σ 2 . Finally
B0 and B1 are also asymptotically normal.
Plugging in the expression of S 2 in V (B0 ) and V (B1 ) we obtain the estimators for the vari-
ances, the standard errors

S2 n(x̄)2
 
2
se (B0 ) = 1 + Pn 2
,
n i=1 (xi − x̄)
S2
se2 (B1 ) = Pn 2
.
i=1 (xi − x̄)
24 S IMPLE L INEAR R EGRESSION

Where the square root of the quantities just introduced constitutes the standard errors of B0
and B1 , respectively. The standard error is a measure of the variability or uncertainty associated
with the estimated coefficients of the regression equation. It provides an indication of how reliable
or precise the estimated coefficients are in representing the true relationships between the inde-
pendent and dependent variables. The distinction between the variance and the standard errors of
B0 and B1 lies in the fact that the variance cannot be directly calculated, as it relies on σ 2 (the
variance of the errors).

2.1.2 Maximum likelihood estimators for β1 and β0

Assuming that the r.v. are i.i.d. (independent and identically distributed) i ∼ N (0, σ 2 ), then the
observed sample (y1 , . . . , yn ) is the realization of the random sample Y = (Y1 , . . . , Yn )> whose
components are distributed as Yi ∼ N (β0 + β1 xi , σ 2 ). Therefore the likelihood function can be
written

− β0 − β1 xi )2
 P 
2 2 −n/2 i (yi
L(β0 , β1 , σ |y) = (2πσ ) exp − ,
2σ 2
and the log-likelihood function is

− β0 − β1 xi )2
P
2 2 i (yi
`(β0 , β1 , σ ) ∝ −(n/2) log(σ ) − ,
2σ 2
where ∝ means proportional to, taking the derivatives with respect to β0 and β1 we end up
with the Maximum Likelihood estimators(MLE)
Pn n
(x − x̄)(yi − ȳ) 1X
Pn i
β̂1 = i=1 2
; β̂0 = ȳ − β̂1 x̄; σ̂ = (yi − β̂0 − β̂1 xi )2 .
i=1 (xi − x̄) n i=1

We note that the estimates and the corresponding MLE estimators coincide with the expres-
sions we found for the OLS. However, the estimate of the variance is biased.
Therefore MLE enjoy all the beautiful properties we have discussed, also for MLE estimators
the Gauss Markov theorem still holds, but because we are using maximum likelihood to obtain
those expressions, our estimators are also sufficient and efficient among all unbiased estimators
(not just linear and unbiased). In fact, it can be proved that the MLE estimators reach the Cramer-
Rao lower bound. Furthermore for any value of n we have that these estimators are also normally
distributed. In the case of OLS we had to impose n → ∞ to obtain an asymptotically normal
distribution.
Under this domain, the normality hypothesis plays a crucial role on the efficiency of these esti-
mators. Therefore, the hypothesis of normality on the residuals will have to be carefully assessed.
2.1 E STIMATION AND I NFERENCE 25

2.1.3 Inference about the paramaters βi

Now that we have estimated the model using the ordinary least squares (OLS) method and max-
imum likelihood, we want to evaluate the significance of the parameters and then move on to the
validation phase. Using the assumption of error normality, the random variable (B0 , B1 )> is essen-
tially a bivariate normal random variable with mean vector β = (β0 , β1 ) and variance-covariance
matrix given by σ 2 V . Where
" #! " 2 #!
β̂0 1
[1 + Pn n(x̄ ) 2 ] − Pn x̄ 2
n i=1 (xi −x̄) i=1 (xi −x̄)
V = σ2V = σ2 x̄ 1
,
β̂1 − n (xi −x̄)2
P P n
(xi −x̄)2
i=1 i=1

where: σ 2 is the variance of the errors i , x̄ is the mean of the xi values, the denominators represent
the sum of the squared deviations of the predictor variable from its mean.

Suppose we have fitted a linear model and we want to test

H0 :β0 = b0 vs H1 : β0 6= b0
H0 :β1 = b1 vs H1 : β1 6= b1 ,

where b0 and b1 are the values we want to test. In order to test this set of hypotheses, we derive
a test statistic. Recall from the previous paragraph, the unbiased estimator for σ 2 is S 2 .

Given that Z is a standardized normal random variable and χ2 is an independent chi-squared


random variable with v degrees of freedom, the ratio

Z
T =q ,
χ2ν
v

follows a T-distribution with v degrees of freedom. If the hypothesized model is true, then:

(Bi −βi )

V (Bi )
q ∼ Tn−2 for i = 0, 1.
S2
σ2

We note that these random variables do not depend on σ 2 as it is eliminated in the ratio. Thus, the
following testing scheme applies:

H0 :β0 = b0 vs H1 : β0 6= b0 we reject H0 if |T0 | ≥ tα/2;n−1


H0 :β1 = b1 vs H1 : β1 6= b1 we reject H0 if |T1 | ≥ tα/2;n−1
26 S IMPLE L INEAR R EGRESSION

where
βˆ0 − b0 βˆ1 − b1
T0 = , T1 = .
se(B0 ) se(B1 )
Similarly, confidence intervals (CI) for the individual parameters can be constructed:

Confidence interval for β0 : β̂0 − tα/2;n−2 se(B0 ) ≤ β0 ≤ β̂0 + tα/2;n−2 se(B0 )


Confidence interval for β1 : β̂1 − tα/2;n−2 se(B1 ) ≤ β1 ≤ β̂1 + tα/2;n−2 se(B1 ).

The value tα/2,n−1 from the Student’s t-distribution with n − 1 degrees of freedom is defined as
the value such that the area to the left of this value is α/2 and the area to the right of its symmetric
value (i.e., the negative of this value) is also α/2.
In simpler terms, for a two-tailed test:

• The area to the left of tα/2,n−1 is α/2.

• The area to the right of −tα/2,n−1 (the negative counterpart) is α/2.

These quantiles are commonly used in hypothesis testing and the construction of confidence
intervals when the population variance is unknown and the sample size is small. For a two-tailed
test at the 5% significance level, where α = 0.05 and α/2 = 0.025, the quantiles t0.025,n−1 and
t0.975,n−1 provide the critical values for the test or the bounds of a 95% confidence interval.

2.2 Goodness of fit


When estimating parameters for a regression model, it is always advisable to propose an indicator
capable of summarizing the explanatory power of the model in relation to the sample data. This
can be done by calculating the R2 index, which can be derived from the decomposition of the total
deviance. Let’s see how.
Dev(y) = Dev(ŷ) + Dev(e),

where

Pn
• Dev(y) = i=1 (yi − ȳ)2 =Total Deviance=TSS;

• Dev(ŷ) =
Pn ¯2=
− ŷ)
Pn
− ȳ)2 =Regression Deviance=SSR;
i=1 (yi i=1 (ŷi

Pn Pn
• Dev(e) = i=1 (ei − 0)2 = i=1 (yi − ŷi )2 =Residual Deviance =RSS.
2.2 G OODNESS OF FIT 27

Figure 2.2: Probability density function (PDF) of a t-distribution with n − 1 degrees of freedom, representative of a
sample size of n. The symmetric curve highlights the central tendency and variability of the t-distribution. Two critical
quantiles, tα/2,n−1 and t1−α/2,n−1 , are marked by red dashed lines, indicating the tail regions of the distribution for
a significance level of α. The shaded areas under the curve on either side represent the cumulative probabilities in
the tails beyond these critical quantiles. For the given example, these quantiles are approximately −2.26 and 2.26
respectively, providing boundaries for hypothesis testing at the 5% significance level. The area between these two
quantiles contains 95% of the distribution’s total area, leaving 2.5% in each tail.

The proposed decomposition exists only when including the intercept. Now, if the variability
of Y is largely explained by the regression line, the regression deviance (SSR) will be high relative
to the total deviance (TSS), and consequently, the residual deviance (RSS) will be low. It follows
that a measure of the goodness of the regression line is
Pn
2 Dev(ŷ) (ŷi − ȳ)2
R = = Pni=1 2
.
Dev(y) i=1 (yi − ȳ)

Equivalently, thanks to the decomposition of the deviance, we can write


Pn
2 Dev(e) (yi − ŷi )2
R =1− = 1 − Pi=1
n 2
.
Dev(y) i=1 (yi − ȳ)

The presented index represents the proportion of the total variance explained by the regression
variance. Therefore, 0 ≤ R2 ≤ 1.
It is possible to prove that in the case of simple linear regression, the coefficient R2 is the
square of the correlation index calculated between the yi and xi .

2 2
(n − 2)s2 n[s2y − (β̂1 )2 s2x ] 2
s2x
 
2 2 sx sxy sxy
R =1− = 1 − = (β̂ 1 ) = = = (rxy )2 .
ns2y ns2y s2y s2x s2y sx sy
28 S IMPLE L INEAR R EGRESSION

Figure 2.3: Anscombe’s quartet is a famous statistical example that consists of four datasets with nearly identical
descriptive statistics but vastly different visual representations each having an equal R2 (0.67). The quartet was
created by the statistician Francis Anscombe in 1973.

Despite the R-squared index being a powerful tool for assessing the goodness of fit, in real-
world applications, everything becomes more complex. Therefore, relying solely on summary
statistics can be misleading. An example of this is the Anscombe’s quartet in Figure ??. In the
Figure ??, four datasets are plotted, each having an equal coefficient of determination R2 (0.67).
It is clear, therefore, that before estimating any model, every analyst must explore the data, ask
questions, and test hypotheses.
2.2 G OODNESS OF FIT 29

Figure 2.4: Another example of why we should never rely solely on summary statistics is that the datasets all have the
same mean, standard deviation, and correlation.
30 S IMPLE L INEAR R EGRESSION

Example 4. (Easy way to compute β̂1 and β̂0 )


A rocket motor is manufactured by bonding an igniter propellant and a sustainer propellant
together inside a metal housing. The shear strength of the bond between the two types of propellant
is a critical quality characteristic. It is hypothesized that the shear strength may be influenced by
the age (in weeks) of the sustainer propellant batch. A set of twenty observations concerning
shear strength and the age of the corresponding propellant batch have been gathered and are
presented in Table ??. Figure ?? depicts a scatter diagram, indicating a significant statistical
relationship between shear strength and propellant age. The preliminary assumption of the linear
model y = β0 + β1 x +  seems justifiable.

Table 2.2: Observations on Shear Strength and Age of Propellant


X Age of Propellant (weeks) Y Shear Strength (psi)
15.50 2158.70
23.75 1678.15
8.00 2316.00
17.00 2061.30
5.50 2207.50
19.00 1708.30
24.00 1784.70
2.50 2575.00
7.50 2357.90
11.00 2256.70
13.00 2165.20
3.75 2399.55
25.00 1779.80
9.75 2336.75
22.00 1765.30
18.00 2053.50
6.00 2414.40
12.50 2200.50
2.00 2654.20
21.50 1753.70

In simple linear regression, we must calculate both β̂0 and β̂1 . While the former is relatively
straightforward to compute, the latter can become cumbersome, especially with a large number of
data points. In this exercise we propose a easier method to compute this quantity
Pn
(x − x̄)(yi − ȳ) Sxy
Pn i
β̂1 = i=1 2
= .
i=1 (xi − x̄) Sxx

With a bit of algebra, we can rewrite this expression, starting from the Sxy we have
2.2 G OODNESS OF FIT 31

n
X n
X n
X
Sxy = yi (xi − x̄) = xi yi − x̄ yi
i=1 i=1 i=1
n n
X n X
= yi xi − x̄ yi
i=1
n i=1
n
X
= xi yi − nx̄ȳ
i=1
n Pn Pn
X
i=1 xi i=1 yi
= x i yi − n
i=1
n n
n Pn  Pn 
X
i=1 xi i=1 yi
= xi yi − .
i=1
n
Pn
While Sxx = i=1 (xi − x̄)2 can be rewritten in this way

n
X n
X
2
Sxx = (xi − x̄) = (x2i − 2xi x̄ + x̄2 )
i=1 i=1
Xn n
X
= x2i − 2x̄ xi + nx̄2
i=1 i=1
n Pn n  Pn 2
X xi X i=1 xi
= x2i
−2 i=1
xi + n
i=1
n i=1
n
 2
Pn
n n i=1 xi
Pn
i=1 xi
X X
2
= xi − 2 xi + n
i=1
n i=1
n2
 2  2
Pn Pn
Xn i=1 xi i=1 xi
2
= xi − 2 +
i=1
n n
 2
Pn
Xn i=1 xi
2
= xi − .
i=1
n

Now we are ready to use these formulas in the exercise. Let’s see how
32 S IMPLE L INEAR R EGRESSION

xi yi x2i y i xi
15.50 2158.70 240.25 33459.85
23.75 1678.15 564.06 39856.06
8.00 2316.00 64.00 18528.00
17.00 2061.30 289.00 35042.10
5.50 2207.50 30.25 12141.25
19.00 1708.30 361.00 32457.70
24.00 1784.70 576.00 42832.80
2.50 2575.00 6.25 6437.50
7.50 2357.90 56.25 17684.25
11.00 2256.70 121.00 24823.70
13.00 2165.20 169.00 28147.60
3.75 2399.55 14.06 8998.31
25.00 1779.80 625.00 44495.00
9.75 2336.75 95.06 22783.31
22.00 1765.30 484.00 38836.60
18.00 2053.50 324.00 36963.00
6.00 2414.40 36.00 14486.40
12.50 2200.50 156.25 27506.25
2.00 2654.20 4.00 5308.40
21.50 1753.70 462.25 37704.55
Column sum 267.250 42627.150 4677.688 528492.637

Table 2.3: Calculations needed for the estimation of the regression line.
2.2 G OODNESS OF FIT 33

Pn
We have that x̄ = 267.250/20 = 13.3625, ȳ = 42627.150/20 = 2131.358 and i=1 xi yi =
528492.637, then

(267.250)(42627.150)
Sxy = 528492.637 − = −41112.65,
20
 2
Pn 2
Pn
while, knowing that i=1 xi = 4677.688 and i=1 xi = 267.2502 = 71422.56 we have

267.2502
Sxx = 4677.688 − = 1106.56.
20
Finally

Sxy −41112.65
β̂1 = = = −37.15,
Sxx 1106.56
and

β̂0 = ȳ − β̂1 x̄ = 2131.3575 − (−37.15 × 13.3625) = 2627.82

The least-squares fit is


ŷ = 2627.82 − 37.15x

Figure 2.6: Exploratory analysis of propellant


Figure 2.5: Scatterplot of Shear Strength vs Age of
data, illustrating the relationship between the shear
Propellant with fitted regression line.
strength and the age of the propellant.
34 S IMPLE L INEAR R EGRESSION

Figure 2.7: Scatterplot of the production data

2.3 Regression Output from R


In this example, we consider some data taken from Foster, Stine and Waterman (1997, pages 191-
199). The original data are in the form of the time taken (in minutes) for a production run, Y
and the number of items produced, X, for 20 randomly selected orders as supervised by three
managers. We consider the data for one of the managers (see Table and Figure ??). We wish to
develop an equation to model the relationship between Y the run time, and X, the run size.

Case Run time Run size Case Run time Run size
1 195 175 11 220 337
2 215 189 12 168 58
3 243 344 13 207 146
4 162 88 14 225 277
5 185 114 15 169 123
6 231 338 16 215 227
7 234 271 17 147 63
8 166 173 18 230 337
9 253 284 19 208 146
10 196 277 20 172 68

Figure ?? shows a scatter plot of the production data. The least squares estimates for the
production data were calculated using R, giving the following results:
2.3 R EGRESSION O UTPUT FROM R 35

Where the intercept is β̂0 = 149.74 and the slope β̂1 = 0.26. While the standard error of the
intercept and of the slope are se(β̂0 ) = 8.33 and se(β̂1 ) = 0.04. t value: The t-value is calculated
as the estimated coefficient divided by its standard error. It measures the significance of the coef-
ficient. Pr(> |t|): The p-value associated with the t-value. It indicates the statistical significance
of the coefficient. Generally, a p-value below 0.05 suggests that the coefficient is statistically sig-
nificant. Residual standard error: This value represents the standard deviation of the residuals,
which measures the average distance between the observed values and the predicted values by the
model. A lower residual standard error indicates a better fit of the model to the data. R-squared:
This statistic, denoted as R-squared or coefficient of determination, indicates the proportion of the
total variance in the dependent variable explained by the regression model. It ranges from 0 to 1,
where 0 indicates no explanatory power, and 1 represents a perfect fit. Adjusted R-squared: This
is a modified version of R-squared that takes into account the number of predictors in the model.
It penalizes the addition of unnecessary predictors, providing a more reliable measure of model
fit. F-statistic: The F-statistic tests the overall significance of the regression model. It assesses
whether at least one of the independent variables has a significant effect on the dependent variable.
The F-statistic is accompanied by its associated [Link] codes: This section provides
asterisks () next to the p-values in the coefficients table to indicate the level of significance. For
example, a p-value less than 0.001 is often represented as "***".
36 S IMPLE L INEAR R EGRESSION

2.4 Exercises
1. Invent examples of data analytic problems, one each for the three major aims of modelling:

• prediction,

• explanation,

• causal inference.

By "invent" it is meant that you describe the variables, what scientists are interested in, and the
background of the situation as far as necessary to understand to which of the three aims you
classify the example. You do not need to make up the datasets.

2. Express the sum of squares of the errors in terms of the three unknown parameters β0 , β1 , β2 ,
and then obtain the normal equations by differentiation with respect to each parameter.

3. Write down the matrix X and the vectors β and Y and verify that the equations obtained in
the previous question are of the form XT Xβ̂ = XT Y.

4. Suppose that we fit the straight-line regression model ŷ = β̂0 + β̂1 x1 but suppose that the
rensponse is affected by a second variable x2 such that the true regression function is

E(y) = β0 + β1 x1 + β2 x2 ,

a) Is the OLS estimator of the slope in the original simple linear regression model unbiased ?
b) Show the bias in B1 .

5. Consider the simple linear regression model

y = β0 + β1 x + ,

where the intercept is known


a) Find the OLS estimator of β1 for this model. Does this answer seem reasonable ?
b) What is the variance of the slope (β̂1 ) fo the OLS estimator found in part a),
c) Find a 100(1 − α)% CI for β1 . Is this interval narrower than the estimator for the case where
both slope and intercept are unknown?

6. Convert the following linear relationships into linear relationships by making transforma-
tions and defining new variables

a) y = a/(b + cx),
b) y = ae−bx ,
c) y = abx ,
d) y = x/(a + bx),
e) y = (1/1 + ebx ).
where a, b, c are constants.
2.4 E XERCISES 37

7. Suppose that the model yi = β0 + β1 xi + i , the errors have mean zero and are independent,
but V ar(i ) = ρ2i σ 2 , where ρi are known constants, so the errors do not have equal variance.
this situation arises when the yi are averages of several observations at xi ; in this case, if yi
is an average of ni independent observations, ρ2i = 1/ni (why?). Because the variances are
not equal, the theory developed in this chapter does not apply; intuitively, it seems that the
observations with large variability should influence the estimates of β0 and β1 less than the
observations with small variability.
The problem may be transformed as follows

ρ−1 −1 −1 −1
i yi = ρi β0 + ρi β1 xi + ρi i ,

or
zi = ui β0 + vi β1 + δi ,

where ui = ρ−1 −1 −1
i , vi = ρi xi , δi = ρi i ,

a) Show that the new model satisfies the assumptions of the standard statistical model,
b) Find the least squares estimates for β0 and β1 ,
c) Show that performing the least squares analysis on the new model, as was done in part b),
is equivalent to minimizing.
n
X
(yi − β0 − β1 xi )2 ρ−2
i .
i=1

This is a weighted least squares criterion; the observations with large variances are weighted
less.

8. Suppose that a line is fit by the method of least squares to n points, that the standard sta-
tistical model holds, and that we want to estimate the line at a new point, x0 . Denoting the
value on the line by µ0 , the estimate is

µ̂o = β̂0 + β̂1 x0 .

a) Derive the expression for the variance of µ̂0 ,


b) Sketch the standard deviation of µ̂0 as a function of x0 − x̄. The shape of the curve should
be intuitively plausible,
c) Derive a 95% confidence interval for µ0 = β0 + β1 x0 under the assumption of normality.
Chapter 3

Multiple Linear Regression

In general let p explanatory variables labelled x1 , . . . , xp . We can consider the following model

Yi = β0 + β1 xi1 + · · · + βp xip + i i = 1, . . . , n, (3.1)

note that in the following expression we have p explanatory variable and p + 1 parameters (
p plus the intercept). The model expressed in (??) can be still considered a linear model as its
predictor is linear in unknown parameters β1 . . . , βp .
The formulation introduced in (??) can be generalized using the matrix notation. Let Y =
(Y1 , Y2 , . . . , Yn )> its realizations y = (y1 , y2 , . . . , yn )> . The explainatory variables are contained
in the matrix X of dimension (n × (p + 1)). Finally, we denote with  = (1 , 2 , . . . , n )> the
vector containing the error terms. The parameters are p + 1 and are contained in the vector β =
(β0 , β1 , . . . , βp )> . The multiple linear model in matrix notation takes the following form

Deterministic Stochastic
z}|{ z}|{
Y= Xβ +  (3.2)

      
Y1 1 x11 x12 . . . x1p β0 1
       
 Y2   1 x21 x22 . . . x2p   β1   2 
       
.  .  . .
where Y =   , X =  , β =  ,  =  
       
 .  n×(p+1)  . .
n×1  (p+1)×1   n×1 .
     
.  .  . .
       
Yn 1 xn1 xn2 . . . xnp βp n

for the i − th statistical unit we have

39
40 M ULTIPLE L INEAR R EGRESSION

Yi = β0 + β1 xi1 + β2 xi2 +, . . . , +βp xip + i , i = 1, 2, . . . , n,

after having collected a sample, we have the observed values yi for the r.v. Y and {xij , i =
1, 2, . . . , n; j = 1, 2, . . . , p} for the explainatory variables Xj , j = 1, 2, . . . , p. Therefore, we have

y = Xβ + e (3.3)

for the i − th statistical unit we have

yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + ei i = 1, . . . , n.

As already discussed in the previous section, the vector e = (e1 , . . . , en )> contains the realiza-
tions of the r.v.  (Stochastic error) which in general is not observed. The realizations of i 1 can
be determined once the vector of parameters β has been estimated and then β̂ is obtained

e = y − X β̂.

3.1 Estimation
As for the simple linear model case, here too we have hypotheses. Whose can be viewed as a
matrix generalization of the ones discussed for the simple linear model.

• (A1) Y = Xβ + ,

• (A2) E() = 0,

• (A3) V ar() = E(> ) = σ 2 In ,

• (A4) X is a deterministic matrix and is full rank, rank(X) = p + 1.

In other words we are assuming i) Linearity of the model: the relationship between the de-
pendent variable Y and the independent variables X is linear and constant for all observations,
this means that there are no structural breaks in the data, an example of structural break is given
by the third plot in Figure ??. ii) Zero mean of errors: the errors  have a zero mean, that is,
E() = 0. This assumption means that on average, the model is correct, and any deviations from
the true values are random and cancel each other out. iii) Homoscedasticity: the errors have
constant variance or are homoscedastic, that is, V () = E(> ) = σ 2 In , where σ 2 is the constant
1
The distinction between error terms i and residuals ei is important. Error terms are unobservable, and distri-
butional assumptions about them are necessary to derive sampling properties of estimators for β. The residuals are
obtained after estimation, and their value depend upon the estimated values for β and therefore depend upon the
sample and estimation method.
3.1 E STIMATION 41

No problem Heteroschedasticity Non−linear

150

2
4

100

0
50

−2
2
Residuals

Residuals
Residual

−4
0

−6
−50

−8
−100
−2

−10
−150

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Fitted Fitted Fitted

Figure 3.1: Examples of violation of the OLS assumptions. Residuals vs Fitted Values Plot obtained after estimating
a multiple linear regression model. The first plot depicts residuals with no issues, as they are centered around zero and
exhibit homoscedasticity. The second plot illustrates an instance of heteroscedastic residuals, which remain centered
around zero, but their dispersion is no longer constant. The final plot, on the other hand, demonstrates an example of
structural break. In practical terms, especially when dealing with economic data, it is possible for the estimated beta
value to vary across different data points.

variance of the errors and In is the identity matrix of size n, an example is given by the second plot
in Figure ??. iv) No multicollinearity: The fourth assumption of OLS is that there is no perfect
multicollinearity among the independent variables. This means that the independent variables in
X are not linearly dependent on each other, as this would lead to problems with the estimation of
the regression coefficients. Further will be discussed in Section ??.

We wish to find the vector of least-squares estimators β̂, that minimizes

RSS(β) = S(β) = (y − Xβ)> (y − Xβ)


Xn n
X
= 2
ei = (yi − x> 2
i β) , (3.4)
i=1 i=1

where S(β) can be expressed as


42 M ULTIPLE L INEAR R EGRESSION

S(β) = y> y − β > X> y − y> Xβ + β > X> Xβ


= y> y − 2β > Xy + β > X> Xβ,

note that since β > Xy is a scalar, its transpose is the same scalar. Taking the derivative respect
to the β vector (Appendix ?? for details about matrix differentiation) and setting equal to zero

∂S
= −2X> y + 2X> Xβ̂ = 0,
∂β
which can be simplified to

X> Xβ̂ = X> y, (3.5)

the Equation (??) are the least squares normal equations. To solve this expression X> X have
to be invertible (this will be always true if the covariates in our model are linearly independent).
We then have

β̂ = (X> X)−1 X> y. (3.6)

It might happen that X is not full rank. This occur if, two of the explanatory variables are
perfectly correlated (e.g., xi2 = 3 × xi4 ). This means that XX> is singular and the least squares
coefficients β̂ are not uniquely defined. In other words we are incorporating the same information
twice. This problem it is usually resolved by recoding or dropping redundand columns in the
design matrix X or applying variable selection methods. However, it is advisable to carry out a
preliminary exploratory analysis in order to identify interactions between explanatory and possible
correlations.
The fitted values can be obtained by

p
X
ŷi = x>
i β̂ = β̂0 + β̂j xij ,
j=1

The vector of fitted values can be obtained considering the entire design matrix

ŷ = Xβ̂ = X(X> X)−1 X> y = Hy, (3.7)

We can notice that H ∈ Rn×n is usually called the hat matrix. It maps the vector of observed
3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 43

Y
X2

X2
X1 X1

(a) 3D Scatterplot (b) Regression Hyperplane

Figure 3.2: A linear model fitted using least square. The observations are shown in red and the blue plane indicates
the least squares fit to the data

values into a vector of fitted values.


The difference between the observed values yi and the corrensponding fitted values ŷi is the
residual ei = yi − ŷi . The vector can be obtained

e = y − X β̂ = y − ŷ. (3.8)

Figure 3.3: Geometric representation of linear regression. The figure illustrates the relationship between the observed
outcomes Y , the fitted outcomes Ŷ , and the residuals e. The column space of X represents all possible fitted values,
and the residuals are orthogonal to this space, indicating the difference between the observed and predicted outcomes.
Source: Link

3.2 TODO(Optional) Gradient Descent Estimation


44 M ULTIPLE L INEAR R EGRESSION

3.2.1 Properties
Denoting with B the estimator of β, assuming that E() = 0, we can prove that B is an unbiased
estimator of β.

Proof.

E(B) = E[(X> X)−1 X> Y]


= E[(X> X)−1 X> (Xβ + )]
= E[(X> X)−1 X> Xβ + (X> X)−1 X> ]
= β + (X> X)−1 X> E[]
= β,

we notice that when E() 6= 0 our estimator is bound to be biased and the amount of bias =
(X> X)−1 X> E[] will depend on the design matrix and by the expected value of the error terms.
The covariance matrix of B, assuming that the error terms are uncorrelated and that the
V ar() = σ 2 I, is

V ar(B) = σ 2 (X> X)−1 .

Proof.

V ar(B) = V ar[(X> X)−1 X> Y]


= V ar[(X> X)−1 X> (Xβ + )]
= V ar[(X> X)−1 X> Xβ + (X> X)−1 X> ]
= (X> X)−1 X> V ar()X(X> X)−1
= σ 2 (X> X)−1 .

To obtain the sampling distribution of B we need to assume that  ∼ Nn (0, σ 2 I). So far the
expressions of our estimators has been obtained as just an albegraic tool, no Normality assumption
is required to obtain B. This simply means that properties A1-A3 are only useful to leverage the
properties of estimator B. Assuming that the erros are normally distributed we can obtain that

B ∼ Np (β, σ 2 (X> X)−1 ),


3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 45

where Np denotes a p−variate normal distribution. The fact of assuming that  are normally
distributed ensure us to have that B is normal as well. This result comes from noting that B is a
linear transformation of Y. In fact

B = (X> X)−1 X> (Xβ + ).


46 M ULTIPLE L INEAR R EGRESSION

Delivery Time,
Observation Number Number of Cases, x1 Distance, x2 (ft)
y(min)
1 16.68 7 560
2 11.50 3 220
3 12.03 3 340
4 14.88 4 80
5 13.75 6 150
6 18.11 7 330
7 8.00 2 110
8 17.83 7 210
9 79.24 30 1460
10 21.50 5 605
11 40.33 16 688
12 21.00 10 215
13 13.50 4 255
14 19.75 6 462
15 24.00 9 448
16 29.00 10 776
17 15.35 6 200
18 19.00 7 132
19 9.50 3 36
20 35.10 17 770
21 17.90 10 140
22 52.32 26 810
23 18.75 9 450
24 19.83 8 635
25 10.75 4 150
Table 3.1: Delivery Time Data.

Example 5. A soft drink bottler is analyzing the vending machine service routes in his distribution
system. He is interested in predicting the amount of time required by the route driver to service the
vending machines in an outlet. This service activity includes stocking the machine with beverage
products and minor maintenance or housekeeping. The industrial engineer responsible for the
study has suggested that the two most important variables affecting the delivery time (y) are the
number of cases of product stocked (x1 ) and the distance walked by the route driver (x2 ). The
engineer has collected 25 observations on delivery time, which are shown in Table ??. We will fit
the multiple linear regression model

y = β0 + β1 x1 + β2 x2 + ε,

to the delivery time data in Table ??. When there are only two regressors, sometimes a three-
dimensional scatter diagram is useful in visualizing the relationship between the response and the
regressors. By spinning these plots, some software packages permit different views of the point
3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 47

Figure 3.4: Example 5 data scatterplot. We observe how there seems to be a linear relationship between time and the
two covariates (cases and distance). It seems reasonable to estimate a linear model.

cloud. This view provides an indication that a multiple linear regression model may provide a
reasonable fit to the data.

To fit the multiple regression model we first form the X matrix and y vector:
48 M ULTIPLE L INEAR R EGRESSION

   
1 7 560 16.68
1 3 220  11.50
   
  
   

 1 3 340 

 12.03 

1 4 80  14.88
   
  
   

 1 6 150 

 13.75 

1 7 330  18.11
   
  
   

 1 2 110 

 8.00 

1 7 210  17.83
   
  
   

 1 30 1460 


 79.24 

1 5 605  21.50
   
  
   

 1 16 688 

 40.33 

1 10 215  21.00
   
  
   
X=
 1 4 255  y=
 13.50 

1 6 462  19.75
   
  
   

 1 9 448 

 24.00 

1 10 776  29.00
   
  
   

 1 6 200 

 15.35 

1 7 132  19.00
   
  
   

 1 3 36 

 9.50 

1 17 770  35.10
   
  
   

 1 10 140 

 17.90 

1 26 810  52.32
   
  
   

 1 9 450 

 18.75 

1 8 635  19.83
   
  
1 4 150 10.75

>
The X X matrix is

 
  1 7 560  
1 1 ··· 1 25 219 10, 232
1 3 220  
 
> 
X X= 7 3 ··· 4   =  219 3, 055 133, 899 
 
.... .. 
. . . 

560 220 · · · 150 10, 232 133, 899 6, 725, 688

1 4 150

>
and the X y vector is
3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 49

 
  16.68  
1 1 ··· 1 559.60
11.50
 
>   
X y= 7 3 ··· 4   =  7, 375.44 

.. 
.
 
560 220 · · · 150 337, 072.00
 
10.75
The least-squares estimator of β is
 > −1 >
β̂ = X X X y

or
   −1  
β̂0 25 219 10, 232 559.60
 β̂1  =  219 3, 055 133, 899   7, 375.44 
     

β̂2 10, 232 133, 899 6, 725, 688 337, 072.00


  
0.11321518 −0.00444859 −0.00008367 559.60
=  −0.00444859 0.00274378 −0.00004786   7, 375.44 
  

−0.00008367 −0.00004786 0.00000123 337, 072.00


 
2.34123115
=  1.61590712 
 

0.01438483
The least-squares fit (with the regression coefficients reported to five decimals) is

ŷ = 2.34123 + 1.61591x1 + 0.01438x2 .


50 M ULTIPLE L INEAR R EGRESSION

3.2.2 Gauss Markov Theorem


We can now discuss one of the result that we are going to see in this course. Gauss- Markov
theorem.
Theorem 3. (Gauss- Markov Theorem). If the assumptions of the linear regression model hold,
then the ordinary least squares (OLS) estimator B is the Best Linear Unbiased Estimator (BLUE)
of β. This means that among all linear and unbiased estimators of the coefficients, the OLS
estimator has the smallest variance. Hence the most efficient estimator.
Proof. First of all we can notice that

B = (X> X)−1 X> Y


= (X> X)−1 X> (Xβ + )
= (X> X)−1 X> Xβ + (X> X)−1 X> 
= β + A,

where A = (X> X)−1 X> such that AA> = (X> X)−1 , the OLS estimators can be written as

B = AY = β + A.

The estimator B for the parameters β are linear as B = AY is a linear combination of the random
variables Y. The variance can be expressed as
−1
V ar(B) = V ar(AY) = A Var(Y)A> = σ 2 AA> = σ 2 X> X ,

now we need to show that OLS estimators are the most efficient in the class of the linear
unbiased estimators. We introduce, for a given matrix C, e generic linear and unbiased estimator
for β

B̃ = (A + C)Y = AY + C(Xβ + ) = B + CXβ + C.

As we are assuming that B̃ is unbiased, E(B̃) = β, this is equivalent to imposing

E(B̃) = E(B) + CXβ + CE() = β + CXβ = [I + CX]β = β ⇔ CX = O.

The constraint CX = O implies:


h −1 i> h −1 i −1 −1
CA> = C X> X X> = C X X> X = CX X> X = O (X0 X) = O,

therefore
>
O = CA> = AC> ⇒ AC> = O.
3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 51

The variance and covariance matrix of B̃ is:

V ar(B̃) = V ar[(A + C)Y] = (A + C)V ar(Y)(A + C)>


= (A + C)σ 2 I(A + C)> =
= σ 2 AA> + CA> + AC> + CC>

−1
= σ 2 AA> + CC> = σ 2 X> X + σ 2 CC> =
 

= V ar(B) + σ 2 CC> .


Therefore, V ar(B̃) − V ar(B) = σ 2 CC> is a positive semi-definite matrix. We can conclude




that, the variance of the OLS estimators of any regression parameter will not exceed the variance
of any other linear and unbiased estimator for β. The two variances will coincide if and only if
C ≡ O, i.e., when: B̃ ≡ B.
Hence, we have shown that the OLS estimators are BLUE.
52 M ULTIPLE L INEAR R EGRESSION

3.2.3 Maximum Likelihood Estimation


The least square estimator is also a maximum likelihood estimator (MLE) when the rensponses are
independent and identically distributed. When Yi ∼ N (x> 2
i β, σ ), therefore the density function is

(yi − x> > >


 
1 i β) (yi − xi β)
fYi (yi ) = √ exp − ,
2πσ 2 2σ 2
once have observed a realization of (Y1 , . . . , Yn ), the likelihood function is
n
(Y − Xβ)> (Y − Xβ)
  
2 1
L(β, σ |y) = √ exp − ,
2πσ 2 2σ 2
where the maximization of L(β, σ 2 |y) with respect to β for given σ 2 is equivalent to maximize
the exponential part whose is maximum when (Y − Xβ)> (Y − Xβ) is miminum. Please attempt
to obtain the MLE for β and for σ 2 as exercise.

3.2.4 Residuals
In assessing the fit of out model we need to consider the behaviour of the residuals. With fitted
values we denote the estimates of the mean rensponse for each observation ŷi = x> i β̂ for i =
1, . . . , n. So the residuals can be defined as

ei = yi − ŷi = yi − x>
i β̂,

in matrix notation

e = y − Xβ̂ = y − X(X> X)−1 X> y


= y − Hy
= (In − H)y,

where In is a n × n identity matrix and H = X(X> X)−1 X> is the hat matrix. H is called in
this way as it puts the hat on y. In fact if we multiply y by H we obtain ŷ = Hy.
We can denote e> = (e1 , . . . , en ) the vector containing the r.v. residuals generated by e =
y − Xβ̂. Let H = X(X> X)−1 X> , M = I − H = I − XA and A = (X> X)−1 X> then

e = y − Xβ̂ = y − X(Ay) = (I − H)y = My,

M is an idempotent matrix with the following properties

M = M> ; MM = M; M> M = M; rank(M) = tr(M) = n − p − 1.


3.2 TODO(O PTIONAL ) G RADIENT D ESCENT E STIMATION 53

Furthermore MX = X> M = 0

e = y − Xβ̂ = My = M(Xβ + ) = 0β + M.

The residual random variable is a linear transformation of an error terms.

E(e) = E(M) = ME(),

and

E(e> e) = E[(M)> (M)]


= E[> M> M]
= E[tr(> M> M)]
= E[tr(M> M> )]
= tr(E[M> M> ])
= tr(M> ME[> ])
= σ 2 tr(M> M)
= σ 2 tr(M)
= σ 2 (n − p − 1).

3.2.5 Inference on σ 2
Once β have been estimated, we can obtain the residual sum of squares (RSS), then

RSS = e> e = (y − Xβ̂)> (y − Xβ̂),

we have proved that

E(RSS) = (n − p − 1)σ 2

Hence an unbiased estimator of σ 2 result to be

RSS
σ̂ 2 =
(n − p − 1)
this since E(σ̂ 2 ) = E(RSS)/(n − p − 1) = (n − p − 1)σ 2 /(n − p − 1) = σ 2 . Assuming
that i ∼ N (0, σ 2 ), it can be shown that RSS
σ2
∼ χ2n−p . Where with χ2 we denoted a chi-squared
random variable. Click here for more details here.
If we assume that the random variable  ∼ N (0, σ 2 I) is distributed normally, then the residuals
54 M ULTIPLE L INEAR R EGRESSION

will be distributed as
e ∼ N (0, σ 2 (I − H)).

3.2.6 Interpetation
In multiple linear regression, the coefficients represent the estimated effect of each predictor vari-
able (xi ’s) on the outcome variable (yi ), while holding all other predictors constant. Specifically,
the intercept (β0 ) represents the estimated mean value of the outcome variable when all predictor
variables are equal to zero. In many cases, this intercept may not have a meaningful interpretation,
particularly if the predictor variables do not take on meaningful zero values.
The coefficients for the predictor variables (β1 , β2 , . . . , βp ) represent the estimated change in
the outcome variable for a one-unit increase in each predictor variable, holding all other predictors
constant. These coefficients can be interpreted as the "effect size" of each predictor variable on
the outcome variable. For example, if we have a multiple linear regression model that predicts the
salary of employees based on their years of experience and level of education, we can interpret the
coefficient for years of experience as the estimated increase in salary associated with a one-year
increase in experience, holding education constant. Similarly, we can interpret the coefficient for
education as the estimated increase in salary associated with a one-level increase in education,
holding experience constant.
We can assume to have the following linear model

y i = x>
i β + i i = 1 . . . , n,

where x> > >


i = (1, xi1 , . . . , xin ) and β = (β0 , β1 , . . . , βk , . . . , βp ) , in this case βk measures
the expected change in yi if xik changes with one unit, whereas the other variables in xi do not
change. Formally

∂E(yi )
= βk .
∂xik
It is important to realize that we had to state explicitly that the other variables in xi did not
change. In a multiple regression model, single coefficients can only be interpreted under ceteris
paribus conditions. If x> 2
i β contains , say xi β2 +xi β3 . With abuse of notation, taking the derivative
we have

∂E(yi )
= β2 + 2xi β3 ,
∂xi
which can be interpreted as the marginal effect of a changing xi if the other variables (exclud-
ing x2i ) are kept constant. This shows how the marginal effects of explanatory variables can be
allowed to vary over the observations by including additional terms involving these variables (in
3.3 T HE G EOMETRY OF T RANSFORMATIONS 55

this case x2i ).


In the case of transformations, interpretations change slightly. Here are some examples:

Model Interpretation
yi = β0 + β1 log(xi1 ) + i A 1% change in x is associated with
an expected change in y of 0.01 ×
β1 .

log(yi ) = β0 + β1 xi1 + i A change in x by one unit (∇x =


1) is associated with an expected
change in y of (exp(β1 ) − 1) × 100.

log(yi ) = β0 + β1 log(xi1 ) + i A change of 1% in x is associated


with a β1 % change in y.
Table 3.2: Models and Their Interpretations.

3.3 The Geometry of Transformations


Response transformation introduce Normality of a distribution and stabilize variances because
they can stretch apart data in one region and push observations together in other regions. Plot ??
illustrates this behavior. On the horizontal axis is a sample of data from a right skewed logNormal
distribution. The transformation h(y) is the logarithm. The symmetry is achieved because the log
transformation stretches apart data with small values and shrinks together data with large values.
This becomes evident observing the derivative of the log funtion. The derivative of log(y) is 1/y
which is a decreasing function of y.
Consides an arbitrary increasing function, h(y). If x and x0 are two nearby data points that are
transformed to h(x) and h(x0 ) respectively then the distance between transformed values can be
obtained by expanding h(x) around x0

h(x) ≈ h(x0 ) + h0 (x − x0 )
adjusting the terms and taking the absolute values
|h(x) − h(x0 )| ≈ h0 |x − x0 |

therefore h(x) and h(x0 ) are stretched apart where h0 is large. h(x) and h(x0 ) are stretched
apart when h0 is small. A function h(·) is called concave if h0 (x) is a decreasing function of x.
56 M ULTIPLE L INEAR R EGRESSION

2
lnY

−2

−4
0 5 10 15
Y

Figure 3.5: A symmetrizing transformation. The skewed lognormal data on the horizontal axis are transformed to
symmetry by the log transformation.
3.4 I NFERENCE 57

3.4 Inference

3.4.1 Confidence Intervals


In case we use the assumption of  ∼ N (0, σ 2 In ) (this assumption should be checked carefully
using the residuals), if we assume that β̂j is the j th element of β̂ we can derive that

β̂j ∼ N (βj , σ 2 vj ),

where vj is the (j, j)th diagonal element of (X > X)−1 . Recalling that

RSS/σ 2 ∼ χ2n−(p+1) ,

we have
β̂j − βj
T = p ∼ tn−(p+1) ,
σ̂ 2 vj
where σ̂ = RSS/(n − (p + 1))
An exact 100(1 − α)% confidence interval for βj is

β̂j ± tn−(p+1), 1 α se(β̂j ),


2

p
where se(β̂j ) = σ̂ 2 vj and tn−(p+1), 1 α is the upper 100 × 12 α% quantile of the tn−(p+1) distribu-
2
tion. This result can be used to test hypotheses of the form H0 : βj = b0 for a given j. In particular
when H0 : βj = 0. The test statistics under H0 reduces to

β̂j
∼ tn−(p+1) ,
se(β̂j )
from which we can obtain p-values. Here we are testing whether the response depends on
the associated explanatory variable, given the inclusion of the other explanatory variables in the
model.

3.4.2 Tests

3.4.3 Testing one linear restriction


Often, it is of interest to test more than one coefficient, such as β1 + β2 + . . . , βp = 1 in general
such linear hypothesis can be formulated, without loss of generality, in the following way

H0 : r1 β1 + r2 β2 + · · · + rp βp = r> β = q

where q ∈ R (is a scalar value), r ∈ Rp . Using the result of the Gauss Markov theorem
58 M ULTIPLE L INEAR R EGRESSION

and the fact that also r> B is the Best Linear Unbiased Estimator (BLUE) for r> β with variance
p
V (r> B) = r> V ar(B)r and then se(r> B) = V (r> B). Replacing σ 2 with its estimate and
noting that B is a p variate normal distribution, r> B is normal as well. Therefore

r> B − r> β
T = ∼ tn−p
se(r> B)
Under the null we have

r> B − q
T |H0 is true =
se(r> B)
this test statistic has a distribution t-student with n − p degree of freedom (d.o.f).

3.4.4 Test about multiple parameters

If we consider the null that a subset of (p + 1) − q parameters out of (p + 1) are zero. Let H1
denote the alternative hypothesis that all p parameters are not 0 and let RSS0 and RSS1 denote
the residual sum of squares under H0 and H1 , respectively. Under the null

RSS0 /σ 2 ∼ χ2n−(p+1)

this result combined with the fact that RSS0 − RSS1 and RSS are independent. Yields that under
the null

RSS0 − RSS1
∼ χ2(p+1)−q
σ2
we obtain the F-test
RSS0 − RSS1 /(p + 1 − q)
RSS1 /(n − p − 1)
under the null follows a Fisher distribution with p + 1 − q and n − p − 1 d.o.f.

3.4.5 Testing: further topics

We an use the results of the previous sections to formulate an even more general test. The most
general linear null hypothesis is a combination of the previous two cases and comprises a set of J
linear restrictions on the coefficients, this can be formulated as

Rβ = q

where R ∈ RJ×p+1 , assumed to be full rank , and q ∈ RJ . If for example we want to test
3.4 I NFERENCE 59

β1 + β2 + · · · + βp = 1 and β2 = β3 , R and q have to be specified in the following way


! !
0 1 1 ... ... 1 1
R= , q=
0 1 −1 0 . . . 0 0

Using the assumption that  is normally distributed, we conclude that RB is normally dis-
tributed with mean Rβ and variance RV (B)R> . This means that, under the null, the quadratic
form
(RB − q)> V (RB)−1 (RB − q)

has a Chi-squared distribution with J degrees of freedom. Unfortunatelly, σ 2 is unknown so


we need to substitute it with an unbiased estimate σ̂ 2 . The resulting test statistic is

ψ = (RB − q)> [RV̂ (B)R> ]−1 (RB − q) (3.9)

where V̂ (B) = σ̂ 2 (X> X)−1 . In large samples, the difference between σ 2 and it estimates has
limited impact and the test statistic in ?? is approximately distributed as a Chi-squared distribution.
It can be proved that an exact distribution under the the usual assumptions we can derive an exact
statistics
(RB − q)> [R(X> X)−1 R> ]−1 (RB − q)
F =
J σ̂ 2
under the null F follows a Fisher distribution with J and n − (p + 1) d.o.f. As before, large values
of F lead to rejection of the null (can you see why?).
60 M ULTIPLE L INEAR R EGRESSION

3.4.6 Prediction

So far we have seen how to estimate a linear model and how to make inference. The model that
we just estimated can be used to predict the future... lets assume to have observed a new vector
of covariates x0 = (x01 , x02 , . . . , x0p )> then a point estimate of the future observation y0 at the
point x0 is

ŷ0 = x>
0 β̂

where β̂ has been estimated not including the new information x0 , therefore β̂ = (X> X)−1 X> y.
However, in this case a point estimate is not sufficient, we need also to asses the uncertainty in this
prediction. A 100(1 − α)% prediction interval for this future observation is
s  
2 > > −1
ŷ0 ± tn−(p+1) 1 α σ̂ 1 + x0 (X X) x0 .
2

We can also be interested in obtaining a prediction of the mean response. In this case the
100(1 − α)% is
q
ŷ0 ± tn−(p+1) 1 α σ̂ 2 x> > −1
0 (X X) x0 ,
2

further application details will be discussed in the labs.

3.5 Regression with qualitative variables

The linear regression theory that we have studied so far can be generalized to the case of quantita-
tive variables. Sometimes, we may need to employ quantitative variables in the analysis, such as
operators, employment status (employed or unemployed), shifts (day, evening, or night), and sex
(male or female). In general, a qualitative variable does not have a natural scale of measurement.
We assign a set of levels to a qualitative variable to account for the effect that the variable may
have on the response. This is done through the use of indicator variables. In econometrics, indica-
tor variables are sometimes referred to as dummy variables. Let’s assume we have the following
dummy variable.

0 if the observation is Male
1{F emale} =
1 if the observation is Female
1{F emale} is a random variable such that assumes value 1 when the ith observation is female and 0
otherwise. We used 0,1 is an arbitrary way. We can assume to have the following model
3.5 R EGRESSION WITH QUALITATIVE VARIABLES 61

y = β0 + β1 x1 + β2 1{F emale} + ε. (3.10)

To interpret this model we need to consider the case in which the observations is a male.
Therefore,

y = β0 + β1 x1 + β2 (0) + ε
= β0 + β1 x1 + ε. (3.11)

Thus, the relationship between y and x1 when the observation is male turn out to be a straight
line with intercept β0 and slope β1 . For female we have

y = β0 + β1 x1 + β2 (1) + ε
= (β0 + β2 ) + β1 x1 + ε. (3.12)

Figure 3.6: Rensponse function for different values of 1{F emale} .

That is, for females, the relationship between y and x1 is also a straight line with a slope of β1
but an intercept of β0 + β2 .
The two response functions are shown in Figure ??. Models (??) and (??) describe two parallel
regression lines, which means they have a common slope of β1 but different intercepts. Addition-
ally, the variance of the errors ε is assumed to be the same for both female and male observations.
The parameter β2 represents the difference in heights between the two regression lines, indicating
the measure of the difference in y when transitioning from male to female.
62 M ULTIPLE L INEAR R EGRESSION

3.6 Checking Model Adequacy


Here are the crucial model assumptions of the multiple linear regression model:
(i) Linearity of the relationship between predictors and Y .
(ii) Normality of errors ei . Though the normal distribution theoretically can generate ar-
bitrary real numbers (including very large ones), very extreme values occur under the normal
distribution with a very small probability. For example, the probability that a point is generated
which is further than 5σ away from the mean value (the residuals are centered around 0) is about
5 × 10−7 which means that only one such point can be expected in more than 1.5 millions of
points generated from the same normal distribution. Because outliers have a very strong influence
on least squares regression, the detection of outliers is the most important task connected to the
normality assumption about the i . Another possible important deviation from normality could be
skewness of the distributional shape (many points one one side but much fewer points, some of
them appearing somewhat outlying, on the other side).
Some other deviations from normality are less dangerous. For example, under uniformly dis-
tributed random variation with restricted value range (this may hold, for example, if the y-variable
consists of percentages), the normal theory still works quite well. Generally, you should have
in mind that normality (as well as all the other model assumptions) is an idealization that never
holds precisely in practice. The important question is not whether a distribution is really normal
but whether there are deviations from normality that may lead the applied statistical methodology
(here linear regression) to misleading conclusions about the data.
(iii) Homogeneity of variances("homoscedasticity" as opposed to "heteroscedasticity") of ei .
This implies particularly that the variances do not depend on any of the predictor variables.
(iv) Independence of errors ei (of each other).
It is important to note that these assumptions are related to the residuals. Hence, we are not
able to detect departures from the underlying assumptions by examining the standard summary
statistics (e.g., R2 , RSS, t and F statistics). Since the residuals play a significant role, it is natural
to assume that most of the diagnostic techniques will be based on examining residuals. Some
useful plots are:
(i-a) Matrix plot: The so-called matrix plot (command in R is pairs()) consists of all scat-
terplots of any pair of predictor variables and predictors vs. response, arranged in matrix form.
The matrix plot can (and should) be plotted without having fitted a linear regression (or any other
model) before. The plots of predictor variables vs. the response can be used to assess linear-
ity, outliers and homoscedasticity. Note, however, that there is some danger of over-interpreting
impressions from these plots, because to see the response plotted against every single predictor
variable does not give a full impression. For example, it is possible that an apparently nonlinear
shape of the plot of a single predictor vs. the response is rather caused by values of the other
predictors than by a real violation of linearity (though strong nonlinear or heteroscedastic shapes
3.6 C HECKING M ODEL A DEQUACY 63

hint at real violations of model assumptions in practice in most cases).


The plots of pairs of predictors can reveal collinearity and leverage points (see below), which
are not violations of the model assumptions (there are no model assumptions about the predictor
variables in linear regression), but still problematic.

Figure 3.7: This is an example of a scatterplot matrix based on the mtcars dataset. Where mpg= Miles per
gallon (fuel efficiency), disp= Displacement (cubic inches), hp= Horsepower, drat= Rear axle ratio, wt=
Weight (in thousands of pounds). This type of plot is highly useful for identifying univariate relationships
between the dependent variable (Y) and the independent variables (X’s). Additionally, the matrix helps
in spotting potential outliers or influential data points. While these graphs allow us to identify possible
univariate relationships, they do not reveal multivariate relationships.

(i-b) Residuals and standardized residuals. The residuals are the deviations between the
data points and the corresponding fitter values. They are defined as

ei = yi − ŷi i = 1, . . . , n.

They can be viewed as estimations of the model errors, denoted as i and represented as ei . This
implies that the residuals can serve as means to evaluate the model’s assumptions concerning the
errors. Standardized residuals are obtained by dividing the residuals by their estimated standard
deviations ni=1 (ei − ē)2 /(n − p), such that
P

ei
r̃i = pPn , i = 1, . . . , n,
2
i=1 (ei − ē) /(n − p − 1)
64 M ULTIPLE L INEAR R EGRESSION

effectively normalizing them to a standard deviation of 1. This normalization assists in gauging


their magnitude. If there are numerous standardized residuals with an absolute value exceeding 2,
it suggests that the error distribution possesses heavier tails than a normal distribution. However,
it’s important to note that approximately 5% of standardized residuals are expected to exceed
a magnitude of 2 even if the errors follow a normal distribution. This is due to the fact that
2 is approximately equal to 1.96, which corresponds to the 97.5-th percentile of the standard
normal distribution. However, theoretically speaking, residuals do not exhibit constant variance
like errors. In fact, it is trivial to show that the covariance matrix of the residuals is

V ar(e) = (In − H)V ar(y) = σ 2 [In − H] ,

where

−1
H = X XT X XT ,

is called the hat matrix.


In details, the variance-covariance matrix of the residual vector has two main elements of
interest
V ar (ei ) = σ 2 (1 − hii ) ,

and
Cov (ei , ej ) = −σ 2 hij i 6= j,

where hij is the (i, j)th element of H.


While the errors are assumed to be uncorrelated, the above result shows that the residuals are,
in general, correlated. This also suggests a formula for the standardised residuals. For the ith
observation, the standardised residual is

ei
ri = √ ,
σ̂ 1 − hii
it follows that the studentized residuals have constant variance V ar(ri ) = 1.
(i-c) Residual plots: Using standardized residuals for residual plots is logical, especially when
seeking heteroscedasticity, as they are expected to display consistent variance. However, examin-
ing raw residuals can also be informative.
There are several possible residual plots. For instance in the predictor vs. residuals plot,
the errors are assumed to be independent from the predictor variables, and therefore the residuals
should look randomly scattered when plotted against every single predictor variable. Otherwise,
the plot can reveal non-linearity, heteroscedasticity, autocorrelation (i.e. dependence among them-
selves) of residuals with neighboring values of the predictor and outliers.
(v) Fitted values vs. residuals. If the model is true, the correlation between the residuals and
3.6 C HECKING M ODEL A DEQUACY 65

fitted values is zero . The plot should look randomly scattered. The fitted values are, mathemat-
ically, a linear combinations of the predictors. Therefore, this plot can reveal the same kind of
problems as the predictor vs. residuals plot.
(vi) Observation order vs. residuals. This makes sense if the observation order is informative
(i.e. time). As the errors are assumed to be i.i.d., this plot should look randomly scattered as well.
It often reveals autocorrelation among the residuals, but may show heteroscedasticity as well. (As
long as the observation order is not a predictor in itself, there is no linearity assumption to be
checked here, but strongly nonlinear patterns may still be worth investigation.)

Figure 3.8: The residuals in this plot don’t appear to be randomly distributed around zero; instead, they seem
to follow a specific seasonal pattern. This plot could indicate the presence of potential autocorrelation in
the residuals. The presence of autocorrelation violates assumption (A3) of the model.

(vii) Normal probability plot of residuals. The normal probability plot plots the sorted (stan-
dardized) residuals r(i) (denoting the i th smallest residual) against the theoretical quantiles of
the normal distribution Φ−1 i−0.5

n
, which are the "ideal" locations of sorted realizations of a
standard normal distribution. This should look roughly like a straight line. Otherwise, it indicates
deviations from normality, including outliers (which can be seen at the extreme ends of the plot).
If a nonlinearity is observed in a normal probability plot, its interpretation involves examining
where the plot demonstrates convex or concave characteristics. A convex curve signifies that as
one progresses from left to right, the slope of the tangent line increases (as depicted in Figure
??(a)). Conversely, if the slope decreases as one moves from left to right, the curve is concave (as
illustrated in Figure ??(b)). A curve that is convex-concave combines convexity on the left and
concavity on the right, while a concave-convex curve presents concavity on the left and convexity
on the right (shown in Figure ??(c) and (d), respectively).
A normal plot demonstrating convexity, concavity, convex-concave, or concave-convex char-
acteristics indicates left skewness, right skewness, heavier tails (in comparison to the normal dis-
tribution), or lighter tails (in comparison to the normal distribution), respectively. It’s important to
note that these interpretations assume the sample quantiles are on the horizontal axis, and adjust-
ments are required if the sample quantiles are plotted on the vertical axis. The tails of a distribution
refer to the regions located far from its center.
66 M ULTIPLE L INEAR R EGRESSION

Figure 3.9: As one moves from (a) to (d), the curves are convex, concave, convex- concave, and concave-
convex. Normal plots with these patterns indicate left skewness, right skewness, heavier tails than a normal
distribution, and lighter tails than a normal distribution, respectively, assuming that the data are on the x-
axis and the normal quantiles on the y-axis.

3.6.1 Remedies for violated model assumptions.


(i) Non-linearity. Occasionally, employing transformations on the predictors and/or the response
variables can be beneficial. A linear model might be applicable to certain nonlinear functions
derived from the observed variables. Common initial choices involve taking logarithms, square
roots, squares, or exponentials. (Futher details can be found in Chapter 5: Introduction to linear
regression analysis, Montgomery)
(ii) Non-normality of the error distribution. Robust linear regression (not treated in this
course Chapter 5: Introduction to linear regression analysis, Montgomery) may help, particularly
with outliers. In case of skewness, transformations could improve the situation. A possibility is to
apply the same transformation to response and predictors.
(iii) Heteroscedasticity (as opposed to "homoscedasticity"). Weighted least squares (or itera-
tive reweighting); transformations; sometimes (in the case that heteroscedasticity)
(iv) Dependence of errors. Sometimes this does not affect regression parameter estimators,
but it does affect standard deviations and confidence intervals. If assumptions about the nature of
dependence can be made, time series models may apply.

• Coefficient of determination, R2 . This is defined by

Pn
2 Dev(ê) (yi − ŷi )2 RSS
R =1− = 1 − Pi=1
n 2
= 1 − ,
Dev(y) i=1 (yi − ȳ) T SS
3.6 C HECKING M ODEL A DEQUACY 67

which is the proportion of the total variation explained by the regression model. We have seen
that the R2 is the square of the correlation between the observed and fitted values . This correlation
is known as the multiple correlation coefficient.
The coefficient of determination (with a maximum value of 1 ) measures how well the model
accounts for the data. Note, however, that it is not directly related to the model assumptions. A
small value of R2 may occur in case that assumptions are violated or that some crucial information
(further predictors) is missing in the data or that the model is fine but the error variance is large, so
that the predictors don’t have a large explanatory or predictive strength. Violations of the model
assumptions are still possible if R2 is relatively high. If the true relationship between predictors
and response is strong and monotone but slightly nonlinear, or in case of heteroscedasticity with
comparably small error variances, a linear regression may still yield a relatively good fit and a high
R2 .
(v) Outliers. Outliers are observations that, in some way, behave differently from the bulk
of the data. They can for instance be extreme in the x-direction or in the y-direction. Hence we
distinguish between the following:

• Regression outliers: these are observations that have an unusual y-value compared to other
observations with similar x-values. If there are only few regression outliers these may show
up in residual plots as having large residuals.

• Leverage points: these are observations that have unusual x-values compared to the bulk
of the data. In a designed experiment they can be prevented but they are very common in
observational data. Note that linear regression does not assume normality for the predictors,
and therefore leverage points do not violate the model assumptions (except if they are "bad",
see below). But they cause instability of the regression in the sense that small modifications
of the data may lead to large changes in the least squares regression estimator.

Depending on whether the y-value is also unusual we further distinguish:


a) Good leverage points: if the y-value is ’in line’ with other y-values. Typically, for a good
leverage point the fitted value xT
i β̂ is similar to the observed value yi and omitting this observation
will not change the fitted model dramatically.
b) Bad leverage points: if the y-value is also unusual. For a bad leverage point the fitted value
T
xi β̂ can be close to or very different from the observed value yi − depending on the extend of the
leverage effect. An example is presented in Figure ??.
Hence leverage describes the potential for affecting the model fit. An intuitive way to measure
the leverage of an observation (xi , yi ) is to calculate how far away is xi from the centre of the x-
values. As this is scale dependent we should standardise this distance, which leads to the following
notion of Mahalanobis Distance M Di
q
M Di = (xi − x)T Σ̂−1
x (xi − x),
68 M ULTIPLE L INEAR R EGRESSION

Figure 3.10: Outliers and leverage points in a simple regression analysis Y ∼ X.

where Σ̂x is the empirical covariance matrix of the x-observations and x̄ is the vector contain-
ing the means. It can be shown that there is a one-to-one relation with the diagonal elements of
the Hat matrix:
 
1
M Di2 = (n − 1) hii − ,
n
where n is the sample size, the leverage of the ith observation is therefore often just measured
by hii . Under multivariate normality of the predictors (which is usually not assumed in regression),
M Di2 ∼ χ2p−1 , which can be used to assess whether a mahalanobis distance is unusually large.
Note that leverage points can be prevented in designed experiments.
Least squares is heavily affected by outliers and it is theorefore a good idea to check, before
even fitting the data, if there are any obvious unusual observations, e.g., by looking at the matrix
plot (leverage points can be found in the plots of pairs of the predictor variables). However, in
higher dimensions (i.e. more than 2 explanatory variables) we might not be able to spot such
unusual observations by looking at 2-dimensional plots. The residual plots can help, but it has to
be kept in mind that the residuals are computed from the fitted model which might itself already
be affected and distorted by outliers.
The problem with influential observations such as bad leverage points is that they may have
such a large influence on the least squares regression that they actually produce a smaller residual
than other points and cannot be revealed by residual plots.
Another possibility of finding out about the influence of an observation is Cook’s statistic.
This is probably the most popular measure of influence among those proposed. Fit the model
repeatedly, always omitting
 one observation, and see how that changes the fitted values: the change
in fitted values is X β̂ − β̂(i) where β̂(i) denotes the least squares estimator of β without the ith
observation. Cook’s statistic is
3.6 C HECKING M ODEL A DEQUACY 69

1  T
T
 
Di = β̂ − β̂ (i) X X β̂ − β̂ (i) .
pσ̂ 2
A large value of Di indicates that the ith observation is influential.
The numerator of Di is just the sum of the squared differences of the fitted values with and
without the ith observation. It can also be shown that
 
1 hii
Di = ri2 ,
p 1 − hii
which means that Di can be computed easily for all i without the need to actually fit the
regression model dropping one observation at a time.
Unfortunately, neither the Mahalanobis distance nor Cook’s statistic can reliably find all lever-
age points. A reason for this is the so-called masking effect, which means that if there is more
than one leverage point (particularly at about the same location), they all together can prevent that
every single one of them produces an unusual value on any of these statistics.
The only possibility to cope with this is to use robust estimators which are less sensitive even
to groups of leverage points. Residual plots with residuals computed from robust estimators are
more informative about outliers.

3.6.2 Multicollinearity
A serious problem that can affect a regression model is multicollinearity. Multicolinnearity implies
near linear dependence among the regressors. The regressors are nothing more than the columns
of the design matrix X. The problem can we split in two scenarios:

• Exact multicollinearity means that there is linear dependence among the predictor vari-
ables. In that case, X> X is not invertible and the least squares estimator does not exist.

• Near multicollinearity means that the predictors are nearly linearly dependent. This hap-
pens particularly if some of the predictors are strongly correlated (not exactly). This may be
detected from the matrix plot (though sometimes the interplay of more than two variables
may produce collinearity, and situations with a high number of variables and a relatively
low number of data points are again, dangerous). Even though the least squares estimator
can be computed in this case and it is not a violation of the model assumptions, approximate
collinearity is problematic. The fact that X> X is close to singularity means that some of the
regression parameter estimators (particularly those belonging to strongly correlated predic-
tors) may be very unstable and should be interpreted with care (the covariance matrix of β̂,
may contain some very large variance entries).
An intuitive reason is that if two predictors measure more or less “the same thing” (i.e.
are highly correlated), it cannot be clearly separated what any of them contributes to the
70 M ULTIPLE L INEAR R EGRESSION

explanation of the response.


3.7 ANOVA- O NE WAY- L AYOUT 71

3.7 ANOVA- One Way- Layout


One-way analysis of variance

Suppose there are N experimental units in total and they are allocated to the treatments at random.
Let yij be the j th observation in the ith treatment group, where j = 1, . . . , ni , i = 1, . . . , r, where
ni is the sample size of the ith group. We must have N = i ni . We wish to decide whether or
P

not the group means µ1 , µ2 , . . . , µr are equal i.e. to test the hypothesis H0 : µ1 = µ2 = · · · = µr
against H1 : at least one pair of group means is unequal.
In this situation we might start by considering the data as observed values of random variables
Yij satisfying the linear model

Yij = µi + ij (i = 1, . . . , r; j = 1, . . . , ni ) , (3.13)

where the ij are independent N (0, σ 2 ) r.v.’s, where we are assuming a common variance
P
within groups. Equivalently, if we define αi = µi − µ, where µ = i ni µi /N is the weighted
average of the µi , we can rewrite the model in additive form as

Yij = µ + αi + ij .

Here µ represent the ’overall mean’, and αi is an ’additional’ effect of the ith treatment. How-
ever, there is a problem, since the model now has r + 1 parameters to describe just r individual
group means. We are estimating the mean of each of the r groups plus the the overall mean.
Clearly, these parameters are not all uniquely defined. To overcome this problem we must im-
pose a constraint on the parameters. The constraint that ensures the intended interpretation is the
P
sum-to-zero parameterisation defined by i ni αi = 0 (other constraints are also possible, their
tratment is out of the scope of this course). In terms of the αi , the null hypothesis can be now
stated as H0 : α1 = α2 = · · · = αr = 0.
Analysis: partition of the total sum of squares. A very convenient way to study this model, at
least for the purpose of testing H0 , is to write each observation as a sum of deviations:

yij = ȳ.. + (ȳi. − ȳ.. ) + (yij − ȳi. ) , (3.14)

Pr Pr
where ȳ1. , . . . , ȳr. . denote the individual group means and ȳ.. = i=1 ni ȳi / i=1 ni is the
overall mean. Notice the notational convention here: a dot is used to indicate a suffix over which
averaging has taken place.
The three terms on the right-hand side of ?? may be thought of as the ’sample’ analogues of
µ, αi and ij respectively - indeed, it can be proved that the first two are the least-squares estimates
72 M ULTIPLE L INEAR R EGRESSION

of µ and αi . We now consider the total variation of the data about the overall mean ȳ.. .

X ni
r X ni
r X
X
2
(yij − ȳ.. ) = {(ȳi. − ȳ.. ) + (yij − ȳi. )}2 .
i=1 j=1 i=1 j=1

Expanding out the bracket on the right-hand side here, it can be shown (see exercises) that the
cross-product term vanishes to yield

X ni
r X r
X ni
r X
X
2 2
(yij − ȳ.. ) = ni (ȳi. − ȳ.. ) + (yij − ȳi. )2 .
i=1 j=1 i=1 i=1 j=1

The first term on the right-hand side in some sense represents variation among the different
group means: it is called the between-groups sum of squares (s.s) and denoted below by SG :

r
X
SG = ni (ȳi. − ȳ.. )2 .
i=1

The second term is a sum of contributions representing the variation of observations within
each individual group about the group means: it is called the residual or error or within-groups
sum of squares (s.s.):

X ni
r X r
X
2
SE = (yij − ȳi ) = (ni − 1) s2i .
i=1 j=1 i=1

Denoting the total variation by ST therefore, we have

ST = SG + SE

It seems intuitively reasonable to assess the plausibility of H0 : α1 = . . . = αr = 0 by


considering the value of SG relative to that of SE . This can be done by computing the variance
ratio, or mean-square ratio

SG /(r − 1)
F = (3.15)
SE /(N − r)

which should be approximately 1 under H0 but larger than 1 otherwise.


The decomposition of the total variation is an algebraic result that does not depend on the as-
sumed probability model. To formulate a test of H0 , however, we need to be able to say something
about the distributions of SE and SG . To do this, we need to make use of the normality assumption.
With a bit of albegra and using some well know theorems, it can be proved that SE /σ 2 ∼ χ2(N −r) ,
SG /σ 2 ∼ χ2(r−1) and that SE , SG are independent, where χ2 denote a chi-squared distribution.
Therefore, under H0 , the mean-square ratio (??) has a Snedecor’s F distribution with r − 1 and
3.7 ANOVA- O NE WAY- L AYOUT 73

N − r dregrees of freedom, we will denote this with the following symbol F(r−1,N −r) .
The Analysis of variance (ANOVA) table includes the sum of squares, the mean square (m.s.),
the mean-square ratio (m.s.r.), the degree of freedom (d.f.) and the F test:

Source of variation Sum of squares d.f. m.s. m.s.r. Expected m.s.


Pr
ni α2i
SG = ri=1 ni (ȳi − ȳ. )2 SG SG /(r−1)
σ2 +
P
Between groups r−1 r−1
F = SE /(N −r)
i=1
r−1
SE = ri=1 nj=1 (yij − ȳi. )2 SE
σ2
P P i
Within groups N −r N −r

ST = ri=1 nj=1 (yij − ȳ.. )2


P P i
Total N −1
The expected m.s. column is not included in practice. To perform the test, compute the m.s.r.
F and reject H0 at level α if F ≥ Fα,(r−1,N −r) , the upper 100α% point of the F distribution.
Suppose that we want to construct a confidence interval for the difference µi − µj or equiva-
lently (αi − αj ), where µi is the mean of the ith and th
q µj is the mean of the j qgroup. An unbiased
estimator is ȳi. − ȳj. with standard error given by S n−1 −1
i + nj , where S =
SE
N −r
is the estimate
of σ. Therefore a 100(1 − α)% confidence interval for µi − µj is
s
1 1
ȳi. − ȳj. ± tα/2,(N −r) S + .
ni nj

Example

(Clarke & Kempson) In an experiment to compare four different fuel mixtures, mixtures were
randomly assigned to each of 25 motor vehicles and the fuel consumption of each (in km/L ) was
measured under controlled conditions. The results were as follows:
A 13.31 14.04 13.68 13.75 13.12 14.11 13.96
B 14.28 14.47 14.03 15.62 15.10
C 15.04 14.77 15.13 15.45 14.98 15.51
D 14.66 13.93 15.05 14.21 14.42 14.30 14.25
The calculations are most easily presented in tabular form:

yij2 yi.2 /ni


P
Treatments ni yi j
A 7 95.97 1316.5907 1315.7487
B 5 73.50 1082.1346 1080.4500
C 6 90.88 1376.9344 1376.5291
D 7 100.82 1452.8760 1452.0961
N = 25 y.. = 361.17 S = 5228.5357 5224.8239

Total s.s. = S − y..2 /N = 5228.5357 − (361.17)2 /25 = 5228.5357 − 5217.7508 = 10.7849.


Between-groups s.s. = ri=1 yi.2 /ni − y..2 /N = 5224.8239 − 5217.7508 = 7.0731.
P

The ANOVA table is therefore as follows:


74 M ULTIPLE L INEAR R EGRESSION

Source of variation s.s. d.f. m.s. m.s.r.


Between groups 7.0731 3 2.3577 13.339
Within groups 3.7118 21 0.1768
Total 10.7849 24

The 0.1% point of F with 3 and 21 degrees of freedom is 7.94 . Since the observed value of
F is greater than this, the result is significant at the 0.1% level. There is very strong evidence that
mean fuel consumption differs under the four different mixtures. The four mean values are 13.71,
14.70, 15.15 and 14.40, indicating that treatment C is the best (km/L).
A 99% confidence
q interval for the difference between treatment C and A is (15.14 − 13.71)±
1 1
2.831 × 0.4204 6 + 7 , giving the interval (0.774, 2.099).

Example in R

[Link] <- c(13.31,14.04,13.68, 13.75, 13.12, 14.11, 13.96,


14.28, 14.47,14.03, 15.62, 15.10,
15.04, 14.77, 15.13, 15.45, 14.98, 15.51,
14.66, 13.93, 15.05, 14.21, 14.42, 14.30, 14.25)

mixture <- [Link](c(rep(’A’,7), rep(’B’, 5), rep(’C’, 6), rep(’D’, 7)))

fit <- aov([Link] ~ mixture)


summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
mixture 3 7.073 2.3577 13.34 4.29e-05 ***
Residuals 21 3.712 0.1768
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Checking model assumptions

The assumptions of the additive model (??) are firstly that the observations are from r independent
normal populations, and secondly that the r population variances are equal. Some comments on
these:

1. The F test is fairly robust to departures from normality if the sample sizes are sufficiently
large (central limit effect), but not for very small samples. To check, plot the residuals
3.7 ANOVA- O NE WAY- L AYOUT 75

Figure 3.11: Diagnostic plots ANOVA

rij = yij − ȳi . as Q-Q plot (see earlier plots for Example). Remedies: possibly transform the
data or use a different distributional model or use a distribution-free analysis (nonparametric
method).

2. Independence, and homogeneity within groups, can be controlled to some extent using ran-
domisation. However, even in this case it is useful to plot residuals against other variables
in case of unexpected problems. For example, if an experiment is carried out by taking
measurements over the course of a day then it may be useful to plot the residuals in time
sequence - it may turn out, for instance, that observations taken in the morning are generally
less than those taken in the afternoon. If such an effect is identifiable, a more sophisticated
analysis may be required - for example, using a randomised blocks layout (out of the scope
of this course).

3. Modest failure of the ’constant variance’ assumption is not too serious if the group sizes
are similar. The assumption can be checked by inspection of the within-groups sample
variances s21 , . . . , s2r , or using residual plots. If necessary, the hypothesis H0 : σ12 = · · · = σr2
can be tested formally using Bartlett’s test (be careful though - this test is very sensitive to
departures from normality).

If the variances appear non-constant, a possible remedy is to transform the data using a variance-
stabilising transformation. Such a transformation may also result in better normality as well. For
example, if the model is multiplicative: Yij = µi ij (so that s.d. ∝ group mean) then transform to
an additive model by taking log Yij .
76 M ULTIPLE L INEAR R EGRESSION

3.8 Missing Data


In this section, we will discuss some fundamental concepts related to missing data. However, since
the topic covered here is primarily theoretical, we will provide only essential definitions. Practical
applications will be extensively addressed in the labs.

3.8.1 Missing data mechanism


Missing data is a common issue in statistical analysis, and it can cause bias or inaccurate results.
The three types of missing data are missing completely at random (MCAR), missing at random
(MAR), and missing not at random (MNAR).

• MCAR happens when the probability to observe a missing observation is the same for all
cases. In other words the missing data mechanism is completely unrelated to the data. Let’s
say you’re conducting a survey about people’s income and education level. You collect data
from a random sample of individuals, but some respondents leave certain fields blank. In
an MCAR scenario, the missing data is not related to the respondents’ income or education
level or any other factor. For instance, consider a situation where you have collected the
following data:

Respondent Income ($) Education Level


1 50000 Bachelor’s
2 75000 Master’s
3 High School
4 60000 Doctorate
5 42000
6 Master’s
Table 3.3: Example of MCAR Data.

In this example, the missing data points (e.g., Respondent 3’s income and Respondent 5’s
education level) are seemingly random and don’t show any particular pattern. The missing
data points occur regardless of the income or education level, demonstrating that the data is
missing completely at random (MCAR).

• MAR happens when the probability of being missing is the same only within groups. Con-
sider a study where researchers are investigating the relationship between income and health
status. They collect data from a sample of participants and ask them to provide their annual
income and self-reported health status on a scale from 1 to 10. However, due to privacy
concerns, some participants choose not to disclose their income.
In this example, the missing income data is related to the health status of the participants.
Those with lower health status scores (like Participant 3) might be less inclined to disclose
3.8 M ISSING DATA 77

Participant Income ($) Health Status


1 45000 7
2 62000 6
3 4
4 55000 8
5 38000 5
6 9
Table 3.4: Example of MAR data.

their income. On the other hand, those with higher health status scores (like Participant 6)
might be more comfortable sharing their income. The missingness of income depends on the
observed variable (health status), but it’s not directly related to the values of the missing data
(income). This scenario demonstrates "Missing At Random" (MAR) because the probability
of income being missing can be explained by other observed variables, making it possible
to handle the missingness through statistical techniques that consider these relationships.
MAR is more general and realistic than MCAR

• MNAR happens when neither MCAR nor MAR assumption holds, in this case the proba-
bility that a i-th case is missing varies for reasons that are unknown to us. Imagine a study
that aims to understand the relationship between income and job satisfaction. Participants
are asked to provide their annual income and rate their job satisfaction on a scale from 1 to
10. However, in this scenario, the missingness in income is not only related to participants’
job satisfaction, but it’s also influenced by the actual income itself. Participants with higher
incomes might be less willing to disclose their income, regardless of their job satisfaction
level.

Participant Income ($) Job Satisfaction


1 60000 8
2 6
3 48000 5
4 9
5 55000 7
6 4
Table 3.5: Example of MNAR data.

In this example, the missing income data is influenced by both the job satisfaction level
and the actual income. Participants with higher incomes might choose not to disclose their
income regardless of their job satisfaction, while those with lower incomes might be more
inclined to share their income.
This type of missingness is referred to as "Missing Not At Random" (MNAR) because the
missingness is not solely dependent on observed variables like job satisfaction; it’s influ-
78 M ULTIPLE L INEAR R EGRESSION

enced by the values of the missing data (income) itself. Handling MNAR data can be more
challenging since the missingness mechanism is not fully explained by observable factors,
making it necessary to carefully consider the potential biases introduced by the missing data
in any analysis.

Overall, identifying the type of missingness in the data is crucial in selecting appropriate methods
to handle missing data. MCAR is the easiest to handle, while MAR can be dealt with using various
methods. MNAR is more challenging and requires a cautious approach to address the missingness
mechanism.

3.8.2 Imputation Methods


Complete case analysis (CCA)

Complete case analysis (CCA) involves excluding all cases (rows) that have at least one missing
entry. If the data is influenced by MCAR (which can be challenging to ascertain in most cases),
CCA yields standard errors and confidence intervals that are accurate for the diminished subset of
data, although relatively larger compared to using all available data. However, if the data is not
MCAR, utilizing CCA introduces significant bias into the estimates of the regression coefficients.
One straightforward method to implement CCA is by utilizing the [Link] function in R.

Mean Imputation

Mean imputation (MI) involves replacing the missing entries of a random variable Xi (the i-th
column in our dataset) with its sample mean computed from the observed values. It is important to
recognize that through MI, we alter the probabilistic characteristics of the underlying distribution.
Mean imputation offers a quick and straightforward approach to handle missing data; however,
it tends to underestimate the variance of the imputed random variable. Additionally, it disrupts
the interrelations between random variables and introduces bias into all estimates except for the
sample mean.

Regression Imputation

Regression imputation (RI) integrates information from other variables to create more informed
imputed values. The initial stage entails constructing a model based on the available data. Subse-
quently, predictions for the incomplete cases are computed using the established model, and these
predictions are used to substitute the missing data.
3.9 E XERCISES 79

3.9 Exercises
1. Suppose the model for simple linear regression is written as

Yi = α0 + α1 (xi − x̄) + ei (i = 1, . . . , N )
1
PN
where x̄ = N i=1 xi .
(a) Obtain the least squares estimators of α0 and α1 . (You may obtain these from the matrix
form of the normal equations.)
(b) Obtain the covariance matrix of these least squares estimators under the usual assumptions
about the errors.
(c) Show that

2
CxY
RSS = CY Y −
Cxx
PN 2
(xi − x̄)2 , CxY =
PN
(xi − x̄) Yi − Ȳ and CY Y = N
 P
where Cxx = i=1 i=1 i=1 Yi − Ȳ .
(d) Refer to R output below (Figure ??). If x denotes log depth and y denotes log flow, you
are given that x̄ = −0.88686, ȳ = 0.21475, Cxx = 1.1482, Cxy = 3.1738, Cyy = 9.3516 using the
notation above.
State how the model defined above is related to the model fitted in the R output. Use the alge-
braic results obtained above and a hand calculator to verify the numerical results for the following
in the R output:

• the least squares estimates of β0 and β1 ,

• the residual standard error,

• estimated covariance matrix of β̂0 and β̂1 .

2. The data can be obtained as a text file [Link] from the course webpage.

Since the volume of a cylinder is a product of the height, the diameter to the square and a
constant, it could actually make sense to assume a multiplicative model, and fitting the log VOL
from log HT and log D16 could make sense. Use R or any other statistics software to fit such a
model. Discuss the results and compare them to the models in handout 3 .
Compute a predicted value and 95% prediction and confidence intervals for VOL for a tree with
D16=10 and HT=100 from the log-transformed model fitted here (you need to take into account
the transformation for this). Compare the prediction interval to the one that you get from a model
that fits VOL as a linear function of D16 and HT. Here are some useful R-commands:
80 M ULTIPLE L INEAR R EGRESSION

Figure 3.12: R output Ex 4

3. Explain in your own words and understandable to a non-statistician (with some solid school
background in maths), what the model assumption of i.i.d. error terms in the general linear
model means. (Don’t assume that the reader knows what an “error term” is.)

4. The matrix formulation of the multiple linear model is given

y = Xβ + 

You may assume that y and  have multivariate normal distributions and that the components
of the vector  are independently and identically distributed (i.i.d.).

(a) Carefully define each of the terms y, Xβ,  appearing in the model.
(b) Given that X is a n × (p + 1) matrix, state the dimensions of y, β and . Explain what
the values n and p refer to.
(c) Define the hat matrix H for an associated general linear model and show that it has the
property H> = H . Clearly state any assumptions about the matrix X and state the
matrix algebra results which are being used.
(d) Define the terms outlier and influential observation. Explain the term leverage with
respect to a multiple linear regression. How this can be used to identify influential
observations.

5. Suppose we are estimating the regression coefficients in a linear regression model by mini-
mizing

n  p 2 p
X X X
min y i − β0 − βj xij subject to |βj | ≤ s, (3.16)
β∈Θ
i=1 j=1 j=1

where yi is the response variable for the i-th observation, xij is the value of the j-th predic-
tor for the i-th observation, β0 is the intercept of the model, βj are the coefficients of the
3.9 E XERCISES 81

predictors, β > = (β0 , β1 , . . . , βp ) ∈ Rp+1 is the coefficient vector including the intercept,
Θ is its parametric space, λ ≥ 0 is the regularization parameter, n denotes the number of
observations, and p the number of predictors (excluding the intercept). Finally, the Residual
Pn  Pp 2
Sum of Squares (RSS) is given by i=1 yi − β0 − j=1 βj xij .
Pp
(a) Discuss the role of the constraint j=1 |βj | ≤ s.
(b) Discuss the advantages and disadvantages of optimization in Equation ??.
(c) Provide a geometric interpretation and draw the figure associated with the optimization
in Equation ??.
(d) Discuss what happens to the parameters and RSS when s increases.
(e) Discuss what happens to the parameters and RSS when s decreases.

6. Suppose we are estimating the regression coefficients in a linear regression model by mini-
mizing the Ridge Regression cost function. The cost function is defined as:

n
X p
!2 p 
X X
min yi − β0 − βj xij +λ βj2 , (3.17)
β∈Θ
i=1 j=1 j=1

where yi is the response variable for the i-th observation, xij is the value of the j-th predic-
tor for the i-th observation, β0 is the intercept of the model, βj are the coefficients of the
predictors, β > = (β0 , β1 , . . . , βp ) ∈ Rp+1 is the coefficient vector including the intercept,
Θ is its parametric space, λ ≥ 0 is the regularization parameter, n denotes the number of
observations, and p the number of predictors (excluding the intercept). Finally, the Residual
P  2
Sum of Squares (RSS) is given by ni=1 yi − β0 − pj=1 βj xij .
P

Pp
(a) Discuss the role of the constraint j=1 βj2 .
(b) Discuss the advantages and disadvantages of optimization in Equation (??).
(c) Provide a geometric interpretation and draw the figure associated with the optimization
in Equation (??).
(d) Discuss what happens to the parameters and RSS when λ increases.
(e) Discuss what happens to the parameters and RSS when λ decreases.

7. This exercise considers hospital expenditures data provided by the U.S. Agency for Health-
care Research and Quality (AHRQ) Filename is “HospitalCosts”

(a) Produce a scatterplot, correlation, and linear regression of LNTOTCHG on AGE. Is


AGE a significant predictor of LNTOTCHG?
82 M ULTIPLE L INEAR R EGRESSION

(b) You are concerned that newborns follow a different pattern than other ages do. Create
a binary variable that indicates whether AGE equals zero. Run a regression using this
binary variable and AGE as explanatory variables. Is the binary variable statistically
significant?
(c) Now examine the sex effect, using the binary variable FEMALE, which is one if the
patient is female and zero otherwise. Run a regression using AGE and FEMALE as
explanatory variables. Run a second regression running the two variables with an
interaction term. Comment on whether the gender effect is important in either model.
(d) Now consider the type of admission, APRDRG, an acronym for “ all patient refined
diagnostic related group.” This is a categorical explanatory variable that provides in-
formation on the type of hospital admission. There are several hundred levels of this
category. For example, level 640 represents admission for a normal newborn, with
neonatal weight greater than or equal to 2.5 kilograms. As another example, level 225
represents admission resulting in an appendectomy.

8. This exercise considers state of Wisconsin lottery sales data, Filename is “WiscLottery”.
You decide to examine the relationship between SALES (y) and all eight explanatory vari-
ables (PERPERHH, MEDSCHYR, MEDHVL, PRCRENT, PRC55P, HHMEDAGE, MED-
INC, and POP).

(a) Fit a regression model of SALES on all eight explanatory variables.


(b) Find R2 . Use it to calculate the correlation coefficient between the observed and fitted
values.
(c) Test whether POP, MEDSCHYR, and MEDHVL are jointly important explanatory
variables for understanding SALES.

After the preliminary analysis , you decide to examine the relationship between SALES(y)
and POP, MEDSCHYR, and MEDHVL.

• Fit a regression model of SALES on these three explanatory variables.


• Has the coefficient of determination decreased from the eight-variable regression model
to the three-variable model? Does this mean that the model is not improved or does it
provide little information? Explain your response.
• To state formally whether one should use the three- or eight-variable model, use a
partial F -test. State your null and alternative hypotheses, decision-making criterion,
and decision-making rules.
3.9 E XERCISES 83

Figure 3.13: R output Ex 1


Chapter 4

Variable Selection and Regularization

4.1 Principal Components Regression


There is another alternative strategy to standard least squares (LS) regression that addresses scenar-
ios where multicollinearity among predictors affects the stability and interpretability of a standard
linear regression model. The idea is to reduce dimensionality before applying LS, which involves
transforming the original predictors into a new set of orthogonal components through Principal
Components Regression (PCR).
Principal Components Regression: PCR simplifies the predictor space by utilizing q trans-
formed variables which are linear combinations of the original p predictors, where q < p. These
transformations are defined as p
X
Zk = φjk Xj ,
j=1

for some constants φ1k , φ2k , . . . , φpk , and k = 1, . . . , q. The transformed variables are orthogonal
components that focus on retaining the components that explain the most variance, which often
leads to more reliable predictions in the presence of high-dimensional data.
Model Formulation: A LS regression model is then fitted using these transformed variables:
q
X
yi = θ0 + θk zik , i = 1, . . . , n.
k=1

Instead of estimating p + 1 coefficients, we estimate only q + 1. This reduces the dimensionality


and simplifies the model by focusing on the most significant aspects of the data.
Trade-offs: However, the convenience of PCR comes with a trade-off. By prioritizing variance
over the direct relationship with the response variable, PCR might exclude components that have
less variance but more predictive power regarding the outcome. This results in a loss of some
explanatory power, as the selected principal components may not fully capture the nuances of how
specific predictors influence the response. Consequently, PCR can obscure the interpretative link

85
86 VARIABLE S ELECTION AND R EGULARIZATION

between original predictors and the response, making it less useful for understanding the exact
nature of these relationships.
Interestingly, with some re-ordering, the relationship between the transformed predictors and
the response can be expressed as:

q p q p
!
X X X X
θk zik = θk φjk xij = βj xij
k=1 j=1 k=1 j=1

where βj = ( qk=1 θk φjk ), and this is essentially a constraint. Like penalised regression, the co-
P

efficients are constrained, but now the constraint is quite different—increasing bias but decreasing
variance, yielding benefits for settings that are not “least-squares friendly”.
Desired Properties of Transformations: In determining the φ constants, it is crucial to con-
sider what properties the transformed variables Z should have. Ideally, if the new variables Z are
uncorrelated, this would provide a solution to the problem of multicollinearity. Furthermore, if
the new variables Z capture most of the variability of the X’s, and assuming that this variability
is predictive of the response, the Z variables will be reliable predictors of the response. This is
precisely what PCR aims to achieve.

4.1.1 Principal Component Analysis (PCA)


PCR utilises Principal Component Analysis (PCA) of the data matrix. Now, we wish to change
the perspective: while maintaining the same information of the original p covariates, our aim is to
create new features that encapsulate all the information of the original variables, but in a reduced
number M < p.
Let Z1 , . . . , ZM be linear combinations of the p predictors, where M < p. If this were not the
case, considering a transformation of the covariates would make little sense; in such a scenario, it
would be more logical to directly use the original covariates. Then

Zm = φ1m X1 + φ2m X2 + · · · + φpm Xp

for m = 1, . . . , M and considering that φ1m , . . . , φpm are constants (are constant in our regres-
sion problem, i.e. they must not be estimated via OLS).
Assuming we have observed the values of φ1m , . . . , φpm , we can estimate the model by now
considering z1 , . . . , zM . Therefore, we can write

yi = θ0 + θ1 zi1 + · · · + θM ziM + i i = 1, . . . , n, (4.1)

observing the Equation ?? we can notice that now we are no longer using the X’s and that in
4.1 P RINCIPAL C OMPONENTS R EGRESSION 87

order to emphasize the change of variable we have used another notation for the coefficients which
are now θ0 , θ1 , . . . , θM .
It is well-known in the literature that if the constants φ1m , . . . , φpm are chosen appropriately,
then the formula ?? performs better than the OLS method applied to X1 , . . . , Xp . However, this
is particularly true if there is a strong linear relationship among the p covariates. Indeed, if the
p covariates are independent, it can be shown that estimating formula ?? or applying the OLS
method to X1 , . . . , Xp is completely equivalent.
But why are we discussing dimensionality reduction? The reason is that while previously our
problem was to estimate p + 1 coefficients, now we need to estimate ’only’ M + 1. The advantage
of this approach is particularly evident in high-dimensional contexts, namely when p  n or p is
very close to n. In these situations, there is a risk of overfitting the model, leading to imprecise
and highly variable estimates. By shifting from p+1 to M +1, we are reducing the dimensionality
while preserving the information of the original covariates.
As previously discussed, PCA (Principal Component Analysis) is a data reduction technique.
When analyzing large datasets in which one or more covariates are strongly correlated, instead of
discarding these correlated covariates due to potential multicollinearity issues, PCA allows for the
condensation of the information contained in p covariates into M < p new features.
Now, we are introducing PCA (Principal Component Analysis) in a regression context, but it
can also be useful in visualization scenarios. Indeed, in data visualization, representing more than
3 dimensions becomes challenging. Techniques to enhance visual dimensionality are well-known
(such as using colors, different shapes, various sizes, etc.), but sometimes they are not sufficient.
In these cases, PCA can serve as a valuable tool to compress dimensions and facilitate easier
visualization. If we think about, this is the same mechanism by which, when we have to handle a
large file, we prefer to compress the file before sending it via email.
For example, take the first principal component based on the set X1 , . . . , Xp ; this can be written
as

Z1 = φ11 X1 + φ21 X2 + · · · + φp1 Xp ,

where (φ11 , . . . , φ1p ) are the loadings of the first principal component. In order to uniquely
identify Z1 , we must impose the following constraint: pj=1 φ2j1 = 1. This constraint limits the
P

variance of Z1 , preventing it from growing indefinitely.


Given that X is an n × p matrix, the first step before calculating the principal components
is to center the data matrix X. This step is reasonable since we are constructing the principal
components to capture as much information as possible. Since variability is synonymous with
information in data, we are interested in variance. Let

z1 = φ11 x1 + φ21 x2 + · · · + φp1 xp ,


88 VARIABLE S ELECTION AND R EGULARIZATION

the first principal component based on the sample observations. Our goal is to identify the
coefficients φ11 , . . . , φp1 (the loadings) such that they result in the maximum possible variance for
Z1 , given that pj=1 φ2j1 = 1. Therefore
P

 p
n X 2  p
X X
−1
maximizeφ11 ,...,φp1 n φj1 xi j s.t. φ2j1 = 1 (4.2)
i=1 j=1 j=1

in the ?? we can write the objective function as n−1 ni=1 zi1 2


P
because the X’s have been
centered and therefore have a mean equal to zero. We refer to z11 , . . . , zn1 as the scores of the first
principal component.
Upon determining Z1 , our next step is to compute Z2 , which represents the second principal
component. This component is derived as a linear combination of X1 , . . . , Xp that maximizes
variance under specific conditions: the loadings must adhere to the constraint pj=1 φ2j2 = 1.
P

Additionally, it is imperative that Z2 is orthogonal to Z1 , ensuring that the two components are
uncorrelated.
The requirement for Z2 to be uncorrelated with Z1 makes perfect sense, as we do not want to
risk incorporating redundant information into Z2 that has already been captured in Z1 . We can
express the second principal component as follows:

zi2 = φ12 xi1 + φ22 xi2 + · · · + φp2 xip

where φ12 , . . . , φp2 are the loadings of the second principal component. This same reasoning
is then extended to Z3 , which, in addition to the usual constraint on the loadings, should also be
uncorrelated with both Z1 and Z2 .

4.2 Ridge regression


In multiple regression it is shown that parameter estimates based on minimum residual sum of
squares have a high probability of being unsatisfactory, if not incorrect, if the prediction vectors
are not orthogonal. A way to reduce the variance of β̂ OLS (i.e., the parameters obtained via the
ordinary least squares method) is to shrink some of the estimated coefficients towards zero. Ridge
Regression (Hoerl and Kennard, 1970) solves the following optimization problem:
 
µ̂R R
kY − µ1 − Xβk22 + λkβk22

λ , β̂λ = arg min
(µ,β)∈R×Rp

Where 1 = (1, . . . , 1)T ∈ Rn and β ∈ Rp is a vector of dimension p, X ∈ Rn×p is our data


matrix and λ is a non-negative scalar. The first thing we notice is that the optimization problem
in Ridge is very similar to that of linear regression. The difference is the addition of an extra
4.2 R IDGE REGRESSION 89

penalty term kβk22 which geometrically can be interpreted as a p-sphere. The parameter λ ≥ 0,
controls the degree of penalty towards zero of the parameters. With λ ≡ 0 we return to a classic
linear regression problem (unpenalized estimation problem) while a value of λ tending to infinity
would force all coefficients to assume very small values but never exactly zero. λ is often called
the tuning parameter or regularization parameter. In the optimization problem, we explicitly
included a non-penalized intercept term.
Consider the case where we are working with a variable temperature which may be in Kelvin
or degrees Celsius, we know that the values of the parameters would not change. However, Xβ̂
is not invariant to scale transformations of the variables, hence it is good and correct practice to
center each column of X (making them orthogonal to the intercept) and then scale them to have

an `2 norm equal to n.
It is trivial to show that after having standardized the data in the design matrix X, µ̂R
λ = Ȳ :=
Pn Pn
i=1 Yi /n, it is straightforward to remember that i=1 Yi = 0 by replacing Yi with Yi − Ȳ thus
removing µ from the objective function. Ridge estimates have the following form:

−1 >
β̂λR = X> X + λI X Y

The addition of the term λ stabilizes the inversion of XT X. Remember indeed that when X
does not have full rank, the estimates obtained with the least squares method are unstable.
An important theorem now comes into play.
Theorem. Assume that X is a full rank matrix. Let β̂ OLS be the estimates obtained with the
ordinary least squares method (OLS), β̂ R those obtained with the Ridge Regression method and
β 0 the true parameter vector. For sufficiently small λ,

  >   >
OLS 0 OLS 0 R 0 R 0
E β̂ −β β̂ −β − E β̂λ − β β̂λ − β

is positive definite.
Proof. First, let us calculate the bias of β̂λR . Ignoring for the moment the subscript λ and the
subscript R for convenience.
−1 >
E(β̂) − β 0 = X> X + λI X Xβ 0 − β 0
−1
= X> X + λI X> X + λI − λI β 0 − β 0

−1 0
= −λ X> X + λI β .

Now let’s think about the variance of β̂.

n −1 > o n > −1 > o>


Var(β̂) = E X> X + λI X ε X X + λI X ε
−1 > −1
= σ2 X> X + λI X X X> X + λI .
90 VARIABLE S ELECTION AND R EGULARIZATION

So,   >   >


E β̂ OLS − β 0 β̂ OLS − β 0 − E β̂ − β 0 β̂ − β 0

is equal to
−1 −1 > −1 −1 0 0> −1
σ 2 X> X −σ 2 X> X + λI X X X> X + λI −λ2 X> X + λI β β X> X + λI

After some simplifications, we have that


−1 h 2 n −1 o >
i −1
λ X> X + λI σ 2I + λ X> X − λβ 0 β 0 X> X + λI .

Therefore, we have that


  >   
E β̂ OLS − β 0 β̂ OLS − β 0 − E β̂ − β 0 β̂ − β 0

is positive definite for λ > 0 if and only if


n −1 o >
σ 2 2I + λ X> X − λβ 0 β 0

 turns out to be true for sufficiently small values of λ > 0 (we can
is positive definite, which
2
take 0 < λ < 2σ 2 / kβ 0 k2 . The theorem therefore assures us that β̂λR performs better than β̂ OLS
provided that λ is chosen appropriately. In order to be able to use ridge regression efficiently, we
must define a way to select a reasonable value for λ (this will be the subject of further studies).
What this theorem does not tell us is when we expect Ridge to perform well. To discuss this point,
we need to explore the relationship between ridge regression and SVD.

4.2.1 Ridge and Singular Value Decomposition (optional reading)


The Singular Value Decomposition (SVD) is a generalization of the eigenvalue-based decompo-
sition of a square matrix (square meaning that the number of rows is equal to the number of
columns). The SVD allows this idea to be generalized and to factorize any X ∈ Rn×p , even in
contexts where one does not work with square matrices. Thus, we have

X = UDV>

where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices and D ∈ Rn×p has diagonal elements
such that D11 ≥ D22 ≥ · · · ≥ Dmm ≥ 0, where m := min(n, p), and all other elements of D are
zero. The r-th columns of U and V are known as the r-th left and right singular vectors of X, and
Drr is the r-th singular value.
When n > p, we can replace U with its first p columns and D with its first p rows to obtain
another version of the SVD (sometimes known as thin SVD). Then X = UDV> where U ∈ Rn×p
4.2 R IDGE REGRESSION 91

has orthonormal columns (but is no longer square) and D is a square diagonal matrix. There is an
equivalent version when p > n.

Let X ∈ Rn×p be our matrix of predictors and suppose n ≥ p. Using the (thin) SVD, we can
write the fitted values of ridge regression in the following way:

−1
Xβ̂λR = X X> X + λI X> Y
−1
= UDV> VD2 V> + λI VDU> Y
−1
= UD D2 + λI DU> Y
p
X D2jj
= Uj U> Y.
j=1
D2jj + λ j

where Uj is the j-th column of U . For comparison, the fitted values from ordinary least squares
regression (OLS) (when X has full rank) are

−1
Xβ̂ OLS = X X> X X> Y = UU> Y

Both OLS regression and ridge regression compute the coordinates of Y with respect to the

columns of U. Ridge regression then shrinks these coordinates by the factors D2jj / D2jj + λ ; if
Djj is small, the amount of shrinkage will be greater. Adding another layer, observe that SVD is
intimately connected to Principal Component Analysis (PCA). Consider v ∈ Rp with |v|2 = 1.
Since the columns of X have had their means subtracted, the sample variance of Xv ∈ Rn is

1 > > 1
v X Xv = v > V D2 V > v
n n

Writing a = V> v, thus |a|2 = 1, we have

1 > 1 1X 2 2 1 X 1
v VD2 V > v = a> D2 a = aj Djj ≤ D11 a2j = D211 .
n n n j n j
n

Since |XV1 | 22 /n = D211 /n, V1 determines the linear combination of columns of X that has
the greatest sample variance, when the coefficients of the linear combination are constrained to
have an `2 norm of 1. XV1 = D11 U1 is known as the first principal component of X. The sub-
sequent principal components D22 U2 , . . . , Dpp Up have a maximum variance of D2jj /n, provided
they are orthogonal to all those preceding - see the list of examples 1 for details.

Returning to ridge regression, we see that it further reduces Y in the smaller principal compo-
nents of X. Therefore, it will perform well when most of the signal is found in the larger principal
components of X.
92 VARIABLE S ELECTION AND R EGULARIZATION

4.3 Variable Selection


Considering again the multiple regression model in the form

Yi = β0 , β1 xi1 , . . . , βp xim + i , wheni = 1, . . . , n

In many applied fields such as Economics, Biology, Psychology, and Social Sciences, there
is often an accumulation of large datasets, which may not all be informative. Specifically, it is
common to handle datasets with many columns (where the number of variables p is very large)
and there arises a need to select only a subset of relevant variables. The reasons for this might
include:
1. Theoretical reasons might suggest that only a few covariates explain the dependent variable
Y . For example, an advertising company, having collected a myriad of consumer data,
may want to focus on a few important characteristics to optimize and channel advertising
expenditures more efficiently.

2. Often, situations arise where p > n. In these cases, the problem of variable selection
takes on a completely different aspect. It is not feasible to estimate a regression model (or
any model in general) when the number of parameters to be estimated exceeds the number
of observations. For instance, to estimate a mean, at least two observations are required.
Estimating a mean with just one observation does not make sense and is not even possible.
This reasoning applies to all statistics. In general, to estimate p parameters, at least the same
number of observations is needed. In other words, if the number of regression parameters
p is large compared to the number of observations n, X> X is often close to collinearity
and the estimation of the parameters can be very unstable. Getting rid of some variables
(and therefore some parameters) can improve the situation. (A rule of thumb is that 10-
20 observations are needed per estimated regression parameter in order to estimate them
accurately enough.)

3. Sometimes, if the aim is prediction, it may be expensive (or, e.g. in some medical appli-
cations, risky or painfull) to observe some of the explanatory variables and it would be bene-
ficial if it turned out to be unnecessary to measure some of the variables in future in order to
still achieve an acceptable predictive accuracy. (Note, however, that in such a situation often
not all variables are equally costly. Individual tests whether the most expensive variables
are needed may then be more sensible than the application of general methods of variable
selection.)
There are also arguments against variable selection:

1. In terms of predictive accuracy, as long as enough observations are available, it is usually


much worse to leave out variables that are really important than to keep variables in the
4.3 VARIABLE S ELECTION 93

model of which the true coefficients are (about) zero. The reason is that with enough ob-
servations the true zero coefficients will be estimated to be about zero anyway. There- fore,
variable selection should not be done unless there is a real need for it (i.e. one of the reasons
given above).

2. If some of the variables in the full model are highly correlated (i.e. they measure more or
less the same thing), usually not all of these variables are needed, but the decision about
precisely which variable should be kept in the model and which should be left out is quite
arbitrary. Therefore, variable selection is often unstable and the chosen variables have to
be interpreted with care concerning explanation and causal inference. Particularly, it cannot
be taken for granted that a variable left out by variable selection is “really unimportant”,
because it may just be “represented” by another variable still in the model.

4.3.1 Criteria to Compare Models

Several criteria have been suggested in the literature that take into account the number of
explanatory variables in the model. In the present course, only two of them are introduced
explicitly. More can be found, e.g., in Hastie, Tibshirani, and Friedman (2001).

Akaike Information Criterion (AIC)

For quite general models fitted by maximum likelihood, the Akaike Information Criterion
(AIC) is defined as:
ˆ
AIC = −2`(model) + 2p
ˆ
where `(model) is the maximum of the log-likelihood function, and p is the number of
regression parameters in the model (including the intercept), i.e., p = k + 1 for a model with
k explanatory variables.
AIC is reported in R-software. Because a good model has a high likelihood and (hopefully)
not many parameters, we look for models with the smallest AIC.
ˆ
Under the assumptions of this chapter, `(model) can be described by:

ˆ n RSS
`(model) = − log(σ 2 ) − + constant.
2 2σ 2

Here, σ 2 is usually unknown and can be estimated by its Maximum Likelihood Estimator
2
(MLE) σ̂ML = RSS
n
. Hence, the AIC becomes:
 
RSS
AIC = n log + 2p.
n
94 VARIABLE S ELECTION AND R EGULARIZATION

The AIC is a so-called "penalised likelihood-method", because the term 2p can be inter-
preted as a penalty for too large models, correcting for the fact that the RSS is always
decreased by increasing the model. For fixed k, the AIC just chooses the model with the
smallest RSS.

Another criterion that is often used is the Bayesian Information Criterion (BIC)

BIC = −2 log(`) + p ln(n)

where log() is the natural logarithm of the likelihood function evaluated at the parameter
values that maximize it, p is the number of parameters in the model including the intercept,
n is the number of observations. The term p log(n) acts as a penalty for complexity, par-
ticularly effective in larger samples, helping to avoid overfitting by penalizing models with
more parameters.

Other functions of the RSS penalising large models have been suggested, such as Mallows
Cp
RSS
Cp = − n + 2p
σ̂ 2

and the so-called “adjusted R2 ”, which delivers a number between 0 and 1 like R2 but is
maximised if σ̂ 2 is minimised. This is, however, a very “soft” criterion which often does not
reduce the number of variables enough (or even not at all).

In the following, you will explore various covariate selection techniques that have been
proposed in the literature; those presented are just a small part. The important thing is that
model selection should be done using appropriate criteria based on the likelihood function
(e.g., AIC, BIC) and should never be done by comparing the p-value of the coefficients.

4.3.2 Best subset selection

In regression analysis, particularly when dealing with multiple explanatory variables, a cru-
cial task is selecting the most effective model from a potentially vast number of possibilities.
With p explanatory variables, you can construct 2p different regression models, each repre-
senting a unique combination of variables.

Best Subset Selection: Best subset selection involves evaluating all 2p possible models to
identify the one that performs best according to certain criteria. We can breakdown the
algorithm in the following steps

(a) Evaluate Models for Each Subset Size:


4.3 VARIABLE S ELECTION 95

• For each possible number of explanatory variables, k, ranging from 1 to p, identify


the best model.
• The “best” model for a given k is typically defined by the smallest Residual Sum of
Squares (RSS). A smaller RSS indicates that the model fits the data more closely.
• Minimizing RSS is mathematically equivalent to minimizing the estimated vari-
ance of the errors (σ̂ 2 ), maximizing the coefficient of determination (R2 ), and
minimizing the p-value of the F-test. The F-test checks the null hypothesis that all
regression coefficients are zero against the alternative that at least one is not.
(b) Comparing Across Different k:
• Adding more variables to a model will always decrease the RSS and increase R2
because each additional variable can explain a part of the variance in the data,
regardless of whether it is statistically significant.
• However, the challenge arises when comparing models with different numbers
of variables (k). An increase in k usually means an improved fit (lower RSS) but
doesn’t necessarily indicate a better model due to potential overfitting or inclusion
of irrelevant variables.
(c) Statistical Challenges:
• The model with l > k variables does not necessarily include all the variables of the
best model with k variables. This non-nesting of models complicates comparisons
using standard t- and F-tests, which assume model nesting.

Advanced Model Comparison Techniques:

• Since comparing models of different sizes directly using RSS, R2 , or F-tests is prob-
lematic, other criteria such as Akaike Information Criterion (AIC) or Bayesian Infor-
mation Criterion (BIC) are used. These criteria help balance model fit with complexity,
penalizing the addition of extraneous variables.

Considering Multiple Good Models:

• It’s often advisable to identify several models that perform well rather than selecting
a single “best” model. This approach acknowledges that data can support multiple
plausible explanatory frameworks, especially in complex real-world scenarios.

This structured approach ensures that you not only find models that fit the data well but
also guard against overfitting, thereby enhancing the robustness and interpretability of your
results.
96 VARIABLE S ELECTION AND R EGULARIZATION

Algorithm 1 Best Subset Selection Algorithm for Implementation in a Programming Language


(e.g., in R or Python).
0: Input: Set of p explanatory variables, dataset D
0: Output: Best model for each subset size
0: for k = 1 to p do
0: Initialize BestModelk ← null
0: Initialize BestScorek ← ∞
0: for all subsets S of size k of the explanatory variables do
0: Fit model MS using variables in subset S
0: Calculate RSS or σ̂ 2 for model MS
0: if RSS of MS < BestScorek then
0: BestModelk ← MS
0: BestScorek ← RSS of MS
0: end if
0: end for
0: Print “Best model for k variables: ”, BestModelk
0: end for=0
4.3 VARIABLE S ELECTION 97

4.3.3 Stepwise Methods


Stepwise methods are utilized to address the challenge of variable selection, particularly when the
number of explanatory variables is large, making best subset selection computationally unfeasible.
These methods streamline the process by significantly reducing the number of candidate models
that need to be evaluated.

Backward Elimination
Backward elimination begins with a full model that includes all explanatory variables and system-
atically reduces the number of variables one at a time based on certain criteria:

Algorithm 2 Backward Elimination


0: Input: Dataset D, number of variables p
0: Output: Optimal model
0: Start with the full model using all p variables
0: Set k = p
0: while k > 0 do
0: Fit all k models, each excluding one of the variables from the current model
0: Select the model with the minimal Residual Sum of Squares (RSS)
0: Update the current model to this new model
0: k =k−1
0: end while
0: return the final model =0

The total number of models evaluated is reduced to p(p+1)


2
. After completing the backward se-
quence, selection criteria such as AIC, BIC, or cross-validation are typically used to determine the
optimal model. Nested model comparison is feasible due to the sequential nature of the method,
allowing the use of t-tests or F-tests.
98 VARIABLE S ELECTION AND R EGULARIZATION

Forward Selection
Forward selection starts with no variables in the model and adds one variable at a time, assessing
each new model’s performance:

Algorithm 3 Forward Selection


0: Input: Dataset D, number of variables p
0: Output: Optimal model
0: Begin with a model with no variables
0: Set k = 0
0: while k < p do
0: Fit all p − k models, each adding one new variable to the current model
0: Select the model with the lowest RSS
0: Update the current model to this new model
0: k =k+1
0: end while
0: return the final model =0

This method also generates a nested sequence of models with the total number of models
evaluated being p(p+1)
2
. The same selection criteria used in backward elimination can be applied
to choose the optimal model. However, less stringent significance levels might be considered to
prevent the exclusion of informative variables prematurely.

Some General Remarks About Stepwise Methods


Stepwise methods generate heuristically reasonable sequences of models; however, they are not
guaranteed to identify the best subset model. This is due to the fact that not all possible subsets of
the covariates are evaluated. Consequently, best subset selection is superior as long as it remains
computationally feasible.
Backward elimination and forward selection may not converge on the same sequence of mod-
els, nor necessarily identify the same "best" model. These methods are often quite unstable, mean-
ing that minor changes in the data can result in significantly different model selections. This issue
is somewhat less pronounced in best subset selection but still present.
Empirical evidence often shows that backward elimination tends to outperform forward selec-
tion in terms of prediction quality. However, if the number of observations n is very small relative
to the number of covariates p (specifically, if n < 2p), the initial model in backward elimination
becomes very unstable, making forward selection a preferable approach.
In particular fields such as genetics, where datasets may have n < p, backward elimination
is not feasible (nor is fitting any regression model with more than n − 2 variables), but forward
selection can still be effectively applied for sufficiently small values of k.
4.3 VARIABLE S ELECTION 99

Implementation of Stepwise Methods in Statistical Software


Implementing stepwise selection methods can be efficiently done using popular statistical software
such as R and Python. Below, we outline the functions and packages commonly used to apply
backward elimination and forward selection.

Implementation in R
In R, the step function from the stats package is typically used for both backward elimination
and forward selection. This function performs stepwise model selection by AIC.

• Backward Elimination: You can start with a full model and use the direction="backward"
argument to perform backward elimination.

[Link] <- lm(Y ~ ., data=dataset)


[Link] <- step([Link], direction="backward")

• Forward Selection: Start with a model with no predictors, and use direction="forward"
to add variables one by one.

[Link] <- lm(Y ~ 1, data=dataset)


[Link] <- step([Link],
direction="forward", scope=(~ ., data=dataset))

4.3.4 The Lasso


The Lasso (Least Absolute Shrinkage and Selection Operator) method is a regression technique
used for variable selection and regularization, particularly effective in scenarios involving collinear-
ity or a high number of predictors. Compared to other methods, Lasso offers greater stability and
typically results in lower prediction errors. However, it introduces a moderate bias and may not
perform optimally when true regression coefficients are either very close to zero or very large.

Standardization of Variables
To prepare for Lasso regression (similarly as we do for the Ridge), original variables (denoted as
z1 , . . . , zm ) must first be standardized. Each variable zj is transformed into xij such that the new
variables have zero mean and unit standard deviation, calculated using:

zij − z j
xij =
sj
100 VARIABLE S ELECTION AND R EGULARIZATION

where z j and sj represent the mean and standard deviation of the j-th variable, respectively. This
standardization is crucial as it ensures the comparability of regression coefficients, reflecting their
relative contributions to the model without implying their absolute importance due to potential
dependencies among variables.

Lasso Estimation
The Lasso estimator β̂ L is derived by minimizing the sum of squared errors:

n p
!2
X X
S(β) = Yi − β0 − βj xij
i=1 j=1

subject to the constraint that the sum of the absolute values of the coefficients does not exceed a
predefined constant t:
p
X
|βj | ≤ t
j=1

Notice the similarity with the ridge regression problem, in this case the `2 penalty pj=1 βj2 is
P

replaced with the `1 penalty pj=1 |βj |, where | · | denotes the absolute value. This latter containt
P

makes the solution not linear, hence there is no closed form.


Computing the lasso solution is a quadratic programming problem, efficient algorithms are
available for computing the entire path of solutions as λ is varied, with the same computational
cost as for ridge regression. Because of the nature of the constraint, making t sufficiently small
will cause some of the coefficients to be exactly zero. Thus, the lasso does a kind of continuous
subset selection. If t is chosen larger than t0 = pj=1 |β̂jOLS | (where β̂jOLS are the least squares
P

estimates), then the lasso estimates are the β̂jL ’s. On the other hand, for t = t0 /2 say, then the
least squares coefficients are shrunk by about 50% on average. The problem is that t should be
adaptively chosen to minimize an estimate of expected prediction error.
Lasso is known as a shrinkage method because it reduces the absolute values of the coef-
ficients. This reduction can significantly improve model accuracy by mitigating overfitting and
biases associated with variable selection, common in methods like stepwise or best subset selec-
tion. However, the optimal degree of shrinkage depends on the true coefficient values and the error
variance σ 2 , posing challenges in setting a definitive guideline for t. To perform Lasso regression
in R, one of the most widely used packages is glmnet. This package not only supports Lasso (`1
regularization) but also Elastic Net models, which combine `1 and `2 regularization techniques.
4.3 VARIABLE S ELECTION 101

Figure 4.1: The estimation approaches for the lasso (left) and ridge regression (right) are depicted in the figures below.
Shown are the contours of the error functions and the constraint functions. The solid blue areas represent the constraint
regions for lasso and ridge regression, defined by |β1 | + |β2 | ≤ t and β12 + β22 ≤ t2 , respectively. Meanwhile, the
red ellipses illustrate the contours of the least squares error function. source: Hastie, T., Tibshirani, R., Friedman, J.
H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp.
1-758). New York: springer.
Chapter 5

Logistic Regression

Despite the simple linear model and its multivariate variant being very powerful tools for predict-
ing qualitative variables (when Y assumes real values), they are less effective when our goal is to
predict probabilities. This is because the linear model is ill-suited to this task due to the fact that
x> β, once estimated, assumes real values. Therefore, we run the risk of predicting probabilities
that are less than zero or greater than one, which is absolutely unacceptable, as it is well-known
that probabilities must be between zero and one, see Figure ?? and its caption. For this reason,
we need to introduce a new class of models, Generalized Linear Models. In this chapter we will
present a specific type, the logistic model. Suppose that n responses Y1 , . . . , YN (in this section
the sample size will be denoted with N , while with n we denote the number of trials of a Binomial
random variable) are independent such that

Yi ∼ Bin(ni , πi )

with Bin(n, p) we denote a Binomial Distribution. Where

g(µi ) = x>
i β

link function g(·) relates the linear predictor x> i to the mean µi = E(Yi ) = ni πi for i =
1, 2, . . . , N . This formulation includes the binary case in which ni = 1 for all i. Note that the
models below are given in terms of πi . This is equivalent to expressing via µi as πi = nµii .
The link function is the logit link, which gives the linear logistic model
 
πi
log = logit(πi ) = x>
i β
1 − πi

In all these cases, the expression for the probability of success, πi , obtained by inverting the
equation for the model, has been chosen to be a cumulative distribution function (cdf):

103
104 L OGISTIC R EGRESSION

1
πi = , cdf of a logistic distribution.
1 + e−ηi
Here, ηi = x>
i β denotes the linear predictor for any observation. Note that these equations can
be used for prediction, after ηi has been estimated by η̂i , by plugging in η̂i = x>
i β̂.

The above choices ensure that 0 ≤ πi ≤ 1, as required. Also, the resulting link function has
the property that −∞ < g(µi ) < ∞, with the consequence that there are no constraints on the
unknown parameters in the linear predictor.
Indeed, what would happen if we wanted to predict πi using a linear model of the same class
introduced in the previous chapters? We would end up obtaining probabilities that are negative
and greater than one.

Figure 5.1: In this example, we wanted to predict the probability of default for some bank customers. On the right, we
notice how the linear model fails to capture the probability. Not only does the fitted line seem implausible, but it also
yields probabilities that are negative or greater than 1. On the left, however, we have a logistic model that correctly
captures the probability levels for different values of the balance.

π

If π denotes the probability of a success, then the logit link function log 1−π is the logarithm
of the odds on a success, or ’log odds’ for short, which we will denote by logit(π). So the coeffi-
cient βj of an explanatory variable xj in a logistic regression model measures the rate of change of
the log odds with xj , holding constant the values of the other explanatory variables in the model.
In particular, suppose that x1 is an indicator variable with just two levels, 0 and 1, as would be
5.1 E STIMATION 105

the case if these values represented ’absence’ and ’presence’ of a factor. Then at x1 = 0,

logit(π) = β0 + β2 x2 + . . . + βp xp ,

and at x1 = 1,
logit(π 0 ) = β0 + β1 + β2 x2 + . . . + βp xp ,

where the probability of success in the second case has been denoted by π 0 .
Subtracting these equations gives

π0 π 0 /(1 − π 0 )
     
0 π
β1 = logit(π ) − logit(π) = log − log = log ,
1 − π0 1−π π/(1 − π)

which is the logarithm of the ratio of the odds on a success at the two values of x1 , or logarithm of
the odds ratio, holding constant the values of the other explanatory variables in the model.
So, the odds on a success when x1 = 1 is eβ1 times the odds on a success when x1 = 0, holding
constant the values of the other explanatory variables in the model.

Example 6. To study the onset of cardiovascular diseases Y = 1 in relation to smoking habits


(FU), the logit model is
1
π = P(Y = 1) =
1 + e 0 −β1 FU
−β

which means  
π
logit(π) = log = β0 + β1 FU
1−π
So, if we obtain an estimated value of β̂1 = 0.693, this means that the transition from Smoking=NO
to Smoking=YES results in a multiplicative increase in the log-odds of 0.693, which corresponds
to an odds ratio of e0.693 . This implies that for smokers, the probability of becoming ill compared
to not becoming ill is in a 2 to 1 ratio compared to non-smokers.

5.1 Estimation
The log-likelihood `(π) for a binomial logistic regression model is given by:

N   
X ni
`(π) = yi log(πi ) + (ni − yi ) log(1 − πi ) + log
i=1
yi

where ni = 1 if Y is distributed according to a Bernoulli variable, by solving the estimation


equations

∂`(π)
= 0 i = 1, . . . , N
∂πi
106 L OGISTIC R EGRESSION

For example, if you want to estimate the parameters of a model with a single explanatory
variable logit(πi ) = β0 + β1 xi the log-likelihood becomes

N     
X πi ni
`(π) = yi log + ni log (1 − πi ) + log
i=1
1 − πi yi

it results in

N   
X ni
`(π) = yi (β0 + β1 xi ) − ni log (β0 + β1 xi ) + log
i=1
yi

therefore, it follows

N  N 
e(β0 +β1 xi )
 X 
∂`(π) X
= yi − ni = yi − ni πi
∂β0 i=1
1 + e(β0 +β1 xi ) i=1

and

N  N 
e(β0 +β1 xi ) xi
 X 
∂`(π) X
= yi xi − ni = y i xi − n i π i xi
∂β1 i=1
1 + e(β0 +β1 xi ) i=1

Deriving the score function yields the information matrix and thus the asymptotic covariance
matrix, from which the standard error estimates are derived:

N
X −1/2
se(β0 ) = [ni πi (1 − πi )]
i=1

N
X −1/2
se(β1 ) = [ni x2i πi (1 − πi )]
i=1

5.2 Inference
i Sampling Distribution of β̂:
β̂ ∼ Np+1 (β, se(β̂)).

ii Deviance for Testing the Goodness-of-Fit of a Model: The deviance for testing the goodness-
of-fit of a particular model is given by

N     
X yi ni − yi
D=2 yi log + (ni − yi ) log ,
i=1
µ̂i ni − µ̂i

where the µ̂i ’s are the fitted µi ’s under the model (using the convention that 0 log 0 = 0). If
the model is true, D ∼ χ2n−p+1 , which provides a test statistic for a goodness-of-fit test.
5.2 I NFERENCE 107

iii Test for H0 : βj = 0:


β̂j
∼ N (0, 1) under H0 .
se(β̂j )
This is a test for the omission of the j-th explanatory variable given the other explanatory
variables in the model.

iv Test for H0 : ν of the Regression Parameters β1 , . . . , βp are 0: Here we are assuming that
the linear predictor consists of a constant term and terms from p explanatory variables and
H0 tests the omission of ν explanatory variables where ν ≤ p. Let D0 and D denote the
deviances under H0 and the maximal (full) models, respectively. The likelihood ratio test
for H0 is
D0 − D ∼ χ2ν under H0 .

v An alternative test for assessing goodness-of-fit is the use of the Pearson chi-squared statis-
tic, denoted by X 2 . This test is based on the comparison of observed and fitted frequencies
(the latter are usually referred to as expected frequencies in introductory texts, but are es-
sentially estimates known as fitted values or, in this context, fitted frequencies):

Observation 1 2 ... N
Observed # successes Y1 Y2 ... YN
Observed # failures n1 − Y1 n2 − Y 2 ... n N − YN
Table 5.1: Table of Observed and Fitted Frequencies

A similar table of fitted values is used, where Yi is replaced by µ̂i = ni π̂i .


Aside: Using common notation o for observed frequency and e for fitted (expected) fre-
quency, D and X 2 have the forms:

X o X (o − e)2
D=2 o log and X 2 = .
e e

This form of the deviance D is often denoted by G2 .


The Pearson chi-squared statistic, after some algebraic manipulations, is given by:

N
X
2 (Yi − ni π̂i )2
X =
i=1
ni π̂i (1 − π̂i )

which, if the model is correctly fitted, has the same large sample distribution as the deviance
D, i.e., χ2N −p+1 . By using the Taylor Series expansion of s log(s/t) about s = t up to the
quadratic term, it can be shown that D ≈ X 2 .
The large sample distribution for D and X 2 is likely to be poor if any of the fitted values in
the 2 × N table are small.
108 L OGISTIC R EGRESSION

Besides considering the residual deviance D, model adequacy should also be checked by ap-
propriate plots ( essentially the same that we have seen for the Linear model).
There are several forms of residuals in the binomial case.

• Raw residuals: êi = Yi − ni π̂i


Interpretation: these show how well the raw data are fitted.

• Pearson or chi-squared residuals:

êi
Xi = p
ni π̂i (1 − π̂i )

so that the chi-squared statistic X 2 = N 2


P
i=1 Xi .
Interpretation: these standardise the raw residuals by the estimated standard deviation of Yi ,
making them comparable in size.

• Standardised Pearson residuals:

Xi
rP i = √
1 − hii

where hii , the leverage for the ith observation, is the ith diagonal element of the hat matrix,
these are standardised so that their variance is 1. They are comparable in size as well but
adjust for the location in x-space. For mathematical comparison these are better than the
Pearson residuals, but the Pearson residuals are more naturally interpreted in terms of which
points are "well fitted".

• Deviance residual:
     1/2
yi ni − y i
di = sign(êi ) 2 yi log + (ni − yi ) log ,
µ̂i ni − µ̂i

so that the deviance D = N 2


P
i=1 di . The term sign(êi ) gives di the same sign as êi .
Interpretation: these formalise how strongly the observation contributes to the deviance, i.e.,
to the standard way of measuring the quality of the overall fit (or rather misfit; the higher,
the worse). They show to what extent the observation indicates that the model is violated
and rather a saturated model is needed.

• Standardised Deviance Residual:

di
rDi = √
1 − hii

Interpretation: This modification makes the di directly mathematically comparable by uni-


fying their variance and adjusting for the location in x-space.
5.2 I NFERENCE 109

5.2.1 Penalized Logistic Regression


Consider the case of logistic regression where Yi ∼ Bernulli(π), i.e. Binomial(1, π), with
  p
πi X
log = β0 + βj xij = fβ0 ,β1 ,...,βp (x)
1 − πi j=1

the negative likelihood equals to

N 
X  
ρ(x, y) = −`(π) = − yi fβ0 ,β1 ,...,βp (xi ) + log 1 + exp (fβ0 ,β1 ,...,βp (xi ))
i=1

The `1 -norm penalized Lasso estimator is the defined as


 N p 
X X
−1
β̂0 (λ), β̂1 (λ), . . . , β̂p (λ) = arg min n ρ(xi , yi ) + λ |βj |
β0 ,β1 ,...,βp i=1 j=1

As can be inferred, the `2 penalty can also be applied in this context resulting in
 N p 
X X
−1 2
β̂0 (λ), β̂1 (λ), . . . , β̂p (λ) = arg min n ρ(xi , yi ) + λ βj
β0 ,β1 ,...,βp i=1 j=1

as it can be easily demonstrated that the log-likelihood of the binomial model is a convex
function. This convexity property arises because the model belongs to the exponential family.
Therefore, applying `1 and `2 penalties to logistic regression does not pose significant computa-
tional problems. The effects of these penalties are essentially similar to those observed for the
multiple linear regression model, as described in previous paragraphs.
The glmnet function in R supports various types of models specified by the family param-
eter. These include:

• "gaussian" for linear regression,

• "binomial" for logistic regression,

• "poisson" for Poisson regression,

• "multinomial" for multinomial logistic regression,

• "cox" for Cox proportional hazards regression,

• "mgaussian" for multiple Gaussian output regression.

For implementing penalized logistic regression, we use family=’binomial’, which is suited


for binary outcomes.
Chapter 6

Useful R commands

Reading Data

dta <- [Link](file=[Link]())


# Choses directly the csv file by clicking

dta <- [Link](file=[Link]())


# Choses directly the txt file by clicking

dta <- [Link]("/danilo/desktop/[Link]")


# Reading a csv file using a file path,
# I reccomend to set your working directory first

Manipulate dataframes

dta
# Shows the data set called dta

head(mydata)
# Shows the first 6 rows (Default)
# The parameter n controls how many rows to show

tail(mydata)
# Shows the last 6 rows (Default)
# The parameter n controls how many rows to show

111
112 U SEFUL R COMMANDS

str(mydata)
# Shows structure of the dataset
# Variable name, type, size of the data frame etc..

names(mydata)
# Returns the variable names

ls()
# Shows a list of objects that are available

Descriptive Statistics

Assume that x ∈ Rn and y ∈ Rn are two data vectors

mean(x)
# Computes the mean of x
# If you are dealing with NAs, the option [Link]=T could be an option
# (Not exclusive of mean(), just check the R documentation)

median(x)
# Computes the median of x

var(x)
# Computes the unbiased variance of x

sd(x)
# Computes the unbiased standard deviation of x

IQR(x)
# Computes the Inter Quartile Range (IQR) which is IQR= Q3-Q1 of x, where
# Q1 is the first quartile and Q3 the third quartile

summary(x)
# Computes the 5-number summary and the mean of x

cor(x,y)
113

# Computes the unbiased correlation coefficient between x and y

cor(mydata)
# Computes the unbiased correlation matrix

Basic Graphs

hist(x)
# Returns a histogram

boxplot(x)
# Returns a boxplot

boxplot(y~x)
# Returns a side-by-side boxplots
# ( one of the variable have to be factor)

plot(y~x)
# Returns a scatterplot of y versus x

plot(dta)
# Returns a scatterplot matrix of the [Link]/matrix called dta

abline(lm(y~x))
#adds regression line to a scatterplot

Liner Regression

###########################################
# Fitting
###########################################

model_fit=lm(y~x)
# Fit a regression model

summary(model_fit)
# Get results from fitting the regression model
# (resudual distribution, coefficients, t tests, p-values,
114 U SEFUL R COMMANDS

# F test, R^2, Adjusted R^2)

anova(model_fit)
# get the ANOVA table

plot(model_fit)
# Get four plots, including normal probability plot, of residuals

confint(model_fit)
# CIs for all parameters

[Link](model_fit, interval="confidence")
# Make prediction and give confidence interval for the mean response

[Link](model_fit, interval="prediction")
# Make prediction and give prediction interval for the mean response

newx=[Link](X=4)
# Create a new data frame with one new x* value of 4

[Link](model_fit, newx, interval="confidence")


# Get a CI for the mean at the value x*

################################################
# Tests
################################################

bptest(model_fit)
# Get the Breusch-Pagan test
(lmtest package must be installed)

[Link](resids)
# Get Anderson-Darling test for normality
(nortest package must be installed)

[Link](resids)
# Get Cramer-von Mises test for normaility
(nortest package must be installed)
115

[Link](resids)
# Get Lilliefors (Kolmogorov-Smirnov) test for normality
(nortest package must be installed)

[Link](resids)
# Get Pearson chi-square test for normaility
(nortest package must be installed)

[Link](resids)
# Get Shapiro-Francia test for normaility
(nortest package must be installed)

boxcox(model_fit)
# Evaluate possible Box-Cox transformations
(MASS package must be installed)

################################################
# Variable Selection
################################################

stepAIC()
# Chose best model using AIC
# “both” (for stepwise regression, both forward and backward selection);
# “backward” (for backward selection)
# “forward” (for forward selection).
(MASS package must be installed)

step()
# Choose a model by AIC in a Stepwise Algorithm
Chapter 7

Lab in R

In this chapter, all the laboratories covered in class are compiled. The following material does
not replace but complements the study of the lecture notes. To successfully pass the exam, we
recommend a thorough study of the following material, as well as completing all the exercises
provided.

7.1 Lab 1 Exploratory Data Analysis

Setup

1 # Most of the data we are going to work with are taken from faraway
package
2 library(faraway) # call the package to use its built-in functions and
data
3 library(ggplot2) # an amazing package to create graphs

Load and Examine Data

1 data(pima)
2 head(pima) # just print the first 6 rows of the dataset

To ensure a comprehensive understanding of the data, it is crucial to generate numerical sum-


maries such as means, quantiles, standard deviations (SDs), maximum and minimum values. This
process is essential in determining the integrity of the data and identifying any potential outliers
or inconsistencies. As a statistician or data scientist, exploring the data should be the first step in
problem-solving.

117
118 L AB IN R

The dataset under consideration is derived from a study conducted by The National Institute
of Diabetes and Digestive and Kidney Disease, which involved 768 adult female Pima Indians
residing near Phoenix. The variables within the dataset include:

Variable Description
pregnant Number of times pregnant
glucose Concentration of plasma glucose at 2 hours in an oral glucose tolerance test
diastolic Diastolic blood pressure in mmHg
triceps Triceps skin fold thickness in mm
insulin 2-hour serum insulin in µU/ml
bmi Body mass index (BMI), where weight is measured in kg and height in m2
diabetes Diabetes pedigree function
age Age in years
test Test about signs of diabetes, coded zero if negative and one if positive

Table 7.1: National Institute of Diabetes and Digestive and Kidney Disease: Data Description.

To initiate the exploration of this dataset, we can use the function summary().
1 summary(pima)

There is something that doesn’t make sense to you ? It is virtually impossible for a patient to
have no blood pressure. Blood pressure can be zero or near-zero for certain medical conditions,
such as shock or cardiac arrest, and can also vary depending on the position of the body. However,
it is highly unlikely for a healthy individual to have a blood pressure of zero.
Regarding the dataset description, it is possible that the value of zero has been used to indicate
missing data, as it is a common practice to represent missing values as zeros in some datasets. It
is also possible that the researchers did not obtain the blood pressures for some patients due to
certain limitations or errors in data collection. It is important to check the data documentation and
consult with experts in the field to gain a better understanding of the data and potential issues.
At this point, it makes sense to denote the zero values as NA.
1 pima$diastolic[pima$diastolic==0] <- NA
2 pima$glucose[pima$glucose==0] <- NA
3 pima$triceps[pima$triceps==0] <- NA
4 pima$insulin[pima$insulin==0] <- NA
5 pima$bmi[pima$bmi==0] <- NA

The variable test is a categorical variable, also called a factor. Therefore, we need to be sure
that R treats qualitative variables as factors. Sometimes (even professional statisticians) forget this
and compute statistics such as ’average zip code’.
Formatting variables is not only important for summary statistics but also when we move on
to modeling. Don’t neglect the formatting phase, please.
7.1 L AB 1 E XPLORATORY DATA A NALYSIS 119

1 str(pima$test)

In this case, test results in being an integer; this does not make sense, so it is better to format
it as a factor.
1 pima$test <- factor(pima$test)
2 summary(pima$test)

Now that it’s coded correctly, summary(test) makes more sense to all of us. We see that
500 cases were negative, and 268 were positive. A way to make this clearer is to use descriptive
labels.
1 levels(pima$test) <- c(’negative’, ’positive’)
2 summary(pima)

Now we are ready to explore a little bit further with some plots.
1 hist(pima$diastolic,
2 xlab=’Diastolic’,
3 main=’’)

1 plot(density(pima$diastolic, [Link]=TRUE),
2 main=’Distolic Kernel plot ’)

If you are uncomfortable with the Kernel methods, this is an extraordinary resource to explore
this topic further CLICK HERE. The kernel plot effectively avoids the blockiness that can be
distracting in a histogram. However, it is important to ensure appropriate bin specifications for the
histogram and bandwidths for the kernel density plot. To understand the effect of bandwidths, we
can play with it.
1 plot(density(pima$diastolic, [Link]=TRUE, bw=1),
2 main=’This is a too wiggly plot ’)

1 plot(density(pima$diastolic, [Link]=TRUE, bw=3),


2 main=’This looks better... more smooth ’)

1 plot(density(pima$diastolic, [Link]=TRUE, bw=10),


2 main=’This is perfect ’)

The higher the bandwidth, the smoother the density estimate will be. An in-depth discussion
regarding kernel methods is out of the scope of this course. If you feel like you don’t know enough
about Kernels, please visit this CLICK HERE.
1 plot(diabetes ~ diastolic, data=pima)
120 L AB IN R

1 plot(diabetes ~ test, data=pima)

An alternative to the base plots in R is the ggplot2 package. The essential elements of a plot
made using this package are:

• Data: The data that is being visualized is passed to ggplot2 as a data frame.

• Aesthetic mapping: The mapping of variables in the data to visual properties of the plot,
such as the x and y axis or color and shape of points.

• Geometric objects: The geometric objects that define the type of plot, such as points, lines,
bars, or histograms.

More information about how to visualize data properly is discussed in MA304-7.


1 ggplot(data= pima,
2 aes(x=diastolic))+
3 geom_histogram()

1 ggplot(data= pima,
2 aes(x=diastolic))+
3 geom_density()+
4 ggtitle(’Diastolic kernel density’)

1 ggplot(data=pima,
2 aes(x=diastolic, y=diabetes, shape=test))+
3 geom_point()

This is an example of a bivariate scatterplot (two dimensions) to which a third has been added.
Can you think of ways to add dimension to this scatterplot? To add dimensions to a scatterplot in
R, various techniques can be employed. Some of the most common approaches are:

• Adding color: You can add color to the points in the scatterplot to represent a third variable.
For example, you can assign different colors to the points based on a categorical variable.

• Adding size: You can adjust the size of the points to indicate a numerical variable. For
instance, you can make the points larger or smaller based on a variable’s value.

• Adding shape: You can change the shape of the points to indicate a categorical variable.
For example, you can use different shapes, such as circles, triangles, or squares, to represent
different categories.
7.1 L AB 1 E XPLORATORY DATA A NALYSIS 121

• Adding facets: You can create a grid of scatterplots, each representing a subset of the data
based on one or more variables. This approach is useful for visualizing complex relation-
ships in the data.

1 ggplot(data=pima,
2 aes(x=diastolic, y=diabetes))+
3 geom_point(size=1)+
4 facet_grid(~ test)

Exercises
Please attempt the following exercises. Recall that to attempt these questions, you need to install
the faraway package first.

Exercise 1
The dataset teengamb concerns a study of teenage gambling in Britain (?teengamb, for further
details about the data). Make a numerical and graphical summary of the data, commenting on any
features you find interesting. Limit the output you present to a quantity that a busy reader would
find sufficient to get a basic understanding of the data.

Exercise 2
The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a
numerical and graphical summary of the data.

References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in
Statistical Science.
122 L AB IN R

7.2 Lab 2: Estimation

Setup

1 # Most of the data we are going to work with are taken from faraway
package
2 library(faraway) # call the package to use its built-in functions and
data
3 library(ggplot2) # an amazing package to create graphs

Estimation
We can start by recalling some basics notions from lecture notes, a multiple linear model can be
defines as:

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ,

where βi i = 0, . . . , p are unknown parameters and  = (1 , . . . , n )> can be assumed to


follow a Gaussian distribution (or not) with E() = 0 and Var() = σ 2 In , where In is an identity
matrix (more details in the lecture notes).
In statistical modeling, specifically under linear model(s) domain, we are assuming a linear
relationship between the dependent variable and one or more independent variables or predictors,
also known as covariates. The term "linear" in linear model refers to the fact that the parameters
of the model enter linearly, meaning that the effect of each predictor on the dependent variable
is assumed to be a linear function of that predictor. However, it is important to note that the
predictors themselves do not have to be linear. In fact, the linear model can be used with non-linear
transformations of the predictors, as long as the relationship between the transformed predictors
and the dependent variable is still linear. This flexibility makes the linear model a powerful tool
for analyzing a wide range of data sets, from simple to complex, and for making predictions based
on those data. For example:

Y = β0 + β1 X1 + β2 log X2 + β3 X1 X2 + .

Despite the presence of a logarithmic term and an interaction term, the given equation is still
considered a linear model. This is because the model is linear in its parameters, which are the
coefficients (β0 , β1 , β2 , β3 ) that multiply each of the predictor variables. The dependent variable
Y is a linear combination of the parameters and the predictor variables, where the coefficients
are constant and do not depend on any function or transformation of the predictor variables. The
7.2 L AB 2: E STIMATION 123

logarithmic term and interaction term do not violate the linearity assumption of the model. But
what about this:

Y = β0 + β1 X1β2 + .

The given equation is not a linear model because it is not linear in the parameters. Specifically,
the exponent β2 applied to X1 means that the coefficient β1 is not constant, but rather varies as a
power function of X1 . This violates the linearity assumption of the model, which states that the
coefficients must be constant and not depend on any function or transformation of the predictor
variables.
Nonlinear models require more complex techniques for estimation and inference.
Now let’s take a look at an example to understand how things work in the R language. The
dataset we are going to examine today concerns the number of species found on the various Gala-
pagos Islands.
1 library(faraway)
2 data(gala)
3 head(gala)

The gala dataset includes several variables that describe different aspects of the Galapagos
Islands. These variables are:

Variable Description
Species Number of plant species found on each island
Endemics Number of endemic species on each island
Area Size of each island in square kilometers
Elevation Highest elevation on each island in meters
Nearest Distance to the nearest neighboring island in kilometers
Scruz Distance to Santa Cruz island in kilometers
Adjacent Area of the adjacent island in square kilometers

Table 7.2: Description of Variables in Galapagos Islands Dataset

To learn more about these variables, you can access the data description by running the com-
mand ?gala in R.
1 str(gala)

Everything seems to be formatted correctly. We can fit a linear model in R using the lm()
function. lm() is a function in R that fits linear regression models to data. In brief, you need two
ingredients:
1. You need to create a formula that specifies the relationship between the response variable
(the variable you want to predict) and one or more predictor variables. The formula takes
the form
124 L AB IN R

response_variable ~ predictor_variable_1 + predictor_variable_2

For example, if you have a dataset with a response variable called y and two predictor
variables called x1 and x2, the formula might be

y ~ x1 + x2

2. Once you have the formula, you can use the lm() function to fit the linear regression model.
The syntax for lm() is lm(formula, data). The first argument is the formula, and
the second argument is the name of the data frame that contains your variables.

1 lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,


data = gala) # fit the model
2 summary(lmod)

From this output, we can extract the regression quantities we need:

R Function Description
residuals() Residuals (ei ): The differences between observed values (yi ) and predicted values (ŷi )
fitted() Fitted Values (ŷi ): The predicted values for each observation
[Link]() Degrees of Freedom (DF): The degrees of freedom associated with the residuals
deviance() Deviance (RSS): The residual sum of squares, a measure of model fit
coef() Coefficients (β̂i ): The estimated coefficients of the regression model

Table 7.3: Description of Quantities Returned by R Functions

1 residuals(lmod)
2 fitted(lmod)
3 # Please make sure that (species - fitted values) gives the residuals
4 [Link](lmod)
5 # degrees of freedom are 24; does it make sense to you?
6 deviance(lmod)
7 coef(lmod)

You can even extract other quantities by examining the model object and its summary:
1 names(lmod)

1 lmodsum <- summary(lmod)


2 names(lmodsum)

Now we can check the assumptions:


1 plot(lmod$residuals)
7.2 L AB 2: E STIMATION 125

1 qqnorm(lmod$residuals)

1 plot(lmod, 1:2)

The plot() function in R can be used to generate four diagnostic plots for linear regression
models:

1. Residuals vs Fitted Values Plot: This plot shows the residuals (the differences between the
observed values and the predicted values) on the y-axis and the fitted values (the predicted
values) on the x-axis. The plot is used to check for non-linear relationships between the
predictor variables and the response variable. If there is a clear pattern in the residuals, such
as a curve or a funnel shape, it may indicate a non-linear relationship.

2. Normal Q-Q Plot: This plot shows the residuals on the y-axis and their expected values
under normality on the x-axis. The plot is used to check for violations of the assumption
that the residuals are normally distributed. If the residuals are normally distributed, they will
follow a straight line on the plot.

3. Scale-Location Plot: This plot shows the square root of the absolute values of the standard-
ized residuals on the y-axis and the fitted values on the x-axis. The plot is used to check for
violations of the assumption that the variance of the residuals is constant across the range
of the predictor variables. If the variance is constant, the points on the plot will be evenly
spread out around a horizontal line.

4. Residuals vs Leverage Plot: This plot shows the leverage (a measure of how much influence
an observation has on the model) on the x-axis and the standardized residuals on the y-axis.
The plot is used to check for influential outliers. If an observation has high leverage and a
large standardized residual, it may be an influential outlier.

In linear regression, it is also useful to have some measure of goodness of fit. One common
choice is the R2 . We recall here its formulation:

(ŷi − yi )2
P
2
R = 1 − Pi 2
i (yi − ȳ)

R2 is a statistical measure that represents the proportion of the variance in the dependent
variable that is predictable from the independent variables in a regression model. It is commonly
used as a goodness-of-fit measure for linear regression models.
The R2 value ranges from 0 to 1, where 0 indicates that the model explains none of the vari-
ability in the dependent variable, and 1 indicates that the model explains all of the variability.
However, it is rare to see an R2 value of 1 in practice, as it would mean that the model perfectly
predicts the dependent variable with no error.
126 L AB IN R

In general, a higher R2 value indicates a better fit of the model to the data. However, the
interpretation of what constitutes a "good" R2 value depends on the specific context and the goals
of the analysis.
For example, in some fields, such as physics or engineering, models with R2 values of 0.9 or
higher may be considered good. In contrast, in social sciences or economics, where human behav-
ior is much more complex and difficult to predict, an R2 value of 0.2 to 0.3 may be considered a
good fit.
It’s important to keep in mind that R2 should not be the only criterion for evaluating a model’s
performance, and it should be used in conjunction with other metrics such as residual plots, cross-
validation, and statistical significance tests to ensure that the model is valid and reliable for the
specific analysis at hand.
An alternative measure of fit is σ̂ 2 . The advantage is that it is measured in the units of the
response Y . The regression summary returns both values, and it is worth paying attention to both
of them. It is good practice to explore your data before fitting any model you have in mind!

Diagnostics
Now we have estimated our linear model, it is time to check the assumptions seen in the lecture
notes. It is a good practice to check regression diagnostics before using the model for any kind of
inference purposes you have in mind (e.g., Tests, Confidence Intervals, Predictions).
We can divide our problems into two main categories:
1. We have assumed that  ∼ N (0, σ 2 In ), with this assumption we are not only telling that
the error terms are all centered at zero and have constant variance, we are also saying that they are
uncorrelated (Do you see why?).
2. We are assuming that the structural part of the model E(Y ) = Xβ is correct and it is the
same for all the observations; there is no structural break in the data.
Let’s start by checking the independency, constant variance, and normality assumptions of the
errors . As has been already discussed, the error terms are not observable, but we can estimate
them using the residuals e.
Merely examining the residuals by plotting them does not suffice to verify the assumption of
constant variance, as other factors may be involved. To ensure accuracy, additional factors must be
checked. One of the most crucial diagnostic plots is the plot of residuals versus predicted values.
If all assumptions are valid, a symmetrical, constant variation in the vertical direction should be
evident. This plot not only identifies heteroskedasticity but also detects non-linearity.
1 res1 <- rnorm(100)
2 res2 <- rnorm(100, sd=1:50)
3 res3.1 <- rnorm(33, mean=-8, sd=1)
4 res3.2 <- rnorm(33, mean=0, sd=1)
7.2 L AB 2: E STIMATION 127

5 res3.3 <- rnorm(33, mean=-4, sd=1)


6 res3 <- c(res3.1, res3.2, res3.3)
7 par(mfrow=c(1,3))
8 {
9 plot(res1, xlab=’Fitted’, ylab=’Residuals’, main=’No problem’)
10 abline(h=0)
11 plot(res2, xlab=’Fitted’, ylab=’Residual’, main=’Heteroskedasticity’)
12 abline(h=0)
13 plot(res3, xlab=’Fitted’, ylab=’Residuals’, main=’Non-linearity’)
14 abline(h=0)
15 }

Creating a plot of residuals against a predictor variable that is not included in the model, such
as (e, xi ), can also be helpful. If any patterns or structures are apparent in this plot, it may suggest
that this predictor should be incorporated into the model.
It is often challenging to check the residuals without prior knowledge or experience with data,
so we can generate some artificial plots just to get an idea of what a good (bad) model(s) looks
like.
1 par(mfrow=c(1,3))
2 n <- 200
3 for (i in 1:3) {
4 x <- runif(n)
5 plot(x, rnorm(x), main=’Constant Variance’)
6 }
7 for (i in 1:3) {
8 x <- runif(n)
9 plot(x, x * rnorm(n), main=’Non-Constant Variance’)
10 }
11 for (i in 1:3) {
12 x <- runif(n)
13 plot(x, cos(x * pi) + rnorm(n, sd=1), main=’Nonlinearity’)
14 }

Since tests and confidence intervals rely heavily on the normality assumption, it is advisable
to confirm its validity. The normality of residuals can be evaluated by examining a Q-Q plot,
which compares the quantiles of the residuals with those of a normal distribution. Formally, a
Q-Q plot, or quantile-quantile plot, is a graphical method used to compare the distribution of a
sample to a theoretical distribution, such as a normal distribution. The plot displays the sample
quantiles on the vertical axis and the theoretical quantiles on the horizontal axis. If the sample
and theoretical distributions are identical, the points on the Q-Q plot will fall along a straight line.
128 L AB IN R

Departures from a straight line indicate deviations from the theoretical distribution. Q-Q plots can
be used to assess the normality of a distribution or to compare the distribution of a sample to other
distributions, such as a uniform or exponential distribution.
1 data(savings)
2 lmod <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
3 {qqnorm(residuals(lmod))
4 qqline(residuals(lmod))}

Normal residuals should follow the 45-degree line approximately. Just to get an idea, we can
simulate some non-normal distributions and see what a Q-Q plot looks like:
1 par(mfrow=c(1,3))
2 n <- 200
3

4 for (i in 1:3) {
5 x <- rnorm(n)
6 qqnorm(x)
7 qqline(x)
8 } # Normal distribution
9

10 for (i in 1:3) {
11 x <- exp(rnorm(n))
12 qqnorm(x)
13 qqline(x)
14 } # Lognormal distribution
15

16 for (i in 1:3) {
17 x <- rcauchy(n)
18 qqnorm(x)
19 qqline(x)
20 } # Cauchy distribution
21

22 for (i in 1:3) {
23 x <- runif(n)
24 qqnorm(x)
25 qqline(x)
26 } # Uniform distribution

Addressing non-normality is not a one-size-fits-all approach; the appropriate solution will de-
pend on the nature of the problem encountered. In the case of skewed errors, transforming the
response variable may be sufficient to resolve the issue. If non-normality persists despite a trans-
7.2 L AB 2: E STIMATION 129

formation, an alternative approach may be to acknowledge the non-normality and make inferences
based on a different distribution, or alternatively, employ resampling methods.
If you would like to test if the residuals follow a normal distribution or not, you can use the
Shapiro-Wilk test.
1 [Link](residuals(lmod))

Under the null, we have that the residuals are normally distributed. Since the p-value is large,
we do not reject the null.

Exercises
Please attempt the following exercises. Recall that to attempt these questions, you need to install
the faraway package first.

Exercise 1
1. Try to obtain the vector of coefficient estimates by using the formula in the lecture notes:
β̂ = (X T X)−1 X T y.

Exercise 2
Try to obtain σ̂ 2 from lm() output and compare it to the result obtained using the formula from
the lecture notes.

Exercise 3
The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression model
with the expenditure on gambling as the response and the sex, status, income, and verbal score as
predictors.

• a) Try to replicate the output of the summary() function by computing all the quantities
using the formulas seen in the lecture notes.

• b) What percentage of variation in the response is explained by these predictors?

• c) Which observation has the largest (positive) residual? Give the case number.

• d) Compute the mean and the median of the residuals. Are they symmetric and centered
around zero?

• e) Compute the correlation of the residuals with the fitted values.


130 L AB IN R

• f) Compute the correlation of the residuals with the income.

• g) For all other predictors held constant, what would be the difference in predicted expendi-
ture on gambling for a male compared to a female?

Exercise 4
In this question, we investigate the relative merits of methods for computing the coefficients.
Generate some artificial data:
1 x <- 1:20
2 y <- x + rnorm(20)

Fit a polynomial in x for predicting y. Compute β̂ in two ways: by using lm() and by using
direct calculation described in the lecture notes. At what degree of polynomial does the direct
calculation method fail? (Hint: you need to use the I() function when fitting the polynomial, i.e.,

lm(y ~ x + I(x^2))).

References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in
Statistical Science.
7.3 L AB 3:I NFERENCE 131

7.3 Lab 3:Inference

Setup

1 # Most of the data we are going to work with are taken from faraway
package
2 library(faraway) # call the package to use its built-in functions and
data
3 library(ggplot2) # an amazing package to create graphs

Inference
With our model estimation complete, we can now move on to making inferences. It’s important
to note that to estimate the intercept β0 and the parameters βi for i = 1, . . . , p, no particular
assumptions are required, as OLS is simply an algebraic tool. However, if we wish to compute
confidence intervals (CI) or perform hypothesis tests, we must make the assumption that the error
terms are normally distributed. This is an important consideration when drawing conclusions from
our model and ensuring the reliability of our results.

Tests about Multiple Parameters


Given several predictors, you might want to test whether all are needed. Consider a larger model,
Ω of dimension p, and a smaller model, ω of dimension q, which consists of a subset of the
predictors that are in Ω. We can obtain the following F statistic:
1 F = (RSS_omega - RSS_Omega / (p - q)) / (RSS_Omega / (n - p)) ~ F_{p-q,
n-p}

Details about the derivation of this statistic are out of the scope of this course. However, F
is obtained by a Likelihood ratio test, therefore by assuming that the error terms are Normally
distributed. Hence, this test and most of the tests we are going to see in this section cannot be used
if there is a strong violation of the Normality assumption.
In R, many tasks can be simplified. To illustrate this point, we will use the Galapagos Islands
dataset to fit the same model as before, with the number of species as the response variable and
the geographic variables as predictors.
1 lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data=gala)
132 L AB IN R

We then fit the model with only the intercept. The function that we are going to use is
anova(); this will do the job for us!
1 nullmodel <- lm(Species ~ 1, data=gala)
2 anova(nullmodel, lmod)

We can see directly the result of the test, where the null here is

H0 : β1 = β2 = β3 = β4 = β5 = 0

H1 : at least one is different from zero,

we have a p-value of 6.838e − 07, we reject the null. This can be verified using the formula
derived above
1 rss0 <- deviance(nullmodel)
2 rss <- deviance(lmod)
3 df0 <- [Link](nullmodel)
4 df <- [Link](lmod)
5

6 (fstatistic <- ((rss0 - rss) / (df0 - df)) / (rss / df))

we can see that this result is consistent with the one obtained using anova().
1 1 - pf(fstatistic, df0 - df, df)

the p-value is consistent as well.

Testing one predictor


Lets suppose to test

H0 : βi = 0 vs H1 : βi 6= 0.

Let Ω be the model with all the predictors of interest which has p parameters and let ω be
the model with all the same predictors except predictor i (the one we are testing). Let’s test
whether Area can be dropped from the full model by testing the hypothesis that the corresponding
parameter is zero.
1 lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
data=gala)
2 lmods <- lm(Species ~ Elevation + Nearest + Scruz + Adjacent, data=gala
)
3 anova(lmods, lmod)
7.3 L AB 3:I NFERENCE 133

the p-value of 0.3 tells that we cannot reject the null. An alternative (equivalent) approach is
to use t-statistic for testing the hypothesis:

β̂i
ti = ,
se(β̂i )
this test statistic can be proved to be distributed as a t-student with n − (p + 1) degrees of
freedom

Testing a pair of predictors


In statistical terms, if we want to investigate whether the area of either the current island or the
adjacent island has any relationship with the response, we can phrase the problem as follows:

H0 : βArea = βAdjacent = 0,

also this test can be carried out with anova()


1 lmods <- lm(Species ~ Elevation + Nearest + Scruz , data=gala)
2 anova(lmod, lmods)

The null is rejected as the p-value is too small.

Testing a subspace
Lets suppose that we want to test

H0 : βArea = βAdjacent ,

we can test this hypothesis again with anova()


1 lmods <- lm(Species ~ I(Area + Adjacent) + Elevation + Nearest + Scruz
, data=gala)
2 anova(lmods, lmod)

The function I() ensures that the argument is evaluated rather than interpreted as part of the
model formula. The p-value suggests that the null can be rejected.
Yet, assume we want to test

H0 : βElevation = 0.5.

This can be set in R by using an offset


1 lmods <- lm(Species ~ Area + offset(0.5 * Elevation) + Nearest + Scruz
+ Adjacent, data=gala)
134 L AB IN R

2 anova(lmod, lmods)

we see that the p-value is small, so the null has to be rejected.

Confidence Intervals
In our domain, Confidence Intervals (CIs) are an incredibly powerful tool that allow us to quantify
the uncertainty in our estimates of β. They provide us with a range of plausible values for the true
value of the parameter, which can help us make informed decisions and draw accurate conclusions
from our data.

β̂i ± tn−(p+1),α/2 se(β̂),

1 lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,


data=gala)
2 summary(lmod)

we can construct manually 95% CIs for βArea for which we need the 2.5% and 97.5% per-
centiles of the t-distribution with 30 − 6 = 24 degrees of freedom.
1 qt(0.975, 24)

1 -0.02394 + c(-1, 1) * qt(0.975, 24) * 0.02242

we can note that the CI contains zero, this indicates that the null hypothesis H0 : βArea = 0
would not be rejected at the 5% level.
A convenient way to obtain all the univariate intervals is
1 confint(lmod)

Exercises
Please attempt the following exercises. Recall that to attempt these questions, you need to install
the faraway package first.

Exercise 1
Using the prostate data, fit a model with lpsa as the response and the other variables as
predictors.

1. Compute 90 and 95% CIs for the parameter associated with age.
7.3 L AB 3:I NFERENCE 135

2. Using just these intervals, what could we have deduced about the p-value for age in the
regression summary?

Exercise 2
Using the sat data

1. Fit a model total SAT score as the response and expand, ratio, and salary as
predictors. Test the hypothesis that βsalary = 0. Test that βsalary = βratio = βexpand = 0. Do
any of these predictors have an effect on the response?

2. Now add takers to the model. Test the hypothesis that βtakers = 0. Compare this model
to the previous one using an F -test. Demonstrate that the F -test and the t-test here are
equivalent.

Exercise 3
Find a formula relating R2 and the F -test for the regression (please try to attempt this alone, not
using external help or AI!).

References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in
Statistical Science.
136 L AB IN R

7.4 Lab 4: Prediction & Transformation

Predictions

Setup

1 # Most of the data we are going to work with are taken from faraway
package
2 library(faraway) # call the package to use its built-in functions and
data
3 library(ggplot2) # an amazing package to create graphs

Up until now, we’ve covered the process of estimating a linear model and making inferences
based on that model. In this lab, our focus will shift towards the topic of prediction. The problem
at hand can be summarized as follows: given a new set of predictors x0 , what is the predicted
response? It can be proved to be:

ŷ0 = xT0 β̂.

Here, β̂ is estimated without taking into account the new information provided by x0 , hence
we can express β̂ = (XT X)−1 XT y. However, for this scenario, a point estimate is not sufficient
to make reliable predictions. We need to assess the uncertainty associated with this prediction. For
instance, if we want to predict the high water mark of a river, we may need to construct barriers
that are high enough to withstand floods that could potentially exceed the predicted maximum.
Financial projections are similarly not very useful without a realistic estimate of the associated
uncertainty.
Just before introducing you to confidence intervals, we need to define what we mean by the
predicted mean response and a prediction of a future observation. Let’s suppose we have built a
regression model that predicts the rental prices of houses in a given area based on predictors such
as the number of bedrooms and proximity to a major highway. We have two types of predictions
that can be made:

1. Suppose that a specific house comes to the market were its characteristics are in the vec-
tor x0 . Its rental price will be Y0 = xT0 β + 0 , without loss of generality we can still
assume that 0 ∼ N (0, σ 2 ). It is easy to prove that E[Ŷ0 − Y0 ] = 0 and V ar[Ŷ0 − Y0 ] =
σ 2 [xT0 (XT X)−1 x0 +1] (please attempt to do this). In assessing the variance of the prediction
xT0 β̂ we need to take into account the variance of 0 .

2. On the other hand, suppose our manager asks the question, "What would a house with
characteristics x0 rent for on average?" This average selling price is xT0 β and is predicted
7.4 L AB 4: P REDICTION & T RANSFORMATION 137

by xT0 β̂. In this case, it makes sense to take into account only the variance of β̂. The recall
in the first case we are looking for a prediction of a future value, while in the second case
prediction of the mean response.

For (1) a 100(1 − α)% CI for a single future response is


q
ŷ0 ± tn−(p+1),α/2 σ̂ 1 + xT0 (XT X)−1 x0 .

So far, CIs have been for parameters; under our Frequentist framework, we are assuming that
a parameter θ is a fixed quantity, they are not random. However, a future observation by definition
is random. For this reason, the interval above is called prediction interval. We say that there is
a 95% chance that the future value falls within this interval. This would be incorrect and may be
outrageous for some statisticians (including me) to say for a parameter. Please take care about
what you write in your lab exam.
For (2) a 100(1 − α)% CI is
q
ŷ0 ± tn−(p+1),α/2 σ̂ xT0 (XT X)−1 x0 .

The CI for (2) is usually much narrower than the prediction interval. It is tempting to use this
version to make intervals for predicted values; please do not do that. This is a serious mistake,
especially if done for a report in a firm.
Let’s see how to compute these quantities in R:
1 library(faraway)
2 data(fat)
3 lmod <- lm(brozek ~ age + weight + height + neck + chest + abdom + hip
+ thigh + knee + ankle + biceps + forearm + wrist, data= fat)

Measuring body fat is not an easy task. One method requires submerging the body underwater
in a tank and measuring the increase in the water level. As many people would prefer not to
be submerged underwater, researchers recorded age, weight, height, and 10 body circumference
measurements for 252 with the aim to build a model able to predict body fat.
In the model just fitted, we use brozek as the response (Brozek’s equation estimates percent
body fat from density). Let’s consider a typical man, this is exemplified by the median value of all
the predictors:
1 x <- [Link](lmod) # this extracts the design matrix
2 x0 <- apply(x, 2, median)
3 x0

We obtained a vector with the same columns as the design but with just one row. We can obtain
ŷ0 using predict:
138 L AB IN R

1 (y0 <- predict(lmod, new=[Link](t(x0))))

Please take care as the predict function requires the new value argument being in the form
of a data frame with the same format as the data used to fit the model.
Now if we want a 95% CI we have just decided whether we are predicting the body fat for one
particular man or the mean body fat for all men with these same characteristics:
1 predict(lmod, new=[Link](t(x0)), interval=’prediction’)

1 predict(lmod, new=[Link](t(x0)), interval=’confidence’)

Transformation
Transformations can be an incredible tool to use to improve the fit and correct some violations
(e.g., heteroscedasticity can usually be corrected by applying a log() transformation on the Y
values). Another option is to add additional predictors that are functions of the existing predictors
like quadratic or cross-product terms.
Suppose that we are considering the following model:

log(Y ) = β0 + β1 X + ,

In the original scale of the response, this model becomes (by applying exp() to both sides):

Y = exp(β0 + β1 X) × exp(),

Under this framework, the model enters in a multiplicative way. Take-home message: using
log() can save your model and your marks during the lab exam.
In practice, we may not know how the error enters the model. The best approach is to try
different transformations to get the structural form of the model right and worry about the error
component later. Non-linearity can often be detected by exploring your data through plotting (y
vs xi ). Once a transformation is applied and the model is fitted, we can check the residuals to see
whether they satisfy the conditions required for linear regression. When transforming, you have
to take care when interpreting your data (more details in the lecture notes).
When you use the log on the independent variable, the regression coefficients can be inter-
preted as:

log(ŷ) = β̂0 + β̂1 x1 + · · · + β̂p xp ,

ŷ = eβ̂0 eβ̂1 x1 . . . eβ̂p xp .


7.4 L AB 4: P REDICTION & T RANSFORMATION 139

An increase of one unit in x1 would multiply ŷ by eβ̂1 . Thus, when a log scale is used, the
regression coefficients can be interpreted in a multiplicative rather than an additive manner.
We can recall that log(1 + x) ≈ x (this is obtained by a Taylor expansion around zero). So
for example, suppose to have β̂1 = 0.09; then an increase of one in x1 would lead to about a 0.09
increase in log y, which is a 9% increase in y.

Box Cox Transformation

A useful tool you can employ to get which transformation to apply is the Box-Cox method. This
method is designed for strictly positive responses and chooses the transformation to find the best fit
to the data. This method transforms the response y → gλ (y) where the family of transformations
is indexed by λ:

 yλ −1 λ 6= 0
λ
gλ (y) =
log(y) λ = 0,

For fixed y > 0, gλ (y) is continuous in λ. The idea is to choose λ using the maximum
likelihood method. Assuming the normality of the errors, the likelihood function is like this:

n X
L(λ) = − log(RSSλ /n) + (λ − 1) log(yi ),
2 i

where RSSλ is the residual sum of squares when gλ (y) is the response. Transforming the
response can make our model hard to interpret, so we want to be sure to do it only if necessary.
One way to check this is to form a confidence interval of λ. A 100(1 − α)% CI is:

{λ : L(λ) > L(λ̂) − 0.5χ21,(1−α) }.

For each value of λ, a transformation is proposed. Some are represented in the table below:

Lambda Transformation
-3 Y −3
-2 Y −2
-1 Y −1
-0.5 Y −0.5
0 log(Y )
1 Y
2 Y2
3 Y3

Let’s move into R to see how things work. To use the Box-Cox transformation, we need to
install and load the MASS library (I suggest you explore the documentation of MASS as it is an
140 L AB IN R

interesting package with tons of statistical tools implemented).


1 library(MASS)
2 library(faraway)
3 data(savings)
4 lmod <- lm(sr ~ pop15 + pop75 + dpi + ddpi, savings)
5

6 boxcox(lmod, plotit = T, lambda = seq(0.5, 1.5, by = 0.1))

We can see from the plot that there seems to be no good reason to transform our data.
Let’s consider the Galapagos Islands:
1 data(gala)
2

3 lmod <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent,


gala)
4 summary(lmod)
5 plot(lmod)
6

7 boxcox(lmod, lambda = seq(-0.25, 0.75, by = 0.05), plotit = T)

A possible transformation is the square root, as this falls just within the confidence intervals.
1 lmod <- lm(sqrt(Species) ~ Area + Elevation + Nearest + Scruz +
Adjacent, gala)
2 plot(lmod)

1 summary(lmod)

The residuals in the fitting with Y seem to be approximately normal (see the min, max, Q1 ,
and Q2 ), and the residual standard error decreased.
Just some words of warning about Box-Cox transformation:

1. The Box-Cox is not robust against outliers, so if you get a value of λ̂ = 5, the reason could
be that you are working with data affected by outliers. (Do you see why it’s so important to
explore your data first?)

2. If some yi < 0, it is customary (not an elegant solution) to add a constant to all y’s. Note
that the constant to add should be small (e.g., 0.05, 0.10).

3. There is doubt about whether the estimation of λ counts as an extra parameter to be con-
sidered in the degrees of freedom. This doubt comes from the fact that λ is not a linear
parameter, and its estimation is not part of the least squares fit.
7.4 L AB 4: P REDICTION & T RANSFORMATION 141

Exercises
Please attempt the following exercises. Recall that to attempt these questions, you need to install
the faraway package first.

Exercise 1
For the prostate data, fit a model with lpsa as the response and the other variables as predic-
tors.

• (a) Suppose a new patient with the following values arrives:

lcavol = 1.44692,
lweight = 3.62301,
age = 65,
lbph = 0.30010,
svi = 0.00000,
lcp = −0.79851,
gleason = 7.00000,
pgg45 = 15.00000.

Predict the lpsa for this patient along with an appropriate 95% CI.

• (b) Repeat the last question for a patient with the same values except that he is age 20.
Explain why the CI is wider.

• (c) For the model of the previous question, remove all the predictors that are not significant at
the 5% level. Now recompute the predictors. Which predictors would you prefer? Explain.

Exercise 2
Using the teengamb data, fit a model with gamble as the response and the other variables as
predictors.

• (a) Predict the amount that a male with average (given these data) status, income, and verbal
score would gamble along with an appropriate 95% CI.

• (b) Repeat the prediction for a male with maximal values (for this data) of status, income,
and verbal score. Which CI is wider, and why is this result expected?
142 L AB IN R


• (c) Fit a model with gamble as the response but with the same predictors. Now predict
the response and give a 95% prediction interval for the individual in (a). Take care to give
your answer in the original units of the response.

• (d) Repeat the prediction for the model in (c) for a female with status=20, income=1,
verbal=10. Comment on the credibility of the result.

References
- Faraway, J. (2015). Linear Models with R Second Edition CHAPMAN & HALL/CRC Texts in
Statistical Science.
Chapter 8

Appendix A: Linear Algebra

8.1 Terminology

In this lecture notes a vector1 is always a column vector, this will be denoted as
 
a1
 
 a2 
 
.
a= 
 
.
 
.
 
an

the transpose of a column vector is a row vector and is denoted as a> = (a1 , a2 , . . . , an ). A matrix
can be defined a rectangular array. A matrix A ∈ Rn×k can be written as

 
a11 a12 . . . a1k
 
 a21 a22 . . . a2k 
 
A =  a31 a32 . . . a3k 


 . .. . . .. 
 .. . . . 
 
an1 an2 . . . ank

the first index of the element aij refers to the ith row, and the second index to the jth column.
The symbol > denotes the transpose of a matrix A ∈ Rk×n

1
vector and matrices are denoted in bold

143
144 A PPENDIX A: L INEAR A LGEBRA

 
a11 a21 . . . an1
 
a12 a22 . . . an2 
 
A> = a13 a23 . . . an3 


 . .. .. .. 
 .. . . . 
 
a1k a2k . . . ank

where the column of A are the rows of A> , and vice versa. A matrix is said to be a square
matrix if n = k. A square matrix is symmetric is A = A> , this implies that aij = aji . An
example of square matrix in statistics is the covariance matrix. A square matrix S looks like as
 
1 0.5 (a12 ) 0.7 (a13 )
S = 0.5 (a21 ) 1 0.3 (a23 )
 

0.7 (a31 ) 0.3 (a32 ) 1


where in brackets we denoted the ijth element of the matrix. It follows that its transpose is
 
1 0.5 (a21 ) 0.7 (a31 )
S> = 0.5 (a12 ) 1 0.3 (a32 )
 

0.7 (a13 ) 0.3 (a23 ) 1


which is the exact same matrix but with indexes swapped.
A square matrix is called diagonal denoted with D if aij = 0 for all i 6= j
 
a11 0 ... 0
 0 a22 ... 0
 

D=
 .. .. .. .. 
 . . . .


0 0 . . . ank
the identity matrix I is a diagonal matrix with all diagonal elements equal to one.
 
1 0 ... 0
0 1 ... 0
 
I=
 .. .. . . .. 
. . . .

0 0 ... 1

8.2 Properties
Let consider a number of vectors a1 to ak , we can take a linear combination of these vectors. With
scalar weights c1 , . . . , ck this produces the vector c1 a1 + c2 a2 + · · · + ck ak , which we can shortly
write as Ac, where A ∈ Rn×k than can be written as A = (a1 , . . . , ak ) and c = (c1 , . . . , ck )>
A set of linear vectors is linearly dependent if any of the vectors can be written as a linear
8.3 I NVERSE M ATRIX 145

combination of the others. That is, if there exist values for c1 , . . . , ck not all zero, such that c1 a1 +
· · · + ck ak = 0. Equivalently, a set of vectors is linearly independent if the only solution to

c1 a1 + · · · + ck ak = 0

is the trivial solution


c1 = c2 = · · · = ck = 0

That is, if the only solution to Ac = 0 is c = 0, where 0 = (0, . . . , 0)> is the null vector.
If we consider all possible vectors that can be obtained as linear combinations of the vectors
a1 , . . . , ak , these vectors form a vector space. If the vectors a1 , . . . , ak are linearly dependent,
we can reduce the number of vectors without changing this vector space. The minimal number
of vectors needed to span a vector space is called the dimension of that space. This way, we can
define the column space of a matrix as the space spanned by its columns, and the column rank
of a matrix as the dimension of its column space. Clearly, the column rank can never exceed
the number of columns. A matrix is of full column rank if the column rank equals the number
of columns. The row rank of a matrix is the dimension of the space spanned by the rows of the
matrix. In general, it holds that the row rank and the column rank of a matrix are equal, so we
can unambiguously define the rank of a matrix. Note that this does not imply that a matrix that is
of full column rank is automatically of full row rank ( this is true only if the matrix is square). A
useful result in regression analysis is that for any matrix A

rank(A) = rank(AA> ) = rank(A> A).

8.3 Inverse Matrix


A matrix B, if it exists, is the inverse of a matrix A if AB = BA = I. A necessary requirement
fot his is that A is a square matrix and has full rank. In this case A is also called invertible or
nonsingular. In this case we define B = A−1 , and

AA−1 = I and A−1 A = I.

Note that the definition implies that A = B−1 . Thus we have (A−1 )−1 = A. If A−1 does not
exist, we say that A is singular.
Suppose we are asked to solve Ac = d where A ∈ Rn×n , c ∈ Rn and d ∈ Rn . This is a
system of n linear equations with n unknowns. If A−1 exists, we can write

A−1 Ac = c = A−1 d

to obtain the solution. If A is not invertible, the system of linear equations has linear depen-
146 A PPENDIX A: L INEAR A LGEBRA

dencies. Two possibilities can happen. Either more than one vector c satisfies Ac = d, so no
unique solution exists, or the solutions are inconsistent, so there is no solution to the system. If
d ≡ 0, only the first possibility remains.
It is straightforward to derive that

(A−1 )> = (A> )−1

and

(AB)−1 = B−1 A−1

8.4 Idempotent Matrices


A special class of matrices is that of symmetric and idempotent. A matrix P is symmetric if
P> = P and idempotent if PP = P. A symmetric idempotent matrix P has the interpretation of
a projection matrix. This means that the projection vector Px is in the column space of P, while
the residual vector x − Px is orthogonal to any vector in the column space of P.
A projection matrix that projects upon the column space of a matrix A can be constructed as
P = A(A> A)−1 A> . Clearly if we try to project twice on the same space we end up with no
effect of the original projection, PPx = Px, try to attempt this result, as exercise. The residual
from the projection matrix with MP = PM = 0 and MM = M = M> . Thus the vectors Mx
and Px are orthogonal. The only nonsingular projection matrix is the identity matrix. All other
projection matrices are singular, each having rank equal to the dimension of the space upon which
they project.

8.5 Eigenvalues and Eigenvectors


Let A be a symmetric n × n matrix. Consider the following problem of finding combinations of
a vector c (other than the trivial solution) and a scalar that satisfy

Ac = λc

At a first glance, this could seems a useless problem to pose for a statistician. However, what
we are trying to do is to compress all the information that is incorporated in A into λ. This
means that once the solutions of the system are obtained, if they exists, are the only ingredients
needed to reconstruct A. In general there are n solutions λ1 , . . . , λn , called the eigenvalues of A,
corresponding to n vectors c1 , . . . , cn , called eigenvalues. If c1 is a solution, the eigenvectors are
defined up to a constant. The eigenvectors of a symmetric matrix are orthogonal, that is, c> i cj = 0
8.6 D IFFERENTIATION 147

for all i 6= j. A singular matrix has at least one zero eigenvalue. In general, the rank of a symmetric
matrix corresponds to the number of nonzero eigenvalues.
A symmetric matrix is called positive definite if all its eigenvalues are positive. It is called
positive semi-definite if all its eigenvalues are non- negative, A positive definite matrix can be
inverted. If A is positive definite, it holds for any vector x that

x> Ax > 0

The determinant of a symmetric matrix equals the product of its n eigenvalues. The determi-
nant of a positive definite matrix is positive. A symmetric matrix is singular if the determinant is
zero.

8.6 Differentiation
Let x ∈ Rn and c ∈ Rn , then c> x is a scalar. Let us consider c> x as a function of x. Then we
can take the derivative respect to each of the elements in x

∂c> x
=c
∂x
this is a column vector of n derivatives. More generally for a vectorial function Ax where A
is a matrix

∂Ax
= A>
∂x
where the element in the column i, row j of this matrix is the derivative of the jth element in
the function Ax with respect to xi . Further

∂x> Ax
= 2Ax
∂x
for a symmetric matrix.
If A is not symmetric, we have

∂x> Ax
= (A + A> )x
∂x

8.7 Expectation and Covariance Operators


Let Xij , i = 1, 2, . . . , m; j = 1, 2, . . . , n be a set of random variables with expectation E(Xij ).
Whose in matrix form becomes E(X)
148 A PPENDIX A: L INEAR A LGEBRA

 
E(X11 ) E(X12 ) . . . E(X1n )
 E(X21 ) E(X22 ) . . . E(X2n ) 
 
E(X) =  .. .. ... .. 
. . .
 
 
E(Xm1 ) E(Xm2 ) . . . E(Xmn )

Theorem 4. If A ∈ Rl×m , B ∈ Rn×p and C ∈ Rl×p matrices, respectively of constants, then

E[AXB + C] = AE[X]B + C

(Proof in Seber (2003))this important results holds also if X ∈ Rm (m dimensional vector)


then E[AX] = AE[X]. Let A and B be m × n matrices of constants, and X and Y be n × 1
vectors of random variables, then

E[AX + BY] = AE[X] + BE[Y].

In a similar manner we can generalize the notions of covariance and variance for vectors. If
X ∈ Rm and Y ∈ Rn vectors of random variables, then we define the generalized covariance
operator Cov(·) as

Theorem 5. If E(X) = µX and E(Y) = µY then

Cov[X, Y] = E[(X − µX )(Y − µY )]

When X = Y, Cov[XX] written as V ar[X] is called the variance (variance-covariance or


dispersion) matrix of X
 
V ar(X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn )
 Cov(X2 , X1 ) V ar(X2 ) . . . Cov(X2 , Xn )
 
V ar(X) =  . .. .. .. 

 .. . . .


Cov(Xn , X1 ) Cov(Xn , X2 ) ... V ar(Xn )
Since Cov[Xi Xj ] = Cov[Xj Xi ], the matrix above is symmetric.
From Theorem ??, with Y = X we have

V ar[X] = E[(X − µX )(X − µX )> ]

which, after an expansion, leads to

V ar[X] = E[XX> ] − µX µ>


X

These last two equations are natural generalization of univariate results


8.7 E XPECTATION AND C OVARIANCE O PERATORS 149

Theorem 6. If X ∈ Rm and Y ∈ Rn be vectors of random variables, and A and B are l × m and


p × n matrices of constants, then

Cov[AX, BY] = ACov[X, Y]B>

(proof in Seber (2003)), from the Theorem just stated we have the special cases

Cov(AX, Y) = ACov(X, Y)

Cov(X, BY) = Cov(X, Y)B >

Of particular importance is the following result, obtained by setting B > = A and Y = X

V ar[AX] = Cov[AX, AX] = ACov[XX]A> = AV ar[X]A>

Theorem 7. If X is a vector of random variables such that no elements of X is a linear combi-


nation of the remaining elements [i.e. there do not exist a(6= 0) and b such that a> X = b for all
values of X = x], then V ar[X] is a positive definite matrix.

(Proof in Seber (2003)).


Quadratic forms play a major role in multivariate linear regression. In particular, we will
frequently need to find the expected value of a quadratic form using the following theorem.

Theorem 8. Let X ∈ Rn vector of random variables, and let A ∈ Rn×n symmetric matrix. If
E[X] = µ and V ar[X] = Σ then

E[X> AX] = tr(AΣ) + µ> Aµ

Proof.

E[X> AX] = tr(E[X> AX])


= E[tr(X> AX)]
= E[tr(AXX> )]
= tr(E[AXX> ])
= tr(AE[XX> ])
= tr(A(V ar[X] + µµ> ))
= tr(AΣ) + tr(Aµµ> )
= tr(AΣ) + µ> Aµ
150 A PPENDIX A: L INEAR A LGEBRA

We can deduce two special cases. Fist, by setting Y = X − B and noting that V ar[Y] =
V ar[X] we have

E[(X − µ)> A(X − µ)] = tr(AΣ) + (µ − B)> A(µ − B)

Second, if Σ = σ 2 I then tr(AΣ) = σ 2 tr(A). Thus

E[X> AX] = σ 2 (sum of coefficients of Xi2 ) + (X> AX)x=µ


Chapter 9

Appendix B: Multivariate Normal


Distribution

The multivariate normal distribution plays a fundamental role in linear regression. While real data
are never exactly multivariate normal, the normal density is often a useful approximation to the
"true" population distribution.
The main advantage of multivariate normal distribution lies in the fact that it is mathematically
tractable and "nice" results can be derived and studied from it. This is frequently not the case for
other distributions expecially in the multivariate domain. Of course, mathematical attractiveness
per se is of little use to the practitioner. It turns out, however, that normal distributions are useful
in practice for two reasons:

• First, the normal distribution serves as a bona fide population model in some instances;

• Second, the sampling distributions of many multivariate statistics are approximately normal,
regardless of the form of the parent population, because of a central limit effect.

The multivariate normal density is a generalization of the univariate normal density when
p ≥ 2. Recall that the univariate normal distribution, with mean µ and variance σ 2 , can be written
as

1 2 /2
f (x) = √ e−[(x−µ)/σ] −∞<x<∞
2πσ 2
It is convenient to denote the normal density function with mean µ and variance σ 2 by N (µ, σ 2 ).
The term
 2
x−µ −1
= (x − µ) σ 2 (x − µ)
σ
in the exponent of the univariate normal density function measures the square of the distance
from x to µ in standard deviation units. This can be generalized for a p×1 vector x of observations

151
152 A PPENDIX B: M ULTIVARIATE N ORMAL D ISTRIBUTION

on several variables as

(x − µ)> Σ−1 (x − µ)

The p × 1 vector µ represents the expected value of the random vector X, and the p × p matrix
Σ is the variance-covariance matrix of X. We shall assume that the symmetric matrix Σ is positive
definite, so the expression in is the square of the generalized distance from x to µ.
The multivariate normal density is obtained by replacing the univariate distance by the multi-
variate generalized distance of in the density function. When this replacement is made, the univari-
−1/2
ate normalizing constant (2π)−1/2 (σ 2 ) must be changed to a more general constant that makes
the volume under the surface of the multivariate density function unity for any p. This is necessary
because, in the multivariate case, probabilities are represented by volumes under the surface over
regions defined by intervals of the xi values. It can be shown that this constant is (2π)−p/2 |Σ|−1/2 ,
and consequently, a p-dimensional normal density for the random vector X> = [X1 , X2 , . . . , Xp ]
has the form

1 >
Σ−1 (x−µ)/2
f (x) = e−(x−µ)
(2π)p/2 |Σ|1/2
where −∞ < xi < ∞, i = 1, 2, . . . , p. We shall denote this p-dimensional normal density by
Np (µ, Σ), which is analogous to the normal density in the univariate case.
The key properties of a Multivariate normal random variable X are:

1. Linear combinations of the components of X are normally distributed.

2. All subsets of the components of X have a (multivariate) normal distribution.

3. Zero covariance implies that the corresponding components are independently distributed.

4. The conditional distributions of the components are (multivariate) normal.

You might also like