0% found this document useful (0 votes)
100 views70 pages

Week01 Lecture BB

This document provides an introduction to the MATH5806 Applied Regression Analysis course taught in Term 2, 2021. It outlines the lecturer and their contact details, lecture and tutorial times, learning resources including textbooks, software, and assessment details. The course aims to help students choose the appropriate regression technique for a given dataset, interpret regression outputs, and communicate results. It will cover both theoretical foundations and applications in R.

Uploaded by

Alice Xiao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views70 pages

Week01 Lecture BB

This document provides an introduction to the MATH5806 Applied Regression Analysis course taught in Term 2, 2021. It outlines the lecturer and their contact details, lecture and tutorial times, learning resources including textbooks, software, and assessment details. The course aims to help students choose the appropriate regression technique for a given dataset, interpret regression outputs, and communicate results. It will cover both theoretical foundations and applications in R.

Uploaded by

Alice Xiao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

MATH5806 Applied Regression Analysis

Lecture 1 - Introduction to Regression Analysis

Boris Beranger
Term 2, 2021

1/70
Some comments about the course

Lecturer: Dr Boris Beranger

Lecturer, School of Mathematics and Statistics


Associate Investigator, ARC Centre of Excellence for Mathematical and Statistical
Frontiers

Previously

• PhD at Université Pierre and Marie Curie (Paris) & UNSW Sydney
• Postdoc at UNSW Sydney

2/70
Contact

You can use any of the following to reach me:

E-mail [email protected]
Office 4103
Webpage www.borisberanger.com
Consultation hours Thursday, 15:00-16:00

3/70
Lectures: Wednesday, 12:00 - 15:00

• One lecture per week: live, video recording available on Blackboard Collaborate

• Venue: online (Blackboard Collaborate)

• Tutorial: Thursday, 16:00-17:00 (online)

All lecture and tutorial material, assignments, communication will be posted the course
Moodle page.

Please read the course outline at the start of the course.

4/70
Learning support

Textbooks:

• A. J. Dobson & A. G. Barnett (2018), An Introduction to Generalised Linear Models,


Chapman and Hall.
• G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to Statistical
Learning with Applications in R, Springer.
• T. Hasite, R. Tibshirani & J. Friedman (2009), The Elements of Statistical Leaning,
Springer.

5/70
Learning support

Software:

We will use the R software. Download and install it.

We will also use RStudio. Download and install it.

6/70
Assessment Details

Task When Due Weight Duration


Assignement 1 Week 3 15% 1 week
Mid-session test Week 5 15% 1 hour
Assignement 2 Week 9 15% 2 weeks
Final Examination ? 55% 3 hours

These will have a mathematical and computational component.


No delay accepted in the delivery of homework.

7/70
Skills to be developed

At the end of this course, you will be able to:

• choose the appropriate regression technique to analyse a given data set;


• choose the appropriate R package to apply effectively this technique;
• interpret the output and the results;
• understand the theoretical foundations on which these techniques rely;
• understand the models, hypotheses, intuitions, and strengths and weaknesses of
the various approaches;
• communicate effectively the results, in written or oral form.

8/70
My philosophy

• Understand deeply enough the theory to use efficiently any statistical tool and
choose the most appropriate one.
• Become able to find/create the appropriate R function to apply the tools on real
data.
• Understand the output of these functions and its link to mathematical formulas.
• Become able to interpret the output.

This being said, I will go into great details in the theory only for the Linear Model in
order to leave time to play with modern and more sophisticated nonlinear models.

Simulations will also play an important role in order to understand several concepts.
9/70
Notation

• Random variables: Y1 , Y2 , . . . , Yn (not italic, not bold, capital letters)


• Realisation of random variables: y1 , y2 , . . . , yn (italic, not bold, small letters)
• Greek letters: parameters β, θ, etc.
• Estimators: β̂,θ̂   
Y1 y1
Y2   y2 
   
• Vectors: y =  .. , y =  ..  (not italic / italic, bold, small letters)
  
 .  .
Yn yn
 
X11 X12 . . . X1p
X21 X22 . . . X2p 
 
• Matrices: X = 
 .. .. ..  (not italic, bold, capital letters)
.. 
 . . . . 
Xn1 Xn2 . . . Xnp
• Transpose: y , y >
>
10/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

11/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

12/70
What is Regression Analysis?

Regression Analysis investigates the functional relationship between statistical


variables. Data are usually multiple observations of a random vector (Y, x).

• x = (X1 , . . . , Xp )> is a p-vector of variables termed: explanatory variables,


regressors, predictors, input variables or independent variables.

• Y is called: response variable, target variable, output variable, outcome variable or


dependent variable. It may be continuous (∈ R), discrete (∈ {1, . . . , K }) or ordinal
(ordered discrete).

Explanatory variables are usually treated as random variables, while predictors are
treated as fixed observations.
13/70
Response and explanatory variables

Response and explanatory variables are measures on one of the following scales:

• nominal: when Y is classified into categories, which can be only two (binary
outcome) or several (multinomial outcome)
• ordinal: when Y is recorded in classes
• continuous: when Y is measured on a continuous scale, at least in theory.

Nominal and ordinal data are discrete variables and can be qualitative or quantitative
(e.g. counts). Continuous data are quantitative.

x can also be quantitative or qualitative. In particular, when the explanatory variable is


qualitative, it is often called factor. A quantitative explanatory variables is called
covariate.
14/70
Response and explanatory variables

Response Explanatory Method


Continuous Binary t-test
Nominal, >2 categories ANOVA
Ordinal ANOVA
Continuous Multiple regression
Binary Categorical Contingency tables
Continuous Logistic or probit regression
Nominal, >2 categories Nominal Contingency tables
Categorical & Continuous Nominal logistic regression
Ordinal Categorical & Continuous Ordinal logistic regression
Counts Categorical Log-linear models
Categorical & Continuous Poisson regression
15/70
Regression

We aim to find a “good” functional relationship of the form Y = f (x) + ε, where ε is a


random error term independent of x with mean zero.

Figure 1: Regression of Y (vertical, continuous) on (X1 , X2 ) (horizontal).

16/70
“Applied” Regression Analysis

Applied means: “If there is no way to calculate it, we won’t talk about it.”

On the other hand, we want to understand the underlying computational methods and
algorithms. This will be impossible without understanding the theory.

“There is nothing more practical than a good theory.” (Kurt Lewin)

17/70
General Framework of Statistical Learning

Statistical learning refers to a vast set of tools for understanding data. It splits into
supervised and unsupervised methods. All the methods presented in this course are
within the framework of supervised learning.

Regression fits into the framework of supervised methods, which requires a statistical
model for predicting or estimating an output based on one or more inputs.

In contrast, unsupervised methods cover situations where there are inputs but no
supervising output. In these type of analysis we learn about relationships and structure
of data. Example of unsupervised analysis is cluster analysis.

18/70
Knowledge assumed

You need to have a fairly good understanding of linear algebra:


• vector spaces,
• linearly independent,
• matrix multiplication,
• diagonalisation,
• projections,
• ...
Need to know multivariate calculus:
• partial derivatives,
• critical points,
• integrals,
• ...
19/70
Knowledge assumed

It’s also good to have some previous exposure to computational software (R):

• data types,
• manipulation of arrays,
• some idea of optimisation,
• ...

And finally, you need to know some probability theory and statistics:

• important distributions (normal, Poisson, ...),


• conditional probability,
• conditional expectation,
• covariance matrix,
• asymptotic normality of maximum likelihood estimators, ...
20/70
Scope of the Course

1. Introduction: concepts such as prediction, inference, estimation methods,


maximum likelihood estimation, least squares estimation, simple linear regression,
exponential family of distributions, introduction to R;
2. Estimation and Inference: methods of classical estimation and model fitting;
3. General Linear Models: multiple regression, ANOVA (Analysis of Variance) is used
for a continuous response variable and categorical explanatory variables (factors),
ANCOVA (Analysis of Covariance) is used when at least one of the explanatory
variables is continuous;
4. Generalized Linear Models (GLMs): flexible general framework including linear
regression, logistic regression as special cases, methods of estimation, model
fitting and statistical inference are discussed;

21/70
Scope of the Course

5. Generalized Linear Models (GLMs): flexible general framework Poisson regression


as special cases, methods of estimation, model fitting and statistical inference are
discussed; linear discriminant analysis;
6. Resampling methods: how to estimate the estimation error, cross validation and
bootstrap;
7. Shrinkage Methods: by shrinking the coefficient estimates towards zero, their
variance is being reduced. Ridge regression and Lasso are two best-known
examples of shrinkage methods;

22/70
Scope of the Course

8. Nonlinear Regression: polynomial regression, step functions, regression splines,


smoothing splines, local regression and generalized additive models; Nonlinear
models are more complex in terms of interpretation and inference. However, their
advantage lies in their predictive power. Polynomial regression extends the linear model
by raising the original predictors to a power. For instance, the cubic regression will have
three variables: X, X 2 and X 3 as predictors. Step functions cuts the range of a variable
into K distinct regions, which produces a piecewise constant fit. Regression splines are
extension of polynomials and step functions and involve dividing the range of X into K
distinct regions. Within each region a polynomial function is fit to data.Smoothing splines
are similar to regression splines and result from minimizing the residual sum of squares
subject to smoothing penalty. In Local regression the regions are allowed to overlap.
9. Generalized additive models extend the methods above to multiple predictors.
23/70
The Purpose of Regression Analysis

Regression analysis is performed for two reasons: prediction or inference.

Prediction
In many situations X is available but the output Y cannot be easily obtained. Since
the error term averages to zero, we can predict Y using

Ŷ = fˆ(X ), (1)

where fˆ is the estimate for f , and Ŷ represents the resulting prediction for Y .
The estimate fˆ is characterised by a reducible error and by an irreducible error.

E[Y − Ŷ ]2 = E[f (X ) + ε − fˆ(X )]2


= [f (X ) − fˆ(X )]2 + Var (ε) (2)
24/70
Inference

The goal of inference is to understand the relationship between X and Y . How Y


changes as a function of X1 , X2 , . . . , Xp .

1. Which predictors are associated with the response?


2. What is the relationship between the response and each predictor?
3. Is the relationship between Y and each predictor linear or more complex?

Depending on the purpose of the analysis, various methods of estimating f may be


more appropriate. For prediction, use of non-linear models may be more appropriate (in
some cases overfitting causes problems), while for inference, linear models may be
more useful.

25/70
Methods of estimation

Two main types of approaches

• parametric
1. Assumption on the functional form of f

f (x) = β0 + β1 X1 + . . . + βp Xp (3)

2. Method to fit the model (estimating β0 , β1 , . . . , βp )


• nonparametric: no explicit assumptions about the functional form of f at the price
of not having a small and fixed amount of parameters (i.e. larger sample sizes are
needed)

26/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

27/70
Maximum likelihood estimation

y = [Y1 , Y2 , . . . , Yn ]> denotes a random vector

f (y ; θ) - joint probability density function of Yi , which depends on the vector of


parameters θ = [θ1 , θ2 , . . . , θp ]> .

The likelihood function L(θ; y ) is algebraically the same as f (y ; θ) but the emphasis
shifts to parameters θ while y stays fixed.

Remark: L is itself a random variable!

Do not mix up f (·; θ) above with the function f (·) of the model y = f (x) +  !!!

28/70
Maximum likelihood estimation

Definition
The maximum likelihood estimator of θ is the value θ̂ which maximizes the likelihood
function, that is

L(θ̂, y ) ≥ L(θ, y ) for all θ∈Θ (parameter space). (4)

Equivalently, θ̂ is the value that maximizes the log-likelihood function

l(θ, y ) = ln L(θ, y ), (5)

since the logarithmic function is monotonic.

29/70
Maximum likelihood estimation

The estimate θ̂ is obtained by differentiating the log-likelihood function with respect to


each element θj of θ and solving the simultaneous equations
∂l(θ, y )
=0 j = 1, . . . , p (6)
∂θj

If the matrix of second derivatives


∂ 2 l(θ, y )
(7)
∂θj ∂θk
evaluated at θ = θ̂ is negative definite, then θ̂ maximizes the log-likelihood function in
the interior of Θ.
Note that it is also necessary to check if there are any values of θ at the edges of the
parameter space Θ that give local maxima of l(θ, y): when all local maxima have been
identified, the value of θ̂ corresponding to the largest one is the MLE.
30/70
Invariance property of MLE

Invariance: If g (θ) is any function of the parameters θ, then the maximum likelihood
estimator of g (θ) is g (θ̂).
Other properties: consistency, sufficiency, asymptotic efficiency, asymptotic normality.

Figure 2: Example of a log-likelihood function. Source: Platen & Rendek (2008) 31/70
Example - Poisson distribution

Example
Let Y1 , Y2 , . . . , Yn be independent random variables with Poisson distribution

θyi e −θ
f (yi , θ) = ,
yi !
yi = 0, 1, . . . , with the same parameter θ.
Find the MLE of θ.

32/70
Least Squares Estimation

Let Y1 , . . . , Yn be independent r.v. with expected values µ1 , . . . , µn which are


functions of the parameter vector (that we want to estimate) β = [β1 , . . . , βp ]> ,
p < n. Thus E(Yi ) = µi (β).

The simplest method of least squares consists of finding the estimator β̂ that
minimizes the sum of squares of the differences between Yi ’s and their expected values
n
X
SS = [Yi − µi (β)]2 . (8)
i=1

and β̂ is obtained by differentiating S with respect to the elements βj


dSS
=0 j = 1, . . . , p (9)
dβj
33/70
Weighted Least Squares

If Yi ’s have variances σi2 that are not all equal it may be desirable to minimize the
weighted sum of squared differences
n
X
WSS = wi [Yi − µi (β)]2 . (10)
i=1

where the weights are wi = (σi2 )−1 . In this way the observations that are less reliable
will have less influence on the estimates.
More generally, if y = [Y1 , . . . , Yn ]> is a random vector with mean vector
µ = [µ1 , . . . , µn ]> and variance-covariance matrix V, then

WSS = (y − µ)> V−1 (y − µ). (11)

34/70
Comments

1. Method of least squares can be used without making assumptions about the
distribution of the response variables Yi in contrast to the maximum likelihood
estimation.
2. For many situations maximum likelihood and least squares estimates are identical.
3. In many cases numerical methods are used for parameter estimation.

Read Dobson pages 18–22.

35/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

36/70
Model Fitting

Model fitting involves four steps

• Model specification: an equation linking the response and the explanatory


variables and a probability distribution for the response variable
• Estimation of the parameter of the model
• Checking the adequacy of the model
• Inference: calculating confidence interval, testing hypotheses

37/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

1. Observation: women living in country areas tend to have fewer consultations with
GPs than women who live near a wide range of health services
2. Hypotheses: is this because they are healthier or because of structural factors?

38/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Group of study:

• Women living in country towns (town group) or in rural areas (country group) in
NSW
• Women aged 70-75 years
• Same socio-economic status
• ≤ 3 GP visits in 1996

Which distribution is suitable for these data?

39/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Let Yjk be a r.v. representing the number of conditions for woman k in group j (j = 1
town group, j = 2 country group).

Yjk ∼ Pois(θj ) k = 1, . . . , Kj

The question of interest can be formalised as

H0 : θ1 = θ2 = θ −→ E[Yjk ] = θ
H1 : θ1 6= θ2 −→ E[Yjk ] = θj

40/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

If H0 is true:
Kj
2 X
X
l0 (θ; y) = yjk log θ − θ − log yjk !
j=1 k=1

and the MLE is


Kj
2 X
X yjk
θ̂ =
N
j=1 k=1
P2
where N = j=1 Kj , θ̂ = 1.184, with lˆ0 = −68.3868.

Implement these calculations in R

41/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

If H1 is true:
K1
X
l1 (θ1 , θ2 ; y) = y1k log θ1 − θ1 − log y1k !+
k=1
K2
X
y2k log θ2 − θ2 − log y2k !
k=1

and the MLE is


K1 K2
X y1k X y2k
θ̂1 = , θ̂2 =
K1 K2
k=1 k=1

with θ̂1 = 1.423 and θ̂2 = 0.913, with lˆ1 = −67.0230.

Implement these calculations in R


42/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Is the difference statistically significant?

Remark: l1 ≥ l0 because one more parameter has been fitted

The adequacy of the model is usually evaluated by looking at the (standardised)


residuals
Y − θ̂
r= p (12)
θ̂

43/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Reproduce these plots in R


44/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Assuming Poisson independent data, the standardised residuals are approximately


distributed as a standard normal distribution N (0, 1), then r12 ∼ χ21 and
n n
X X (Yi − θ̂i )2
ri2 = ∼ χ2m (13)
i=1 i=1
θ̂i

where χ2m is a χ2 distribution with m degrees of freedom and m = n − p


PN 2
Under H0 , i=1 r0i = 46.8457 and m = K1 + K2 − 1 = 26 + 23 − 1 = 48.
PN 2
Under H1 , i=1 r1i = 43.6304 with m = K1 + K2 − 2 = 26 + 23 − 2 = 47.

For now, it is enough to say that 46.8457 − 43.6304 = 3.2153 seems small, but later
we will quantify the interpretation on it.
45/70
Relating income to years of education

Let’s analyse the dataset Income available in the R package ISLR

80

80
70

70
60

60
Income

Income
50

50
40

40
30

30
20

20
10 12 14 16 18 20 22 10 12 14 16 18 20 22

Years of Education Years of Education

46/70
Relating income to years of education

Predicting

• Input: years of education (X1 ) and seniority (X2 )


• Output: income (Y)

Hypothesis: there is some relationship between Y and x = (X1 , X2 )

Y = f (x) + ε (14)

• f is an unknown function (systematic part)


• ε is an error term

47/70
Relating income to years of education

Incom
e

ity
or
Ye

ni
ars

Se
of
Ed
uc
ati
on

Figure 3: Plot of income as a function of years of education and seniority. The blue
surface represents the true underlying relationship (simulated data).
48/70
Relating income to years of education

We can fit the linear model: income = β0 + β1 × education + β2 × seniority + ε

Incom
e

ity
or
Ye

ni
ars

Se
ofE
du
ca
tio
n

Figure 4: The yellow surface represents β̂0 + β̂1 × education + β̂2 × seniority
49/70
Relating income to years of education

Nonparametric Model: Thin-Plate Spline


How to choose the level of smoothness?
Incom

Incom
e

e
y

y
it

it
or

or
Ye Ye
ni

ni
ars ar
Se so

Se
of fE
Ed du
uc ca
a tio tio
n n

50/70
What have we learnt from these examples?

1. What is the scale of measurement?


2. What is a reasonable distribution to model the data?
3. What is the relationship with other variables?
E[Y] = α + βx
log[E(Y)] = α + β sin(γx)
4. What is the best parameter estimation process? MLE, Least Squares, Bayesian
methods
5. Why choosing a restrictive (parametric method) instead of a very flexible
(nonparametric) approach?
6. Model checking: residual checking, plots, checking of the assumptions, Occam’s
razor, overfitting is not good for inference, but it can also been not good for
prediction!
51/70
Don’t under-evaluate exploratory statistics!

Wage dataset: wages for a group of males from the Atlantic region of the US (available
in the R package ISLR)
300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

Figure 5: Graphical representations of wage as a function of age, year and education


52/70
Don’t under-evaluate exploratory statistics!

Few comments

1. Many statistical learning methods are relevant and useful in a wide range of
disciplines, beyond just statistical sciences.
2. Statistical learning should not be viewed as a series of black boxes.
3. While it is important to know what job is performed by each tool, it is not
necessary to have the skills to construct the machine inside the box.
4. We will work on real-world problems.

53/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

54/70
Measuring the quality of fit

We need to measure how well the model predictions match the data.

In a regression setting, this is done by the mean squared error (MSE)

n
1X
MSE = (yi − fˆ(xi ))2 (15)
n
i=1

However, a low MSE can hide problems of overfitting on the dataset at hand. Then,
what we really want to have is the accuracy of the predictions when we apply the
method on unseen data.
Suppose we have (training) observations {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} on which we
estimate f (x); then we obtain estimates fˆ(x1 ), fˆ(x2 ), . . . , fˆ(xn ).
55/70
Measuring the quality of fit

We want to measure the accuracy of the model on new input (test) variables x0 , to
obtain the test MSE
Ave(fˆ(x0 ) − y0 )2

which is the average squared prediction error on the test observations (x0 , y0 )
Then we can

• select the model which minimises the test MSE, if test observations are available
• select the model which minimises the training MSE, if test observations are not
available
• use estimation method for the test MSE, like cross-validation

56/70
The Bias-Variance trade-off

There are two competing properties of statistical learning methods.


The expected test MSE, for a given value x0 can be decomposed into three quantities:
h i2
MSE(x0 ) = E y0 − fˆ(x0 )
h i2
= E f (x0 ) + ε − fˆ(x0 )
h i2
= E f (x0 ) − E(fˆ(x0 )) + E(fˆ(x0 )) − fˆ(x0 ) + Var[ε]
h i2 h i h i
= f (x0 ) − E(fˆ(x0 )) + 2 f (x0 ) − E(fˆ(x0 )) E fˆ(x0 ) − E(fˆ(x0 ))
h i2
+ E fˆ(x0 ) − E(fˆ(x0 )) + Var[ε]
   
= Bias2 fˆ(x0 ) + Var fˆ(x0 ) + Var[ε]

57/70
The Bias-Variance trade-off

Comments

• The test MSE can never be lower than Var[ε]


• Variance: the amount by which fˆ would change when changing the training
dataset (in general, flexible methods have larger variance)
• Bias: the error that is introduced by approximating a potentially complicated
relationship between Y and X with a simpler model (in general, restrictive
methods have a larger bias)
• in practice, as we increase the flexibility of the model, the bias tends to decrease
faster than the variance increases; however, at some point, increasing flexibility
has little impact on the bias and significantly increases the variance
• in practice, we cannot explicitly compute the test MSE
58/70
The Classification setting

Model accuracy can be also applied for categorical outputs, with slight modifications.

In a classification setting, model accuracy is measured by the (training) error rate


n
1X
ER = I(yi 6= ŷi ) (16)
n
i=1

where ŷi is the predicted label for the i-th observation, using the estimate fˆ and
I(yi 6= ŷi ) is an indicator function which is equal to

• 1 if yi 6= ŷi (miss-classification)
• 0 if yi = ŷi (correct classification)

59/70
The Classification setting

As in the case of regression, we are more interested in the test error rate, which is the
error rate that results from applying the classifier to test observations, not available in
the training dataset
Ave (I(y0 6= ŷ0 ))

where ŷ0 is the prediction obtained by applying the classifier on input x0 .

60/70
The Bayes classifier

The test error rate is minimised, on average, by a simple classifier - the Bayes classifier
- that assigns each observation to the most likely class given its predictor values, i.e.
we should simply assign a test observation with predictor x0 to the class j for which

Pr(Y = j|X = x0 )

is maximised.
If there are only 2 classes, the Bayes classifier predicts class 1 if

Pr(Y = 1|X = x0 ) > 0.5

and class 2 otherwise.

61/70
The Bayes classifier

The expected prediction error is


EPE = E [I(y0 6= ŷ0 )]
1
X
= EX [I(y0 6= ŷ0 )] Pr(y0 = j|X = x)
j=0

and to minimise the expected error, you need to minimize the probability of being
wrong, or
ŷ0 = 1 if Pr(y0 = 1|X = x0 ) = max Pr(y0 = j|X = x0 )
j∈{0,1}

The expected Bayes error rate is then


1 − E[ max Pr(Y0 = j|X = x0 )]
j∈{0,1}

where the expectation is with respect to the probability over all possible values of X .
62/70
The Bayes classifier

oo o
oo o
o
o
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o o oo o oo
o o o o o o o
oo oo o o o o
o o o oo o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
oo

X2
o
o o o o
oo o o o o
o o
oo o oo o oo o
o o o o oo o
o o oo oo o
o o o ooo o
o o oo
o ooooo oooo o
oo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o

X1

Figure 6: 100 observations from two groups. The purple line indicates the Bayes decision
boundary. The grid colour indicate the group to which a test observation will be allocated to.
63/70
K -nearest neighbours

In practice, we do not know the conditional distribution of Y given X, so we can’t


compute the Bayes classifier. The Bayes classifier is an unattainable gold standard.

The K-nearest neighbours (KNN) classifier estimates the conditional probability of Y


given X and classifies a given observation to the class with the highest estimated
probability.

Despite its simplicity, the KNN classifier produces classifications which are often close
to the Bayes classifier.

64/70
K -nearest neighbours

Given a positive integer K (the choice of K is essential!) and a test input observation
x0 , the KNN classifier

• Identifies the K points in the training dataset closest to x0 - a neighbourhood of


x0 , N0
• Computes
1 X
Pr(Y = j|X = x0 ) = I(yi = j)
K
i∈N0

• Applies the Bayes rule

Pr(Y = j|X = x0 )
Pr(X = x0 |Y = j) =
Pr(Y = j)

• Classifies the test observation x0 to the class with the largest probability
65/70
K -nearest neighbours

Figure 7: KNN approach using K = 3. Left: a test observation. Right: decision boundary. 66/70
K -nearest neighbours

KNN: K=1 KNN: K=100

o o
oo o oo oo o oo
o o o o
o oo o o oo o
oo oo
oo o oo o
o o oo o o o oo o
o oo o oo o o o oo o oo o o
oo o o o o o oo oo oo o o o o o oo
o o o o o o oo o o o o o o oo
o o o o o o oo o o o o o o o o oo o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o
o o ooo o oo o o o o o o ooo o oo o o o o
ooo o o ooo o o
oo
o o o o o o o oo oo o oo
o o o o o o o oo oo o
o o o o ooo o o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o oo o o o o oo o
o o o o oo o o o o o o oo o o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o ooo o
o o o ooo o
o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o

Figure 8: KNN with K = 1. KNN decision boundary compared with Bayes decision boundary. 67/70
K -nearest neighbours

KNN: K=10

oo o
oo o
o
o
o oo oo o
o oo o
o oo
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o o o o
o
o o o
oo o o o
o o o oo o o
oo o o o o
o o o o
o o oooo o ooo o o o o ooo
o
oo
X2
o o oooo o o o
oo
o o
oo o oo o oo o
o o o o oo o
o o o o
o
o o o oo o oo o o
o o
o o
oo o ooooo oo
o oo o
o oo o
o o oo oo o o o
o o o oo o
oo
o o oo o
o oo o
o o o

X1
Figure 9: KNN with K = 10. KNN decision boundary compared with Bayes decision boundary.
68/70
K -nearest neighbours

KNN: K=1 KNN: K=100

o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
oo o oo o
o o o o o o o o o o
o oo o oo o o o oo o oo o o
oo o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o o oo oo o oo
o o oo o o o oo o
o ooo o o o oo ooo o o o
o o o o o
o oo o o o o oo o o o
o o o o o o
oo oo oo o o o
o o oo
oo o o o
o
oo o o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o oooo o o o oooo o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o

Figure 10: KNN with K = 100. KNN decision boundary compared with Bayes decision
69/70
References

Sources and recommended reading:

1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear


models, Chapman and Hall. Chapters 1-2.
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapters 1-2.

Some of the figures in this presentation are taken from “An Introduction to Statistical
Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James,
D. Witten, T. Hastie and R. Tibshirani

70/70

You might also like