0% found this document useful (0 votes)

100 views70 pages

Week01 Lecture BB

This document provides an introduction to the MATH5806 Applied Regression Analysis course taught in Term 2, 2021. It outlines the lecturer and their contact details, lecture and tutorial times, learning resources including textbooks, software, and assessment details. The course aims to help students choose the appropriate regression technique for a given dataset, interpret regression outputs, and communicate results. It will cover both theoretical foundations and applications in R.

Uploaded by

Alice Xiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views70 pages

Week01 Lecture BB

Uploaded by

Alice Xiao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

MATH5806 Applied Regression Analysis

Lecture 1 - Introduction to Regression Analysis

Boris Beranger
Term 2, 2021

1/70
Some comments about the course

Lecturer: Dr Boris Beranger

Lecturer, School of Mathematics and Statistics

Associate Investigator, ARC Centre of Excellence for Mathematical and Statistical
Frontiers

Previously

• PhD at Université Pierre and Marie Curie (Paris) & UNSW Sydney
• Postdoc at UNSW Sydney

2/70
Contact

You can use any of the following to reach me:

E-mail [email protected]
Office 4103
Webpage www.borisberanger.com
Consultation hours Thursday, 15:00-16:00

3/70
Lectures: Wednesday, 12:00 - 15:00

• One lecture per week: live, video recording available on Blackboard Collaborate

• Venue: online (Blackboard Collaborate)

• Tutorial: Thursday, 16:00-17:00 (online)

All lecture and tutorial material, assignments, communication will be posted the course
Moodle page.

Please read the course outline at the start of the course.

4/70
Learning support

Textbooks:

• A. J. Dobson & A. G. Barnett (2018), An Introduction to Generalised Linear Models,

Chapman and Hall.
• G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to Statistical
Learning with Applications in R, Springer.
• T. Hasite, R. Tibshirani & J. Friedman (2009), The Elements of Statistical Leaning,
Springer.

5/70
Learning support

Software:

We will use the R software. Download and install it.

We will also use RStudio. Download and install it.

6/70
Assessment Details

Task When Due Weight Duration

Assignement 1 Week 3 15% 1 week
Mid-session test Week 5 15% 1 hour
Assignement 2 Week 9 15% 2 weeks
Final Examination ? 55% 3 hours

These will have a mathematical and computational component.

No delay accepted in the delivery of homework.

7/70
Skills to be developed

At the end of this course, you will be able to:

• choose the appropriate regression technique to analyse a given data set;

• choose the appropriate R package to apply effectively this technique;
• interpret the output and the results;
• understand the theoretical foundations on which these techniques rely;
• understand the models, hypotheses, intuitions, and strengths and weaknesses of
the various approaches;
• communicate effectively the results, in written or oral form.

8/70
My philosophy

• Understand deeply enough the theory to use efficiently any statistical tool and
choose the most appropriate one.
• Become able to find/create the appropriate R function to apply the tools on real
data.
• Understand the output of these functions and its link to mathematical formulas.
• Become able to interpret the output.

This being said, I will go into great details in the theory only for the Linear Model in
order to leave time to play with modern and more sophisticated nonlinear models.

Simulations will also play an important role in order to understand several concepts.
9/70
Notation

• Random variables: Y1 , Y2 , . . . , Yn (not italic, not bold, capital letters)

• Realisation of random variables: y1 , y2 , . . . , yn (italic, not bold, small letters)
• Greek letters: parameters β, θ, etc.
• Estimators: β̂,θ̂   
Y1 y1
Y2   y2 
   
• Vectors: y =  .. , y =  ..  (not italic / italic, bold, small letters)
  
 .  .
Yn yn
 
X11 X12 . . . X1p
X21 X22 . . . X2p 
 
• Matrices: X = 
 .. .. ..  (not italic, bold, capital letters)
.. 
 . . . . 
Xn1 Xn2 . . . Xnp
• Transpose: y , y >
>
10/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

11/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

12/70
What is Regression Analysis?

Regression Analysis investigates the functional relationship between statistical

variables. Data are usually multiple observations of a random vector (Y, x).

• x = (X1 , . . . , Xp )> is a p-vector of variables termed: explanatory variables,

regressors, predictors, input variables or independent variables.

• Y is called: response variable, target variable, output variable, outcome variable or

dependent variable. It may be continuous (∈ R), discrete (∈ {1, . . . , K }) or ordinal
(ordered discrete).

Explanatory variables are usually treated as random variables, while predictors are
treated as fixed observations.
13/70
Response and explanatory variables

Response and explanatory variables are measures on one of the following scales:

• nominal: when Y is classified into categories, which can be only two (binary
outcome) or several (multinomial outcome)
• ordinal: when Y is recorded in classes
• continuous: when Y is measured on a continuous scale, at least in theory.

Nominal and ordinal data are discrete variables and can be qualitative or quantitative
(e.g. counts). Continuous data are quantitative.

x can also be quantitative or qualitative. In particular, when the explanatory variable is

qualitative, it is often called factor. A quantitative explanatory variables is called
covariate.
14/70
Response and explanatory variables

Response Explanatory Method

Continuous Binary t-test
Nominal, >2 categories ANOVA
Ordinal ANOVA
Continuous Multiple regression
Binary Categorical Contingency tables
Continuous Logistic or probit regression
Nominal, >2 categories Nominal Contingency tables
Categorical & Continuous Nominal logistic regression
Ordinal Categorical & Continuous Ordinal logistic regression
Counts Categorical Log-linear models
Categorical & Continuous Poisson regression
15/70
Regression

We aim to find a “good” functional relationship of the form Y = f (x) + ε, where ε is a

random error term independent of x with mean zero.

Figure 1: Regression of Y (vertical, continuous) on (X1 , X2 ) (horizontal).

16/70
“Applied” Regression Analysis

Applied means: “If there is no way to calculate it, we won’t talk about it.”

On the other hand, we want to understand the underlying computational methods and
algorithms. This will be impossible without understanding the theory.

“There is nothing more practical than a good theory.” (Kurt Lewin)

17/70
General Framework of Statistical Learning

Statistical learning refers to a vast set of tools for understanding data. It splits into
supervised and unsupervised methods. All the methods presented in this course are
within the framework of supervised learning.

Regression fits into the framework of supervised methods, which requires a statistical
model for predicting or estimating an output based on one or more inputs.

In contrast, unsupervised methods cover situations where there are inputs but no
supervising output. In these type of analysis we learn about relationships and structure
of data. Example of unsupervised analysis is cluster analysis.

18/70
Knowledge assumed

You need to have a fairly good understanding of linear algebra:

• vector spaces,
• linearly independent,
• matrix multiplication,
• diagonalisation,
• projections,
• ...
Need to know multivariate calculus:
• partial derivatives,
• critical points,
• integrals,
• ...
19/70
Knowledge assumed

It’s also good to have some previous exposure to computational software (R):

• data types,
• manipulation of arrays,
• some idea of optimisation,
• ...

And finally, you need to know some probability theory and statistics:

• important distributions (normal, Poisson, ...),

• conditional probability,
• conditional expectation,
• covariance matrix,
• asymptotic normality of maximum likelihood estimators, ...
20/70
Scope of the Course

1. Introduction: concepts such as prediction, inference, estimation methods,

maximum likelihood estimation, least squares estimation, simple linear regression,
exponential family of distributions, introduction to R;
2. Estimation and Inference: methods of classical estimation and model fitting;
3. General Linear Models: multiple regression, ANOVA (Analysis of Variance) is used
for a continuous response variable and categorical explanatory variables (factors),
ANCOVA (Analysis of Covariance) is used when at least one of the explanatory
variables is continuous;
4. Generalized Linear Models (GLMs): flexible general framework including linear
regression, logistic regression as special cases, methods of estimation, model
fitting and statistical inference are discussed;

21/70
Scope of the Course

5. Generalized Linear Models (GLMs): flexible general framework Poisson regression

as special cases, methods of estimation, model fitting and statistical inference are
discussed; linear discriminant analysis;
6. Resampling methods: how to estimate the estimation error, cross validation and
bootstrap;
7. Shrinkage Methods: by shrinking the coefficient estimates towards zero, their
variance is being reduced. Ridge regression and Lasso are two best-known
examples of shrinkage methods;

22/70
Scope of the Course

8. Nonlinear Regression: polynomial regression, step functions, regression splines,

smoothing splines, local regression and generalized additive models; Nonlinear
models are more complex in terms of interpretation and inference. However, their
advantage lies in their predictive power. Polynomial regression extends the linear model
by raising the original predictors to a power. For instance, the cubic regression will have
three variables: X, X 2 and X 3 as predictors. Step functions cuts the range of a variable
into K distinct regions, which produces a piecewise constant fit. Regression splines are
extension of polynomials and step functions and involve dividing the range of X into K
distinct regions. Within each region a polynomial function is fit to data.Smoothing splines
are similar to regression splines and result from minimizing the residual sum of squares
subject to smoothing penalty. In Local regression the regions are allowed to overlap.
9. Generalized additive models extend the methods above to multiple predictors.
23/70
The Purpose of Regression Analysis

Regression analysis is performed for two reasons: prediction or inference.

Prediction
In many situations X is available but the output Y cannot be easily obtained. Since
the error term averages to zero, we can predict Y using

Ŷ = fˆ(X ), (1)

where fˆ is the estimate for f , and Ŷ represents the resulting prediction for Y .
The estimate fˆ is characterised by a reducible error and by an irreducible error.

E[Y − Ŷ ]2 = E[f (X ) + ε − fˆ(X )]2

= [f (X ) − fˆ(X )]2 + Var (ε) (2)
24/70
Inference

The goal of inference is to understand the relationship between X and Y . How Y

changes as a function of X1 , X2 , . . . , Xp .

1. Which predictors are associated with the response?

2. What is the relationship between the response and each predictor?
3. Is the relationship between Y and each predictor linear or more complex?

Depending on the purpose of the analysis, various methods of estimating f may be

more appropriate. For prediction, use of non-linear models may be more appropriate (in
some cases overfitting causes problems), while for inference, linear models may be
more useful.

25/70
Methods of estimation

Two main types of approaches

• parametric
1. Assumption on the functional form of f

f (x) = β0 + β1 X1 + . . . + βp Xp (3)

2. Method to fit the model (estimating β0 , β1 , . . . , βp )

• nonparametric: no explicit assumptions about the functional form of f at the price
of not having a small and fixed amount of parameters (i.e. larger sample sizes are
needed)

26/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

27/70
Maximum likelihood estimation

y = [Y1 , Y2 , . . . , Yn ]> denotes a random vector

f (y ; θ) - joint probability density function of Yi , which depends on the vector of

parameters θ = [θ1 , θ2 , . . . , θp ]> .

The likelihood function L(θ; y ) is algebraically the same as f (y ; θ) but the emphasis
shifts to parameters θ while y stays fixed.

Remark: L is itself a random variable!

Do not mix up f (·; θ) above with the function f (·) of the model y = f (x) + !!!

28/70
Maximum likelihood estimation

Definition
The maximum likelihood estimator of θ is the value θ̂ which maximizes the likelihood
function, that is

L(θ̂, y ) ≥ L(θ, y ) for all θ∈Θ (parameter space). (4)

Equivalently, θ̂ is the value that maximizes the log-likelihood function

l(θ, y ) = ln L(θ, y ), (5)

since the logarithmic function is monotonic.

29/70
Maximum likelihood estimation

The estimate θ̂ is obtained by differentiating the log-likelihood function with respect to

each element θj of θ and solving the simultaneous equations
∂l(θ, y )
=0 j = 1, . . . , p (6)
∂θj

If the matrix of second derivatives

∂ 2 l(θ, y )
(7)
∂θj ∂θk
evaluated at θ = θ̂ is negative definite, then θ̂ maximizes the log-likelihood function in
the interior of Θ.
Note that it is also necessary to check if there are any values of θ at the edges of the
parameter space Θ that give local maxima of l(θ, y): when all local maxima have been
identified, the value of θ̂ corresponding to the largest one is the MLE.
30/70
Invariance property of MLE

Invariance: If g (θ) is any function of the parameters θ, then the maximum likelihood
estimator of g (θ) is g (θ̂).
Other properties: consistency, sufficiency, asymptotic efficiency, asymptotic normality.

Figure 2: Example of a log-likelihood function. Source: Platen & Rendek (2008) 31/70
Example - Poisson distribution

Example
Let Y1 , Y2 , . . . , Yn be independent random variables with Poisson distribution

θyi e −θ
f (yi , θ) = ,
yi !
yi = 0, 1, . . . , with the same parameter θ.
Find the MLE of θ.

32/70
Least Squares Estimation

Let Y1 , . . . , Yn be independent r.v. with expected values µ1 , . . . , µn which are

functions of the parameter vector (that we want to estimate) β = [β1 , . . . , βp ]> ,
p < n. Thus E(Yi ) = µi (β).

The simplest method of least squares consists of finding the estimator β̂ that
minimizes the sum of squares of the differences between Yi ’s and their expected values
n
X
SS = [Yi − µi (β)]2 . (8)
i=1

and β̂ is obtained by differentiating S with respect to the elements βj

dSS
=0 j = 1, . . . , p (9)
dβj
33/70
Weighted Least Squares

If Yi ’s have variances σi2 that are not all equal it may be desirable to minimize the
weighted sum of squared differences
n
X
WSS = wi [Yi − µi (β)]2 . (10)
i=1

where the weights are wi = (σi2 )−1 . In this way the observations that are less reliable
will have less influence on the estimates.
More generally, if y = [Y1 , . . . , Yn ]> is a random vector with mean vector
µ = [µ1 , . . . , µn ]> and variance-covariance matrix V, then

WSS = (y − µ)> V−1 (y − µ). (11)

34/70
Comments

1. Method of least squares can be used without making assumptions about the
distribution of the response variables Yi in contrast to the maximum likelihood
estimation.
2. For many situations maximum likelihood and least squares estimates are identical.
3. In many cases numerical methods are used for parameter estimation.

Read Dobson pages 18–22.

35/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

36/70
Model Fitting

Model fitting involves four steps

• Model specification: an equation linking the response and the explanatory

variables and a probability distribution for the response variable
• Estimation of the parameter of the model
• Checking the adequacy of the model
• Inference: calculating confidence interval, testing hypotheses

37/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

1. Observation: women living in country areas tend to have fewer consultations with
GPs than women who live near a wide range of health services
2. Hypotheses: is this because they are healthier or because of structural factors?

38/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Group of study:

• Women living in country towns (town group) or in rural areas (country group) in
NSW
• Women aged 70-75 years
• Same socio-economic status
• ≤ 3 GP visits in 1996

Which distribution is suitable for these data?

39/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Let Yjk be a r.v. representing the number of conditions for woman k in group j (j = 1
town group, j = 2 country group).

Yjk ∼ Pois(θj ) k = 1, . . . , Kj

The question of interest can be formalised as

H0 : θ1 = θ2 = θ −→ E[Yjk ] = θ
H1 : θ1 6= θ2 −→ E[Yjk ] = θj

40/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

If H0 is true:
Kj
2 X
X
l0 (θ; y) = yjk log θ − θ − log yjk !
j=1 k=1

and the MLE is

Kj
2 X
X yjk
θ̂ =
N
j=1 k=1
P2
where N = j=1 Kj , θ̂ = 1.184, with lˆ0 = −68.3868.

Implement these calculations in R

41/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

If H1 is true:
K1
X
l1 (θ1 , θ2 ; y) = y1k log θ1 − θ1 − log y1k !+
k=1
K2
X
y2k log θ2 − θ2 − log y2k !
k=1

and the MLE is

K1 K2
X y1k X y2k
θ̂1 = , θ̂2 =
K1 K2
k=1 k=1

with θ̂1 = 1.423 and θ̂2 = 0.913, with lˆ1 = −67.0230.

Implement these calculations in R

42/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Is the difference statistically significant?

Remark: l1 ≥ l0 because one more parameter has been fitted

The adequacy of the model is usually evaluated by looking at the (standardised)

residuals
Y − θ̂
r= p (12)
θ̂

43/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Reproduce these plots in R

44/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)

Assuming Poisson independent data, the standardised residuals are approximately

distributed as a standard normal distribution N (0, 1), then r12 ∼ χ21 and
n n
X X (Yi − θ̂i )2
ri2 = ∼ χ2m (13)
i=1 i=1
θ̂i

where χ2m is a χ2 distribution with m degrees of freedom and m = n − p

PN 2
Under H0 , i=1 r0i = 46.8457 and m = K1 + K2 − 1 = 26 + 23 − 1 = 48.
PN 2
Under H1 , i=1 r1i = 43.6304 with m = K1 + K2 − 2 = 26 + 23 − 2 = 47.

For now, it is enough to say that 46.8457 − 43.6304 = 3.2153 seems small, but later
we will quantify the interpretation on it.
45/70
Relating income to years of education

Let’s analyse the dataset Income available in the R package ISLR

80
70

70
60

60
Income

Income
50

50
40

40
30

30
20

20
10 12 14 16 18 20 22 10 12 14 16 18 20 22

Years of Education Years of Education

46/70
Relating income to years of education

Predicting

• Input: years of education (X1 ) and seniority (X2 )

• Output: income (Y)

Hypothesis: there is some relationship between Y and x = (X1 , X2 )

Y = f (x) + ε (14)

• f is an unknown function (systematic part)

• ε is an error term

47/70
Relating income to years of education

Incom
e

ity
or
Ye

ni
ars

Se
of
Ed
uc
ati
on

Figure 3: Plot of income as a function of years of education and seniority. The blue
surface represents the true underlying relationship (simulated data).
48/70
Relating income to years of education

We can fit the linear model: income = β0 + β1 × education + β2 × seniority + ε

Incom
e

ity
or
Ye

ni
ars

Se
ofE
du
ca
tio
n

Figure 4: The yellow surface represents β̂0 + β̂1 × education + β̂2 × seniority
49/70
Relating income to years of education

Nonparametric Model: Thin-Plate Spline

How to choose the level of smoothness?
Incom

Incom
e

e
y

y
it

it
or

or
Ye Ye
ni

ni
ars ar
Se so

Se
of fE
Ed du
uc ca
a tio tio
n n

50/70
What have we learnt from these examples?

1. What is the scale of measurement?

2. What is a reasonable distribution to model the data?
3. What is the relationship with other variables?
E[Y] = α + βx
log[E(Y)] = α + β sin(γx)
4. What is the best parameter estimation process? MLE, Least Squares, Bayesian
methods
5. Why choosing a restrictive (parametric method) instead of a very flexible
(nonparametric) approach?
6. Model checking: residual checking, plots, checking of the assumptions, Occam’s
razor, overfitting is not good for inference, but it can also been not good for
prediction!
51/70
Don’t under-evaluate exploratory statistics!

Wage dataset: wages for a group of males from the Atlantic region of the US (available
in the R package ISLR)
300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

Figure 5: Graphical representations of wage as a function of age, year and education

52/70
Don’t under-evaluate exploratory statistics!

Few comments

1. Many statistical learning methods are relevant and useful in a wide range of
disciplines, beyond just statistical sciences.
2. Statistical learning should not be viewed as a series of black boxes.
3. While it is important to know what job is performed by each tool, it is not
necessary to have the skills to construct the machine inside the box.
4. We will work on real-world problems.

53/70
Chapter 1 - Introduction to Regression Analysis

1.1 - What is Regression Analysis?

1.2 - Estimation of parametric models

1.3 - Model Fitting

1.4 - Assessing Model Accuracy

54/70
Measuring the quality of fit

We need to measure how well the model predictions match the data.

In a regression setting, this is done by the mean squared error (MSE)

n
1X
MSE = (yi − fˆ(xi ))2 (15)
n
i=1

However, a low MSE can hide problems of overfitting on the dataset at hand. Then,
what we really want to have is the accuracy of the predictions when we apply the
method on unseen data.
Suppose we have (training) observations {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} on which we
estimate f (x); then we obtain estimates fˆ(x1 ), fˆ(x2 ), . . . , fˆ(xn ).
55/70
Measuring the quality of fit

We want to measure the accuracy of the model on new input (test) variables x0 , to
obtain the test MSE
Ave(fˆ(x0 ) − y0 )2

which is the average squared prediction error on the test observations (x0 , y0 )
Then we can

• select the model which minimises the test MSE, if test observations are available
• select the model which minimises the training MSE, if test observations are not
available
• use estimation method for the test MSE, like cross-validation

56/70
The Bias-Variance trade-off

There are two competing properties of statistical learning methods.

The expected test MSE, for a given value x0 can be decomposed into three quantities:
h i2
MSE(x0 ) = E y0 − fˆ(x0 )
h i2
= E f (x0 ) + ε − fˆ(x0 )
h i2
= E f (x0 ) − E(fˆ(x0 )) + E(fˆ(x0 )) − fˆ(x0 ) + Var[ε]
h i2 h i h i
= f (x0 ) − E(fˆ(x0 )) + 2 f (x0 ) − E(fˆ(x0 )) E fˆ(x0 ) − E(fˆ(x0 ))
h i2
+ E fˆ(x0 ) − E(fˆ(x0 )) + Var[ε]

= Bias2 fˆ(x0 ) + Var fˆ(x0 ) + Var[ε]

57/70
The Bias-Variance trade-off

Comments

• The test MSE can never be lower than Var[ε]

• Variance: the amount by which fˆ would change when changing the training
dataset (in general, flexible methods have larger variance)
• Bias: the error that is introduced by approximating a potentially complicated
relationship between Y and X with a simpler model (in general, restrictive
methods have a larger bias)
• in practice, as we increase the flexibility of the model, the bias tends to decrease
faster than the variance increases; however, at some point, increasing flexibility
has little impact on the bias and significantly increases the variance
• in practice, we cannot explicitly compute the test MSE
58/70
The Classification setting

Model accuracy can be also applied for categorical outputs, with slight modifications.

In a classification setting, model accuracy is measured by the (training) error rate

n
1X
ER = I(yi 6= ŷi ) (16)
n
i=1

where ŷi is the predicted label for the i-th observation, using the estimate fˆ and
I(yi 6= ŷi ) is an indicator function which is equal to

• 1 if yi 6= ŷi (miss-classification)
• 0 if yi = ŷi (correct classification)

59/70
The Classification setting

As in the case of regression, we are more interested in the test error rate, which is the
error rate that results from applying the classifier to test observations, not available in
the training dataset
Ave (I(y0 6= ŷ0 ))

where ŷ0 is the prediction obtained by applying the classifier on input x0 .

60/70
The Bayes classifier

The test error rate is minimised, on average, by a simple classifier - the Bayes classifier
- that assigns each observation to the most likely class given its predictor values, i.e.
we should simply assign a test observation with predictor x0 to the class j for which

Pr(Y = j|X = x0 )

is maximised.
If there are only 2 classes, the Bayes classifier predicts class 1 if

Pr(Y = 1|X = x0 ) > 0.5

and class 2 otherwise.

61/70
The Bayes classifier

The expected prediction error is

EPE = E [I(y0 6= ŷ0 )]
1
X
= EX [I(y0 6= ŷ0 )] Pr(y0 = j|X = x)
j=0

and to minimise the expected error, you need to minimize the probability of being
wrong, or
ŷ0 = 1 if Pr(y0 = 1|X = x0 ) = max Pr(y0 = j|X = x0 )
j∈{0,1}

The expected Bayes error rate is then

1 − E[ max Pr(Y0 = j|X = x0 )]
j∈{0,1}

where the expectation is with respect to the probability over all possible values of X .
62/70
The Bayes classifier

oo o
oo o
o
o
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o o oo o oo
o o o o o o o
oo oo o o o o
o o o oo o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
oo

X2
o
o o o o
oo o o o o
o o
oo o oo o oo o
o o o o oo o
o o oo oo o
o o o ooo o
o o oo
o ooooo oooo o
oo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o

Figure 6: 100 observations from two groups. The purple line indicates the Bayes decision
boundary. The grid colour indicate the group to which a test observation will be allocated to.
63/70
K -nearest neighbours

In practice, we do not know the conditional distribution of Y given X, so we can’t

compute the Bayes classifier. The Bayes classifier is an unattainable gold standard.

The K-nearest neighbours (KNN) classifier estimates the conditional probability of Y

given X and classifies a given observation to the class with the highest estimated
probability.

Despite its simplicity, the KNN classifier produces classifications which are often close
to the Bayes classifier.

64/70
K -nearest neighbours

Given a positive integer K (the choice of K is essential!) and a test input observation
x0 , the KNN classifier

• Identifies the K points in the training dataset closest to x0 - a neighbourhood of

x0 , N0
• Computes
1 X
Pr(Y = j|X = x0 ) = I(yi = j)
K
i∈N0

• Applies the Bayes rule

Pr(Y = j|X = x0 )
Pr(X = x0 |Y = j) =
Pr(Y = j)

• Classifies the test observation x0 to the class with the largest probability
65/70
K -nearest neighbours

Figure 7: KNN approach using K = 3. Left: a test observation. Right: decision boundary. 66/70
K -nearest neighbours

KNN: K=1 KNN: K=100

o o
oo o oo oo o oo
o o o o
o oo o o oo o
oo oo
oo o oo o
o o oo o o o oo o
o oo o oo o o o oo o oo o o
oo o o o o o oo oo oo o o o o o oo
o o o o o o oo o o o o o o oo
o o o o o o oo o o o o o o o o oo o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o
o o ooo o oo o o o o o o ooo o oo o o o o
ooo o o ooo o o
oo
o o o o o o o oo oo o oo
o o o o o o o oo oo o
o o o o ooo o o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o oo o o o o oo o
o o o o oo o o o o o o oo o o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o ooo o
o o o ooo o
o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o

Figure 8: KNN with K = 1. KNN decision boundary compared with Bayes decision boundary. 67/70
K -nearest neighbours

KNN: K=10

oo o
oo o
o
o
o oo oo o
o oo o
o oo
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o o o o
o
o o o
oo o o o
o o o oo o o
oo o o o o
o o o o
o o oooo o ooo o o o o ooo
o
oo
X2
o o oooo o o o
oo
o o
oo o oo o oo o
o o o o oo o
o o o o
o
o o o oo o oo o o
o o
o o
oo o ooooo oo
o oo o
o oo o
o o oo oo o o o
o o o oo o
oo
o o oo o
o oo o
o o o

X1
Figure 9: KNN with K = 10. KNN decision boundary compared with Bayes decision boundary.
68/70
K -nearest neighbours

KNN: K=1 KNN: K=100

o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
oo o oo o
o o o o o o o o o o
o oo o oo o o o oo o oo o o
oo o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o o oo oo o oo
o o oo o o o oo o
o ooo o o o oo ooo o o o
o o o o o
o oo o o o o oo o o o
o o o o o o
oo oo oo o o o
o o oo
oo o o o
o
oo o o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o oooo o o o oooo o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o

Figure 10: KNN with K = 100. KNN decision boundary compared with Bayes decision
69/70
References

Sources and recommended reading:

1. A. J. Dobson & A. G. Barnett (2018) An introduction to generalised linear

models, Chapman and Hall. Chapters 1-2.
2. G. James, D. Witten, T. Hastie & R. Tibshirani (2013) An Introduction to
Statistical Learning with Applications in R, Springer. Chapters 1-2.

Some of the figures in this presentation are taken from “An Introduction to Statistical
Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James,
D. Witten, T. Hastie and R. Tibshirani

70/70

Week01 Lecture Lyu
No ratings yet
Week01 Lecture Lyu
70 pages
Business Analytics
No ratings yet
Business Analytics
19 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
Topic0 Introduction
No ratings yet
Topic0 Introduction
9 pages
Linear Regression Models Guide
No ratings yet
Linear Regression Models Guide
42 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Unit III
No ratings yet
Unit III
13 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
Midterm 2 Nem Veg Leges
No ratings yet
Midterm 2 Nem Veg Leges
9 pages
Simple Regression Analysis Overview
No ratings yet
Simple Regression Analysis Overview
12 pages
STAT630Slide Adv Data Analysis
0% (1)
STAT630Slide Adv Data Analysis
238 pages
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
No ratings yet
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
226 pages
DA-3rd Unit
No ratings yet
DA-3rd Unit
16 pages
APPLIED REGRESSION ANALYSIS AND GENERALIZED LINEAR MODELS Fox 2008
0% (1)
APPLIED REGRESSION ANALYSIS AND GENERALIZED LINEAR MODELS Fox 2008
103 pages
Overview of Regression Analysis
No ratings yet
Overview of Regression Analysis
10 pages
Econometrics: Linear Regression Basics
No ratings yet
Econometrics: Linear Regression Basics
52 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
11 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
Introduction to Regression Analysis
No ratings yet
Introduction to Regression Analysis
8 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
Regression Analysis by Shalabh Sir
No ratings yet
Regression Analysis by Shalabh Sir
278 pages
Finance Students' Guide to Regression
No ratings yet
Finance Students' Guide to Regression
41 pages
Linear Regression Analysis Intro
No ratings yet
Linear Regression Analysis Intro
13 pages
Unit Ii ML (Mca)
No ratings yet
Unit Ii ML (Mca)
61 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Regression Analysis and ANOVA
No ratings yet
Regression Analysis and ANOVA
2 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
59 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Da Unit 3 R22
No ratings yet
Da Unit 3 R22
15 pages
Regression Analysis
No ratings yet
Regression Analysis
65 pages
Ra Web
No ratings yet
Ra Web
70 pages
Fox 2016 PDF
100% (1)
Fox 2016 PDF
817 pages
Chap01-3 (Autosaved)
No ratings yet
Chap01-3 (Autosaved)
51 pages
Research Tools and Techniques: Comsats Institute of Information Technology, Wah Cantt Department of Management Sciences
No ratings yet
Research Tools and Techniques: Comsats Institute of Information Technology, Wah Cantt Department of Management Sciences
4 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Lec 1 Course Overview
No ratings yet
Lec 1 Course Overview
23 pages
Unit 3 1
No ratings yet
Unit 3 1
41 pages
Advancedeconometricsl3!4!240128102442 58a0f1f1
No ratings yet
Advancedeconometricsl3!4!240128102442 58a0f1f1
58 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
DS 3 2
No ratings yet
DS 3 2
17 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
CS 2008 3complete PDF
No ratings yet
CS 2008 3complete PDF
53 pages
Model Development
No ratings yet
Model Development
80 pages
Lecture3 4
No ratings yet
Lecture3 4
48 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Unit - II - DA
No ratings yet
Unit - II - DA
22 pages
Comprehensive Guide to Regression Analysis
No ratings yet
Comprehensive Guide to Regression Analysis
132 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
Techniques of Statistical Analysis 1 Group 2 2014-15
No ratings yet
Techniques of Statistical Analysis 1 Group 2 2014-15
3 pages
Data Science Q&A - Latest Ed (2020) - 3 - 1
No ratings yet
Data Science Q&A - Latest Ed (2020) - 3 - 1
2 pages
Simple Linear Regression Assumptions
No ratings yet
Simple Linear Regression Assumptions
20 pages
University of Sargodha Department of Statistics M.Phil. Statistics (2 Years Program)
No ratings yet
University of Sargodha Department of Statistics M.Phil. Statistics (2 Years Program)
31 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
104 pages
Understanding Regression Analysis
100% (1)
Understanding Regression Analysis
29 pages
SM Notes 2020
No ratings yet
SM Notes 2020
139 pages
(Ebook) Graybill & Iyer 2004 Regression Analysis - Concepts & Applications - With SAS & Minitab
No ratings yet
(Ebook) Graybill & Iyer 2004 Regression Analysis - Concepts & Applications - With SAS & Minitab
648 pages
Bayesian Learning Optimization
No ratings yet
Bayesian Learning Optimization
4 pages
Golden Parachutes PDF
No ratings yet
Golden Parachutes PDF
44 pages
Unit-2 Regression-Supervised Machine Learning
No ratings yet
Unit-2 Regression-Supervised Machine Learning
90 pages
Multinomial Logistic Regression Insights
No ratings yet
Multinomial Logistic Regression Insights
22 pages
NYC Crime Geographic Profiling
No ratings yet
NYC Crime Geographic Profiling
16 pages
ROOT Histogram Filling and Fitting
No ratings yet
ROOT Histogram Filling and Fitting
25 pages
Stat 431 Assignment 1 Overview
No ratings yet
Stat 431 Assignment 1 Overview
2 pages
Maximum Parsimony and Likelihood
No ratings yet
Maximum Parsimony and Likelihood
34 pages
Normal Distribution - WikipediaTheory of Estimation
No ratings yet
Normal Distribution - WikipediaTheory of Estimation
61 pages
Convergence Rates of Posterior Distributions
No ratings yet
Convergence Rates of Posterior Distributions
32 pages
Hashem Pesaran, Lung-Fei Lee - Analysis of Panels and Limited Dependent Variable Models (1999)
No ratings yet
Hashem Pesaran, Lung-Fei Lee - Analysis of Panels and Limited Dependent Variable Models (1999)
350 pages
MLE Insights for Data Scientists
No ratings yet
MLE Insights for Data Scientists
15 pages
Risk-Based Inspection for H2S Plants
No ratings yet
Risk-Based Inspection for H2S Plants
7 pages
Bayesian Autoregressions Smiranda
No ratings yet
Bayesian Autoregressions Smiranda
65 pages
Cambridge Probabilidad
No ratings yet
Cambridge Probabilidad
38 pages
Mcnichols-Stock Dividends, Stock Splits, and Signalling-1990
No ratings yet
Mcnichols-Stock Dividends, Stock Splits, and Signalling-1990
23 pages
Machine Learning Homework Guide
No ratings yet
Machine Learning Homework Guide
2 pages
Ec1 18 PDF
No ratings yet
Ec1 18 PDF
32 pages
Jet Propulsion Laboratory Case
No ratings yet
Jet Propulsion Laboratory Case
4 pages
Stochastic AVO Inversion Thesis
No ratings yet
Stochastic AVO Inversion Thesis
69 pages
Determination and Interpretation of Activation Energy
No ratings yet
Determination and Interpretation of Activation Energy
7 pages
Lecture 10 Naïve Bayes Classification
No ratings yet
Lecture 10 Naïve Bayes Classification
29 pages
Asymptotical Statistics
100% (3)
Asymptotical Statistics
460 pages
Phylogenetic Tree Methods Guide
No ratings yet
Phylogenetic Tree Methods Guide
27 pages
Kalman Filter & Normal Distributions
No ratings yet
Kalman Filter & Normal Distributions
19 pages
Ringle Wende Will 2005
No ratings yet
Ringle Wende Will 2005
10 pages
Sensory Analysis: Paired Comparison Test
No ratings yet
Sensory Analysis: Paired Comparison Test
30 pages
Children in The Slums of Dhaka Diarrhoea Prevalence and Its Impl
No ratings yet
Children in The Slums of Dhaka Diarrhoea Prevalence and Its Impl
30 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Regression Model Transformation Guide
No ratings yet
Regression Model Transformation Guide
16 pages