Week01 Lecture BB
Week01 Lecture BB
Boris Beranger
Term 2, 2021
1/70
Some comments about the course
Previously
• PhD at Université Pierre and Marie Curie (Paris) & UNSW Sydney
• Postdoc at UNSW Sydney
2/70
Contact
E-mail [email protected]
Office 4103
Webpage www.borisberanger.com
Consultation hours Thursday, 15:00-16:00
3/70
Lectures: Wednesday, 12:00 - 15:00
• One lecture per week: live, video recording available on Blackboard Collaborate
All lecture and tutorial material, assignments, communication will be posted the course
Moodle page.
4/70
Learning support
Textbooks:
5/70
Learning support
Software:
6/70
Assessment Details
7/70
Skills to be developed
8/70
My philosophy
• Understand deeply enough the theory to use efficiently any statistical tool and
choose the most appropriate one.
• Become able to find/create the appropriate R function to apply the tools on real
data.
• Understand the output of these functions and its link to mathematical formulas.
• Become able to interpret the output.
This being said, I will go into great details in the theory only for the Linear Model in
order to leave time to play with modern and more sophisticated nonlinear models.
Simulations will also play an important role in order to understand several concepts.
9/70
Notation
11/70
Chapter 1 - Introduction to Regression Analysis
12/70
What is Regression Analysis?
Explanatory variables are usually treated as random variables, while predictors are
treated as fixed observations.
13/70
Response and explanatory variables
Response and explanatory variables are measures on one of the following scales:
• nominal: when Y is classified into categories, which can be only two (binary
outcome) or several (multinomial outcome)
• ordinal: when Y is recorded in classes
• continuous: when Y is measured on a continuous scale, at least in theory.
Nominal and ordinal data are discrete variables and can be qualitative or quantitative
(e.g. counts). Continuous data are quantitative.
16/70
“Applied” Regression Analysis
Applied means: “If there is no way to calculate it, we won’t talk about it.”
On the other hand, we want to understand the underlying computational methods and
algorithms. This will be impossible without understanding the theory.
17/70
General Framework of Statistical Learning
Statistical learning refers to a vast set of tools for understanding data. It splits into
supervised and unsupervised methods. All the methods presented in this course are
within the framework of supervised learning.
Regression fits into the framework of supervised methods, which requires a statistical
model for predicting or estimating an output based on one or more inputs.
In contrast, unsupervised methods cover situations where there are inputs but no
supervising output. In these type of analysis we learn about relationships and structure
of data. Example of unsupervised analysis is cluster analysis.
18/70
Knowledge assumed
It’s also good to have some previous exposure to computational software (R):
• data types,
• manipulation of arrays,
• some idea of optimisation,
• ...
And finally, you need to know some probability theory and statistics:
21/70
Scope of the Course
22/70
Scope of the Course
Prediction
In many situations X is available but the output Y cannot be easily obtained. Since
the error term averages to zero, we can predict Y using
Ŷ = fˆ(X ), (1)
where fˆ is the estimate for f , and Ŷ represents the resulting prediction for Y .
The estimate fˆ is characterised by a reducible error and by an irreducible error.
25/70
Methods of estimation
• parametric
1. Assumption on the functional form of f
f (x) = β0 + β1 X1 + . . . + βp Xp (3)
26/70
Chapter 1 - Introduction to Regression Analysis
27/70
Maximum likelihood estimation
The likelihood function L(θ; y ) is algebraically the same as f (y ; θ) but the emphasis
shifts to parameters θ while y stays fixed.
Do not mix up f (·; θ) above with the function f (·) of the model y = f (x) + !!!
28/70
Maximum likelihood estimation
Definition
The maximum likelihood estimator of θ is the value θ̂ which maximizes the likelihood
function, that is
29/70
Maximum likelihood estimation
Invariance: If g (θ) is any function of the parameters θ, then the maximum likelihood
estimator of g (θ) is g (θ̂).
Other properties: consistency, sufficiency, asymptotic efficiency, asymptotic normality.
Figure 2: Example of a log-likelihood function. Source: Platen & Rendek (2008) 31/70
Example - Poisson distribution
Example
Let Y1 , Y2 , . . . , Yn be independent random variables with Poisson distribution
θyi e −θ
f (yi , θ) = ,
yi !
yi = 0, 1, . . . , with the same parameter θ.
Find the MLE of θ.
32/70
Least Squares Estimation
The simplest method of least squares consists of finding the estimator β̂ that
minimizes the sum of squares of the differences between Yi ’s and their expected values
n
X
SS = [Yi − µi (β)]2 . (8)
i=1
If Yi ’s have variances σi2 that are not all equal it may be desirable to minimize the
weighted sum of squared differences
n
X
WSS = wi [Yi − µi (β)]2 . (10)
i=1
where the weights are wi = (σi2 )−1 . In this way the observations that are less reliable
will have less influence on the estimates.
More generally, if y = [Y1 , . . . , Yn ]> is a random vector with mean vector
µ = [µ1 , . . . , µn ]> and variance-covariance matrix V, then
34/70
Comments
1. Method of least squares can be used without making assumptions about the
distribution of the response variables Yi in contrast to the maximum likelihood
estimation.
2. For many situations maximum likelihood and least squares estimates are identical.
3. In many cases numerical methods are used for parameter estimation.
35/70
Chapter 1 - Introduction to Regression Analysis
36/70
Model Fitting
37/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
1. Observation: women living in country areas tend to have fewer consultations with
GPs than women who live near a wide range of health services
2. Hypotheses: is this because they are healthier or because of structural factors?
38/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
Group of study:
• Women living in country towns (town group) or in rural areas (country group) in
NSW
• Women aged 70-75 years
• Same socio-economic status
• ≤ 3 GP visits in 1996
39/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
Let Yjk be a r.v. representing the number of conditions for woman k in group j (j = 1
town group, j = 2 country group).
Yjk ∼ Pois(θj ) k = 1, . . . , Kj
H0 : θ1 = θ2 = θ −→ E[Yjk ] = θ
H1 : θ1 6= θ2 −→ E[Yjk ] = θj
40/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
If H0 is true:
Kj
2 X
X
l0 (θ; y) = yjk log θ − θ − log yjk !
j=1 k=1
41/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
If H1 is true:
K1
X
l1 (θ1 , θ2 ; y) = y1k log θ1 − θ1 − log y1k !+
k=1
K2
X
y2k log θ2 − θ2 − log y2k !
k=1
43/70
Australian Longitudinal Study on Women’s Health, Lee et al. (2005)
For now, it is enough to say that 46.8457 − 43.6304 = 3.2153 seems small, but later
we will quantify the interpretation on it.
45/70
Relating income to years of education
80
80
70
70
60
60
Income
Income
50
50
40
40
30
30
20
20
10 12 14 16 18 20 22 10 12 14 16 18 20 22
46/70
Relating income to years of education
Predicting
Y = f (x) + ε (14)
47/70
Relating income to years of education
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Figure 3: Plot of income as a function of years of education and seniority. The blue
surface represents the true underlying relationship (simulated data).
48/70
Relating income to years of education
Incom
e
ity
or
Ye
ni
ars
Se
ofE
du
ca
tio
n
Figure 4: The yellow surface represents β̂0 + β̂1 × education + β̂2 × seniority
49/70
Relating income to years of education
Incom
e
e
y
y
it
it
or
or
Ye Ye
ni
ni
ars ar
Se so
Se
of fE
Ed du
uc ca
a tio tio
n n
50/70
What have we learnt from these examples?
Wage dataset: wages for a group of males from the Atlantic region of the US (available
in the R package ISLR)
300
300
300
200
200
200
Wage
Wage
Wage
50 100
50 100
50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5
Few comments
1. Many statistical learning methods are relevant and useful in a wide range of
disciplines, beyond just statistical sciences.
2. Statistical learning should not be viewed as a series of black boxes.
3. While it is important to know what job is performed by each tool, it is not
necessary to have the skills to construct the machine inside the box.
4. We will work on real-world problems.
53/70
Chapter 1 - Introduction to Regression Analysis
54/70
Measuring the quality of fit
We need to measure how well the model predictions match the data.
n
1X
MSE = (yi − fˆ(xi ))2 (15)
n
i=1
However, a low MSE can hide problems of overfitting on the dataset at hand. Then,
what we really want to have is the accuracy of the predictions when we apply the
method on unseen data.
Suppose we have (training) observations {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} on which we
estimate f (x); then we obtain estimates fˆ(x1 ), fˆ(x2 ), . . . , fˆ(xn ).
55/70
Measuring the quality of fit
We want to measure the accuracy of the model on new input (test) variables x0 , to
obtain the test MSE
Ave(fˆ(x0 ) − y0 )2
which is the average squared prediction error on the test observations (x0 , y0 )
Then we can
• select the model which minimises the test MSE, if test observations are available
• select the model which minimises the training MSE, if test observations are not
available
• use estimation method for the test MSE, like cross-validation
56/70
The Bias-Variance trade-off
57/70
The Bias-Variance trade-off
Comments
Model accuracy can be also applied for categorical outputs, with slight modifications.
where ŷi is the predicted label for the i-th observation, using the estimate fˆ and
I(yi 6= ŷi ) is an indicator function which is equal to
• 1 if yi 6= ŷi (miss-classification)
• 0 if yi = ŷi (correct classification)
59/70
The Classification setting
As in the case of regression, we are more interested in the test error rate, which is the
error rate that results from applying the classifier to test observations, not available in
the training dataset
Ave (I(y0 6= ŷ0 ))
60/70
The Bayes classifier
The test error rate is minimised, on average, by a simple classifier - the Bayes classifier
- that assigns each observation to the most likely class given its predictor values, i.e.
we should simply assign a test observation with predictor x0 to the class j for which
Pr(Y = j|X = x0 )
is maximised.
If there are only 2 classes, the Bayes classifier predicts class 1 if
61/70
The Bayes classifier
and to minimise the expected error, you need to minimize the probability of being
wrong, or
ŷ0 = 1 if Pr(y0 = 1|X = x0 ) = max Pr(y0 = j|X = x0 )
j∈{0,1}
where the expectation is with respect to the probability over all possible values of X .
62/70
The Bayes classifier
oo o
oo o
o
o
o oo oo o
o
o oo oo ooo
o o oo ooo o
oo o o o o ooo oo oo
o o oo o oo
o o o o o o o
oo oo o o o o
o o o oo o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
oo
X2
o
o o o o
oo o o o o
o o
oo o oo o oo o
o o o o oo o
o o oo oo o
o o o ooo o
o o oo
o ooooo oooo o
oo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o
X1
Figure 6: 100 observations from two groups. The purple line indicates the Bayes decision
boundary. The grid colour indicate the group to which a test observation will be allocated to.
63/70
K -nearest neighbours
Despite its simplicity, the KNN classifier produces classifications which are often close
to the Bayes classifier.
64/70
K -nearest neighbours
Given a positive integer K (the choice of K is essential!) and a test input observation
x0 , the KNN classifier
Pr(Y = j|X = x0 )
Pr(X = x0 |Y = j) =
Pr(Y = j)
• Classifies the test observation x0 to the class with the largest probability
65/70
K -nearest neighbours
Figure 7: KNN approach using K = 3. Left: a test observation. Right: decision boundary. 66/70
K -nearest neighbours
o o
oo o oo oo o oo
o o o o
o oo o o oo o
oo oo
oo o oo o
o o oo o o o oo o
o oo o oo o o o oo o oo o o
oo o o o o o oo oo oo o o o o o oo
o o o o o o oo o o o o o o oo
o o o o o o oo o o o o o o o o oo o
oo o o o o oo o o o o
o o o oo o o o o o o oo o o o
o oo o o o o o oo o o o
o o ooo o oo o o o o o o ooo o oo o o o o
ooo o o ooo o o
oo
o o o o o o o oo oo o oo
o o o o o o o oo oo o
o o o o ooo o o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o oo o o o o oo o
o o o o oo o o o o o o oo o o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o ooo o
o o o ooo o
o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o
Figure 8: KNN with K = 1. KNN decision boundary compared with Bayes decision boundary. 67/70
K -nearest neighbours
KNN: K=10
oo o
oo o
o
o
o oo oo o
o oo o
o oo
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o o o o
o
o o o
oo o o o
o o o oo o o
oo o o o o
o o o o
o o oooo o ooo o o o o ooo
o
oo
X2
o o oooo o o o
oo
o o
oo o oo o oo o
o o o o oo o
o o o o
o
o o o oo o oo o o
o o
o o
oo o ooooo oo
o oo o
o oo o
o o oo oo o o o
o o o oo o
oo
o o oo o
o oo o
o o o
X1
Figure 9: KNN with K = 10. KNN decision boundary compared with Bayes decision boundary.
68/70
K -nearest neighbours
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
oo o oo o
o o o o o o o o o o
o oo o oo o o o oo o oo o o
oo o o o oo oo oo o o o o o oo oo
o o oo oo o oo o o o oo oo o oo
o o oo o o o oo o
o ooo o o o oo ooo o o o
o o o o o
o oo o o o o oo o o o
o o o o o o
oo oo oo o o o
o o oo
oo o o o
o
oo o o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o oo o o o oo o o
o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o oo
o oo o o o o oo o o
o o oo oo o o o o o oo oo o o o
o o oooo o o o oooo o
o o o oo o o o o oo o
oo o oo o
o o
o o o o o o
Figure 10: KNN with K = 100. KNN decision boundary compared with Bayes decision
69/70
References
Some of the figures in this presentation are taken from “An Introduction to Statistical
Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James,
D. Witten, T. Hastie and R. Tibshirani
70/70