Chapter 2: Overview of Supervised Learning
DD3364
March 9, 2012
Introduction and Notation
Problem 1: Regression
y
10
Predict its y value?
5
x
0
Problem 2: Classification
Is it a bike or a
face ?
Some Terminology
In Machine Learning have outputs which are predicted from
measured inputs.
In Statistical literature have responses which are predicted
from measured predictors.
In Pattern Recognition have responses which are predicted
from measured features.
Some Terminology
In Machine Learning have outputs which are predicted from
measured inputs.
In Statistical literature have responses which are predicted
from measured predictors.
In Pattern Recognition have responses which are predicted
from measured features.
The goal of supervised learning is to predict the value of the
output(s) given an input and lots of labelled training examples
{(input1 , output1 ), (input2 , output2 ), . . . , (inputn , outputn )}
Variable types
Outputs can be
discrete (categorical, qualitative),
continuous (quantitative) or
ordered categorical (order is important)
Predicting a discrete output is referred to as classification.
Predicting a continuous output is referred to as regression.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
Notation of the book
Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a
vector of the same length.
gi is the ith observed value of G.
Matrices are represented by bold uppercase letters.
I will try to stick these conventions in the slides.
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
It is presumed that we have labelled training data for
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
It is presumed that we have labelled training data for
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
It is presumed that we have labelled training data for
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
Prediction via Least Squares and
Nearest Neighbours
Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as
Y = 0 +
p
X
Xj j
j=1
where 0 is known as the intercept and also as the bias
Let X = (1, X1 , . . . , Xp )t and = (0 , . . . , p )t then
Y = X t
Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as
Y = 0 +
p
X
Xj j
j=1
where 0 is known as the intercept and also as the bias
Let X = (1, X1 , . . . , Xp )t and = (0 , . . . , p )t then
Y = X t
Linear Models and Least Squares
How is a linear model fit to a set of training data?
Most popular approach is a Least Squares approach
is chosen to minimize
n
X
RSS() =
(yi xti )2
i=1
As this is quadratic a minimum always exist but it may not be
unique.
In matrix notation can write RSS() as
RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t
Linear Models and Least Squares
How is a linear model fit to a set of training data?
Most popular approach is a Least Squares approach
is chosen to minimize
n
X
RSS() =
(yi xti )2
i=1
As this is quadratic a minimum always exist but it may not be
unique.
In matrix notation can write RSS() as
RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t
Linear Models and Least Squares
The solution to
= arg min (y X)t (y X)
is given by
= (Xt X)1 Xt y
if Xt X is non-singular
This is easy to show by differentiation of RSS()
This model has p + 1 parameters.
Linear Models, Least Squares and Classification
Assume one has training data {(xi , yi )}n
i=1 with each
yi {0, 1} (its really categorical data)
A linear regression model is fit to the data and
(
0
G(x)
=
1
if xt .5
if xt > .5
This is not the best way to perform binary classification with a
linear discriminant function...
Linear Models, Least Squares and Classification
Assume one has training data {(xi , yi )}n
i=1 with each
yi {0, 1} (its really categorical data)
A linear regression model is fit to the data and
(
0
G(x)
=
1
if xt .5
if xt > .5
This is not the best way to perform binary classification with a
linear discriminant function...
Example binary classification with
Example binary classification with
The linear classifier
mis-classifies quite a few of
the training examples
The linear model may be
too rigid
By inspection it seem the
two classes cannot be
separated by a line
Points from each class are
k=1
generated from a GMM with
10 mixtures
Example binary classification with
The linear classifier
mis-classifies quite a few of
the training examples
The linear model may be
too rigid
By inspection it seem the
two classes cannot be
separated by a line
Points from each class are
k=1
generated from a GMM with
10 mixtures
Example binary classification with
The linear classifier
mis-classifies quite a few of
the training examples
The linear model may be
too rigid
By inspection it seem the
two classes cannot be
separated by a line
Points from each class are
k=1
generated from a GMM with
10 mixtures
k-Nearest Neighbour regression fitting
the k-nearest neighbour fit for Y is
1
Y (x) =
k
yi
xi Nk (x)
where Nk (x) is the neighbourhood of x defined by the k
closest points xi in the training data.
Closeness if defined by some metric.
For this lecture assume it is the Euclidean distance.
k-nearest neighbours in words:
Find the k observations xi closest to x and average their
responses.
k-Nearest Neighbour binary classification
Training data: {(xi , gi )} with each gi {0, 1}
is
the k-nearest neighbour estimate for G
G(x)
=
(
0
1
if
P
1
k
g
xi Nk (x) i .5
otherwise
where Nk (x) is the neighbourhood of x defined by the k
closest points xi in the training data.
k-nearest neighbours in words:
Find the k observations xi closest to x and estimate the class
of x as the majority class amongst the neighbours.
Example: k-Nearest Neighbour classification
k = 15
Example: k-Nearest Neighbour classification
k=1
Example: k-Nearest Neighbour classification
For k = 1 all the training
examples are correctly
classified.
This is always the case !
But how well will it perform
on test data drawn from the
same distribution?
k=1
Example: k-Nearest Neighbour classification
For k = 1 all the training
examples are correctly
classified.
This is always the case !
But how well will it perform
on test data drawn from the
same distribution?
k=1
Example: k-Nearest Neighbour classification
For k = 1 all the training
examples are correctly
classified.
This is always the case !
But how well will it perform
on test data drawn from the
same distribution?
k=1
Effective number of parameters of k-nn
There are two parameters that control the behaviour of k-nn.
These are k and n the number of training samples
The effective number of parameters of k-nn is n/k
Intuitively
say the nbds were non-overlapping
Would have n/k neighbourhoods
Need to fit one parameter (a mean) to each neighbourhood
Effective number of parameters of k-nn
There are two parameters that control the behaviour of k-nn.
These are k and n the number of training samples
The effective number of parameters of k-nn is n/k
Intuitively
say the nbds were non-overlapping
Would have n/k neighbourhoods
Need to fit one parameter (a mean) to each neighbourhood
k-nn Vs Linear decision boundaries
Linear decision boundary is
smooth,
stable to fit
assumes a linear decision boundary is suitable
In statistical learning lingo: it has low variance and high bias
k-nn decision boundary is
can adapt to any shape of the data,
unstable to fit (for small k)
not smooth, wiggly (for small k)
In statistical learning lingo: it has high variance and low bias
k-nn Vs Linear decision boundaries
Linear decision boundary is
smooth,
stable to fit
assumes a linear decision boundary is suitable
In statistical learning lingo: it has low variance and high bias
k-nn decision boundary is
can adapt to any shape of the data,
unstable to fit (for small k)
not smooth, wiggly (for small k)
In statistical learning lingo: it has high variance and low bias
Optimal Bayes decision boundary
This is the optimal decision boundary computed from the known
pdfs for the two classes.
Mis-classification rate for the simulation experiment
Test error
Bayes error rate
Training error: k-nn
Test error: k-nn
Training error: linear
Test error: linear
0.3
0.2
0.1
0
log( nk )
0
ntrain = 200 and ntest = 10, 000
Statistical Decision Theory
Some statistical theory
How do we measure how well f (X) predicts Y ?
Statisticians would compute the Expected Prediction Error
w.r.t. some loss function
EPE(f ) = EX,Y [L(Y, f (X))] =
Z Z
L(y, f (x)) p(x, y) dx dy
A common loss function is the squared error loss
L(y, f (x)) = (y f (x))2
By conditioning on X can write
EPE(f ) = EX,Y [(Y f (X))2 ] = EX [ EY |X [(Y f (X))2 |X] ]
Some statistical theory
How do we measure how well f (X) predicts Y ?
Statisticians would compute the Expected Prediction Error
w.r.t. some loss function
EPE(f ) = EX,Y [L(Y, f (X))] =
Z Z
L(y, f (x)) p(x, y) dx dy
A common loss function is the squared error loss
L(y, f (x)) = (y f (x))2
By conditioning on X can write
EPE(f ) = EX,Y [(Y f (X))2 ] = EX [ EY |X [(Y f (X))2 |X] ]
Some statistical theory
At a point x can minimize EPE to get the best prediction of y
f (x) = arg min EY |X [(Y c)2 |X = x]
c
The solution is
f (x) = E[Y |X = x]
This is known as the regression function.
Some statistical theory
At a point x can minimize EPE to get the best prediction of y
f (x) = arg min EY |X [(Y c)2 |X = x]
c
The solution is
f (x) = E[Y |X = x]
This is known as the regression function.
Only one problem with this: one rarely knows the pdf
p(Y |X).
The regression methods we encounter can be viewed as ways
to approximate E[Y |X = x].
Local Methods in High Dimensions
Intuition and knearest neighbour averaging
Example:
p
Training data {(xi , yi )}n
i=1 where xi X R and yi R
Predict response at x X using the training data and 3-nn
averaging.
Intuition and knearest neighbour averaging
Let
X = [1, 1]2 and
the training xi s be uniformly sampled from X .
xi s from training sets of different size
x2
x2
x2
x
x1
x1
x1
As n increases the expected area of the nbd containing the 3
nearest neighbours decreases
= accuracy of y increases.
Intuition and knearest neighbour averaging
Therefore intuition says:
Lots of training data
k-nearest neighbour produces accurate stable prediction.
More formally:
As n increases then
y =
1
k
xi Nk (x)
yi E[y | x]
Intuition and knearest neighbour averaging
Therefore intuition says:
Lots of training data
k-nearest neighbour produces accurate stable prediction.
More formally:
As n increases then
y =
1
k
xi Nk (x)
yi E[y | x]
However for large p...
The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition
breaks down in high dimensions.
However for large p...
The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition
breaks down in high dimensions.
Manifestations of this problem
For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely
sample X Rp
However for large p...
The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition
breaks down in high dimensions.
Manifestations of this problem
For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely
sample X Rp
However for large p...
The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition
breaks down in high dimensions.
Manifestations of this problem
For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely
sample X Rp
Curse of Dimensionality: Problem 1
For large p nearest neighbours are not so close
Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)
training inputs are uniformly sampled from X .
For large p nearest neighbours are not so close
Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)
training inputs are uniformly sampled from X .
Question:
Let k = r n where r [0, 1] and x = 0.
What is the expected length of the side of the minimal hyper-cube
containing the k-nearest neighbours of x?
For large p nearest neighbours are not so close
Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)
training inputs are uniformly sampled from X .
Question:
Let k = r n where r [0, 1] and x = 0.
What is the expected length of the side of the minimal hyper-cube
containing the k-nearest neighbours of x?
Solution:
Volume of hyper-cube of side a is ap . Looking for a s.t. ap equals
a fraction r of the unit hyper-cube volume. Therefore
ap = r = a = r1/p
For large p nearest neighbours are not close
To recap the expected edge length of the hyper-cube containing a
fraction r of the training data is
ep (r) = r1/p
For large p nearest neighbours are not close
To recap the expected edge length of the hyper-cube containing a
fraction r of the training data is
ep (r) = r1/p
Plug in some numbers
ep (r)
p = 10
Let p = 10 then
p=3
0.8
ep (.01) = .63,
ep (.1) = .80
Entire range for each input is 1.
Therefore in this case 1% and
10% nearest neighbour estimate
are not local estimates.
p=2
0.6
p=1
0.4
0.2
0
r
0
0.2
0.4
0.6
0.8
Curse of Dimensionality: Problem 2
For large p nearest neighbours are not close II
Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.
n training inputs are uniformly sampled from X .
For large p nearest neighbours are not close II
Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.
n training inputs are uniformly sampled from X .
Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?
For large p nearest neighbours are not close II
Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.
n training inputs are uniformly sampled from X .
Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?
Solution:
This median distance is given by the expression
1
d(p, n) = (1 .5 n ) p
Median distance of nearest neighbour to the origin
Plot of d(p, n) for n = 500
distance
0.5
0.4
0.3
0.2
0.1
p
2
10
Note: For p = 10 the closest training point is closer to the
boundary of X than to x
Consequence of this expression
Consequence
For large p most of the training data points are closer to the
boundary of X than to x.
This is bad because
To make a prediction at x, you will use training samples near
the edge of the training data
Therefore perform extrapolation as opposed to interpolation
between neighbouring samples.
Curse of Dimensionality: Problem 3
Dense sampling in high dimensions is prohibitive
Explanation:
Say n1 = 100 samples represents a dense sampling for a single
input problem
Then n10 = 10010 is required to densely sample with 10 such
inputs.
Therefore in high dimensions all feasible training sets sparsely
sample the input space.
Dense sampling in high dimensions is prohibitive
Explanation:
Say n1 = 100 samples represents a dense sampling for a single
input problem
Then n10 = 10010 is required to densely sample with 10 such
inputs.
Therefore in high dimensions all feasible training sets sparsely
sample the input space.
Simulated Example
The Set-up
e8x
0.5
x
1
Let X = [1, 1]p and have n = 1000 training examples xi
uniformly sampled from X .
The relationship between the inputs and output is defined by
2
Y = f (X) = e8kXk
The regression method
e8x
y0
0.5
x0
1
x(1) 0
Use 1-nearest neighbour rule to predict y0 at a test point x0
Histogram of the position of nearest neighbour
frequency
150
100
50
0
p = 1, n = 20
x(1)
1
0.5
0.5
Average estimate of y0
frequency
300
200
100
y0
0
p = 1, n = 20, ntrial = 400
Note: True value is y = 1
0.5
p=2
1
x2
x1
1
Let X = [1, 1]p and have n = 1000 training examples xi
uniformly sampled from X .
The relationship between the inputs and output is defined by
2
Y = f (X) = e8kXk
p=2
x2
x(1)
x0
x1
1
Use 1-nearest neighbour rule to predict y0 at a test point x0
1-nn estimate of y0
60
frequency
40
20
y0
0
p = 2, n = 20, ntrial = 600
Note: True value is y = 1
0.5
1-nn estimate of y0
frequency
80
60
40
20
0
y0
0
p = 2, n = 40, ntrial = 600
Note: True value is y = 1
0.5
As p increases
average distance to nn
1
average value of y0
0.8
0.6
0.5
0.4
0.2
p
2
10
0
2
10
ntrain = 1000, ntrial = 400
average distance to nearest neighbour increases rapidly with p
thus average estimate of y0 also rapidly degrades
Bias-Variance Decomposition
For the simulation experiment have a completely deterministic
relationship:
Y = f (X) = e8kXk
Mean Squared Error for estimating f (0) is
MSE(x0 ) = ET [(f (x0 ) y0 )2 ]
= ET [(
y0 ET [
y0 ])2 ] + (ET [
y0 ] f (x0 ))2
= VarT (
y0 ) + Bias2 (
y0 )
Bias-Variance Decomposition for this example
1
MSE
Bias2
Variance
MSE
0.5
p
2
10
The Bias dominates the MSE as p increases.
Why?
As p increases the nearest neighbour is never close to x0 = 0
Hence the estimate y0 tends to 0.
Another Simulated Example
where variance dominates the MSE
The Set-up
1
2 (x
+ 1)3
3
2
1
0
x
1
Let X = [1, 1]p and have n = 1000 training examples xi
uniformly sampled from X .
The relationship between the inputs and output is defined by
1
Y = f (X) = (X1 + 1)3
2
The regression method
1
2 (x
+ 1)3
3
2
1
y0
0
x0
1
0 x(1)
Use a 1-nn to estimate f (x0 ) where x0 = 0.
Variance dominates the MSE as p increases
MSE
Bias2
Variance
MSE
0.2
0.1
p
2
10
The variance dominates the MSE as p increases.
Why?
as the deterministic function only involves one dimension the
bias doesnt explode as p increases!
Comparison of Linear and NN predictors
Case 1
.5(x1 + 1)3 +
4
2
0
2
x1
1
Y = .5(X1 + 1)3 + ,
N (0, 1)
Case 2
x1 +
2
0
2
x1
1
Y = X1 + ,
N (0, 1)
Linear predictor Vs 1-NN predictor
EPE
f (x): linear, Pred: 1-nn
2
f (x): linear, Pred: linear
f (x): cubic, Pred: 1-nn
f (x): cubic, Pred: linear
1.5
p
2
10
EPE refers to the expected prediction error at point x0 = 0
EPE(x0 ) = Ey0 |x0 [ ET [(y0 y0 )2 ] ]
Linear predictor Vs 1-NN predictor
EPE
f (x): linear, Pred: 1-nn
2
f (x): linear, Pred: linear
f (x): cubic, Pred: 1-nn
f (x): cubic, Pred: linear
1.5
p
2
10
The noise level destroys the 1-nn predictor
linear predictor has a biased estimate of the cubic function
linear predictor fits well even in the presence of noise and high
dimension for the linear f
linear model beats curse of dimensionality
Words of Caution
Case of horses for courses
In previous example linear predictor out-performed the 1-nn
regression function as
bias of linear predictor variance of the 1-nn predictor
But could easily manufacture and example where
bias of linear predictor variance of the 1-nn predictor
More predictors than linear and NN
There are a whole hosts of models in between the rigid linear
model and the extremely flexible 1-nn method
Each one has it own assumptions and biases
Many are specifically designed to avoid the exponential
growth in complexity of functions in high dimensions.
Statistical models,
Supervised learning and
Function approximation
Goal
Know there is a function f (x) relating inputs to outputs:
Y f (X)
Want to find an estimate f(x) of f (x) from labelled training
data.
This is difficult when X is high dimensional
In this case need to incorporate special structure
reduce the bias and variance of the estimates
help combat the curse of dimensionality
A Statistical Model for Regression
y
10
f (x)
x
0
3
random variable indept of input X
Y = f (X) +
output
deterministic relationship
Additive Error Model
Y = f (X) +
where
the random variable has E[] = 0
is independent of X
f (x) = E[Y |X = x]
any departures from the deterministic relationship are mopped
up by
Statistical model for binary classification
p(x)
1
0.5
x
0
p(G|X = x) is modelled as a Bernoulli distribution with
p(x) = p(G = 1|X = x)
Therefore
E[G|X = x] = p(x) and
Var[G|X = x] = p(x)(1 p(x))
Supervised Learning - Function Approximation
Have training data
T = {(x1 , y1 ), . . . , (xn , yn )}
where each xi Rp and yi R.
y
10
x
0
Learn deterministic relationship f between X and Y from T .
In book Supervised Learning is viewed as a problem in
function approximation.
Common approach
2.6 Statistical Models, Supervised Learning and Function Approximation
31
FIGURE 2.10. Least squares fitting of a function of two inputs. The parameters
(x) are chosen so as toform
minimize of
the sum-of-squared
[Link] expansion
Decide onof fparametric
f , i.e. vertical
linear
principle for estimation is maximum likelihood estimation. Suppose we have
a random sample yi , i = 1, . . . , N fromM
a density Pr (y) indexed by some
parameters . The log-probability of the observed sample is
f (x) =
L() =
N
!
i=1
hm (x) m
log
Pr (yi ).
m=1
(2.33)
principle of maximum likelihood assumes that the most reasonable
Use least The
squares
to estimate in by minimizing
values for are those for which the probability of the observed sample is
largest. Least squares for the additive error model Y = f (X) + , with
n likelihood using the conditional
N (0, 2 ), is equivalent to maximum
likelihood
2
Pr(Y |X, ) = N (f (X),i 2 ).
(2.34)
i
RSS() =
(y f (x ))
So although the additional assumption
of normality seems more restrictive,
i=1
the results are the same. The log-likelihood of the data is
Dont have to always use least squares
Can find by optimizing other criteria.
Another option is Maximum Likelihood Estimation
For the additive model, Y = f (X) + have
P (Y |X, ) = N (f (X), 2 )
Log-likelihood of the training data is
L() =
=
n
X
i=1
n
X
i=1
log P (Y = yi |X = xi , )
log N (yi ; f (xi ), 2 )
Find the that minimizes L()
Structured Regression Models
Why do we need structure?
Consider the Residual Sum of Squares for a function f
RSS(f ) =
n
X
(yi f (xi ))2
i=1
There are infinitely many f with
f = arg min RSS(f )
f
and
RSS(f) = 0
Why do we need structure?
Any function f passing through the training points (xi , yi ) is
a solution.
Obviously not all the f will be equally good at predicting the
value of unseen test points...
Must restrict the class of f considered
Dont consider and arbitrary function f,
Instead restrict ourselves to f F
f = arg min RSS(f )
f F
But what restrictions should be used....
Initial ambiguity in choosing f has just been transferred to
choice of constraint.
Must restrict the class of f considered
Dont consider and arbitrary function f,
Instead restrict ourselves to f F
f = arg min RSS(f )
f F
But what restrictions should be used....
Initial ambiguity in choosing f has just been transferred to
choice of constraint.
Options to restrict the class of f
Have a parametric representation of f
Linear model: f (x) = 1t x + 0
Quadratic: f (x) = xt x + 1t x + 0
Impose complexity restrictions on the function.
i.e. f must have some regular behaviour in small
neighbourhoods of the input space, but then
What size should the neighbourhood be?
What form should f have in the neighbourhood?
No unique way to impose complexity constraints
Complexity and Neighbourhood size
Large neighbourhood = strong constraint
Small neighbourhood = weak constraint
Classes of Restricted Estimators
How to restrict the predictor f
The techniques used to restrict the regression or classification
function learned loosely fall into several classes.
Each class has parameter(s) termed smoothing parameters
which control the effective size of the local neighbourhood.
Some examples from each class follow.
How to restrict the predictor f
The techniques used to restrict the regression or classification
function learned loosely fall into several classes.
Each class has parameter(s) termed smoothing parameters
which control the effective size of the local neighbourhood.
Some examples from each class follow.
Note:
It is assumed we have training examples {(xi , yi )}n
i=1 and
We present the energy functions or functionals which are
minimised in order to find f
Class 1: Roughness Penalty
ensure f predicts the training values
penalty parameter
PRSS(f, ) =
Pn
f (xi ))2 + J(f )
i=1 (yi
functional measuring smoothness of f
One such penalty functional is
J(f ) =
[f 00 (x)]2 dx
For wiggly f s this functional will have a large value while for
linear f s it is zero.
Regularization methods express our belief that the f were
trying to approximate has a certain smoothness properties.
Class 2: Kernel Methods and Local Regression
Estimate the regression or classification function in a local
neighbourhood.
Need to specify
the nature of local neighbourhood
the class of functions used in local fit
Kernel Methods and Local Regression
Can define a local regression estimate of f (x0 ), from training
data {(xi , yi )}, as f(x0 ) where minimizes
RSS(f , x0 ) =
n
X
i=1
K (x0 , xi )(yi f (xi ))2
where
Kernel function: K (x0 , xi ) assign weights to xi depending
on its closeness to x0 .
Base regression function: f is a parameterized function
such as a low order polynomial.
A common kernel is the Gaussian kernel
kx0 xk2
1
K (x0 , x) = exp
Class 3: Basis functions and Dictionary methods
f is modelled as a linear expansion of basis functions
f (x) =
M
X
m hm (x)
m=1
Each hm is a function of the input x.
Linear refers to the actions of the parameters.
Example 1: Radial Basis Functions
f (x) =
M
X
K (m , x) m
m=1
where
Km (m , x) is a symmetric kernel centred at location m .
the Gaussian kernel is a popular kernel to use
K (m , x) = exp(km xk2 /(2))
If m s and m s pre-defined = estimating a linear
problem.
However, if m s and m s not pre-defined = estimating
, m s and m s is a hard non-linear problem.
Example 2: Adaptive basis function method
f (x) =
M
X
t
m (m
x + bm )
m=1
where
= (1 , . . . , M , 1 , . . . , M , b1 , . . . , bm )t
(z) = 1/(1 + ez ) is the activation function.
The directions m and bias terms bm have to be determined
and estimating them is the core of the estimation.
Dictionary methods
Adaptively chosen basis function methods aka dictionary
methods
Challenge is to choose a number of basis functions from a
dictionary set D of candidate basis functions (possibly
infinite).
Models are built up by employing some kind of search
mechanism
Model Selection and,
the Bias-Variance Trade-off
The complexity of learnt function
Many models have a parameter which control its complexity.
We have seen examples of this
k - number of nearest neighbours (nearest neighbour classifier)
- width of the kernel (radial basis functions)
M - number of basis functions (dictionary methods)
- weight of the penalty term (spline fitting)
How does increasing or decreasing the complexity of the
model affect their predictive behaviour?
Consider the nearest neighbour regression fit
y
2
f(x)
f (x)
1.5
x
0
Approximate f (x) with 1-nn regression fit f1 (x) given
{(xi , yi )}ni=1 and n = 100.
Each training example is yi = f (xi ) + i with i N (0, 2 )
and = .1
Expected predictor when k = 1
y
2
E[f(x)]
1.5
x
0
Shown above is the expected prediction of the 1-nn regression
fit given n = 100 and = .1
E[f1 (x)] is a good approximation to f (x). There is no bias!
At each x one std of the estimate is shown. Note its
magnitude.
15-nn regression fit
y
2
1.5
f(x)
f (x)
x
0
Approximate f (x) with 15-nn regression fit f15 (x) given
{(xi , yi )}ni=1 and n = 100.
Each training example is yi = f (xi ) + i with i N (0, 2 )
and = .1
Expected predictor when k = 15
y
2
E[f(x)]
1.5
x
0
E[f15 (x)] is smooth but biased.
Compare the peak of f (x) and E[f15 (x)] !
Note the variance of estimate is much smaller than when
k = 1.
Have illustrated the Bias-Variance trade-off
y
E[f(x)]
E[f(x)]
1.5
1.5
x
0
High complexity: k = 1
x
0
Lower complexity: k = 15
Model complexity increased, the variance tends to increase
and the squared bias tends to decrease.
Model complexity is decreased, the variance tends to
decrease, but the squared bias tends to increase.
How to choose the model complexity?
What not to do:
Want to choose model complexity which minimizes test error.
Training error is one estimate of the test error.
Could choose the model complexity that produces the
predictor which minimizes the training error.
Not a good idea!
How to choose the model complexity?
What not to do:
Want to choose model complexity which minimizes test error.
Training error is one estimate of the test error.
Could choose the model complexity that produces the
predictor which minimizes the training error.
Not a good idea!
Why??
Training error decreases when model complexity
increases
Overfitting
38
2. Overview of Supervised Learning
Low Bias
High Variance
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
Too much fitting = adapt too closely to the training data
be close to f (x0 ). As k grows, the neighbors are further away, and then
Have
a high
variance predictor
anything
can happen.
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
This More
scenario
is astermed
generally,
the modeloverfitting
complexity of our procedure is increased, the
variance tends to increase and the squared bias tends to decrease. The op-
posite behavior
occurs as the loses
model complexity
is decreased.
For k-nearest
In such
cases predictor
the ability
to generalize
neighbors, the model complexity is controlled by k.
Underfitting
38
2. Overview of Supervised Learning
Low Bias
High Variance
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
Low complexity model = predictor may have large bias
be close to f (x0 ). As k grows, the neighbors are further away, and then
anything can happen.
Therefore
predictor has poor generalization
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.
Underfitting
38
2. Overview of Supervised Learning
Low Bias
High Variance
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
Low complexity model = predictor may have large bias
be close to f (x0 ). As k grows, the neighbors are further away, and then
anything can happen.
Therefore
predictor has poor generalization
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.