Open navigation menu

Scribd

0% found this document useful (0 votes)

122 views133 pages

Lecture 1 - Overview of Supervised Learning

The document provides an overview of supervised learning, including discussing regression and classification problems, defining related terminology, and introducing techniques like linear models and k-nearest neighbors. It describes predicting continuous and discrete outputs, notation used in the text, and how to fit linear models using least squares as well as limitations for classification.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views133 pages

Lecture 1 - Overview of Supervised Learning

The document provides an overview of supervised learning, including discussing regression and classification problems, defining related terminology, and introducing techniques like linear models and k-nearest neighbors. It describes predicting continuous and discrete outputs, notation used in the text, and how to fit linear models using least squares as well as limitations for classification.

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 2: Overview of Supervised Learning

DD3364

March 9, 2012

Introduction and Notation

Problem 1: Regression

y
10
Predict its y value?
5

x
0

Problem 2: Classification
Is it a bike or a
face ?

Some Terminology
In Machine Learning have outputs which are predicted from

measured inputs.

In Statistical literature have responses which are predicted

from measured predictors.

In Pattern Recognition have responses which are predicted

from measured features.

Some Terminology
In Machine Learning have outputs which are predicted from

measured inputs.

In Statistical literature have responses which are predicted

from measured predictors.

In Pattern Recognition have responses which are predicted

from measured features.

The goal of supervised learning is to predict the value of the

output(s) given an input and lots of labelled training examples
{(input1 , output1 ), (input2 , output2 ), . . . , (inputn , outputn )}

Variable types
Outputs can be
discrete (categorical, qualitative),
continuous (quantitative) or
ordered categorical (order is important)

Predicting a discrete output is referred to as classification.

Predicting a continuous output is referred to as regression.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

Notation of the book

Denote an input variable by X.
If X is a vector, its components are denoted by Xj
Quantitative (continuous) outputs are denoted by Y
Qualitative (discrete) outputs are denoted by G
Observed values are written in lower case.
xi is the ith observed value of X. If X is a vector then xi is a

vector of the same length.

gi is the ith observed value of G.

Matrices are represented by bold uppercase letters.

I will try to stick these conventions in the slides.

More Notation
The prediction of the output for a given value of input vector

X is denoted by Y .

It is presumed that we have labelled training data for

regression problems

T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for

classification problems

T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}

More Notation
The prediction of the output for a given value of input vector

X is denoted by Y .

It is presumed that we have labelled training data for

regression problems

T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for

classification problems

T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}

More Notation
The prediction of the output for a given value of input vector

X is denoted by Y .

It is presumed that we have labelled training data for

regression problems

T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for

classification problems

T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}

Prediction via Least Squares and

Nearest Neighbours

Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as

Y = 0 +

p
X

Xj j

j=1

where 0 is known as the intercept and also as the bias

Let X = (1, X1 , . . . , Xp )t and = (0 , . . . , p )t then

Y = X t

Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as

Y = 0 +

p
X

Xj j

j=1

where 0 is known as the intercept and also as the bias

Let X = (1, X1 , . . . , Xp )t and = (0 , . . . , p )t then

Y = X t

Linear Models and Least Squares

How is a linear model fit to a set of training data?
Most popular approach is a Least Squares approach
is chosen to minimize
n
X
RSS() =
(yi xti )2
i=1

As this is quadratic a minimum always exist but it may not be

unique.

In matrix notation can write RSS() as

RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t

Linear Models and Least Squares

How is a linear model fit to a set of training data?
Most popular approach is a Least Squares approach
is chosen to minimize
n
X
RSS() =
(yi xti )2
i=1

As this is quadratic a minimum always exist but it may not be

unique.

In matrix notation can write RSS() as

RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t

Linear Models and Least Squares

The solution to

= arg min (y X)t (y X)

is given by
= (Xt X)1 Xt y
if Xt X is non-singular
This is easy to show by differentiation of RSS()
This model has p + 1 parameters.

Linear Models, Least Squares and Classification

Assume one has training data {(xi , yi )}n
i=1 with each

yi {0, 1} (its really categorical data)

A linear regression model is fit to the data and

(
0

G(x)
=
1

if xt .5
if xt > .5

This is not the best way to perform binary classification with a

linear discriminant function...

Linear Models, Least Squares and Classification

Assume one has training data {(xi , yi )}n
i=1 with each

yi {0, 1} (its really categorical data)

A linear regression model is fit to the data and

(
0

G(x)
=
1

if xt .5
if xt > .5

This is not the best way to perform binary classification with a

linear discriminant function...

Example binary classification with

Example binary classification with

The linear classifier

mis-classifies quite a few of

the training examples

The linear model may be

too rigid

By inspection it seem the

two classes cannot be

separated by a line

Points from each class are

k=1

generated from a GMM with

10 mixtures

Example binary classification with

The linear classifier

mis-classifies quite a few of

the training examples

The linear model may be

too rigid

By inspection it seem the

two classes cannot be

separated by a line

Points from each class are

k=1

generated from a GMM with

10 mixtures

Example binary classification with

The linear classifier

mis-classifies quite a few of

the training examples

The linear model may be

too rigid

By inspection it seem the

two classes cannot be

separated by a line

Points from each class are

k=1

generated from a GMM with

10 mixtures

k-Nearest Neighbour regression fitting

the k-nearest neighbour fit for Y is

1
Y (x) =
k

yi

xi Nk (x)

where Nk (x) is the neighbourhood of x defined by the k

closest points xi in the training data.
Closeness if defined by some metric.
For this lecture assume it is the Euclidean distance.
k-nearest neighbours in words:

Find the k observations xi closest to x and average their

responses.

k-Nearest Neighbour binary classification

Training data: {(xi , gi )} with each gi {0, 1}

is
the k-nearest neighbour estimate for G

G(x)
=

(
0
1

if

P
1
k

g
xi Nk (x) i .5

otherwise

where Nk (x) is the neighbourhood of x defined by the k

closest points xi in the training data.
k-nearest neighbours in words:

Find the k observations xi closest to x and estimate the class

of x as the majority class amongst the neighbours.

Example: k-Nearest Neighbour classification

k = 15

Example: k-Nearest Neighbour classification

k=1

Example: k-Nearest Neighbour classification

For k = 1 all the training

examples are correctly

classified.

This is always the case !

But how well will it perform

on test data drawn from the

same distribution?

k=1

Example: k-Nearest Neighbour classification

For k = 1 all the training

examples are correctly

classified.

This is always the case !

But how well will it perform

on test data drawn from the

same distribution?

k=1

Example: k-Nearest Neighbour classification

For k = 1 all the training

examples are correctly

classified.

This is always the case !

But how well will it perform

on test data drawn from the

same distribution?

k=1

Effective number of parameters of k-nn

There are two parameters that control the behaviour of k-nn.
These are k and n the number of training samples
The effective number of parameters of k-nn is n/k
Intuitively
say the nbds were non-overlapping
Would have n/k neighbourhoods
Need to fit one parameter (a mean) to each neighbourhood

Effective number of parameters of k-nn

There are two parameters that control the behaviour of k-nn.
These are k and n the number of training samples
The effective number of parameters of k-nn is n/k
Intuitively
say the nbds were non-overlapping
Would have n/k neighbourhoods
Need to fit one parameter (a mean) to each neighbourhood

k-nn Vs Linear decision boundaries

Linear decision boundary is
smooth,
stable to fit
assumes a linear decision boundary is suitable

In statistical learning lingo: it has low variance and high bias

k-nn decision boundary is
can adapt to any shape of the data,
unstable to fit (for small k)
not smooth, wiggly (for small k)

In statistical learning lingo: it has high variance and low bias

k-nn Vs Linear decision boundaries

Linear decision boundary is
smooth,
stable to fit
assumes a linear decision boundary is suitable

In statistical learning lingo: it has low variance and high bias

k-nn decision boundary is
can adapt to any shape of the data,
unstable to fit (for small k)
not smooth, wiggly (for small k)

In statistical learning lingo: it has high variance and low bias

Optimal Bayes decision boundary

This is the optimal decision boundary computed from the known

pdfs for the two classes.

Mis-classification rate for the simulation experiment

Test error

Bayes error rate

Training error: k-nn
Test error: k-nn
Training error: linear
Test error: linear

0.3
0.2
0.1
0

log( nk )
0

ntrain = 200 and ntest = 10, 000

Statistical Decision Theory

Some statistical theory

How do we measure how well f (X) predicts Y ?
Statisticians would compute the Expected Prediction Error

w.r.t. some loss function

EPE(f ) = EX,Y [L(Y, f (X))] =

Z Z

L(y, f (x)) p(x, y) dx dy

A common loss function is the squared error loss

L(y, f (x)) = (y f (x))2

By conditioning on X can write

EPE(f ) = EX,Y [(Y f (X))2 ] = EX [ EY |X [(Y f (X))2 |X] ]

Some statistical theory

How do we measure how well f (X) predicts Y ?
Statisticians would compute the Expected Prediction Error

w.r.t. some loss function

EPE(f ) = EX,Y [L(Y, f (X))] =

Z Z

L(y, f (x)) p(x, y) dx dy

A common loss function is the squared error loss

L(y, f (x)) = (y f (x))2

By conditioning on X can write

EPE(f ) = EX,Y [(Y f (X))2 ] = EX [ EY |X [(Y f (X))2 |X] ]

Some statistical theory

At a point x can minimize EPE to get the best prediction of y

f (x) = arg min EY |X [(Y c)2 |X = x]

c

The solution is

f (x) = E[Y |X = x]
This is known as the regression function.

Some statistical theory

At a point x can minimize EPE to get the best prediction of y

f (x) = arg min EY |X [(Y c)2 |X = x]

c

The solution is

f (x) = E[Y |X = x]
This is known as the regression function.
Only one problem with this: one rarely knows the pdf

p(Y |X).

The regression methods we encounter can be viewed as ways

to approximate E[Y |X = x].

Local Methods in High Dimensions

Intuition and knearest neighbour averaging

Example:
p
Training data {(xi , yi )}n
i=1 where xi X R and yi R

Predict response at x X using the training data and 3-nn

averaging.

Intuition and knearest neighbour averaging

Let
X = [1, 1]2 and

the training xi s be uniformly sampled from X .

xi s from training sets of different size

x2

x2

x2
x

x1

x1

x1

As n increases the expected area of the nbd containing the 3

nearest neighbours decreases

= accuracy of y increases.

Intuition and knearest neighbour averaging

Therefore intuition says:
Lots of training data

k-nearest neighbour produces accurate stable prediction.

More formally:
As n increases then
y =

1
k

xi Nk (x)

yi E[y | x]

Intuition and knearest neighbour averaging

Therefore intuition says:
Lots of training data

k-nearest neighbour produces accurate stable prediction.

More formally:
As n increases then
y =

1
k

xi Nk (x)

yi E[y | x]

However for large p...

The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition

breaks down in high dimensions.

However for large p...

The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition

breaks down in high dimensions.

Manifestations of this problem

For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely

sample X Rp

However for large p...

The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition

breaks down in high dimensions.

Manifestations of this problem

For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely

sample X Rp

However for large p...

The Curse of Dimensionality (Bellman, 1961)
k-nearest neighbour averaging approach and our intuition

breaks down in high dimensions.

Manifestations of this problem

For large p
Nearest neighbours are not so close !
The k-nn of x are closer to the boundary of X .
Need a prohibitive number of training samples to densely

sample X Rp

Curse of Dimensionality: Problem 1

For large p nearest neighbours are not so close

Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)

training inputs are uniformly sampled from X .

For large p nearest neighbours are not so close

Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)

training inputs are uniformly sampled from X .

Question:
Let k = r n where r [0, 1] and x = 0.

What is the expected length of the side of the minimal hyper-cube

containing the k-nearest neighbours of x?

For large p nearest neighbours are not so close

Scenario:
Estimate a regression function, f : X R, using a k-nn regressor.
Have
X = [0, 1]p (the unit hyper-cube)

training inputs are uniformly sampled from X .

Question:
Let k = r n where r [0, 1] and x = 0.

What is the expected length of the side of the minimal hyper-cube

containing the k-nearest neighbours of x?
Solution:
Volume of hyper-cube of side a is ap . Looking for a s.t. ap equals
a fraction r of the unit hyper-cube volume. Therefore
ap = r = a = r1/p

For large p nearest neighbours are not close

To recap the expected edge length of the hyper-cube containing a
fraction r of the training data is
ep (r) = r1/p

For large p nearest neighbours are not close

To recap the expected edge length of the hyper-cube containing a
fraction r of the training data is
ep (r) = r1/p

Plug in some numbers

ep (r)

p = 10

Let p = 10 then

p=3

0.8

ep (.01) = .63,

ep (.1) = .80

Entire range for each input is 1.

Therefore in this case 1% and
10% nearest neighbour estimate
are not local estimates.

p=2

0.6

p=1

0.4
0.2
0

r
0

0.2

0.4

0.6

0.8

Curse of Dimensionality: Problem 2

For large p nearest neighbours are not close II

Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.

n training inputs are uniformly sampled from X .

For large p nearest neighbours are not close II

Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.

n training inputs are uniformly sampled from X .

Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?

For large p nearest neighbours are not close II

Scenario:
Estimate a regression function, f : X R, using a k-nearest
neighbour regressor. Have
X is the unit hyper-sphere(ball) in Rp centred at the origin.

n training inputs are uniformly sampled from X .

Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?
Solution:
This median distance is given by the expression
1

d(p, n) = (1 .5 n ) p

Median distance of nearest neighbour to the origin

Plot of d(p, n) for n = 500

distance
0.5
0.4
0.3
0.2
0.1

p
2

10

Note: For p = 10 the closest training point is closer to the

boundary of X than to x

Consequence of this expression

Consequence
For large p most of the training data points are closer to the
boundary of X than to x.
This is bad because
To make a prediction at x, you will use training samples near

the edge of the training data

Therefore perform extrapolation as opposed to interpolation

between neighbouring samples.

Curse of Dimensionality: Problem 3

Dense sampling in high dimensions is prohibitive

Explanation:
Say n1 = 100 samples represents a dense sampling for a single

input problem

Then n10 = 10010 is required to densely sample with 10 such

inputs.

Therefore in high dimensions all feasible training sets sparsely

sample the input space.

Dense sampling in high dimensions is prohibitive

Explanation:
Say n1 = 100 samples represents a dense sampling for a single

input problem

Then n10 = 10010 is required to densely sample with 10 such

inputs.

Therefore in high dimensions all feasible training sets sparsely

sample the input space.

Simulated Example

The Set-up

e8x

0.5

x
1

Let X = [1, 1]p and have n = 1000 training examples xi

uniformly sampled from X .

The relationship between the inputs and output is defined by

2

Y = f (X) = e8kXk

The regression method

e8x

y0
0.5

x0
1

x(1) 0

Use 1-nearest neighbour rule to predict y0 at a test point x0

Histogram of the position of nearest neighbour

frequency
150
100
50
0

p = 1, n = 20

x(1)
1

0.5

0.5

Average estimate of y0
frequency
300
200
100

y0
0

p = 1, n = 20, ntrial = 400

Note: True value is y = 1

0.5

p=2
1

x2

x1
1

Let X = [1, 1]p and have n = 1000 training examples xi

uniformly sampled from X .

The relationship between the inputs and output is defined by

2

Y = f (X) = e8kXk

p=2

x2

x(1)

x0

x1
1

Use 1-nearest neighbour rule to predict y0 at a test point x0

1-nn estimate of y0
60

frequency

40

20

y0
0

p = 2, n = 20, ntrial = 600

Note: True value is y = 1

0.5

1-nn estimate of y0
frequency
80
60
40
20
0

y0
0

p = 2, n = 40, ntrial = 600

Note: True value is y = 1

0.5

As p increases
average distance to nn
1

average value of y0

0.8
0.6
0.5
0.4
0.2

p
2

10

0
2

10

ntrain = 1000, ntrial = 400

average distance to nearest neighbour increases rapidly with p
thus average estimate of y0 also rapidly degrades

Bias-Variance Decomposition
For the simulation experiment have a completely deterministic

relationship:

Y = f (X) = e8kXk

Mean Squared Error for estimating f (0) is

MSE(x0 ) = ET [(f (x0 ) y0 )2 ]

= ET [(
y0 ET [
y0 ])2 ] + (ET [
y0 ] f (x0 ))2
= VarT (
y0 ) + Bias2 (
y0 )

Bias-Variance Decomposition for this example

1

MSE

Bias2
Variance
MSE

0.5

p
2

10

The Bias dominates the MSE as p increases.

Why?
As p increases the nearest neighbour is never close to x0 = 0
Hence the estimate y0 tends to 0.

Another Simulated Example

where variance dominates the MSE

The Set-up

1
2 (x

+ 1)3

3
2
1
0

x
1

Let X = [1, 1]p and have n = 1000 training examples xi

uniformly sampled from X .

The relationship between the inputs and output is defined by
1
Y = f (X) = (X1 + 1)3
2

The regression method

1
2 (x

+ 1)3

3
2
1
y0
0

x0
1

0 x(1)

Use a 1-nn to estimate f (x0 ) where x0 = 0.

Variance dominates the MSE as p increases

MSE

Bias2
Variance
MSE

0.2

0.1

p
2

10

The variance dominates the MSE as p increases.

Why?
as the deterministic function only involves one dimension the
bias doesnt explode as p increases!

Comparison of Linear and NN predictors

Case 1
.5(x1 + 1)3 +
4
2
0
2

x1
1

Y = .5(X1 + 1)3 + ,

N (0, 1)

Case 2

x1 +

2
0
2

x1
1

Y = X1 + ,

N (0, 1)

Linear predictor Vs 1-NN predictor

EPE
f (x): linear, Pred: 1-nn
2

f (x): linear, Pred: linear

f (x): cubic, Pred: 1-nn
f (x): cubic, Pred: linear

1.5

p
2

10

EPE refers to the expected prediction error at point x0 = 0

EPE(x0 ) = Ey0 |x0 [ ET [(y0 y0 )2 ] ]

Linear predictor Vs 1-NN predictor

EPE
f (x): linear, Pred: 1-nn
2

f (x): linear, Pred: linear

f (x): cubic, Pred: 1-nn
f (x): cubic, Pred: linear

1.5

p
2

10

The noise level destroys the 1-nn predictor

linear predictor has a biased estimate of the cubic function
linear predictor fits well even in the presence of noise and high
dimension for the linear f
linear model beats curse of dimensionality

Words of Caution

Case of horses for courses

In previous example linear predictor out-performed the 1-nn

regression function as

bias of linear predictor variance of the 1-nn predictor

But could easily manufacture and example where
bias of linear predictor variance of the 1-nn predictor

More predictors than linear and NN

There are a whole hosts of models in between the rigid linear

model and the extremely flexible 1-nn method

Each one has it own assumptions and biases

Many are specifically designed to avoid the exponential

growth in complexity of functions in high dimensions.

Statistical models,
Supervised learning and
Function approximation

Goal
Know there is a function f (x) relating inputs to outputs:

Y f (X)
Want to find an estimate f(x) of f (x) from labelled training

data.

This is difficult when X is high dimensional

In this case need to incorporate special structure
reduce the bias and variance of the estimates
help combat the curse of dimensionality

A Statistical Model for Regression

y
10

f (x)

x
0

3
random variable indept of input X

Y = f (X) +
output

deterministic relationship

Additive Error Model

Y = f (X) +
where
the random variable has E[] = 0
is independent of X
f (x) = E[Y |X = x]
any departures from the deterministic relationship are mopped

up by

Statistical model for binary classification

p(x)
1

0.5

x
0

p(G|X = x) is modelled as a Bernoulli distribution with

p(x) = p(G = 1|X = x)
Therefore
E[G|X = x] = p(x) and

Var[G|X = x] = p(x)(1 p(x))

Supervised Learning - Function Approximation

Have training data

T = {(x1 , y1 ), . . . , (xn , yn )}
where each xi Rp and yi R.
y

10

x
0

Learn deterministic relationship f between X and Y from T .

In book Supervised Learning is viewed as a problem in

function approximation.

Common approach
2.6 Statistical Models, Supervised Learning and Function Approximation

31

FIGURE 2.10. Least squares fitting of a function of two inputs. The parameters

(x) are chosen so as toform

minimize of
the sum-of-squared
[Link] expansion
Decide onof fparametric
f , i.e. vertical
linear

principle for estimation is maximum likelihood estimation. Suppose we have

a random sample yi , i = 1, . . . , N fromM
a density Pr (y) indexed by some
parameters . The log-probability of the observed sample is

f (x) =
L() =

N
!
i=1

hm (x) m

log
Pr (yi ).
m=1

(2.33)

principle of maximum likelihood assumes that the most reasonable

Use least The
squares
to estimate in by minimizing
values for are those for which the probability of the observed sample is
largest. Least squares for the additive error model Y = f (X) + , with
n likelihood using the conditional
N (0, 2 ), is equivalent to maximum
likelihood
2
Pr(Y |X, ) = N (f (X),i 2 ).
(2.34)
i

RSS() =

(y f (x ))

So although the additional assumption

of normality seems more restrictive,
i=1
the results are the same. The log-likelihood of the data is

Dont have to always use least squares

Can find by optimizing other criteria.
Another option is Maximum Likelihood Estimation
For the additive model, Y = f (X) + have

P (Y |X, ) = N (f (X), 2 )
Log-likelihood of the training data is
L() =
=

n
X

i=1
n
X
i=1

log P (Y = yi |X = xi , )

log N (yi ; f (xi ), 2 )

Find the that minimizes L()

Structured Regression Models

Why do we need structure?

Consider the Residual Sum of Squares for a function f

RSS(f ) =

n
X
(yi f (xi ))2
i=1

There are infinitely many f with

f = arg min RSS(f )

f

and

RSS(f) = 0

Why do we need structure?

Any function f passing through the training points (xi , yi ) is

a solution.

Obviously not all the f will be equally good at predicting the

value of unseen test points...

Must restrict the class of f considered

Dont consider and arbitrary function f,
Instead restrict ourselves to f F

f = arg min RSS(f )

f F

But what restrictions should be used....

Initial ambiguity in choosing f has just been transferred to

choice of constraint.

Must restrict the class of f considered

Dont consider and arbitrary function f,
Instead restrict ourselves to f F

f = arg min RSS(f )

f F

But what restrictions should be used....

Initial ambiguity in choosing f has just been transferred to

choice of constraint.

Options to restrict the class of f

Have a parametric representation of f
Linear model: f (x) = 1t x + 0

Quadratic: f (x) = xt x + 1t x + 0

Impose complexity restrictions on the function.

i.e. f must have some regular behaviour in small
neighbourhoods of the input space, but then
What size should the neighbourhood be?
What form should f have in the neighbourhood?

No unique way to impose complexity constraints

Complexity and Neighbourhood size

Large neighbourhood = strong constraint
Small neighbourhood = weak constraint

Classes of Restricted Estimators

How to restrict the predictor f

The techniques used to restrict the regression or classification

function learned loosely fall into several classes.

Each class has parameter(s) termed smoothing parameters

which control the effective size of the local neighbourhood.

Some examples from each class follow.

How to restrict the predictor f

The techniques used to restrict the regression or classification

function learned loosely fall into several classes.

Each class has parameter(s) termed smoothing parameters

which control the effective size of the local neighbourhood.

Some examples from each class follow.

Note:
It is assumed we have training examples {(xi , yi )}n
i=1 and
We present the energy functions or functionals which are

minimised in order to find f

Class 1: Roughness Penalty

ensure f predicts the training values
penalty parameter

PRSS(f, ) =

Pn

f (xi ))2 + J(f )

i=1 (yi

functional measuring smoothness of f

One such penalty functional is

J(f ) =

[f 00 (x)]2 dx

For wiggly f s this functional will have a large value while for
linear f s it is zero.
Regularization methods express our belief that the f were

trying to approximate has a certain smoothness properties.

Class 2: Kernel Methods and Local Regression

Estimate the regression or classification function in a local

neighbourhood.

Need to specify
the nature of local neighbourhood
the class of functions used in local fit

Kernel Methods and Local Regression

Can define a local regression estimate of f (x0 ), from training

data {(xi , yi )}, as f(x0 ) where minimizes

RSS(f , x0 ) =

n
X
i=1

K (x0 , xi )(yi f (xi ))2

where
Kernel function: K (x0 , xi ) assign weights to xi depending

on its closeness to x0 .

Base regression function: f is a parameterized function

such as a low order polynomial.

A common kernel is the Gaussian kernel

kx0 xk2
1
K (x0 , x) = exp

Class 3: Basis functions and Dictionary methods

f is modelled as a linear expansion of basis functions

f (x) =

M
X

m hm (x)

m=1

Each hm is a function of the input x.

Linear refers to the actions of the parameters.

Example 1: Radial Basis Functions

f (x) =

M
X

K (m , x) m

m=1

where
Km (m , x) is a symmetric kernel centred at location m .
the Gaussian kernel is a popular kernel to use

K (m , x) = exp(km xk2 /(2))

If m s and m s pre-defined = estimating a linear

problem.

However, if m s and m s not pre-defined = estimating

, m s and m s is a hard non-linear problem.

Example 2: Adaptive basis function method

f (x) =

M
X

t
m (m
x + bm )

m=1

where
= (1 , . . . , M , 1 , . . . , M , b1 , . . . , bm )t
(z) = 1/(1 + ez ) is the activation function.
The directions m and bias terms bm have to be determined

and estimating them is the core of the estimation.

Dictionary methods
Adaptively chosen basis function methods aka dictionary

methods

Challenge is to choose a number of basis functions from a

dictionary set D of candidate basis functions (possibly

infinite).

Models are built up by employing some kind of search

mechanism

Model Selection and,

the Bias-Variance Trade-off

The complexity of learnt function

Many models have a parameter which control its complexity.
We have seen examples of this
k - number of nearest neighbours (nearest neighbour classifier)
- width of the kernel (radial basis functions)
M - number of basis functions (dictionary methods)
- weight of the penalty term (spline fitting)

How does increasing or decreasing the complexity of the

model affect their predictive behaviour?

Consider the nearest neighbour regression fit

y
2

f(x)

f (x)

1.5

x
0

Approximate f (x) with 1-nn regression fit f1 (x) given

{(xi , yi )}ni=1 and n = 100.

Each training example is yi = f (xi ) + i with i N (0, 2 )

and = .1

Expected predictor when k = 1

y
2

E[f(x)]
1.5

x
0

Shown above is the expected prediction of the 1-nn regression

fit given n = 100 and = .1

E[f1 (x)] is a good approximation to f (x). There is no bias!

At each x one std of the estimate is shown. Note its

magnitude.

15-nn regression fit

y
2

1.5

f(x)

f (x)

x
0

Approximate f (x) with 15-nn regression fit f15 (x) given

{(xi , yi )}ni=1 and n = 100.

Each training example is yi = f (xi ) + i with i N (0, 2 )

and = .1

Expected predictor when k = 15

y
2

E[f(x)]
1.5

x
0

E[f15 (x)] is smooth but biased.

Compare the peak of f (x) and E[f15 (x)] !
Note the variance of estimate is much smaller than when

k = 1.

Have illustrated the Bias-Variance trade-off

y

E[f(x)]

E[f(x)]

1.5

1.5

x
0

High complexity: k = 1

x
0

Lower complexity: k = 15

Model complexity increased, the variance tends to increase

and the squared bias tends to decrease.

Model complexity is decreased, the variance tends to

decrease, but the squared bias tends to increase.

How to choose the model complexity?

What not to do:
Want to choose model complexity which minimizes test error.
Training error is one estimate of the test error.
Could choose the model complexity that produces the

predictor which minimizes the training error.

Not a good idea!

How to choose the model complexity?

What not to do:
Want to choose model complexity which minimizes test error.
Training error is one estimate of the test error.
Could choose the model complexity that produces the

predictor which minimizes the training error.

Not a good idea!

Why??
Training error decreases when model complexity
increases

Overfitting
38

2. Overview of Supervised Learning

Low Bias
High Variance

Prediction Error

High Bias
Low Variance

Test Sample

Training Sample

Low

High

Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.

Too much fitting = adapt too closely to the training data

be close to f (x0 ). As k grows, the neighbors are further away, and then

Have
a high
variance predictor
anything
can happen.

The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.

This More
scenario
is astermed
generally,
the modeloverfitting
complexity of our procedure is increased, the
variance tends to increase and the squared bias tends to decrease. The op-

posite behavior
occurs as the loses
model complexity
is decreased.
For k-nearest
In such
cases predictor
the ability
to generalize
neighbors, the model complexity is controlled by k.

Underfitting
38

2. Overview of Supervised Learning

Low Bias
High Variance

Prediction Error

High Bias
Low Variance

Test Sample

Training Sample

Low

High

Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.

Low complexity model = predictor may have large bias

be close to f (x0 ). As k grows, the neighbors are further away, and then

anything can happen.

Therefore
predictor has poor generalization

The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.

Underfitting
38

2. Overview of Supervised Learning

Low Bias
High Variance

Prediction Error

High Bias
Low Variance

Test Sample

Training Sample

Low

High

Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.

Low complexity model = predictor may have large bias

be close to f (x0 ). As k grows, the neighbors are further away, and then

anything can happen.

Therefore
predictor has poor generalization

The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.

You might also like

Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
No ratings yet
Understanding The Geometry of Predictive Models: Workshop at S P Jain School Institute of Management and Research
78 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
All Models Are Wrong
No ratings yet
All Models Are Wrong
429 pages
CPSC 540 Assignment 1 (Due January 19)
100% (1)
CPSC 540 Assignment 1 (Due January 19)
9 pages
5.2 Regression
No ratings yet
5.2 Regression
19 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Linear Algebra in Regression Analysis
No ratings yet
Linear Algebra in Regression Analysis
21 pages
Lecture 09 - 02.09.2024 - Regression-01
No ratings yet
Lecture 09 - 02.09.2024 - Regression-01
62 pages
Introduction: Geometric Models: - Page 1 of 25
No ratings yet
Introduction: Geometric Models: - Page 1 of 25
25 pages
Regression 1
No ratings yet
Regression 1
63 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Progression Linaire
No ratings yet
Progression Linaire
187 pages
Matrix Model
No ratings yet
Matrix Model
6 pages
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Linear Models and Learning Via Optimization: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Lec03 MultLinRegression
No ratings yet
Lec03 MultLinRegression
42 pages
Elements of Statistical Learning Overview
No ratings yet
Elements of Statistical Learning Overview
63 pages
07 - Linear Models For Classification
No ratings yet
07 - Linear Models For Classification
76 pages
Fun Least Squares
No ratings yet
Fun Least Squares
3 pages
Day 1
No ratings yet
Day 1
41 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Linear Regression
No ratings yet
Linear Regression
3 pages
Linear Regression
No ratings yet
Linear Regression
4 pages
ML Merge
No ratings yet
ML Merge
145 pages
ML Module 2,3,4
No ratings yet
ML Module 2,3,4
13 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
ML 3
No ratings yet
ML 3
66 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Machine Learning: Linear Regression Basics
No ratings yet
Machine Learning: Linear Regression Basics
17 pages
ML PPT 2
No ratings yet
ML PPT 2
206 pages
Linear Regression
No ratings yet
Linear Regression
89 pages
ML 2
No ratings yet
ML 2
155 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lec 12
No ratings yet
Lec 12
9 pages
1 Intro
No ratings yet
1 Intro
5 pages
Lect 1
No ratings yet
Lect 1
24 pages
Statistical Learning Guide
No ratings yet
Statistical Learning Guide
5 pages
Machine Learning UNIT II
No ratings yet
Machine Learning UNIT II
34 pages
Tutorial R LM
No ratings yet
Tutorial R LM
22 pages
Alok Bishoyi's Insights on Machine Learning
100% (1)
Alok Bishoyi's Insights on Machine Learning
7 pages
AIMLB PGP 2025 Session 4
No ratings yet
AIMLB PGP 2025 Session 4
38 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
ML Unit3
No ratings yet
ML Unit3
9 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
CSE545 sp23 (7) Regressions To Transformers 3-29
No ratings yet
CSE545 sp23 (7) Regressions To Transformers 3-29
188 pages
ML Introduction
No ratings yet
ML Introduction
76 pages
Getting Started in R Stata Notes On Exploring Data: Oscar Torres-Reyna
No ratings yet
Getting Started in R Stata Notes On Exploring Data: Oscar Torres-Reyna
30 pages
Economic Accounts 2005 Final
No ratings yet
Economic Accounts 2005 Final
2 pages
Exam4135 2004 Solutions
No ratings yet
Exam4135 2004 Solutions
8 pages
Abdc Journal Quality List 2013
No ratings yet
Abdc Journal Quality List 2013
32 pages
Balabolka Sample
No ratings yet
Balabolka Sample
15 pages
Midterm Microeconomics 1 2012-13
No ratings yet
Midterm Microeconomics 1 2012-13
1 page
Measures of Association
No ratings yet
Measures of Association
10 pages
Active Models: Question Form Sheet
No ratings yet
Active Models: Question Form Sheet
8 pages
APS 623 12 000001 Instructions
No ratings yet
APS 623 12 000001 Instructions
25 pages
Capacity Building Dissertation
No ratings yet
Capacity Building Dissertation
339 pages
Marketing Communication
No ratings yet
Marketing Communication
17 pages
(J. G. Kalbfleisch) Probability and Statistical I PDF
No ratings yet
(J. G. Kalbfleisch) Probability and Statistical I PDF
188 pages
Inferring Structure of Cortical Neuronal Networks From Firing Data: A Statistical Physics Approach
No ratings yet
Inferring Structure of Cortical Neuronal Networks From Firing Data: A Statistical Physics Approach
24 pages
Chapter 2 Econometrics Simple Linear Regression Analysis
No ratings yet
Chapter 2 Econometrics Simple Linear Regression Analysis
42 pages
Stochastic Frontier Models Guide
No ratings yet
Stochastic Frontier Models Guide
143 pages
Signal Processing
No ratings yet
Signal Processing
5 pages
ProblemSetQTII 2024
No ratings yet
ProblemSetQTII 2024
29 pages
Regression With A Binary Dependent Variable
No ratings yet
Regression With A Binary Dependent Variable
63 pages
Chapter 6 Estimation
No ratings yet
Chapter 6 Estimation
5 pages
On The Econometrics of The Koyck Model
No ratings yet
On The Econometrics of The Koyck Model
12 pages
Info Theory for Systems Scientists
No ratings yet
Info Theory for Systems Scientists
18 pages
10-601 Machine Learning Exam
No ratings yet
10-601 Machine Learning Exam
21 pages
A Stochastic Frontier Model With Correction For Sample Selection
No ratings yet
A Stochastic Frontier Model With Correction For Sample Selection
10 pages
Advanced Wind Turbine Technology
100% (1)
Advanced Wind Turbine Technology
351 pages
Lessons in Digital Estimation Theory
100% (1)
Lessons in Digital Estimation Theory
161 pages
Understanding Logistic Regression in GLMs
No ratings yet
Understanding Logistic Regression in GLMs
87 pages
Autoregressive Models in Unsupervised Learning
No ratings yet
Autoregressive Models in Unsupervised Learning
113 pages
Statistical Methodology Exam Paper
No ratings yet
Statistical Methodology Exam Paper
4 pages
ps2 Sol
No ratings yet
ps2 Sol
19 pages
Machine Learning Unit 1 Notes
No ratings yet
Machine Learning Unit 1 Notes
22 pages
MS Regress - The MATLAB Package For Markov Regime Switching Models
No ratings yet
MS Regress - The MATLAB Package For Markov Regime Switching Models
38 pages
On Fitting Models For Danish Fire Data
No ratings yet
On Fitting Models For Danish Fire Data
49 pages
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
100% (1)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
414 pages
A Simplified Approach To Understanding The Kalman Filter Techniqu
No ratings yet
A Simplified Approach To Understanding The Kalman Filter Techniqu
25 pages
Chain Ladder Method
No ratings yet
Chain Ladder Method
54 pages
Tospj 8 27
No ratings yet
Tospj 8 27
12 pages
Final 1
No ratings yet
Final 1
20 pages
ML 2024a QP Solution Full
No ratings yet
ML 2024a QP Solution Full
13 pages
MSGARCH: Markov-Switching GARCH in R
No ratings yet
MSGARCH: Markov-Switching GARCH in R
27 pages
B. J. Winer-Statistical Principles in Experimental Design-NY C. - McGraw-Hill Book Company (1962)
No ratings yet
B. J. Winer-Statistical Principles in Experimental Design-NY C. - McGraw-Hill Book Company (1962)
687 pages
Applied Time-Series Analysis: Arun K. Tangirala
No ratings yet
Applied Time-Series Analysis: Arun K. Tangirala
50 pages