Introduction to Machine Learning
Linear Models for Regression
林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University
Some slides are modified from Prof. Sheng-Jyh Wang
and Prof. Hwang-Tzong Chen
Regression
• Given a training data set comprising 𝑁 observations {𝐱𝑛 }𝑁 𝑛=1
and the corresponding target values {𝑡𝑛 }𝑁 𝑛=1 , the goal of
regression is to predict the value of 𝑡 for a new value of 𝐱
[Link]
2
A simple regression model
• A simple linear model:
➢ Each observation is in a 𝐷–dimensional space 𝐱 = (𝑥1 , … , 𝑥𝐷 )T
➢ 𝑦 is a regression model parametrized by 𝐰 = (𝑤0 , … , 𝑤𝐷 )T
➢ The output is a linear combination of the input variables
➢ It is a linear function of parameters
➢ The fitting power is quite limited. Seeking a nonlinear extension
for the input variables
3
An example
• A regressor in the form of
➢ A straight line in this case -> Insufficient fitting power
➢ Nonlinear feature transforms before linear regression
4
Linear regression with nonlinear basis functions
• Simple linear model:
• A linear model with nonlinear basis functions
where {𝜙𝑗 }𝑀−1
𝑗=1 : nonlinear basis functions
𝑀: the number of parameters
𝑤0 : the bias parameter allowing a fixed offset
• The regression output is a linear combination of nonlinear
basis functions of the inputs
5
Linear regression with nonlinear basis functions
• A linear model with nonlinear basis functions
• Let 𝜙0 𝐱 = 1, a dummy basis function. The regression function
is equivalently expressed as
where and
6
Examples of basis functions
• Polynomial basis function: taking the form of powers of 𝑥
• Gaussian basis function: governed by and
➢ governs the location while governs the scale
• Sigmoidal basis function: governed by and
where
7
How basis functions work
• Take Gaussian basis functions as an example
y = w0 + w11 ( x ) + w22 ( x ) + ... + wM −1M −1 ( x )
1(x) 2(x) 3(x) 4(x) 5(x) 6(x) 7(x) 8(x)
8
Maximum likelihood and least squares
• Assume each observation is sampled from a deterministic
function with an added Gaussian noise
where 𝜀 is a zero mean Gaussian and precision (inverse
variance) is 𝛽
• Thus, we have the conditional probability
9
Maximum likelihood and least squares
• Given a data set of inputs X = {x1, . . . , xN} with corresponding
target values t1, . . . , tN, we have the likelihood function
• The log likelihood function is
where
10
Maximum likelihood and least squares
• Given a data set of inputs X = {x1, . . . , xN} with corresponding
target values t1, . . . , tN, we have the likelihood function
• The log likelihood function is
How?
where
11
Maximum likelihood and least squares
• Gaussian noise likelihood ֞ sum-of-squares error function
• Maximum likelihood solution: Optimize 𝐰 by maximizing the
log likelihood function
• Step 1: Compute the gradient of log likelihood w.r.t. 𝐰
• Step 2: Set the gradient to zero, which gives
12
Maximum likelihood and least squares
• Define the design matrix in this task
➢ It has 𝑁 rows, one for each training sample
➢ It has 𝑀 columns, one for each basis function
13
Maximum likelihood and least squares
• Setting the gradient to zero
we have
• How to derive?
➢ Hint 1:
➢ Hint 2:
14
Maximum likelihood and least squares
• The ML solution
• is known as the Moore-Penrose pseudo-
inverse of the design matrix
• has linearly independent columns. Why is invertible?
15
Maximum likelihood and least squares
• Similarly, 𝛽 is optimized by maximizing the log likelihood
where
• We get
16
Regression for a new data point
• The conditional probability (likelihood function)
• After learning, we get 𝐰 ՚ 𝐰ML and 𝛽 ՚ 𝛽ML
• Specify the prediction of a data point 𝐱 in the form of a
−1
Gaussian distribution with mean 𝑦 𝐱, 𝐰ML and variance 𝛽ML
17
Regularized least squares
• Add a regularization term helps alleviate over-fitting
• The simplest form of the regularization term
• The total error function becomes
• Setting the gradient of the function w.r.t. 𝐰 to 0, we have
18
Regularized least squares
• A more general regularizer
• q=2 → quadratic regularizer
• q=1 → the lasso in the statistics literature
• Contours of the regularization term
19
Multiple outputs
• In some applications, we wish to predict 𝐾 > 1 target values
➢ One target value: Income -> Happiness
➢ Multiple target values: Income -> Happiness, Hours of duty, Health
• Recall the one-dimensional case
• With the same basis functions, the regression approach becomes
where 𝐖 is a 𝑀 × 𝐾 matrix, 𝑀 is the number of basis functions,
and 𝐾 is the number of target values
20
Multiple outputs
• The conditional probability of a single observation is
➢ An isotropic Gaussian, i.e., with a diagonal covariance matrix
➢ Each pair of variables are independent
• The log likelihood function is
21
Multiple outputs: Maximum likelihood solution
• Setting the gradient of the log likelihood function w.r.t. 𝐖 to 0,
we have
• Consider the kth column of 𝐖ML
where 𝐭 𝑘 is a 𝑁-dimensional vector with components [𝑡𝑛𝑘 ]
• It leads to 𝐾 independent regression problems
22
Sequential learning
• The maximum likelihood derivation is a batch technique
➢ It takes all training data into account at the same time
➢ Case 1: The training data set is sufficiently large
➢ Case 2: Data points are arriving in a continuous stream
• For the two cases, it is worthwhile to use sequential
algorithms, or on-line algorithms, in which the data points are
considered one by one, and the model parameters are
updated incrementally
23
Sequential learning
• Stochastic gradient descent
➢ Error function comprises a sum over data points 𝐸 = σ𝑛 𝐸𝑛
➢ Given data point 𝐱 𝑛 , the parameter vector 𝐰 is updated by
where 𝜏 is the iteration number and η is the learning rate
➢ In the case of sum-of-squares error, it is
24
Maximum a posterior
• Likelihood function
• Let’s consider a prior function, which is a Gaussian
where 𝐦0 is the mean and S0 is the covariance matrix
• The posterior function is also a Gaussian
where is the mean
and is the covariance
25
How to derive the mean and covariance in posterior
• According to the marginal and conditional Gaussians on page
93 of the PRML textbook
26
A zero-mean isotropic Gaussian prior
• A general Gaussian prior function
where 𝐦0 is the mean and S0 is the covariance matrix
• A widely used Gaussian prior
• Mean and covariance of the resulting posterior function
27
Sequential Bayesian learning: An example
• Data, including observations and target values, are given one-
by-one
• Data are in a one-dimensional space
• Data are sampled from the function ,
where and , and added by a Gaussian
noise
➢ Note that the function is unknown
➢ We have just the observations and the target values
28
An example
• Regression function
29
An example
• Regression function
• In the beginning, no
data are available
• Constant likelihood
• Prior = posterior
• Sample 6 curves for
function according to
posterior distribution
30
An example
• Regression function
• One data (blue circle)
sample is given
• Likelihood for this
sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
31
An example
• Regression function
• Second data (blue
circle) sample is given
• Likelihood for the
second sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
32
An example
• Regression function
• 20 data (blue circle)
sample are given
• Likelihood for the 20th
sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
33
Predictive distribution
• Recall the posterior function
where and
• Given 𝐰, we regress a data sample via
• In Bayesian treatment, the predictive distribution is
• Then we have
• where
34
• Green curve is used to sample data. It is unknown
• Blue circle: a sampled data
• After learning, the predictive distribution
• Red curve: the mean of the Gaussian above
• Red shaded region: One standard deviation on either side of mean
35
36
• Sample 5 points of 𝐰 according to the posterior function
• Plot the corresponding regression functions
37
References
• Chapters 3.1 and 3.3 in the PRML textbook
38
Thank You for Your Attention!
Yen-Yu Lin (林彥宇)
Email: lin@[Link]
URL: [Link]
39