Generalized Linear Model
Outline
• Linear regression
– Regression: predicting a continuous value
• Logistic regression
– Classification: predicting a discrete value
• Gradient descent
– Very general optimization technique
Regression wants to predict a continuous-
valued output for an input.
• Data:
• Goal:
Linear Regression
Linear regression assumes a linear
relationship between inputs and outputs.
• Data:
• Goal:
You collected data about commute times.
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.
1.1
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.
~23
1.1
How can we find this line?
How can we find this line?
• Define
– xi: input, distance from campus
– yi: output, commute time
• We want to predict y for an
unknown x
• Assume
– In general, assume
y = f(x) + ε
– For 1-D linear regression, assume
f(x) = w0 + w1x
• We want to learn the
parameters w
We can learn w from the observed data by
maximizing the conditional likelihood.
• Recall:
• Introducing some new notation…
We can learn w from the observed data by
maximizing the conditional likelihood.
We can learn w from the observed data by
maximizing the conditional likelihood.
minimizing least-squares error
For the 1-D case…
• Two values define this line
– w0: intercept
– w1: slope
– f(x) = w0 + w1x
Linear Regression
So far, we’ve been interested in learning P(Y|X) where Y has
discrete values (called ‘classification’)
What if Y is continuous? (called ‘regression’)
• predict weight from gender, height, age, …
• predict Google stock price today from Google, Yahoo,
MSFT prices yesterday
• predict each pixel intensity in robot’s current camera
image, from previous image and previous action
Regressio
n
Wish to learn f:XY, where Y is real, given {<x1,y1>…<xn,yn>}
Approach:
1. choose some parameterized form for P(Y|X; θ)
( θ is the vector of parameters)
2. derive learning algorithm as MCLE or MAP estimate for θ
1. Choose parameterized form for P(Y|X; θ)
X
Assume Y is some deterministic f(X), plus random noise
where
Therefore Y is a random variable that follows the distribution
and the expected value of y for any given x is f(x)
Consider Linear
Regression
E.g., assume f(x) is linear function of x
Notation: to make our parameters explicit, let’s write
Training Linear Regression
How can we learn W from the training data?
TrainingLinear
Regression
How can we learn W from the training data?
Learn Maximum Conditional Likelihood Estimate!
where
TrainingLinear
Regression
Learn Maximum Conditional Likelihood Estimate
where
TrainingLinear
Regression
Learn Maximum Conditional Likelihood Estimate
where
so:
TrainingLinear
Regression
Learn Maximum Conditional Likelihood Estimate
Can we derive gradient descent rule for training?
How about MAP instead of MLE estimate?
Regression – What you should know
Under general assumption
1. MLE corresponds to minimizing sum of squared prediction errors
2. MAP estimate minimizes SSE plus sum of squared weights
3. Again, learning is an optimization problem once we choose our
objective function
• maximize data likelihood
• maximize posterior prob of W
4. Again, we can use gradient descent as a general learning algorithm
• as long as our objective fn is differentiable wrt W
• though we might learn local optima ins
5. Almost nothing we said here required that f(x) be linear in x
Bias/Variance Decomposition of Error
Bias and Variance
• given some estimator Y for some
parameter θ, we define
• the bias of estimator Y=
= the variance of
estimator Y
• e.g., define Y as the MLE estimator for
probability of heads, based on n
independent coin flips
Bias – Variance decomposition of error
Reading: Bishop chapter 9.1, 9.2
• Consider simple regression problem f:XY
y = f(x) +
noise N(0,)
deterministic
What are sources of prediction error?
learned
estimate of f(x)
Sources of error
• What if we have perfect learner, infinite
data?
– Our learned h(x) satisfies h(x)=f(x)
– Still have remaining, unavoidable error
2
Sources of error
• What if we have only n training examples?
• What is our expected error
– Taken over random training sets of size n,
drawn from distribution D=p(x,y)
Sources of error
Logistic Regression
Logistic regression is a discriminative
approach to classification.
• Classification: predicts discrete-valued output
– E.g., is an email spam or not?
Logistic regression is a discriminative
approach to classification.
• Discriminative: directly estimates P(Y|X)
– Only concerned with discriminating (differentiating)
between classes Y
– In contrast, naïve Bayes is a generative classifier
• Estimates P(Y) & P(X|Y) and uses Bayes’ rule to calculate
P(Y|X)
• Explains how data are generated, given class label Y
• Both logistic regression and naïve Bayes use their
estimates of P(Y|X) to assign a class to an input
X—the difference is in how they arrive at these
estimates.
The assumptions of logistic regression
• Given
• Want to learn
• Want to learn p(Y=1|X=x)
The logistic function is appropriate for
making probability estimates.
b
Logistic regression models
probabilities with the logistic function.
• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5
Y=1
P(Y=1|X)
Y=0
Logistic regression models
probabilities with the logistic function.
• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5
Y=1
P(Y=1|X)
Y=0
Therefore, logistic regression is
a linear classifier.
• Use the logistic function to estimate the
probability of Y given X
• Decision boundary:
Maximize the conditional likelihood to
find the weights w = [w0,w1,…,wd].
How can we optimize this function?
• Concave [check Hessian of P(Y|X,w)]
• No closed-form solution for w
Logistic Regression
Idea:
• Naïve Bayes allows computing P(Y|X) by
learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
• Consider learning f: X Y, where
• X is a vector of real-valued features, < X1 … Xn >
• Y is boolean
• assume all Xi are conditionally independent given Y
• model P(Xi | Y = yk) as Gaussian N(ik,i)
• model P(Y) as Bernoulli ()
• What does that imply about the form of P(Y|X)?
Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi
Very convenient!
implies
implies
implies
Very convenient!
implies
implies
linear
classification
implies rule!
Logistic function
Logistic regression more generally
• Logistic regression when Y not boolean (but
still discrete-valued).
• Now y {y1 ... yR} : learn R-1 sets of weights
for k<R
for k=R
Training Logistic Regression:
•
MCLE
we have L training examples:
• maximum likelihood estimate for parameters W
• maximum conditional likelihood estimate
Training Logistic Regression:
MCLE
• Choose parameters W=<w0, ... wn> to
maximize conditional likelihood of training data
where
• Training data D =
• Data likelihood =
• Data conditional likelihood =
Expressing Conditional Log
Likelihood
Maximizing Conditional Log
Likelihood
Good news: l(W) is concave function of W
Bad news: no closed-form solution to maximize l(W)
Gradient Descent:
Batch gradient: use error over entire training set D
Do until satisfied:
1. Compute the gradient
2. Update the vector of parameters:
Stochastic gradient: use error over single examples
Do until satisfied:
1. Choose (with replacement) a random training
example
2. Compute the gradient just for :
3. Update the vector of parameters:
Stochastic approximates Batch arbitrarily closely as
Stochastic can be much faster when D is very large
Intermediate approach: use error over subsets of D
Maximize Conditional Log Likelihood:
Gradient Ascent
Maximize Conditional Log Likelihood:
Gradient Ascent
Gradient ascent algorithm: iterate until change <
For all i, repeat
That’s all for M(C)LE. How about MAP?
• One common approach is to define priors on W
– Normal distribution, zero mean, identity covariance
• Helps avoid very large weights and overfitting
• MAP estimate
• let’s assume Gaussian prior: W ~ N(0, σ)
MLE vs MAP
• Maximum conditional likelihood estimate
• Maximum a posteriori estimate with prior W~N(0,σI)
MAP estimates and
• Regularization
Maximum a posteriori estimate with prior W~N(0,σI)
called a “regularization” term
• helps reduce overfitting
• keep weights nearer to zero (if P(W) is zero
mean Gaussian prior), or whatever the prior
suggests
• used very frequently in Logistic Regression
The Bottom Line
•Consider learning f: X Y, where
• X is a vector of real-valued features, < X1 … Xn >
• Y is boolean
• assume all Xi are conditionally independent given Y
• model P(Xi | Y = yk) as Gaussian N(ik,i)
• model P(Y) as Bernoulli ()
•Then P(Y|X) is of this form, and we can
directly estimate W
• Furthermore, same holds if the Xi are boolean
• trying proving that to yourself
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: X Y, or P(Y|X)
Generative classifiers (e.g., Naïve Bayes)
• Assume some functional form for P(X|Y), P(X)
• Estimate parameters of P(X|Y), P(X) directly from training data
• Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers (e.g., Logistic regression)
• Assume some functional form for P(Y|X)
• Estimate parameters of P(Y|X) directly from training data
Gradient Descent
Gradient descent can optimize
differentiable functions.
• Suppose you have a differentiable function f(x)
• Gradient descent
– Choose starting point 𝑥 (0)
– Repeat until no change:
Updated value
for optimum Previous value
for optimum
Step size
Gradient of f,
evaluated at
current x
Here is the trajectory of gradient
descent on a quadratic function.
How does step size affect the result?
Gradient descent can optimize
differentiable functions.
• Suppose you have a differentiable function f(x)
• Gradient descent
– Choose starting point 𝑥 (0)
– Repeat until no change:
Updated value
for optimum Previous value
for optimum
Step size
Gradient of f,
evaluated at
current x