0% found this document useful (0 votes)

47 views67 pages

Generalized Linear Model

Logistic regression is a discriminative classification approach that directly estimates P(Y|X) to predict a discrete output Y given inputs X. It assumes a linear relationship between the logits (log odds) of the probability of each class and the features X. The weights W can be learned using gradient descent to maximize the conditional log-likelihood of the training data. Regularization can also be applied by using a MAP estimate with a prior like a Gaussian over the weights W.

Uploaded by

shanthikk3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views67 pages

Generalized Linear Model

Uploaded by

shanthikk3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Generalized Linear Model

Outline
• Linear regression
– Regression: predicting a continuous value
• Logistic regression
– Classification: predicting a discrete value
• Gradient descent
– Very general optimization technique
Regression wants to predict a continuous-
valued output for an input.
• Data:

• Goal:
Linear Regression
Linear regression assumes a linear
relationship between inputs and outputs.
• Data:

• Goal:
You collected data about commute times.
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.

1.1
Now, you want to predict commute time for a
new person, who lives 1.1 miles from campus.

~23

1.1
How can we find this line?
How can we find this line?

• Define
– xi: input, distance from campus
– yi: output, commute time
• We want to predict y for an
unknown x
• Assume
– In general, assume
y = f(x) + ε
– For 1-D linear regression, assume
f(x) = w0 + w1x
• We want to learn the
parameters w
We can learn w from the observed data by
maximizing the conditional likelihood.
• Recall:

• Introducing some new notation…

We can learn w from the observed data by
maximizing the conditional likelihood.
We can learn w from the observed data by
maximizing the conditional likelihood.

minimizing least-squares error

For the 1-D case…

• Two values define this line

– w0: intercept
– w1: slope
– f(x) = w0 + w1x
Linear Regression
So far, we’ve been interested in learning P(Y|X) where Y has
discrete values (called ‘classification’)

What if Y is continuous? (called ‘regression’)

• predict weight from gender, height, age, …

• predict Google stock price today from Google, Yahoo,

MSFT prices yesterday

• predict each pixel intensity in robot’s current camera

image, from previous image and previous action
Regressio
n
Wish to learn f:XY, where Y is real, given {<x1,y1>…<xn,yn>}

Approach:

1. choose some parameterized form for P(Y|X; θ)

( θ is the vector of parameters)

2. derive learning algorithm as MCLE or MAP estimate for θ

1. Choose parameterized form for P(Y|X; θ)

X
Assume Y is some deterministic f(X), plus random noise
where

Therefore Y is a random variable that follows the distribution

and the expected value of y for any given x is f(x)

Consider Linear
Regression
E.g., assume f(x) is linear function of x

Notation: to make our parameters explicit, let’s write

Training Linear Regression

How can we learn W from the training data?

TrainingLinear
Regression
How can we learn W from the training data?

Learn Maximum Conditional Likelihood Estimate!

where
TrainingLinear
Regression
Learn Maximum Conditional Likelihood Estimate

where

so:
TrainingLinear
Regression
Learn Maximum Conditional Likelihood Estimate

Can we derive gradient descent rule for training?

How about MAP instead of MLE estimate?
Regression – What you should know
Under general assumption

1. MLE corresponds to minimizing sum of squared prediction errors

2. MAP estimate minimizes SSE plus sum of squared weights

3. Again, learning is an optimization problem once we choose our

objective function
• maximize data likelihood
• maximize posterior prob of W

4. Again, we can use gradient descent as a general learning algorithm

• as long as our objective fn is differentiable wrt W
• though we might learn local optima ins

5. Almost nothing we said here required that f(x) be linear in x

Bias/Variance Decomposition of Error
Bias and Variance

• given some estimator Y for some

parameter θ, we define

• the bias of estimator Y=

= the variance of
estimator Y

• e.g., define Y as the MLE estimator for

probability of heads, based on n
independent coin flips
Bias – Variance decomposition of error
Reading: Bishop chapter 9.1, 9.2

• Consider simple regression problem f:XY

y = f(x) + 

noise N(0,)
deterministic

What are sources of prediction error?

learned
estimate of f(x)
Sources of error
• What if we have perfect learner, infinite
data?
– Our learned h(x) satisfies h(x)=f(x)
– Still have remaining, unavoidable error

2
Sources of error
• What if we have only n training examples?
• What is our expected error
– Taken over random training sets of size n,
drawn from distribution D=p(x,y)
Sources of error
Logistic Regression
Logistic regression is a discriminative
approach to classification.
• Classification: predicts discrete-valued output
– E.g., is an email spam or not?
Logistic regression is a discriminative
approach to classification.
• Discriminative: directly estimates P(Y|X)
– Only concerned with discriminating (differentiating)
between classes Y
– In contrast, naïve Bayes is a generative classifier
• Estimates P(Y) & P(X|Y) and uses Bayes’ rule to calculate
P(Y|X)
• Explains how data are generated, given class label Y
• Both logistic regression and naïve Bayes use their
estimates of P(Y|X) to assign a class to an input
X—the difference is in how they arrive at these
estimates.
The assumptions of logistic regression
• Given

• Want to learn

• Want to learn p(Y=1|X=x)

The logistic function is appropriate for
making probability estimates.

b
Logistic regression models
probabilities with the logistic function.

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

Y=1

P(Y=1|X)

Y=0
Logistic regression models
probabilities with the logistic function.

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

Y=1

P(Y=1|X)

Y=0
Therefore, logistic regression is
a linear classifier.
• Use the logistic function to estimate the
probability of Y given X

• Decision boundary:
Maximize the conditional likelihood to
find the weights w = [w0,w1,…,wd].
How can we optimize this function?

• Concave  [check Hessian of P(Y|X,w)]

• No closed-form solution for w 
Logistic Regression
Idea:
• Naïve Bayes allows computing P(Y|X) by
learning P(Y) and P(X|Y)

• Why not learn P(Y|X) directly?

• Consider learning f: X Y, where
• X is a vector of real-valued features, < X1 … Xn >
• Y is boolean
• assume all Xi are conditionally independent given Y
• model P(Xi | Y = yk) as Gaussian N(ik,i)
• model P(Y) as Bernoulli ()

• What does that imply about the form of P(Y|X)?

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi
Very convenient!

implies

implies
Very convenient!

implies

linear
classification
implies rule!
Logistic function
Logistic regression more generally
• Logistic regression when Y not boolean (but
still discrete-valued).
• Now y  {y1 ... yR} : learn R-1 sets of weights

for k<R

for k=R
Training Logistic Regression:
•
MCLE
we have L training examples:

• maximum likelihood estimate for parameters W

• maximum conditional likelihood estimate

Training Logistic Regression:
MCLE
• Choose parameters W=<w0, ... wn> to
maximize conditional likelihood of training data
where

• Training data D =
• Data likelihood =
• Data conditional likelihood =
Expressing Conditional Log
Likelihood
Maximizing Conditional Log
Likelihood

Good news: l(W) is concave function of W

Bad news: no closed-form solution to maximize l(W)
Gradient Descent:
Batch gradient: use error over entire training set D
Do until satisfied:
1. Compute the gradient

2. Update the vector of parameters:

Stochastic gradient: use error over single examples

Do until satisfied:
1. Choose (with replacement) a random training
example
2. Compute the gradient just for :

3. Update the vector of parameters:

Stochastic approximates Batch arbitrarily closely as

Stochastic can be much faster when D is very large
Intermediate approach: use error over subsets of D
Maximize Conditional Log Likelihood:
Gradient Ascent
Maximize Conditional Log Likelihood:
Gradient Ascent

Gradient ascent algorithm: iterate until change < 

For all i, repeat
That’s all for M(C)LE. How about MAP?

• One common approach is to define priors on W

– Normal distribution, zero mean, identity covariance
• Helps avoid very large weights and overfitting
• MAP estimate

• let’s assume Gaussian prior: W ~ N(0, σ)

MLE vs MAP
• Maximum conditional likelihood estimate

• Maximum a posteriori estimate with prior W~N(0,σI)

MAP estimates and
• Regularization
Maximum a posteriori estimate with prior W~N(0,σI)

called a “regularization” term

• helps reduce overfitting
• keep weights nearer to zero (if P(W) is zero
mean Gaussian prior), or whatever the prior
suggests
• used very frequently in Logistic Regression
The Bottom Line

•Consider learning f: X Y, where

• X is a vector of real-valued features, < X1 … Xn >
• Y is boolean
• assume all Xi are conditionally independent given Y
• model P(Xi | Y = yk) as Gaussian N(ik,i)
• model P(Y) as Bernoulli ()

•Then P(Y|X) is of this form, and we can

directly estimate W
• Furthermore, same holds if the Xi are boolean
• trying proving that to yourself
Generative vs. Discriminative Classifiers

Training classifiers involves estimating f: X Y, or P(Y|X)

Generative classifiers (e.g., Naïve Bayes)

• Assume some functional form for P(X|Y), P(X)
• Estimate parameters of P(X|Y), P(X) directly from training data
• Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers (e.g., Logistic regression)

• Assume some functional form for P(Y|X)

• Estimate parameters of P(Y|X) directly from training data
Gradient Descent
Gradient descent can optimize
differentiable functions.
• Suppose you have a differentiable function f(x)
• Gradient descent
– Choose starting point 𝑥 (0)
– Repeat until no change:

Updated value
for optimum Previous value
for optimum
Step size
Gradient of f,
evaluated at
current x
Here is the trajectory of gradient
descent on a quadratic function.
How does step size affect the result?
Gradient descent can optimize
differentiable functions.
• Suppose you have a differentiable function f(x)
• Gradient descent
– Choose starting point 𝑥 (0)
– Repeat until no change:

Updated value
for optimum Previous value
for optimum
Step size
Gradient of f,
evaluated at
current x

04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
Notes 05
No ratings yet
Notes 05
51 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Lecture 11 Logistic
No ratings yet
Lecture 11 Logistic
19 pages
CSCI-43646364 S25 - Lecture 4
No ratings yet
CSCI-43646364 S25 - Lecture 4
92 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
Intro to Logistic Regression
No ratings yet
Intro to Logistic Regression
4 pages
Final ML
No ratings yet
Final ML
54 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Machine Learning for Mechanics
No ratings yet
Machine Learning for Mechanics
19 pages
Fileml
No ratings yet
Fileml
54 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Binary Logistic Regression 2
No ratings yet
Binary Logistic Regression 2
43 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
ML Classifiers & Regression Guide
No ratings yet
ML Classifiers & Regression Guide
46 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
No ratings yet
ML - MU - Unit - 2 - Supervised Learning-Classification Techniques
153 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
4 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
8 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Regression vs Classification Algorithms
100% (1)
Regression vs Classification Algorithms
13 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Lec 05
No ratings yet
Lec 05
53 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
4 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
8 pages
Lec18 Logistic Regression
No ratings yet
Lec18 Logistic Regression
17 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Logistic Regression for NLP
No ratings yet
Logistic Regression for NLP
64 pages
ML - Mca
No ratings yet
ML - Mca
48 pages
Machine Learning Shortnote
No ratings yet
Machine Learning Shortnote
14 pages
Business Analytics & Machine Learning: Logistic and Poisson Regressions
No ratings yet
Business Analytics & Machine Learning: Logistic and Poisson Regressions
62 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Supervised Learning Cheatsheet
No ratings yet
Supervised Learning Cheatsheet
2 pages
Understanding Logistic Regression Techniques
No ratings yet
Understanding Logistic Regression Techniques
19 pages
Logistic Regression
No ratings yet
Logistic Regression
79 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
Response Surpface Method, Contour, Factorial Design
No ratings yet
Response Surpface Method, Contour, Factorial Design
50 pages
Deep Learning for NLP Enthusiasts
No ratings yet
Deep Learning for NLP Enthusiasts
189 pages
Sample Size Selection in Optimization Methods For Machine Learning
No ratings yet
Sample Size Selection in Optimization Methods For Machine Learning
29 pages
Calculus of Variations: Total Variation Denoising
No ratings yet
Calculus of Variations: Total Variation Denoising
4 pages
CSE445 T3 Linear Regression One Variable
No ratings yet
CSE445 T3 Linear Regression One Variable
57 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
12 pages
Algorithms For Artificial Intelligence - Alg4ai
No ratings yet
Algorithms For Artificial Intelligence - Alg4ai
69 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Molecular Energy Minimization
No ratings yet
Molecular Energy Minimization
20 pages
Multivariable Linear Regression Guide
No ratings yet
Multivariable Linear Regression Guide
7 pages
Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF
No ratings yet
Principled Penalty-Based Methods For Bilevel Reinforcement Learning and RLHF
49 pages
4.2 - 1b - Gradient Descent - Wikipedia - Workedout
No ratings yet
4.2 - 1b - Gradient Descent - Wikipedia - Workedout
5 pages
ML Final MCQsa
No ratings yet
ML Final MCQsa
7 pages
Reviews On Optimization of A Rocket's Trajectory For Maximum Payload
No ratings yet
Reviews On Optimization of A Rocket's Trajectory For Maximum Payload
32 pages
Magic Formula Tyre Model Algorithm Comparison
No ratings yet
Magic Formula Tyre Model Algorithm Comparison
26 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
UNIT I-PGI20C05J-Deep Neural Networks
No ratings yet
UNIT I-PGI20C05J-Deep Neural Networks
35 pages
Decision Trees & Neural Networks
No ratings yet
Decision Trees & Neural Networks
19 pages
AI Explanation Basic
No ratings yet
AI Explanation Basic
86 pages
Lect3 UWA PDF
No ratings yet
Lect3 UWA PDF
73 pages
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
No ratings yet
Grundlehren Der Mathematischen Wissenschaften 306: A Series of Comprehensive Studies in Mathematics
361 pages
CSD Final Report
No ratings yet
CSD Final Report
8 pages
(Ebook) Deep Learning, Vol. 1: From Basics To Practice by Andrew Glassner Digital Version 2025
No ratings yet
(Ebook) Deep Learning, Vol. 1: From Basics To Practice by Andrew Glassner Digital Version 2025
299 pages
Machine Learning Uint I & Ii
No ratings yet
Machine Learning Uint I & Ii
60 pages
Week 5 Local Search
No ratings yet
Week 5 Local Search
33 pages
Airline Delay Prediction
No ratings yet
Airline Delay Prediction
6 pages
Second Edition: Optimization For Engineering Design
No ratings yet
Second Edition: Optimization For Engineering Design
16 pages
Hurst Exponent and Financial Market Predictability
No ratings yet
Hurst Exponent and Financial Market Predictability
7 pages

Generalized Linear Model

Uploaded by

Generalized Linear Model

Uploaded by

Generalized Linear Model

• Introducing some new notation…

minimizing least-squares error

• Two values define this line

What if Y is continuous? (called ‘regression’)

• predict Google stock price today from Google, Yahoo,

• predict each pixel intensity in robot’s current camera

1. choose some parameterized form for P(Y|X; θ)

2. derive learning algorithm as MCLE or MAP estimate for θ

Therefore Y is a random variable that follows the distribution

and the expected value of y for any given x is f(x)

Notation: to make our parameters explicit, let’s write

How can we learn W from the training data?

Learn Maximum Conditional Likelihood Estimate!

Can we derive gradient descent rule for training?

1. MLE corresponds to minimizing sum of squared prediction errors

2. MAP estimate minimizes SSE plus sum of squared weights

3. Again, learning is an optimization problem once we choose our

4. Again, we can use gradient descent as a general learning algorithm

5. Almost nothing we said here required that f(x) be linear in x

• given some estimator Y for some

• the bias of estimator Y=

• e.g., define Y as the MLE estimator for

• Consider simple regression problem f:XY

What are sources of prediction error?

• Want to learn p(Y=1|X=x)

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

• Concave  [check Hessian of P(Y|X,w)]

• Why not learn P(Y|X) directly?

• What does that imply about the form of P(Y|X)?

• maximum likelihood estimate for parameters W

• maximum conditional likelihood estimate

Good news: l(W) is concave function of W

2. Update the vector of parameters:

Stochastic gradient: use error over single examples

3. Update the vector of parameters:

Stochastic approximates Batch arbitrarily closely as

Gradient ascent algorithm: iterate until change < 

• One common approach is to define priors on W

• let’s assume Gaussian prior: W ~ N(0, σ)

• Maximum a posteriori estimate with prior W~N(0,σI)

called a “regularization” term

•Consider learning f: X Y, where

•Then P(Y|X) is of this form, and we can

Training classifiers involves estimating f: X Y, or P(Y|X)

Generative classifiers (e.g., Naïve Bayes)

Discriminative classifiers (e.g., Logistic regression)

• Assume some functional form for P(Y|X)

You might also like