0% found this document useful (0 votes)

20 views39 pages

Handout 02 Logistic Regression

The document provides an overview of Logistic Regression, a statistical method used for binary classification problems where the outcome is categorical. It explains the limitations of linear regression for binary outcomes and introduces the logistic function as a solution, emphasizing the use of maximum likelihood estimation for parameter optimization. The document also discusses the gradient descent optimization technique used to minimize the cross-entropy loss function in logistic regression.

Uploaded by

zhangx30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views39 pages

Handout 02 Logistic Regression

Uploaded by

zhangx30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

02 - Logistic Regression

François Pitié

Assistant Professor in Media Signal Processing

Department of Electronic & Electrical Engineering, Trinity College Dublin

[4C16/5C16] Deep Learning and its Applications — 2022/2023

1
Motivation

With Linear Regression, we looked at linear models, where the out-

put of the problem was a continuous variable (eg. height, car price,
temperature, …).
Very often you need to design a classifier that can answer questions
such as: what car type is it? is the person smiling? is a solar flare
going to happen? In such problems the model depends on categorical
variables.
Logistic Regression (David Cox, 1958), considers the case of a binary
variable, where the outcome is 0/1 or true/false.

2
There is a whole zoo of classifiers out there. Why are we covering
logistic regression in particular?
Because logistic regression is the building block of Neural Nets.

3
Introductory Example

We’ll start with an example from Wikipedia:

A group of 20 students spend between 0 and 6 hours studying for
an exam. How does the number of hours spent studying affect the
probability that the student will pass the exam?

4
Introductory Example

The collected data looks like so:

Studying Hours : 0.75 1.00 2.75 3.50 ...
result (1=pass,0=fail) : 0 0 1 0 ...

1
y (Exam Outcome)
0

0 1 2 3 4 5 6
x (Hours studying)
5
Regression?

Although the output 𝑦 is binary, we could still attempt to fit a linear

model via least squares:

ℎw (x) = x⊤ w = 𝑤1 𝑥1 + ⋯ + 𝑤𝑝 𝑥𝑝

where ℎw (x) is the prediction given model parameters w and input

features x.

6
Regression?

This is what the least squares estimate ℎw (x) looks like:

ℎ𝑤 (𝑥) ≈ 0.18 × 𝑥 + 0.08

1.0
y (Exam Outcome)

0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)
7
Regression?

The model prediction ℎw (x) = x⊤ w is continuous, but we could apply

a threshold to obtain the binary classifier as follows:

0 if x⊤ w ≤ 0.5
𝑦 = [x⊤ w > 0.5] =
1 if x⊤ w > 0.5

and the output would be 0 or 1.

Numerically on our example we would have:

0 if 0.18 × 𝑥 + 0.08 ≤ 0.5

𝑦=
1 if 0.18 × 𝑥 + 0.08 > 0.5

Obviously, we have some issues with that approach...

8
Regression?

100h
Example: a student studied 100 hours and is successful:

ℎ𝑤 (𝑥) = 100 × 0.18 + 0.08 = 18.08 > 0.5

But in terms of LS, the error 𝜀 2 = (1 − ℎ𝑤 (𝑥))2 = 17.12 is large, when

everything is in fact perfectly fine.

9
Regression?

The issue is that the our Least Squares loss is defined as:

𝐸(w) = (w⊤ x𝑖 − 𝑦𝑖 )2
𝑖

But we should include the threshold and have something like:

𝐸(w) = ([w⊤ x𝑖 > 0.5] − 𝑦𝑖 )2

𝑖

So Least Squares doesn’t really work...

Let’s see what can be done.

10
General Linear Model

The general problem of general linear models can be presented as

follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

11
General Linear Model

The general problem of general linear models can be presented as

follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

The quantity x⊤ w is sometimes called the risk score. It is a scalar

value that grades the certainty to belong to one class of the other:
x⊤ w ≫ 0 ⇒ 𝑦=1
⊤
x w≪0⇒ 𝑦=0
x⊤ w ≈ 0 ⇒ undecided

The risk score does a dimensional reduction: it combines multiple

input features into a single number.

12
General Linear Model

The general problem of general linear models can be presented as

follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

The error term is represented by the random variable 𝜖. Multiple

choices are possible for the distribution of 𝜖 .

13
In logistic regression, the error 𝜖 is assumed to follow a logistic dis-
tribution and the risk score x⊤ w is also called the logit.

0.005
0.004
0.003
p(²)

0.002
0.001
0.000
10 5 0 5 10
²
Figure: pdf of the logistic distribution

14
In probit regression, the error 𝜖 is assumed to follow a normal distri-
bution, the risk score x⊤ w is also called the probit.

0.008

0.006

0.004
p(²)

0.002

0.000
10 5 0 5 10
²
Figure: pdf of the normal distribution

15
For our purposes, there is not much difference between logistic and
logit regression. The main difference is that logistic regression is nu-
merically easier to solve.
From now on, we’ll only look at the logistic model but note that similar
derivations could be made for any other model.

16
Logistic Regression Model

Consider 𝑝(𝑦 = 1|x, w), the likelihood that the output is a success:

𝑝(𝑦 = 1|x, w) = 𝑝(x⊤ w + 𝜖 > 0)

= 𝑝(𝜖 > −x⊤ w)

since 𝜖 is symmetrically distributed around 0, it follows that

𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w)

Because we have made some assumptions about the distribution of

𝜖, we are able to derive a closed-form expression for the likelihood.

17
The Logistic Function

The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logistic

distribution and is also called the logistic function or sigmoid:

1.0
0.8
0.6
f(t)

1
𝑓(𝑡) = 0.4
1 + 𝑒−𝑡
0.2
0.0
6 4 2 0 2 4 6
t

18
Logistic Regression Model

Thus we have a simple model for the likelihood of success:

1
𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x

The likelihood of failure is simply given by:

1
𝑝(𝑦 = 0|x, w) = 1 − 𝑝(𝑦 = 1|x, w) = ⊤w
1 + 𝑒+x

Exercise:

show that 𝑝(𝑦 = 0|x, w) = ℎw (−x)

19
Logistic Regression Model

Below is the plot of 𝑝(𝑦 = 1|x, w) = 1/(1 + exp(−(𝑤0 + 𝑤1 𝑥))) for our
problem (using optimal values of 𝑤0 and 𝑤1 ):

y (Exam Outcome) 1.0

0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)

The results are easy to interpret: there is about 60% chance to pass
the exam if you study for 3 hours.
20
Logistic Regression vs. Least Squares vs

In linear regression, the model prediction ℎw (x) was a direct predic-

tion of the outcome:

ℎw (x) = ŷ

In logistic regression, the model prediction ℎw (x) is an estimate of

the likelihood of the outcome:

ℎw (x) = 𝑝(𝑦 = 1|x, w)

Thus whereas in linear regression we try to answer the question:

What is the expected value of 𝑦 given x?
In logistic regression (and any other general linear model), we try in-
stead to answer the question:
What is the probability that 𝑦 = 1 given x?

21
Maximum Likelihood

To estimate the weights w, we will again use the concept of Maximum

Likelihood.

22
Maximum Likelihood

As we’ve just seen, for a particular observation x𝑖 and model w, the

likelihood is given by:
𝑝(𝑦 = 1|x𝑖 , w) = ℎw (x𝑖 ) if 𝑦𝑖 = 1
𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) =
𝑝(𝑦 = 0|x𝑖 , w) = 1 − ℎw (x𝑖 ) if 𝑦𝑖 = 0

As 𝑦𝑖 ∈ {0, 1}, this can be written in a slightly more compact form:

𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))1−𝑦𝑖

This works because 𝑧0 = 1.

Assuming independent observations, the likelihood over all observa-
tions is:
𝑛
1−𝑦𝑖
𝑝(y|X, w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))
𝑖=1

23
Maximum Likelihood

We want to find w that maximises the likelihood 𝑝(y|X). As always, it

is equivalent but more convenient to minimise the negative log like-
lihood:

𝐸(w) = −ln(𝑝(y|X, w))

𝑛

= −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))

𝑖=1

This loss function we need to minimise is called the cross-entropy.

Note that we can consider also the average cross-entropy:
𝑛
1
𝐸(w) = −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))
𝑛
𝑖=1

24
We could have considered optimising the parameters w using other
loss functions. For instance we could have tried to minimise the least
square error as we did in linear regression:
𝑛
2
𝐸𝐿𝑆 (w) = (ℎw (x𝑖 ) − 𝑦𝑖 )
𝑖=1

The solution would not maximise the likelihood, as would the cross-
entropy loss, but maybe that would still be a reasonable thing to do?
The problem is that ℎw is non-convex, which makes the minimisation
of 𝐸𝐿𝑆 (w) much harder than when using cross-entropy.
This is in fact a mistake that the Neural Net community did for a num-
ber of years before switching to the cross entropy loss function.

25
Optimisation: gradient descent

To minimise the error function, we need to resort to gradient descent,

which is a general method for nonlinear optimisation and which will
be at the core of neural networks optimisation.
We start at w(0) and take steps along the steepest direction v using a
fixed size step as follows:

w(𝑛+1) = w(𝑛) + 𝜂v(𝑛)

𝜂 is called the learning rate and controls the speed of the descent.
What is the steepest slope v?

26
Optimisation: gradient descent

Without loss of generality we set v to be a unit vector (ie. ‖v‖ = 1).

Then, moving w to w + 𝜂v yields a new error as follows:
⊤
𝜕𝐸
𝐸 (w + 𝜂v) = 𝐸 (w) + 𝜂 v + 𝑂(𝜂2 )
𝜕w

which reaches a minimum when

𝜕𝐸
𝜕w
v= − 𝜕𝐸
‖ ‖
𝜕w

27
Optimisation: gradient descent

now, it is hard to find a good value for the learning rate 𝜂 and we
usually adopt an adaptive step instead. Thus instead of using
𝜕𝐸
(𝑛+1) (𝑛) 𝜕w
w =w − 𝜂 𝜕𝐸
‖ ‖
𝜕w

we usually use the following update step:

𝜕𝐸
w(𝑛+1) = w(𝑛) − 𝜂
𝜕w

28
Optimisation: gradient descent

Recall that the cross-entropy loss function is:

𝑛

𝐸(w) = −𝑦𝑖 ln(ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln(1 − ℎw (x𝑖 ))

𝑖=1

1
and that ℎw (x) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x
Exercise:

Given that the derivative of the sigmoid 𝑓 is 𝑓′ (𝑡) = (1 − 𝑓(𝑡))𝑓(𝑡),

show that
𝑛
𝜕𝐸
= (ℎw (x𝑖 ) − 𝑦𝑖 ) x𝑖
𝜕w
𝑖=1

29
Optimisation: gradient descent

The overall gradient descent method looks like so:

1. set an initial weight vector w(0) and

w(0)
2. for 𝑡 = 0, 1, 2, ⋯ do until convergence
t=0,1,2...
3. compute the gradient
𝑛
𝜕𝐸 1
= ⊤ − 𝑦𝑖 x𝑖
𝜕w 1 + 𝑒−x𝑖 w
𝑖=1

𝜕𝐸
4. update the weights: w(𝑡+1) = w(𝑡) − 𝜂
𝜕w

30
Example

Below is an example with 2 features.

4
3
2
1
0
1
2
3
44 3 2 1 0 1 2 3 4

31
Example

The estimate for the probability of success is

ℎw (x) = 1/(1 + 𝑒−(−1.28−1.09𝑥1 +1.89𝑥2 ) )

Below are drawn the lines that correspond to ℎw (x) = 0.05, ℎw (x) =
0.5 and ℎw (x) = 0.95.

4
3
linear 2
1
0
class 1
1 class 0
2 50%
95%
3 5%
44 3 2 1 0 1 2 3 4

32
Multiclass Classification

33
In many applications you have to deal with more than 2 classes.
In these cases, we need to use multinomial logistic regression, which
is an extension of logistic regression to more than 2 classes.

34
Multinomial Logistic Regression

In Multinomial Logistic Regression, each of the binary classifier is

based on the following likelihood model:

exp(w⊤
𝑘 x)
𝑝(𝑦 = 𝐶𝑘 |x, w) = softmax(x⊤ w)𝑘 = 𝐾
∑𝑗=1 exp(w⊤
𝑗
x)

𝐶𝑘 is the class 𝑘 and softmax ∶ ℝ𝐾 → ℝ𝐾 is the function defined as

exp(𝑡𝑘 )
softmax(t)𝑘 = 𝐾
∑𝑗=1 exp(𝑡𝑗 )

In other words, softmax takes as an input the vector of logits for all
classes and returns the vector of corresponding likelihoods.

35
Multinomial Logistic Regression

For instance, say we have 3 classes A, B, C, with

x⊤ w𝐴 = −1.2 𝑝(𝐴|x) = 0.0131

softmax
x → x⊤ w𝐵 = +3.1 −−−−−→ 𝑝(𝐵|x) = 0.9691
x⊤ w𝐶 = −0.9 𝑝(𝐶|x) = 0.0177

where 𝑝(𝐴|x) = exp(−1.2)/(exp(−1.2)+exp(3.1)+exp(−0.9)) = 0.0131

36
Multinomial Cross Entropy

To optimise for the parameters. We can take again the maximum like-
lihood approach.
Combining the likelihood for all possible classes gives us:

𝑝(𝑦|x) = 𝑝(𝑦 = 𝐶1 |x)[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦 = 𝐶𝐾 |x)[𝑦=𝐶𝐾 ]

where [𝑦 = 𝐶1 ] is 1 if 𝑦 = 𝐶1 and 0 otherwise.

The total likelihood is:
𝑛

𝑝(𝑦|X) = 𝑝(𝑦𝑖 = 𝐶1 |x𝑖 )[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦𝑖 = 𝐶𝐾 |x𝑖 )[𝑦=𝐶𝐾 ]

𝑖=1

37
Multinomial Cross Entropy

Taking the negative log likelihood yields the cross entropy error func-
tion for the multiclass problem:

𝑛 𝐾

𝐸(w1 , ⋯ , w𝐾 ) = −ln(𝑝(𝑦|X)) = − [𝑦𝑖 = 𝐶𝑘 ] ln(𝑝(𝑦𝑖 = 𝐶𝑘 |x𝑖 ))

𝑖=1 𝑘=1

Similarly to logistic regression, we can use a gradient descent ap-

proach to find the 𝐾 weight vectors w1 , ⋯ , w𝐾 that minimise this cross
entropy expression.

38
Take Away

With Logistic Regression, we look at linear models, where the output

of the problem is a binary categorical response.
Instead of directly predicting the actual outcome as in least squares,
the model proposed in logistic regression makes a prediction about
the likelihood of belonging to a particular class.
Finding the maximum likelihood parameters is equivalent to minimis-
ing the cross entropy loss function. The minimisation can be done
using the gradient descent technique.
The extension of Logistic Regression to more than 2 classes is called
the Multinomial Logistic Regression.

Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Logistic Regression and Sigmoid Function
No ratings yet
Logistic Regression and Sigmoid Function
32 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
2+logistic Regression
No ratings yet
2+logistic Regression
10 pages
Understanding Logistic Regression Techniques
No ratings yet
Understanding Logistic Regression Techniques
19 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
w03 LectureSlices MA4550
No ratings yet
w03 LectureSlices MA4550
28 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Ch2Regression and Regularization1
No ratings yet
Ch2Regression and Regularization1
45 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
15 pages
Intro to Logistic Regression
No ratings yet
Intro to Logistic Regression
4 pages
Logistic Regression for NLP
No ratings yet
Logistic Regression for NLP
64 pages
Logistic Regression
No ratings yet
Logistic Regression
74 pages
Reference Material Logistic Regression
No ratings yet
Reference Material Logistic Regression
11 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
05 LogisticRegression PDF
No ratings yet
05 LogisticRegression PDF
23 pages
Lecture 07
No ratings yet
Lecture 07
26 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
CSCI-43646364 S25 - Lecture 4
No ratings yet
CSCI-43646364 S25 - Lecture 4
92 pages
Supervised Learning Fundamentals
No ratings yet
Supervised Learning Fundamentals
47 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Ch03 LogisticRegression
No ratings yet
Ch03 LogisticRegression
79 pages
Notes 05
No ratings yet
Notes 05
51 pages
Output 23
No ratings yet
Output 23
6 pages
Unit 2
No ratings yet
Unit 2
8 pages
Data Science Training Program Overview
No ratings yet
Data Science Training Program Overview
351 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
21 pages
L14 Logistic Regression
No ratings yet
L14 Logistic Regression
22 pages
Intro to Classification & Regression
No ratings yet
Intro to Classification & Regression
42 pages
Regression Techniques Overview
No ratings yet
Regression Techniques Overview
5 pages
Slide 2
No ratings yet
Slide 2
30 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Logisticregression 2021
No ratings yet
Logisticregression 2021
78 pages
Linear Models in Regression & Classification
No ratings yet
Linear Models in Regression & Classification
30 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
4.logistic Regression
No ratings yet
4.logistic Regression
16 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
16 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
41 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
LR2
No ratings yet
LR2
25 pages
Cost Function
No ratings yet
Cost Function
17 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
In-International-Organizations-Political-Economy-Of-Institutions-And-Decisions-1st-Edition-Darren-G - Hawkins
No ratings yet
In-International-Organizations-Political-Economy-Of-Institutions-And-Decisions-1st-Edition-Darren-G - Hawkins
62 pages
Book Review by Anang Tawiah - Comprehensive Summary and Review of Practical Statistics For Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
No ratings yet
Book Review by Anang Tawiah - Comprehensive Summary and Review of Practical Statistics For Data Scientists by Andrew Bruce, Peter Bruce, and Peter Gedeck
14 pages
Heart Attack Prediction with ML Models
No ratings yet
Heart Attack Prediction with ML Models
10 pages
Chi Nguyen - 1622431 - LAB 4
No ratings yet
Chi Nguyen - 1622431 - LAB 4
5 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
Logistic Regression Course Note
No ratings yet
Logistic Regression Course Note
23 pages
r21 Cs603c Gnit
No ratings yet
r21 Cs603c Gnit
15 pages
MLT, Two Marks
No ratings yet
MLT, Two Marks
19 pages
Department of Statistics: STATS 762: Topics in Regression Modelling Term Test Friday October 12, 2007
No ratings yet
Department of Statistics: STATS 762: Topics in Regression Modelling Term Test Friday October 12, 2007
6 pages
8th Project Report
No ratings yet
8th Project Report
38 pages
ML LAB Viva Questions With Answers
No ratings yet
ML LAB Viva Questions With Answers
10 pages
Deep Learning Notes All Units
No ratings yet
Deep Learning Notes All Units
69 pages
Lecture 7 - Feature Selection & Model Optimization
No ratings yet
Lecture 7 - Feature Selection & Model Optimization
48 pages
Statistical Methods For Healthcare Performance Monitoring (Chapman & Hall/CRC Biostatistics Series Book 92) - , 978-1482246094
100% (26)
Statistical Methods For Healthcare Performance Monitoring (Chapman & Hall/CRC Biostatistics Series Book 92) - , 978-1482246094
23 pages
Fundamentals of Biostatistics MCQ PDF
No ratings yet
Fundamentals of Biostatistics MCQ PDF
9 pages
(Mouton Textbook) Stefan Th. Gries - Statistics For Linguistics With R - A Practical Introduction-Walter de Gruyter (2013) PDF
No ratings yet
(Mouton Textbook) Stefan Th. Gries - Statistics For Linguistics With R - A Practical Introduction-Walter de Gruyter (2013) PDF
374 pages
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
No ratings yet
Choosing The Correct Statistical Test in SAS, Stata, SPSS and R
8 pages
Electoral Participation of Women in India
No ratings yet
Electoral Participation of Women in India
9 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Biswas (2022)
No ratings yet
Biswas (2022)
30 pages
Explainable AI (XAI) For Obesity Prediction: An Optimized MLP Approach With SHAP Interpretability On Lifestyle and Behavioral Data
No ratings yet
Explainable AI (XAI) For Obesity Prediction: An Optimized MLP Approach With SHAP Interpretability On Lifestyle and Behavioral Data
9 pages
Teachers' Views on Classroom Video Games
No ratings yet
Teachers' Views on Classroom Video Games
10 pages
MCQs on Machine Learning Concepts
No ratings yet
MCQs on Machine Learning Concepts
30 pages
Statistical Modelling Assignment II
No ratings yet
Statistical Modelling Assignment II
3 pages
Did The COVID-19 Pandemic Trigger Nostalgia? Evidence of Music Consumption On Spotify
No ratings yet
Did The COVID-19 Pandemic Trigger Nostalgia? Evidence of Music Consumption On Spotify
32 pages
Comandos Geral Minitab
No ratings yet
Comandos Geral Minitab
1,187 pages
Desert Tortoise Recovery Plan Assessment, 2004
No ratings yet
Desert Tortoise Recovery Plan Assessment, 2004
276 pages
Industrial Problem Solving Exam Guide
No ratings yet
Industrial Problem Solving Exam Guide
10 pages
Nested Logit Model for Oslo Travel Choices
No ratings yet
Nested Logit Model for Oslo Travel Choices
25 pages
CampusX DSMP 2.0 Syllabus
No ratings yet
CampusX DSMP 2.0 Syllabus
66 pages

Handout 02 Logistic Regression

Uploaded by

Handout 02 Logistic Regression

Uploaded by

02 - Logistic Regression

Assistant Professor in Media Signal Processing

[4C16/5C16] Deep Learning and its Applications — 2022/2023

With Linear Regression, we looked at linear models, where the out-

We’ll start with an example from Wikipedia:

The collected data looks like so:

Although the output 𝑦 is binary, we could still attempt to fit a linear

where ℎw (x) is the prediction given model parameters w and input

This is what the least squares estimate ℎw (x) looks like:

ℎ𝑤 (𝑥) ≈ 0.18 × 𝑥 + 0.08

The model prediction ℎw (x) = x⊤ w is continuous, but we could apply

and the output would be 0 or 1.

0 if 0.18 × 𝑥 + 0.08 ≤ 0.5

Obviously, we have some issues with that approach...

ℎ𝑤 (𝑥) = 100 × 0.18 + 0.08 = 18.08 > 0.5

But in terms of LS, the error 𝜀 2 = (1 − ℎ𝑤 (𝑥))2 = 17.12 is large, when

But we should include the threshold and have something like:

𝐸(w) = ([w⊤ x𝑖 > 0.5] − 𝑦𝑖 )2

So Least Squares doesn’t really work...

Let’s see what can be done.

The general problem of general linear models can be presented as

The general problem of general linear models can be presented as

The quantity x⊤ w is sometimes called the risk score. It is a scalar

The risk score does a dimensional reduction: it combines multiple

The general problem of general linear models can be presented as

The error term is represented by the random variable 𝜖. Multiple

𝑝(𝑦 = 1|x, w) = 𝑝(x⊤ w + 𝜖 > 0)

since 𝜖 is symmetrically distributed around 0, it follows that

𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w)

Because we have made some assumptions about the distribution of

The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logistic

Thus we have a simple model for the likelihood of success:

The likelihood of failure is simply given by:

show that 𝑝(𝑦 = 0|x, w) = ℎw (−x)

y (Exam Outcome) 1.0

In linear regression, the model prediction ℎw (x) was a direct predic-

In logistic regression, the model prediction ℎw (x) is an estimate of

ℎw (x) = 𝑝(𝑦 = 1|x, w)

Thus whereas in linear regression we try to answer the question:

To estimate the weights w, we will again use the concept of Maximum

As we’ve just seen, for a particular observation x𝑖 and model w, the

As 𝑦𝑖 ∈ {0, 1}, this can be written in a slightly more compact form:

𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))1−𝑦𝑖

This works because 𝑧0 = 1.

We want to find w that maximises the likelihood 𝑝(y|X). As always, it

𝐸(w) = −ln(𝑝(y|X, w))

= −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))

This loss function we need to minimise is called the cross-entropy.

To minimise the error function, we need to resort to gradient descent,

w(𝑛+1) = w(𝑛) + 𝜂v(𝑛)

Without loss of generality we set v to be a unit vector (ie. ‖v‖ = 1).

which reaches a minimum when

we usually use the following update step:

Recall that the cross-entropy loss function is:

𝐸(w) = −𝑦𝑖 ln(ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln(1 − ℎw (x𝑖 ))

Given that the derivative of the sigmoid 𝑓 is 𝑓′ (𝑡) = (1 − 𝑓(𝑡))𝑓(𝑡),

The overall gradient descent method looks like so:

1. set an initial weight vector w(0) and

Below is an example with 2 features.

The estimate for the probability of success is

ℎw (x) = 1/(1 + 𝑒−(−1.28−1.09𝑥1 +1.89𝑥2 ) )

In Multinomial Logistic Regression, each of the binary classifier is

𝐶𝑘 is the class 𝑘 and softmax ∶ ℝ𝐾 → ℝ𝐾 is the function defined as

For instance, say we have 3 classes A, B, C, with

x⊤ w𝐴 = −1.2 𝑝(𝐴|x) = 0.0131

where 𝑝(𝐴|x) = exp(−1.2)/(exp(−1.2)+exp(3.1)+exp(−0.9)) = 0.0131

𝑝(𝑦|x) = 𝑝(𝑦 = 𝐶1 |x)[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦 = 𝐶𝐾 |x)[𝑦=𝐶𝐾 ]

where [𝑦 = 𝐶1 ] is 1 if 𝑦 = 𝐶1 and 0 otherwise.

𝑝(𝑦|X) = 𝑝(𝑦𝑖 = 𝐶1 |x𝑖 )[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦𝑖 = 𝐶𝐾 |x𝑖 )[𝑦=𝐶𝐾 ]

𝐸(w1 , ⋯ , w𝐾 ) = −ln(𝑝(𝑦|X)) = − [𝑦𝑖 = 𝐶𝑘 ] ln(𝑝(𝑦𝑖 = 𝐶𝑘 |x𝑖 ))

Similarly to logistic regression, we can use a gradient descent ap-

With Logistic Regression, we look at linear models, where the output

You might also like