02 - Logistic Regression
François Pitié
Assistant Professor in Media Signal Processing
Department of Electronic & Electrical Engineering, Trinity College Dublin
[4C16/5C16] Deep Learning and its Applications — 2022/2023
1
Motivation
With Linear Regression, we looked at linear models, where the out-
put of the problem was a continuous variable (eg. height, car price,
temperature, …).
Very often you need to design a classifier that can answer questions
such as: what car type is it? is the person smiling? is a solar flare
going to happen? In such problems the model depends on categorical
variables.
Logistic Regression (David Cox, 1958), considers the case of a binary
variable, where the outcome is 0/1 or true/false.
2
There is a whole zoo of classifiers out there. Why are we covering
logistic regression in particular?
Because logistic regression is the building block of Neural Nets.
3
Introductory Example
We’ll start with an example from Wikipedia:
A group of 20 students spend between 0 and 6 hours studying for
an exam. How does the number of hours spent studying affect the
probability that the student will pass the exam?
4
Introductory Example
The collected data looks like so:
Studying Hours : 0.75 1.00 2.75 3.50 ...
result (1=pass,0=fail) : 0 0 1 0 ...
1
y (Exam Outcome)
0
0 1 2 3 4 5 6
x (Hours studying)
5
Regression?
Although the output 𝑦 is binary, we could still attempt to fit a linear
model via least squares:
ℎw (x) = x⊤ w = 𝑤1 𝑥1 + ⋯ + 𝑤𝑝 𝑥𝑝
where ℎw (x) is the prediction given model parameters w and input
features x.
6
Regression?
This is what the least squares estimate ℎw (x) looks like:
ℎ𝑤 (𝑥) ≈ 0.18 × 𝑥 + 0.08
1.0
y (Exam Outcome)
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)
7
Regression?
The model prediction ℎw (x) = x⊤ w is continuous, but we could apply
a threshold to obtain the binary classifier as follows:
0 if x⊤ w ≤ 0.5
𝑦 = [x⊤ w > 0.5] =
1 if x⊤ w > 0.5
and the output would be 0 or 1.
Numerically on our example we would have:
0 if 0.18 × 𝑥 + 0.08 ≤ 0.5
𝑦=
1 if 0.18 × 𝑥 + 0.08 > 0.5
Obviously, we have some issues with that approach...
8
Regression?
100h
Example: a student studied 100 hours and is successful:
ℎ𝑤 (𝑥) = 100 × 0.18 + 0.08 = 18.08 > 0.5
But in terms of LS, the error 𝜀 2 = (1 − ℎ𝑤 (𝑥))2 = 17.12 is large, when
everything is in fact perfectly fine.
LS
9
Regression?
The issue is that the our Least Squares loss is defined as:
𝐸(w) = (w⊤ x𝑖 − 𝑦𝑖 )2
𝑖
But we should include the threshold and have something like:
𝐸(w) = ([w⊤ x𝑖 > 0.5] − 𝑦𝑖 )2
𝑖
So Least Squares doesn’t really work...
Let’s see what can be done.
10
General Linear Model
The general problem of general linear models can be presented as
follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:
𝑦 = [x⊤ w + 𝜖 > 0]
11
General Linear Model
The general problem of general linear models can be presented as
follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:
𝑦 = [x⊤ w + 𝜖 > 0]
The quantity x⊤ w is sometimes called the risk score. It is a scalar
value that grades the certainty to belong to one class of the other:
x⊤ w ≫ 0 ⇒ 𝑦=1
⊤
x w≪0⇒ 𝑦=0
x⊤ w ≈ 0 ⇒ undecided
The risk score does a dimensional reduction: it combines multiple
input features into a single number.
12
General Linear Model
The general problem of general linear models can be presented as
follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:
𝑦 = [x⊤ w + 𝜖 > 0]
The error term is represented by the random variable 𝜖. Multiple
choices are possible for the distribution of 𝜖 .
13
In logistic regression, the error 𝜖 is assumed to follow a logistic dis-
tribution and the risk score x⊤ w is also called the logit.
0.005
0.004
0.003
p(²)
0.002
0.001
0.000
10 5 0 5 10
²
Figure: pdf of the logistic distribution
14
In probit regression, the error 𝜖 is assumed to follow a normal distri-
bution, the risk score x⊤ w is also called the probit.
0.008
0.006
0.004
p(²)
0.002
0.000
10 5 0 5 10
²
Figure: pdf of the normal distribution
15
For our purposes, there is not much difference between logistic and
logit regression. The main difference is that logistic regression is nu-
merically easier to solve.
From now on, we’ll only look at the logistic model but note that similar
derivations could be made for any other model.
16
Logistic Regression Model
Consider 𝑝(𝑦 = 1|x, w), the likelihood that the output is a success:
𝑝(𝑦 = 1|x, w) = 𝑝(x⊤ w + 𝜖 > 0)
= 𝑝(𝜖 > −x⊤ w)
since 𝜖 is symmetrically distributed around 0, it follows that
𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w)
Because we have made some assumptions about the distribution of
𝜖, we are able to derive a closed-form expression for the likelihood.
17
The Logistic Function
The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logistic
distribution and is also called the logistic function or sigmoid:
1.0
0.8
0.6
f(t)
1
𝑓(𝑡) = 0.4
1 + 𝑒−𝑡
0.2
0.0
6 4 2 0 2 4 6
t
18
Logistic Regression Model
Thus we have a simple model for the likelihood of success:
1
𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x
The likelihood of failure is simply given by:
1
𝑝(𝑦 = 0|x, w) = 1 − 𝑝(𝑦 = 1|x, w) = ⊤w
1 + 𝑒+x
Exercise:
show that 𝑝(𝑦 = 0|x, w) = ℎw (−x)
19
Logistic Regression Model
Below is the plot of 𝑝(𝑦 = 1|x, w) = 1/(1 + exp(−(𝑤0 + 𝑤1 𝑥))) for our
problem (using optimal values of 𝑤0 and 𝑤1 ):
y (Exam Outcome) 1.0
0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)
The results are easy to interpret: there is about 60% chance to pass
the exam if you study for 3 hours.
20
Logistic Regression vs. Least Squares vs
In linear regression, the model prediction ℎw (x) was a direct predic-
tion of the outcome:
ℎw (x) = ŷ
In logistic regression, the model prediction ℎw (x) is an estimate of
the likelihood of the outcome:
ℎw (x) = 𝑝(𝑦 = 1|x, w)
Thus whereas in linear regression we try to answer the question:
What is the expected value of 𝑦 given x?
In logistic regression (and any other general linear model), we try in-
stead to answer the question:
What is the probability that 𝑦 = 1 given x?
21
Maximum Likelihood
To estimate the weights w, we will again use the concept of Maximum
Likelihood.
22
Maximum Likelihood
As we’ve just seen, for a particular observation x𝑖 and model w, the
likelihood is given by:
𝑝(𝑦 = 1|x𝑖 , w) = ℎw (x𝑖 ) if 𝑦𝑖 = 1
𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) =
𝑝(𝑦 = 0|x𝑖 , w) = 1 − ℎw (x𝑖 ) if 𝑦𝑖 = 0
As 𝑦𝑖 ∈ {0, 1}, this can be written in a slightly more compact form:
𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))1−𝑦𝑖
This works because 𝑧0 = 1.
Assuming independent observations, the likelihood over all observa-
tions is:
𝑛
1−𝑦𝑖
𝑝(y|X, w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))
𝑖=1
23
Maximum Likelihood
We want to find w that maximises the likelihood 𝑝(y|X). As always, it
is equivalent but more convenient to minimise the negative log like-
lihood:
𝐸(w) = −ln(𝑝(y|X, w))
𝑛
= −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))
𝑖=1
This loss function we need to minimise is called the cross-entropy.
Note that we can consider also the average cross-entropy:
𝑛
1
𝐸(w) = −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))
𝑛
𝑖=1
24
We could have considered optimising the parameters w using other
loss functions. For instance we could have tried to minimise the least
square error as we did in linear regression:
𝑛
2
𝐸𝐿𝑆 (w) = (ℎw (x𝑖 ) − 𝑦𝑖 )
𝑖=1
The solution would not maximise the likelihood, as would the cross-
entropy loss, but maybe that would still be a reasonable thing to do?
The problem is that ℎw is non-convex, which makes the minimisation
of 𝐸𝐿𝑆 (w) much harder than when using cross-entropy.
This is in fact a mistake that the Neural Net community did for a num-
ber of years before switching to the cross entropy loss function.
25
Optimisation: gradient descent
To minimise the error function, we need to resort to gradient descent,
which is a general method for nonlinear optimisation and which will
be at the core of neural networks optimisation.
We start at w(0) and take steps along the steepest direction v using a
fixed size step as follows:
w(𝑛+1) = w(𝑛) + 𝜂v(𝑛)
𝜂 is called the learning rate and controls the speed of the descent.
What is the steepest slope v?
26
Optimisation: gradient descent
Without loss of generality we set v to be a unit vector (ie. ‖v‖ = 1).
Then, moving w to w + 𝜂v yields a new error as follows:
⊤
𝜕𝐸
𝐸 (w + 𝜂v) = 𝐸 (w) + 𝜂 v + 𝑂(𝜂2 )
𝜕w
which reaches a minimum when
𝜕𝐸
𝜕w
v= − 𝜕𝐸
‖ ‖
𝜕w
27
Optimisation: gradient descent
now, it is hard to find a good value for the learning rate 𝜂 and we
usually adopt an adaptive step instead. Thus instead of using
𝜕𝐸
(𝑛+1) (𝑛) 𝜕w
w =w − 𝜂 𝜕𝐸
‖ ‖
𝜕w
we usually use the following update step:
𝜕𝐸
w(𝑛+1) = w(𝑛) − 𝜂
𝜕w
28
Optimisation: gradient descent
Recall that the cross-entropy loss function is:
𝑛
𝐸(w) = −𝑦𝑖 ln(ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln(1 − ℎw (x𝑖 ))
𝑖=1
1
and that ℎw (x) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x
Exercise:
Given that the derivative of the sigmoid 𝑓 is 𝑓′ (𝑡) = (1 − 𝑓(𝑡))𝑓(𝑡),
show that
𝑛
𝜕𝐸
= (ℎw (x𝑖 ) − 𝑦𝑖 ) x𝑖
𝜕w
𝑖=1
29
Optimisation: gradient descent
The overall gradient descent method looks like so:
1. set an initial weight vector w(0) and
w(0)
2. for 𝑡 = 0, 1, 2, ⋯ do until convergence
t=0,1,2...
3. compute the gradient
𝑛
𝜕𝐸 1
= ⊤ − 𝑦𝑖 x𝑖
𝜕w 1 + 𝑒−x𝑖 w
𝑖=1
𝜕𝐸
4. update the weights: w(𝑡+1) = w(𝑡) − 𝜂
𝜕w
30
Example
Below is an example with 2 features.
4
3
2
1
0
1
2
3
44 3 2 1 0 1 2 3 4
31
Example
The estimate for the probability of success is
ℎw (x) = 1/(1 + 𝑒−(−1.28−1.09𝑥1 +1.89𝑥2 ) )
Below are drawn the lines that correspond to ℎw (x) = 0.05, ℎw (x) =
0.5 and ℎw (x) = 0.95.
4
3
linear 2
1
0
class 1
1 class 0
2 50%
95%
3 5%
44 3 2 1 0 1 2 3 4
32
Multiclass Classification
33
In many applications you have to deal with more than 2 classes.
In these cases, we need to use multinomial logistic regression, which
is an extension of logistic regression to more than 2 classes.
34
Multinomial Logistic Regression
In Multinomial Logistic Regression, each of the binary classifier is
based on the following likelihood model:
exp(w⊤
𝑘 x)
𝑝(𝑦 = 𝐶𝑘 |x, w) = softmax(x⊤ w)𝑘 = 𝐾
∑𝑗=1 exp(w⊤
𝑗
x)
𝐶𝑘 is the class 𝑘 and softmax ∶ ℝ𝐾 → ℝ𝐾 is the function defined as
exp(𝑡𝑘 )
softmax(t)𝑘 = 𝐾
∑𝑗=1 exp(𝑡𝑗 )
In other words, softmax takes as an input the vector of logits for all
classes and returns the vector of corresponding likelihoods.
35
Multinomial Logistic Regression
For instance, say we have 3 classes A, B, C, with
x⊤ w𝐴 = −1.2 𝑝(𝐴|x) = 0.0131
softmax
x → x⊤ w𝐵 = +3.1 −−−−−→ 𝑝(𝐵|x) = 0.9691
x⊤ w𝐶 = −0.9 𝑝(𝐶|x) = 0.0177
where 𝑝(𝐴|x) = exp(−1.2)/(exp(−1.2)+exp(3.1)+exp(−0.9)) = 0.0131
36
Multinomial Cross Entropy
To optimise for the parameters. We can take again the maximum like-
lihood approach.
Combining the likelihood for all possible classes gives us:
𝑝(𝑦|x) = 𝑝(𝑦 = 𝐶1 |x)[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦 = 𝐶𝐾 |x)[𝑦=𝐶𝐾 ]
where [𝑦 = 𝐶1 ] is 1 if 𝑦 = 𝐶1 and 0 otherwise.
The total likelihood is:
𝑛
𝑝(𝑦|X) = 𝑝(𝑦𝑖 = 𝐶1 |x𝑖 )[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦𝑖 = 𝐶𝐾 |x𝑖 )[𝑦=𝐶𝐾 ]
𝑖=1
37
Multinomial Cross Entropy
Taking the negative log likelihood yields the cross entropy error func-
tion for the multiclass problem:
𝑛 𝐾
𝐸(w1 , ⋯ , w𝐾 ) = −ln(𝑝(𝑦|X)) = − [𝑦𝑖 = 𝐶𝑘 ] ln(𝑝(𝑦𝑖 = 𝐶𝑘 |x𝑖 ))
𝑖=1 𝑘=1
Similarly to logistic regression, we can use a gradient descent ap-
proach to find the 𝐾 weight vectors w1 , ⋯ , w𝐾 that minimise this cross
entropy expression.
38
Take Away
With Logistic Regression, we look at linear models, where the output
of the problem is a binary categorical response.
Instead of directly predicting the actual outcome as in least squares,
the model proposed in logistic regression makes a prediction about
the likelihood of belonging to a particular class.
Finding the maximum likelihood parameters is equivalent to minimis-
ing the cross entropy loss function. The minimisation can be done
using the gradient descent technique.
The extension of Logistic Regression to more than 2 classes is called
the Multinomial Logistic Regression.
39