0% found this document useful (0 votes)
20 views39 pages

Handout 02 Logistic Regression

The document provides an overview of Logistic Regression, a statistical method used for binary classification problems where the outcome is categorical. It explains the limitations of linear regression for binary outcomes and introduces the logistic function as a solution, emphasizing the use of maximum likelihood estimation for parameter optimization. The document also discusses the gradient descent optimization technique used to minimize the cross-entropy loss function in logistic regression.

Uploaded by

zhangx30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Handout 02 Logistic Regression

The document provides an overview of Logistic Regression, a statistical method used for binary classification problems where the outcome is categorical. It explains the limitations of linear regression for binary outcomes and introduces the logistic function as a solution, emphasizing the use of maximum likelihood estimation for parameter optimization. The document also discusses the gradient descent optimization technique used to minimize the cross-entropy loss function in logistic regression.

Uploaded by

zhangx30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

02 - Logistic Regression

François Pitié

Assistant Professor in Media Signal Processing


Department of Electronic & Electrical Engineering, Trinity College Dublin

[4C16/5C16] Deep Learning and its Applications — 2022/2023

1
Motivation

With Linear Regression, we looked at linear models, where the out-


put of the problem was a continuous variable (eg. height, car price,
temperature, …).
Very often you need to design a classifier that can answer questions
such as: what car type is it? is the person smiling? is a solar flare
going to happen? In such problems the model depends on categorical
variables.
Logistic Regression (David Cox, 1958), considers the case of a binary
variable, where the outcome is 0/1 or true/false.

2
There is a whole zoo of classifiers out there. Why are we covering
logistic regression in particular?
Because logistic regression is the building block of Neural Nets.

3
Introductory Example

We’ll start with an example from Wikipedia:


A group of 20 students spend between 0 and 6 hours studying for
an exam. How does the number of hours spent studying affect the
probability that the student will pass the exam?

4
Introductory Example

The collected data looks like so:


Studying Hours : 0.75 1.00 2.75 3.50 ...
result (1=pass,0=fail) : 0 0 1 0 ...

1
y (Exam Outcome)
0

0 1 2 3 4 5 6
x (Hours studying)
5
Regression?

Although the output 𝑦 is binary, we could still attempt to fit a linear


model via least squares:

ℎw (x) = x⊤ w = 𝑤1 𝑥1 + ⋯ + 𝑤𝑝 𝑥𝑝

where ℎw (x) is the prediction given model parameters w and input


features x.

6
Regression?

This is what the least squares estimate ℎw (x) looks like:

ℎ𝑤 (𝑥) ≈ 0.18 × 𝑥 + 0.08

1.0
y (Exam Outcome)

0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)
7
Regression?

The model prediction ℎw (x) = x⊤ w is continuous, but we could apply


a threshold to obtain the binary classifier as follows:

0 if x⊤ w ≤ 0.5
𝑦 = [x⊤ w > 0.5] =
1 if x⊤ w > 0.5

and the output would be 0 or 1.


Numerically on our example we would have:

0 if 0.18 × 𝑥 + 0.08 ≤ 0.5


𝑦=
1 if 0.18 × 𝑥 + 0.08 > 0.5

Obviously, we have some issues with that approach...

8
Regression?

100h
Example: a student studied 100 hours and is successful:

ℎ𝑤 (𝑥) = 100 × 0.18 + 0.08 = 18.08 > 0.5

But in terms of LS, the error 𝜀 2 = (1 − ℎ𝑤 (𝑥))2 = 17.12 is large, when


everything is in fact perfectly fine.

LS

9
Regression?

The issue is that the our Least Squares loss is defined as:

𝐸(w) = (w⊤ x𝑖 − 𝑦𝑖 )2
𝑖

But we should include the threshold and have something like:

𝐸(w) = ([w⊤ x𝑖 > 0.5] − 𝑦𝑖 )2


𝑖

So Least Squares doesn’t really work...

Let’s see what can be done.


10
General Linear Model

The general problem of general linear models can be presented as


follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

11
General Linear Model

The general problem of general linear models can be presented as


follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

The quantity x⊤ w is sometimes called the risk score. It is a scalar


value that grades the certainty to belong to one class of the other:
x⊤ w ≫ 0 ⇒ 𝑦=1

x w≪0⇒ 𝑦=0
x⊤ w ≈ 0 ⇒ undecided

The risk score does a dimensional reduction: it combines multiple


input features into a single number.

12
General Linear Model

The general problem of general linear models can be presented as


follows. We are trying to find a linear combination of the data x⊤ w,
such that the sign of x⊤ w tells us about the outcome 𝑦:

𝑦 = [x⊤ w + 𝜖 > 0]

The error term is represented by the random variable 𝜖. Multiple


choices are possible for the distribution of 𝜖 .

13
In logistic regression, the error 𝜖 is assumed to follow a logistic dis-
tribution and the risk score x⊤ w is also called the logit.

0.005
0.004
0.003
p(²)

0.002
0.001
0.000
10 5 0 5 10
²
Figure: pdf of the logistic distribution

14
In probit regression, the error 𝜖 is assumed to follow a normal distri-
bution, the risk score x⊤ w is also called the probit.

0.008

0.006

0.004
p(²)

0.002

0.000
10 5 0 5 10
²
Figure: pdf of the normal distribution

15
For our purposes, there is not much difference between logistic and
logit regression. The main difference is that logistic regression is nu-
merically easier to solve.
From now on, we’ll only look at the logistic model but note that similar
derivations could be made for any other model.

16
Logistic Regression Model

Consider 𝑝(𝑦 = 1|x, w), the likelihood that the output is a success:

𝑝(𝑦 = 1|x, w) = 𝑝(x⊤ w + 𝜖 > 0)


= 𝑝(𝜖 > −x⊤ w)

since 𝜖 is symmetrically distributed around 0, it follows that

𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w)

Because we have made some assumptions about the distribution of


𝜖, we are able to derive a closed-form expression for the likelihood.

17
The Logistic Function

The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logistic


distribution and is also called the logistic function or sigmoid:

1.0
0.8
0.6
f(t)

1
𝑓(𝑡) = 0.4
1 + 𝑒−𝑡
0.2
0.0
6 4 2 0 2 4 6
t

18
Logistic Regression Model

Thus we have a simple model for the likelihood of success:


1
𝑝(𝑦 = 1|x, w) = 𝑝(𝜖 < x⊤ w) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x

The likelihood of failure is simply given by:


1
𝑝(𝑦 = 0|x, w) = 1 − 𝑝(𝑦 = 1|x, w) = ⊤w
1 + 𝑒+x

Exercise:

show that 𝑝(𝑦 = 0|x, w) = ℎw (−x)

19
Logistic Regression Model

Below is the plot of 𝑝(𝑦 = 1|x, w) = 1/(1 + exp(−(𝑤0 + 𝑤1 𝑥))) for our
problem (using optimal values of 𝑤0 and 𝑤1 ):

y (Exam Outcome) 1.0


0.8
0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
x (Hours studying)

The results are easy to interpret: there is about 60% chance to pass
the exam if you study for 3 hours.
20
Logistic Regression vs. Least Squares vs

In linear regression, the model prediction ℎw (x) was a direct predic-


tion of the outcome:

ℎw (x) = ŷ

In logistic regression, the model prediction ℎw (x) is an estimate of


the likelihood of the outcome:

ℎw (x) = 𝑝(𝑦 = 1|x, w)

Thus whereas in linear regression we try to answer the question:


What is the expected value of 𝑦 given x?
In logistic regression (and any other general linear model), we try in-
stead to answer the question:
What is the probability that 𝑦 = 1 given x?

21
Maximum Likelihood

To estimate the weights w, we will again use the concept of Maximum


Likelihood.

22
Maximum Likelihood

As we’ve just seen, for a particular observation x𝑖 and model w, the


likelihood is given by:
𝑝(𝑦 = 1|x𝑖 , w) = ℎw (x𝑖 ) if 𝑦𝑖 = 1
𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) =
𝑝(𝑦 = 0|x𝑖 , w) = 1 − ℎw (x𝑖 ) if 𝑦𝑖 = 0

As 𝑦𝑖 ∈ {0, 1}, this can be written in a slightly more compact form:

𝑝(𝑦 = 𝑦𝑖 |x𝑖 , w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))1−𝑦𝑖

This works because 𝑧0 = 1.


Assuming independent observations, the likelihood over all observa-
tions is:
𝑛
1−𝑦𝑖
𝑝(y|X, w) = ℎw (x𝑖 )𝑦𝑖 (1 − ℎw (x𝑖 ))
𝑖=1

23
Maximum Likelihood

We want to find w that maximises the likelihood 𝑝(y|X). As always, it


is equivalent but more convenient to minimise the negative log like-
lihood:

𝐸(w) = −ln(𝑝(y|X, w))


𝑛

= −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))


𝑖=1

This loss function we need to minimise is called the cross-entropy.


Note that we can consider also the average cross-entropy:
𝑛
1
𝐸(w) = −𝑦𝑖 ln (ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln (1 − ℎw (x𝑖 ))
𝑛
𝑖=1

24
We could have considered optimising the parameters w using other
loss functions. For instance we could have tried to minimise the least
square error as we did in linear regression:
𝑛
2
𝐸𝐿𝑆 (w) = (ℎw (x𝑖 ) − 𝑦𝑖 )
𝑖=1

The solution would not maximise the likelihood, as would the cross-
entropy loss, but maybe that would still be a reasonable thing to do?
The problem is that ℎw is non-convex, which makes the minimisation
of 𝐸𝐿𝑆 (w) much harder than when using cross-entropy.
This is in fact a mistake that the Neural Net community did for a num-
ber of years before switching to the cross entropy loss function.

25
Optimisation: gradient descent

To minimise the error function, we need to resort to gradient descent,


which is a general method for nonlinear optimisation and which will
be at the core of neural networks optimisation.
We start at w(0) and take steps along the steepest direction v using a
fixed size step as follows:

w(𝑛+1) = w(𝑛) + 𝜂v(𝑛)

𝜂 is called the learning rate and controls the speed of the descent.
What is the steepest slope v?

26
Optimisation: gradient descent

Without loss of generality we set v to be a unit vector (ie. ‖v‖ = 1).


Then, moving w to w + 𝜂v yields a new error as follows:

𝜕𝐸
𝐸 (w + 𝜂v) = 𝐸 (w) + 𝜂 v + 𝑂(𝜂2 )
𝜕w

which reaches a minimum when


𝜕𝐸
𝜕w
v= − 𝜕𝐸
‖ ‖
𝜕w

27
Optimisation: gradient descent

now, it is hard to find a good value for the learning rate 𝜂 and we
usually adopt an adaptive step instead. Thus instead of using
𝜕𝐸
(𝑛+1) (𝑛) 𝜕w
w =w − 𝜂 𝜕𝐸
‖ ‖
𝜕w

we usually use the following update step:

𝜕𝐸
w(𝑛+1) = w(𝑛) − 𝜂
𝜕w

28
Optimisation: gradient descent

Recall that the cross-entropy loss function is:


𝑛

𝐸(w) = −𝑦𝑖 ln(ℎw (x𝑖 )) − (1 − 𝑦𝑖 ) ln(1 − ℎw (x𝑖 ))


𝑖=1

1
and that ℎw (x) = 𝑓(x⊤ w) = ⊤w
1 + 𝑒−x
Exercise:

Given that the derivative of the sigmoid 𝑓 is 𝑓′ (𝑡) = (1 − 𝑓(𝑡))𝑓(𝑡),


show that
𝑛
𝜕𝐸
= (ℎw (x𝑖 ) − 𝑦𝑖 ) x𝑖
𝜕w
𝑖=1

29
Optimisation: gradient descent

The overall gradient descent method looks like so:

1. set an initial weight vector w(0) and


w(0)
2. for 𝑡 = 0, 1, 2, ⋯ do until convergence
t=0,1,2...
3. compute the gradient
𝑛
𝜕𝐸 1
= ⊤ − 𝑦𝑖 x𝑖
𝜕w 1 + 𝑒−x𝑖 w
𝑖=1

𝜕𝐸
4. update the weights: w(𝑡+1) = w(𝑡) − 𝜂
𝜕w

30
Example

Below is an example with 2 features.

4
3
2
1
0
1
2
3
44 3 2 1 0 1 2 3 4

31
Example

The estimate for the probability of success is

ℎw (x) = 1/(1 + 𝑒−(−1.28−1.09𝑥1 +1.89𝑥2 ) )

Below are drawn the lines that correspond to ℎw (x) = 0.05, ℎw (x) =
0.5 and ℎw (x) = 0.95.

4
3
linear 2
1
0
class 1
1 class 0
2 50%
95%
3 5%
44 3 2 1 0 1 2 3 4

32
Multiclass Classification

33
In many applications you have to deal with more than 2 classes.
In these cases, we need to use multinomial logistic regression, which
is an extension of logistic regression to more than 2 classes.

34
Multinomial Logistic Regression

In Multinomial Logistic Regression, each of the binary classifier is


based on the following likelihood model:

exp(w⊤
𝑘 x)
𝑝(𝑦 = 𝐶𝑘 |x, w) = softmax(x⊤ w)𝑘 = 𝐾
∑𝑗=1 exp(w⊤
𝑗
x)

𝐶𝑘 is the class 𝑘 and softmax ∶ ℝ𝐾 → ℝ𝐾 is the function defined as

exp(𝑡𝑘 )
softmax(t)𝑘 = 𝐾
∑𝑗=1 exp(𝑡𝑗 )

In other words, softmax takes as an input the vector of logits for all
classes and returns the vector of corresponding likelihoods.

35
Multinomial Logistic Regression

For instance, say we have 3 classes A, B, C, with

x⊤ w𝐴 = −1.2 𝑝(𝐴|x) = 0.0131


softmax
x → x⊤ w𝐵 = +3.1 −−−−−→ 𝑝(𝐵|x) = 0.9691
x⊤ w𝐶 = −0.9 𝑝(𝐶|x) = 0.0177

where 𝑝(𝐴|x) = exp(−1.2)/(exp(−1.2)+exp(3.1)+exp(−0.9)) = 0.0131

36
Multinomial Cross Entropy

To optimise for the parameters. We can take again the maximum like-
lihood approach.
Combining the likelihood for all possible classes gives us:

𝑝(𝑦|x) = 𝑝(𝑦 = 𝐶1 |x)[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦 = 𝐶𝐾 |x)[𝑦=𝐶𝐾 ]

where [𝑦 = 𝐶1 ] is 1 if 𝑦 = 𝐶1 and 0 otherwise.


The total likelihood is:
𝑛

𝑝(𝑦|X) = 𝑝(𝑦𝑖 = 𝐶1 |x𝑖 )[𝑦=𝐶1 ] × ⋯ × 𝑝(𝑦𝑖 = 𝐶𝐾 |x𝑖 )[𝑦=𝐶𝐾 ]


𝑖=1

37
Multinomial Cross Entropy

Taking the negative log likelihood yields the cross entropy error func-
tion for the multiclass problem:

𝑛 𝐾

𝐸(w1 , ⋯ , w𝐾 ) = −ln(𝑝(𝑦|X)) = − [𝑦𝑖 = 𝐶𝑘 ] ln(𝑝(𝑦𝑖 = 𝐶𝑘 |x𝑖 ))


𝑖=1 𝑘=1

Similarly to logistic regression, we can use a gradient descent ap-


proach to find the 𝐾 weight vectors w1 , ⋯ , w𝐾 that minimise this cross
entropy expression.

38
Take Away

With Logistic Regression, we look at linear models, where the output


of the problem is a binary categorical response.
Instead of directly predicting the actual outcome as in least squares,
the model proposed in logistic regression makes a prediction about
the likelihood of belonging to a particular class.
Finding the maximum likelihood parameters is equivalent to minimis-
ing the cross entropy loss function. The minimisation can be done
using the gradient descent technique.
The extension of Logistic Regression to more than 2 classes is called
the Multinomial Logistic Regression.

39

You might also like