Logistic Regression
1
Probabilistic classification
• Most real-life prediction scenarios with discreet
outputs, such as Yes/No, are probabilistic:
• Will you carry an umbrella if it is raining?
• Will you carry an umbrella if it is sunny?
• Will you carry an umbrella if it drizzles?
• Logistic Regression gives the probability of an
event occurring, given historical data to train-test
the model
2
Y1
Probability of carrying Umbrella
Carry Umbrella (history) Y2
YES
NO
Probability of Rain X
N Y Historical data (Y2 Axis)
Prediction (Y1 Axis)
Decide a Threshold to Classify Decision Boundary
Y1
Probability of carrying Umbrella
P(Y=1|X)= 0.8
Carry Umbrella? Y2
YES
P(Y=1|X)= 0.5
P(Y=1|X)= 0.3
NO
Probability of Rain X
N Y Historical data (Y2 Axis)
Prediction (Y1 Axis)
Logistic Regression
•
w are the adjustable weight parameters
• This is the Sigmoid function
5
A Comparison
• Predicted value not limited between 0 and 1
• Predicted and actual outputs have same
units
• Constant slope = w 1
• Used for regression
• Predicted output is a probability,
• Predicted output is unitless
• Slope varies from 0 to max at
centre
• Used typically for Classification
Midpoint & Slope
7
Performance Tallies
False -ve True +ve
True +ve False +ve
8
Log odds or Logit
• Assume there are two classes, y = 0 and y = 1 and
• Odds:
• Log Odds:
• That is, the log odds of class 1 w.r.t class 2, is a linear function
of x
9
Model Fitting
Let p1 be P(y=1|x,w)
Sequence n: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Actual Data y: 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1
Prediction p: p1 p1 p1 1-p1 1-p1 1-p1 p1 p1 p1 p1 p1 1-p1 1-p1 1-p1 1-p1 p1 p1 p1
10
Likelihood of a match?
Let p be P(y=1|x,w)
1
Sequence n: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Actual Data y: 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1
Prediction p: p1 p1 p1 1-p1 1-p1 1-p1 p1 p1 p1 p1 p1 1-p1 1-p1 1-p1 1-p1 p1 p1 p1
Likelihood of a match? Note yn can be either 1 or 0
•
Log Likelihood:
11
Training
• Maximum Likehood Estimation MLE.
• Note:
• Here xl and yl are pre-determined from training data.
• Intercept w0, and coefficients wi calculated so as to maximize
probability
• So, how many w should we try out?
12
Computing the Log-Likelihood
• We can re-express the log of the conditional
likelihood as:
• Need to maximize l(w)
13
Fitting LogR by Gradient Ascent
• Unfortunately, there is no closed form solution to maximizing l(w)
with respect to w. Therefore, one common approach is to use
gradient ascent
• The i th component of the vector gradient has the form
14
Fitting LogR by Gradient Ascent
• Use standard gradient ascent to optimize w. Begin
with initial weights = zero
15
Regularization in Logistic Regression
• Overfitting the training data is a problem that can arise
in Logistic Regression, especially when data has very
high dimensions and is sparse.
• One approach to reducing overfitting is regularization,
in which we create a modified “penalized log likelihood
function,” which penalizes large values of w.
( )
16
Regularization in Logistic Regression
• The derivative of this penalized log likelihood function is similar to our
earlier derivative, with one additional penalty term
• which gives us the modified gradient descent rule
17
Summary of Logistic Regression
• Learns the Conditional Probability Distribution P(y|x)
• Local Search.
• Begins with initial weight vector.
• Modifies it iteratively to maximize an objective function.
• The objective function is the conditional log likelihood of the data – so the
algorithm seeks the probability distribution P(y|x) that is most likely given the
data.
18
What you should know LogR
• In general, NB and LR make different assumptions
• NB: Features independent given class -> assumption on P(X|Y)
• LR: Functional form of P(Y|X), no assumption on P(X|Y)
• LogR can be used as a linear classifier
• decision rule is a hyperplane
• LogR optimized by conditional likelihood
• no closed-form solution
• concave -> global optimum with gradient ascent
19