Logistic regression
model
Case studies for choice
models
Choice model cater to cases where the response variable are
categorical variables
Home loan/credit card/ Consumer loan defaults { default vs. no
default}
Fraud detection {fraud case vs. no fraud}
Customer Churn Analysis {churn vs. no churn}
Propensity to buy models { buy vs. no buy|
Linear regression bad choice when
response variables are categorical
- Clearly simplest
model could be y =1
when tumor size is
greater than 5
- In the first model one
could do that by
saying y_predicted
>0.5
- Adding a few more
grey points should not
result in new model or
a new line because in
reality the cut has not
changed
General structure for choice
models
X
Home loan default
Income
Debt to Income
Default on other
loans
Salaried vs.
Business
Expense to
Income
Credit Score
Probability of
default
Logistic regression model
Instead of predicting absolute value we predict probability
of an event
1.2
Probability
of Cancer
1
0.8
0.6
0.4
0.2
0
0
P(z) = 1/(1+exp(-z))
6
10
Tumor Size
12
14
16
Sigmoid function
Error function(analogy)
Y=0
(p-0)
Roughly
MLE
1
Error
Y=1
(1-p)
Error
p1 y (1 p ) y
Minimiz
e
p y (1 p )1 y
Maximiz
e
MLE
(Maximum
Likelihood)
Estimate parameter using
Maximum Likelihood
Max yi ln( p ( zi )) (1 yi ) ln(1 p ( zi ))
i
where
zi xi
Churn Model Example
Setting Threshold for
classification
Positive
Threshold
Negative
High Threshold -> High Accuracy low
capture
Low Threshold -> Low Accuracy high
capture
Picking a threshold:
KS Chart
- Divide the
population into
deciles
-
Take upper limit of
all deciles and plot
the cumulative
percentage of good
and bad examples
- Pick the
score/threshold of
the decile where the
separation between
good and bad is the
maximum
Truth Table to measure
accuracy
False Negative Rate = False Negative/Total Actual False
(specificity)
True Positive Rate = True Positive/Total Actual True
(sensitivity)
actual
True
False
True
True Positive
False
Positive
False
True
Negative
False
Negative
Predicted
Max sensitivity and
Specificity
Choose the threshold where both sensitivity and specificity are
maximized
Goodness of fit ROC Curve
- The dotted line
represents the case
where model has not
learnt anything i.e. picks
the same percentage of
of false positives and
True Positives
- The area under the blue
curve therefore
represents the goodness
of fit (0.5<Area<1)