Logistic Regression
Why do we ever need Logistic Regression?
Violates the assumption of Linear Regression!
Assumption says that the residulas should be normally distributed.
The error term can only take on two values, hence it's impossible for it to
have a normal distribution.
Violates the assumption of Homoscedasticity!
Homoscedasticity describes a situation in which the error term is the
same across all values of the independent variables.
Logistic Regression
Odds
Weight of Evidence (WoE) and Information Value (IV)
Weight of Evidence
The Weight of Evidence or WoE value is a widely used measure of the “strength” of a
grouping for separating good and bad risk (default). It is computed from the basic
odds ratio:
Information Value (IV)
The Information Value (IV) of a predictor is related to the sum of the
(absolute) values for WoE over all groups.
Weight of Evidence (WoE) and Information Value (IV)
According to Siddiqi (2006), by convention the values of the IV statistic can be interpreted as follows. If
the IV statistic is:
•Less than 0.02, then the predictor is not useful for modeling (separating the Goods from the Bads)
•0.02 to 0.1, then the predictor has only a weak relationship to the Goods/Bads odds ratio
•0.1 to 0.3, then the predictor has a medium strength relationship to the Goods/Bads odds ratio
•0.3 or higher, then the predictor has a strong relationship to the Goods/Bads odds ratio.
Indicates a weak relationship to the binary dependent variable.
What are Dummy Variable, Design Variable, Boolean
Indicators and Proxies?
These are all the synonyms for dummy variable
Categorical Variables – Male / Female, High Low Bank Bal etc
They are coded with 1 and 0
Class Class_Dummy1 Class_Dummy2
1 1 0
1 1 0
1 1 0
2 0 1
2 0 1
2 0 1
3 0 0
3 0 0
3 0 0
Results and Interpretation
Independent p value interpretation – p value less than 0.05 (alpha)
should be retained in the model, else remove them from the model!
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 "-2.6516" 0.6748 15.4424 <.0001
blackd 1 0.5952 0.3939 2.2827 0.1308
whitvic 1 0.2565 0.4002 0.4107 0.5216
serious 1 0.1871 0.0612 9.3342 0.0022
Baseline, R Square and Max-rescaled R square and C
What is R square?
R square of Logistic Regression?
How much the goodness of fit improves!!
C statistics – based on receiver operating characteristic (ROC)
curve
Ranges from 0.5 to 1; closer to 1 better the model
Gini – 2*C statistics -1
Ranges from 0 to 1; closer to 1 better the model
Check Multicollinearity!!
Check the VIF / Tolerance to detect the
multicollinearity!!
Results and Interpretation – Classification Table
Correct Incorrect Percentages
Prob Non- Non- Sensi- Speci- FALSE FALSE
Level Event Event Event Event Correct tivity ficity POS NEG
0.05 30 47 23 0 77 100 67.1 43.4 0
0.1 30 53 17 0 83 100 75.7 36.2 0
0.15 30 55 15 0 85 100 78.6 33.3 0
0.2 30 60 10 0 90 100 85.7 25 0
0.25 29 61 9 1 90 96.7 87.1 23.7 1.6
0.3 25 62 8 5 87 83.3 88.6 24.2 7.5
0.35 23 62 8 7 85 76.7 88.6 25.8 10.1
0.4 23 63 7 7 86 76.7 90 23.3 10
0.45 23 63 7 7 86 76.7 90 23.3 10
0.5 23 63 7 7 86 76.7 90 23.3 10
Higher sensitivity and specificity indicates better fit.
Results and Interpretation – Predicted Probability
Obs CURED INTERVENTION DURATION _LEVEL_ pred
1 0 0 7 1 0.42812
2 0 0 7 1 0.42812
3 0 0 6 1 0.43004
4 1 0 8 1 0.42621
5 1 1 7 1 0.71991
6 1 0 6 1 0.43004
Logistic Regression – KS Stat
KS lies between 0 – 1
Closer to 1 better the model