0% found this document useful (0 votes)
140 views49 pages

Logistic Regression Insights

Classification problems involve predicting discrete class outcomes and are common in analytics. Logistic regression is widely used for classification problems to predict the probability of class membership based on explanatory variables. It is well-suited for binary and multi-class dependent variables and makes fewer assumptions than other techniques like linear regression. Examples of classification problems include customer profiling, credit risk assessment, and fraud detection.

Uploaded by

rakesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views49 pages

Logistic Regression Insights

Classification problems involve predicting discrete class outcomes and are common in analytics. Logistic regression is widely used for classification problems to predict the probability of class membership based on explanatory variables. It is well-suited for binary and multi-class dependent variables and makes fewer assumptions than other techniques like linear regression. Examples of classification problems include customer profiling, credit risk assessment, and fraud detection.

Uploaded by

rakesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CLASSIFICATION PROBLEMS

 Classification problems are an important category


of problems in analytics in which the response
variable (Y) takes a discrete value.

 The primary objective is to predict the class of a


customer (or class probability) based on the values
of explanatory variables or predictors.
Classification Problems

 Examples of Classification Problems:

 Customer profiling (customer segmentation)


 Customer Churn
 Credit Classification (low, high and medium risk)
 Employee attrition.
 Fraud (classification of transaction to fraud/no-fraud)
 Stress levels
 Text Classification (Sentiment Analysis)
Types of Classification Techniques

 Logistics Regression
 Discriminate analysis
 Support Vector Machine
 Naïve Bayes
 Stochastic Gradient Descent
 Decision Tree
 Random Forest
Types of Classification Techniques
Why do we need logistics regression
whenwe have linear regression

 Think about an important metric in


marketing: customer retention.
 If Keepmoney Bank wants to use a
regression analysis to examine whether it
will retain a customer, it will set retention as
its dependent variable.
Why do we need logistics regression
when we have linear regression

 . Rather than being normally distributed in a


bell curve in the manner of continuous
variables, however, a 1 will be assigned to
represent customer retention and a 0 will
represent customer loss. Only those two
outcomes are possible.
 This is a situation, wherein what you are
trying to predict is one of two options.
Why do we need logistics regression
when we have linear regression
 But why can’t Keepmoney use trusted linear
regression to determine the likelihood of
customer retention given a set of
independent variables?
 Linear regressions assume a bell-curve
distribution of outcomes (what is known as a
normal distribution) from negative infinity to
infinity. variable such as customer retention,
there is no curve across a range of
outcomes. The outcome can only be 1 or 0.
Why do we need logistics regression
when we have linear regression

 If Keepmoney attempts to use a linear


regression to examine customer retention,
nonsensical predictions may result. The
bank may find its chances of customer
retention are greater than 1, meaning it has
even better than a 100% chance of
retaining a customer. Or the bank may find
its chances are less than 0.
Customer Choice Behaviour

 Logistics regression is used to to represent consumers’


choice behavior as accurately as possible.

 When individual consumers choose products, the value


they place on the product does not typically increase
linearly with increases in a preferred feature of the
product.

 Instead, research indicates consumer valuation of a


product typically follows an S shaped curve with increases
in the levels of a preferred attribute.
S Shaped Curve
S Shaped Curve
• Imagine that on the x-axis we have the level of discount on a
INR 10000 plane ticket from Bangalore to New Delhi.

• Ask a group of your friends how many of them would


purchase the flight. Then offer a discount of INR 500. How
many additional people said they would buy the ticket?
Probably not many.

• Increase the discount to 1000 . Maybe one person half


heartedly jumps in.

• At 3000- 4000 discount, you would , you are likely to see a


spike in purchasers.

• After that the number of additional purchasers will taper off,


as you have reached the upper threshold.
Logistic Regression: Regression
with a Binary Dependent Variable

12
Logistic Regression Defined
Logistic Regression . . . is a specialized
form of regression that is designed to predict
and explain a binary (two-group) categorical
variable rather than a metric dependent
measure.

It is less affected than discriminant analysis


when the basic assumptions, particularly
normality of the independent variables, are
not met.

13
Logistic regression is best suited to address
two research objectives . . .
• Identifying the independent variables that
impact group membership in the dependent
variable.
• Establishing a classification system based on
the logistic model for determining group
membership.

14
Why Logistic Regression, not
linear Regression
• The binary nature of the dependent variable (0 – 1)
means the error term has a binomial distribution
instead of a normal distribution, and it thus invalidates
all testing based on the assumption of normality.
• The variance of the dichotomous variable is not
constant, creating instances of heteroscedasticity as
well.
• Neither of the above violations can be remedied
through transformations of the dependent or
independent variables. Logistic regression was
developed to specifically deal with these issues.

15
Limited Assumptions in
Logistic Regression

• The advantages of logistic regression are


primarily the result of the general lack of
assumptions.
• Logistic regression does not require any specific
distributional form for the independent variables.
• Linear relationships between the dependent and
independent variables are not required.

16
Logistic Regression - Introduction

 The name logistic regression emerges from logistic


distribution function.

eZ
1  eZ

 Mathematically, logistic regression attempts to estimate


conditional probability of an event (or class probability).
Logistic Regression

 Logistic regression models estimate how


probability of an event may be affected by
one or more explanatory variables.

 Logistic regression is a technique used for


predicting “class probability”, that is the
probability that the case belongs to a
particular class.
Binomial & Multinomial Logistic Regression

 Binomial (or binary) logistic regression is a


model in which the dependent variable is
dichotomous.

 In multinomial logistic regression model, the


dependent variable can take more than two
values.

 The independent variables may be of any type.


Mathematics of Logistic Regression
Concept of Odd

 Probabilities are simply the likelihood that


something would happen.

 A probability of .2 rain means that there is a


20% chance of rain.
Mathematics of Logistic Regression
 Odds are the ratio of the probability that an
event will occur divided by the probability
that an event will not occur.

 If there is a 20% chance of rain, there is an


80% chance of no rain.

 Odds = Prob(rain)/Prob (no rain) = .2/.8 =


.25
Mathematics of Logistic Regression
 Unlike probability, odds can take any value.

 An 80% chance of rain has odds of .8/.2 = 4

 ODDS RATIO: Ratio of two odds.


Logit Function

 The logit function is defined as the natural


logarithm of odds.

 Logit of a variable (probability)  (with value


between 0 and 1) is given by:


Logit ( )  ln( )   0  1 x1
1 

If there is a 20% chance of rain, then there is a logit of ln(.25) = -1.386


Logistic Transformation

• The logistic regression model is given by:

 i 
ln    0  1 X i
1 i 
Function w ith linear properties (Link Function)

i
 e(  0  1 X i )

1 i

e(  0  1 X i )
i 
1  e(  0  1 X i )

Business Analytics – The Science of Data Driven Decision Making


Estimation of Logistic Regression Model
and Assessing Overall Fit

• Transforming the dependent variable


• Estimating the coefficients
• Transforming a probability into odds and
logit values
• Model estimation
• Assessing the goodness of fit

25
Transforming a Probability into
Odds and Logit Values

o The logistic transformation has two basic steps:


 Restating a probability as odds, and
 Calculating the logit values.
o Instead of using ordinary least squares to
estimate the model, the maximum likelihood
method is used.
o The basic measure of how well the maximum
likelihood estimation procedure fits is the
likelihood value.
26
Logistics Regression Output

6-27
Test for significance of the coefficients

-We use hypothesis testing to see if the coefficient is


significantly different from 0, it has any impact or not.
- Like t value in Linear regression, here we use Wald
statistics
- Gre and shopping donot impact the probability of
getting admission.

6-28
Wald’s test
Wald’s test is used for checking statistical
significance of individual predictor variables
(equivalent to t-test in MLR model). The null and
alternative hypotheses for Wald’s test are:
H0: i = 0
H1: i  0

Wald’s test statistic is given by

2
  
 i 
W 
 
 S e (  i ) 
Interpretation of the coefficients

-This model develops coefficient for independent


variables similar to linear regression.

- But the interpretations are different

-Here the equation is

Log (P= 1/ P=0) = -4.087909+ .827991*gpa – (


.13602527*( 1- if from tier 3 institute, else 0) -1.500575
(1- if from tier 4 institute, else 0)

- The coefficients B0, B1 --- are actually measures of


the change in the ratio of the probabilities (Odd) 6-30
Directionality of the Relationship
A positive relationship means an increase in the
independent variable is associated with an increase in the
predicted probability, and vice versa. But the direction of the
relationship is reflected differently for the original and
exponentiated logistic coefficients.
• Original coefficient signs indicate the direction of the
relationship.
• Exponentiated coefficients are interpreted differently
since they are the logarithms of the original coefficients
and do not have negative values. Thus, exponentiated
coefficients above 1.0 represent a positive relationship
and values less than 1.0 represent negative
relationships.
31
Magnitude of the Relationship . . .

The magnitude of metric independent variables is


interpreted differently for original and exponentiated
logistic coefficients:
• Original logistic coefficients – are less useful in
determining the magnitude of the relationship since the
reflect the change in the logit (logged odds) value.
• For every one unit increase in Brand attitude score, we
expect a 1.274 increase in the log odd of Brand Loyalty
in the positive direction.

32
Magnitude of the Relationship . . .
Exponentiated coefficients – directly reflect the
magnitude of the change in the odds value.
- An exponentiated coefficient of 1.0 denotes no change
(1.0 times the independent variable = no change).
- The exponentiated coefficient minus 1.0 equals the
percentage change in the odds.

- An exponentiated coefficient of .2 indicates a negative


80 percent change in the odd (.20-1) for each unit
change in the independent variable.

33
Magnitude of the Relationship . . .

Percentage change in Odd = ( Exponentiated


coefficient- 1)*100

Exponentiated .2 1 1.7
coeff (eb)
Exponentiated -.8 0.0 .7
coeff (eb) - 1

Percent -80% 0 70%


Change in
odds 34
Calculating new Odd. . .

New Odd Value = Old odd value * exp coeff * change in


independent variable

• At present, odds are 1, exp coeff is 2.35, the


independent variable changes from 5.5 to 7. What would
be the new odd

35
Calculating new Odd. . .

• New odd is 1*2.35*(7-5.5) = 3.525

• Probability = odds / (1+odds).


• The odds of 3.525 indicates a probability of 77.9 ;
(3.52/ (1+3.52)).

36
Assessing mobile app purchasers. . .

37
Output of logistics regression. . .

38
Model Parameters. . .

39
Model Parameters. . .

40
Problem. . .

When the customer review average is 3, the odd for any game
being best seller is 3.310. Please answer the below 2
questions.

What is probability of the game being best seller when the


customer review average is 3.

If the customer review average becomes 4 from 3, what is the


increase in probability of the game being best seller?
41
Classification Acuuracy

-This shows how well the group memberships are


predicted.

- It develops a hit ratio, which is percentage correctly


specified.. This is known as CA (Classification
Accuracy).

6-42
Accuracy Paradox
 Assume an example of insurance fraud. Past data
has revealed that out of 1000 claims in the past,
950 are true claims and 50 are fraudulent claims.

 The classification table using a logistic regression


model is given below:

Observed Predicted % accuracy


0 1
0 900 50 94.73%
1 5 45 90.00%

The overall accuracy is 94.5%. Classifying all of


them as true claims will give 95% accuracy!
Sensitivity, Specificity and Precision
 The ability of the model to correctly classify positives
and negatives are called sensitivity and specificity,
respectively.
 The terminologies sensitivity and specificity originated in
medical diagnostics.
 In generic case
Sensitivity = P(model classifies Yi as positive | Yi is
positive)
Sensitivity is calculated using the following equation:

True Positive (TP)


Sensitivity = True Positive (TP)  False Negative (FN)

where True Positive (TP) is the number of positives


correctly classified as positives by the model and False
Specificity
 Specificity is the ability of the diagnostic test to
correctly classify the test as negative when the disease
is not present. That is:
Specificity = P(diagnostic test is negative | patient has no
disease)
 In general:
Sensitivity = P(model classifies Yi as negative | Yi is
negative)
Specificity can be calculated using the following
equation:

True Negative (TN)


Specificity = True Negative (TN)  False Positive (FP)

where True Negative (TN) is number of the negatives


 The decision maker has to consider the tradeoff between
sensitivity and specificity to arrive at an optimal cut-off
probability.
 Precision measures the accuracy of positives classified
by the model.
Precision = P(patient has disease | diagnostic test is
positive)

True Positive (TP)


Precision = True Positive (TP)  False Positive (FP)

 F Score (F Measure) is another measure used in binary


logistic regression that combines both precision and
recall and is given by:

2  Precision  Recall
F  Score 
Precision  Recall
Concordant and Discordant Pairs

 Discordant Pairs. A pair of positive and negative


observations for which the model has no cut-off
probability to classify both of them correctly are called
discordant pairs.

 Concordant Pairs. A pair of positive and negative


observations for which the model has a cut-off
probability to classify both of them correctly are called
concordant pairs.
Receiver Operating Characteristics (ROC)
Curve
 ROC curve is a plot between sensitivity (true
positive rate) in the vertical axis and 1 – specificity
(false positive rate) in the horizontal axis.
Area Under ROC Curve (AUC), Lorenz Curve
and Gini Coefficient

 Gini coefficient = Area A 



 Area A Area B 
1

0.9

0.8

0.7

0.6 Area A
0.5
Area B
0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Ideal Wealth Distribution Actual Wealth Distribution

You might also like