0% found this document useful (0 votes)
282 views29 pages

Telecom Customer Churn Prediction Model

The document summarizes a project to build a predictive model for customer churn using logistic regression, KNN, and Naive Bayes models. It includes: 1) Exploratory data analysis of a telecom customer dataset to understand patterns and relationships between variables. 2) Building an initial logistic regression model and refining it by addressing multicollinearity. 3) Evaluating the logistic regression model on test data and interpreting the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
282 views29 pages

Telecom Customer Churn Prediction Model

The document summarizes a project to build a predictive model for customer churn using logistic regression, KNN, and Naive Bayes models. It includes: 1) Exploratory data analysis of a telecom customer dataset to understand patterns and relationships between variables. 2) Building an initial logistic regression model and refining it by addressing multicollinearity. 3) Evaluating the logistic regression model on test data and interpreting the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Telecom

Customer Churn Prediction Modelling

R Venkataraman

24th May 2020



PGP BABI

Group 5
Index

Project Description & Objective ………………………………………………………………. 3

Project Report……………………………………………………………………………………..... 4-28

Reference……………………………………………………………………………………………….. 29
Telecom Customer Churn Prediction Assessment

Project Description

Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behavior, contract details and
the payment details. The data also indicates which were the customers who canceled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.

Project Objective

• EDA - Basic data summary, Univariate, Bivariate analysis, graphs


• EDA - Check for Outliers and missing values and check the summary of the dataset
• EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
• EDA - Summarize the insights you get from EDA
• Applying Logistic Regression
• Interpret Logistic Regression
• Applying KNN Model
• Interpret KNN Model
• Applying Naive Bayes Model
• Interpret Naive Bayes Model
• Confusion matrix interpretation for all models
• Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
• Remarks on Model validation exercise <Which model performed the best>
• Actionable Insights and Recommendations

3|Page
Telecom Customer Churn Prediction Assessment

Project Report

EDA - Basic data summary, Univariate, Bivariate analysis, graphs

With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:

We have 10 independent variables and 1 dependent variable (‘Churn’) in the given data set.
We have 3333 rows which can be split into test & train dataset for various model building.

Data Description:

Churn 1 if customer cancelled service, 0 if not


AccountWeeks number of weeks customer has had active account
ContractRenewal 1 if customer recently renewed contract, 0 if not
DataPlan 1 if customer has data plan, 0 if not
DataUsage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
MonthlyCharge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes

Initial summary of the data

4|Page
Telecom Customer Churn Prediction Assessment

Data structure just after loading

Missing values checked and results as below:

No missing values

Converted the factor variables from numeric


mycell$Churn=as.factor(mycell$Churn)
mycell$ContractRenewal=as.factor(mycell$ContractRenewal)
mycell$DataPlan=as.factor(mycell$DataPlan)

Final summary of the dataset

5|Page
Telecom Customer Churn Prediction Assessment

Univariate Analysis

14% of the customers (483) have


cancelled service while 86% (2850)
have continued.

Customer details with account


weeks as below:

Weeks #Customers %
<25 weeks 94 3
25-49 232 7
50-74 531 16
75-99 770 23
100-150 1350 41
>125 356 11

90% of the customers have


renewed their service recently

6|Page
Telecom Customer Churn Prediction Assessment

72% of the customers doesn’t have data


service offered while 28% have it.

Data Usage:
Mean =.81 with Std.dev=1.2727

No of outliers: 11

Service calls:
Mean = 1.56 with Std.dev=1.3154

No of outliers: 267

7|Page
Telecom Customer Churn Prediction Assessment

Daytime usage:
Mean = 179.8 with Std.dev=54.467

No of outliers: 25

Daycalls:
Mean =100.4 with Std.dev=20.069

No of outliers: 23

Monthly charges:
Mean=56.31 with Std.dev=16.426

No of outliers: 34

8|Page
Telecom Customer Churn Prediction Assessment

Overage Fee:
Mean = 10.05 with Std.dev=2.535

No of outliers: 24

Roaming minutes:
Mean=10.24 with Std.dev=2.791

No of outliers: 46

No treating of outliers carried out in this exercise and it is not explicitly asked
for.

Bivariate Analysis

We will analyze how the independent variables stack up with the dependent varirable
(Customer Churn)

Accountweeks Vs Customer Churn

No major trend in this

9|Page
Telecom Customer Churn Prediction Assessment

Contract Renewal Vs Churn

42% probability of customer churn if Contract is not renewed.

Dataplan Vs Churn

Higher probability of churn if customers don’t have dataplan as against having a plan

10 | P a g e
Telecom Customer Churn Prediction Assessment

Datausage Vs Churn

High probability of churn if datausage is very less or 0

Customer service calls Vs Churn

If the number of calls are between 3 to 7, churn % is higher.

Daytime usage Vs Churn

As the daytime usage goes up > 120, the probability of churn also increases.

11 | P a g e
Telecom Customer Churn Prediction Assessment

Daycalls Vs Churn

Higher probability of customer churn if the day calls are between 66 and 132.

Monthly charge Vs Churn

Higher probability of customer churn once the monthly charge are between 33 and 78

There is
a

12 | P a g e
Telecom Customer Churn Prediction Assessment

Overage Fee Vs Churn

Probability of churning out higher between the overage Fee range of 7.28 and 14.6

Roaming Vs Churn

Probability of churning is higher between the ranges : 8 to 14

13 | P a g e
Telecom Customer Churn Prediction Assessment

Multicollinearity:

cor.plot(mycellcor)
corrplot(mycellcor,method="number")
corrplot(mycellcor,method="ellipse")

Multicollinearity exists between Datausage/Dataplan, Monthly charge with Dataplan & Data
usage. This will be treated after confirming the VIF values in the logistic regression using
vif_func (). The results of this will be used for Regression, KNN & Naïve Bayes models (by
ignoring the fields)

14 | P a g e
Telecom Customer Churn Prediction Assessment

Model Building

First logistic regression model will be built on the dataset.

Train and Test data sets are created with a split of 70% & 30%.

set.seed(101)
mysplit=sample.split(mycell[,c(-12)],SplitRatio = 0.7)
mycell_train=subset(mycell[,c(-12)],mysplit==TRUE)
mycell_test=subset(mycell[,c(-12)],mysplit==FALSE)
str(mycell_train)
str(mycell_test)

First draft run of the logistic regression done on the training dataset with all the
columns included in the regression and the corresponding VIF is checked to act upon
the multicollinearity.

myglm1=glm(Churn~.,data=mycell_train,family="binomial")
summary(myglm1)
vif(myglm1)

15 | P a g e
Telecom Customer Churn Prediction Assessment

Vif output:

As illustrated in the corr.plot, Dataplan, Datausage,Daymins,Monthlycharge variables


have high VIF values.

We will use a step reduction using VIF_FUNC to remove the variables:

vif_func=dget("vif_func.R")
myvif=vif_func(in_frame=mycell_train,thresh=5,trace=TRUE)

Based on the above output, we will ignore Monthlycharge and


Datausage from the regression.

16 | P a g e
Telecom Customer Churn Prediction Assessment

New regression built by removing the above two variables(multicollinearity effect)

myglm2=glm(Churn~. -MonthlyCharge -DataUsage,data=mycell_train,family="binomial")


summary(myglm2)
vif(myglm2)

VIF is also checked and it is within accepted values.

Variables Accountweeks & Daycalls are not significant in the regression and this can be
discarded in the equation.

Final regression is built on the training set by removing these two variables.

myglm3=glm(Churn~. -MonthlyCharge -DataUsage -AccountWeeks -


DayCalls,data=mycell_train,family="binomial")
summary(myglm3)
vif(myglm3)

17 | P a g e
Telecom Customer Churn Prediction Assessment

Summary of the final regression in the training dataset:

VIF output

We will plot the prediction to check the threshold on the training dataset.

18 | P a g e
Telecom Customer Churn Prediction Assessment

Based on the plot above, we will use a cut off .16 to predict.

pred2.churn=ifelse(pred.test>0.2,1,0)

Confusion matrix on the training data set:

As the confusion matrix parameters (after threshold adjustment) looks optimized, we will
apply the same on the test dataset. Same threshold of .16 applied.

ROC Plot:

AUC : 81.25%

GINI: 62.5%

19 | P a g e
Telecom Customer Churn Prediction Assessment

Confusion Matrix

KS Plots

KS Value: 0.5268

20 | P a g e
Telecom Customer Churn Prediction Assessment

Summary of the final logistic


regression model Odds Ratio

Interpretation of Logistic regression model / Odds ratio:

Independent variables with positive higher Z values (Custservcalls, Daymins, OverageFee &
Roammins) are very significant which influences the churn.

Odds ratio explains negative relationship of ContractRenewal and Dataplan variables. Positive
relationship for Custservcalls, Daymins, OverageFee and Roammins. So for the positive
relationship variables, each increase in their score, the odds of being churn increase by the
factor as in the OR table.

Variable importance:

The model has an accuracy of : 82.6% with sensitivity of 87% and specificity of 60%.
Area under the curve: 81.25 and GINI : 62.5
KS Score of the model : 53

21 | P a g e
Telecom Customer Churn Prediction Assessment

KNN Model

Before creating the model, the dataset is normalized.

norm = function(x) { (x- min(x))/(max(x) - min(x)) }


mycell_orig.data = as.data.frame(lapply(mycell_orig, norm))
mycell_norm.data = cbind(mycell_orig[,1], mycell_orig.data)

Then from the normalized dataset, training and test split is done.

Best value of K is found using train function. We will be using the same independent variables
which are qualified in the logistic regression(effect of multicollinearity and significance)

This returns a best fit of


K=5

We will use K=5 for the model building.

mypred1 = knn(mycellknn_train[,c(3,4,6,7,10,11)], mycellknn_test[,c(3,4,6,7,10,11)],


mycellknn_train$Churn, k = 5)

Confusion matrix

22 | P a g e
Telecom Customer Churn Prediction Assessment

ROC Plots

AUC : 73.73%

GINI : 47.45%

KS Plots:

KS Statistic : 47.45

23 | P a g e
Telecom Customer Churn Prediction Assessment

Interpretation of KNN Model

Variable importance of the model can be found using varImp as below

Variable importance:

The model has an accuracy of 91% with sensitivity of 99% and specificity of 48%.
AUC=73.73, GINI = 47.45 and KS =47.45

With lesser value of K there will be more noise built into the classification prediction, whereas
with higher K it will be a overfit.

Naïve Bayes Model

We use the same test and training split done for the logistic regression.

Plotting the prediction on the training dataset.

24 | P a g e
Telecom Customer Churn Prediction Assessment

After few iterations, found that .15 threshold is getting the best results.

nb_train.churn=ifelse(nbpred1[,2]>.15,1,0)

ROC & AUC on the training dataset Confusion matrix


AUC = 87%

As the model performance measures are good in the training set, we will apply on the test data
set

25 | P a g e
Telecom Customer Churn Prediction Assessment

Prediction plot: threshold of > .15

ROC Plot Confusion Matrix


AUC =85.69
GINI = 71.38

26 | P a g e
Telecom Customer Churn Prediction Assessment

We will calculate KS static

KS= 67.23

Interpretation of Naïve Bayes model:

This model has an accuracy of 86% with sensitivity of % and specificity of 90% and 60%.
AUC = 86 , GINI = 71 and KS = 67

Model performance Measures chart:

Logistic
Measures Regression KNN Model Naïve Bayes
Confusion Matrix
Accuracy 80% 91% 85%
Sensitivity 82% 99% 86%
Specificity 67% 48% 81%
Balanced Accuracy 74% 91% 84%

AUC 81% 74% 86%

GINI 63% 47% 71%

KS 53% 47% 67%

27 | P a g e
Telecom Customer Churn Prediction Assessment

In the confusion matrix parameters, KNN outscores in accuracy and sensitivity, but less
specificity while NB has a balanced figure overall closely followed by logistic regression.

On the other parameters of AUC,GINI and KS, NB has outscored both Logistic regression and
KNN.

Result: NB is the best model for this case based on the above parameters for this
dataset.

Actionable insights & Recommendations:

Based on the variable importance parameters and bi-variate analysis, it is evident that:

Those who use more data time on the day have higher probability of churning out, may be
looking for faster/better service and they have to be identified and to be offered with free
additional data plans or other features.

Those who make more calls to the customer service have higher probability of churning out.
The company can look into better service models, self service portals/Apps, incident analysis
on the calls to pre-empt the customer with solutions etc...

Those who haven’t renewed their contract recently have higher probability of churning out.
The company can look into new attractive packages, longer duration contracts etc to lock in
the customers.

28 | P a g e
Telecom Customer Churn Prediction Assessment

References:

Great Learning Videos & Course Materials

CRAN package documentation

29 | P a g e

You might also like