Telecom
Customer Churn Prediction Modelling
R Venkataraman
24th May 2020
—
PGP BABI
—
Group 5
Index
Project Description & Objective ………………………………………………………………. 3
Project Report……………………………………………………………………………………..... 4-28
Reference……………………………………………………………………………………………….. 29
Telecom Customer Churn Prediction Assessment
Project Description
Customer Churn is a burning problem for Telecom companies. In this project, we simulate
one such case of customer churn where we work on a data of postpaid customers with a
contract. The data has information about the customer usage behavior, contract details and
the payment details. The data also indicates which were the customers who canceled their
service. Based on this past data, we need to build a model which can predict whether a
customer will cancel their service in the future or not.
Project Objective
• EDA - Basic data summary, Univariate, Bivariate analysis, graphs
• EDA - Check for Outliers and missing values and check the summary of the dataset
• EDA - Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
• EDA - Summarize the insights you get from EDA
• Applying Logistic Regression
• Interpret Logistic Regression
• Applying KNN Model
• Interpret KNN Model
• Applying Naive Bayes Model
• Interpret Naive Bayes Model
• Confusion matrix interpretation for all models
• Interpretation of other Model Performance Measures for logistic <KS, AUC, GINI>
• Remarks on Model validation exercise <Which model performed the best>
• Actionable Insights and Recommendations
3|Page
Telecom Customer Churn Prediction Assessment
Project Report
EDA - Basic data summary, Univariate, Bivariate analysis, graphs
With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:
We have 10 independent variables and 1 dependent variable (‘Churn’) in the given data set.
We have 3333 rows which can be split into test & train dataset for various model building.
Data Description:
Churn 1 if customer cancelled service, 0 if not
AccountWeeks number of weeks customer has had active account
ContractRenewal 1 if customer recently renewed contract, 0 if not
DataPlan 1 if customer has data plan, 0 if not
DataUsage gigabytes of monthly data usage
CustServCalls number of calls into customer service
DayMins average daytime minutes per month
DayCalls average number of daytime calls
MonthlyCharge average monthly bill
OverageFee largest overage fee in last 12 months
RoamMins average number of roaming minutes
Initial summary of the data
4|Page
Telecom Customer Churn Prediction Assessment
Data structure just after loading
Missing values checked and results as below:
No missing values
Converted the factor variables from numeric
mycell$Churn=as.factor(mycell$Churn)
mycell$ContractRenewal=as.factor(mycell$ContractRenewal)
mycell$DataPlan=as.factor(mycell$DataPlan)
Final summary of the dataset
5|Page
Telecom Customer Churn Prediction Assessment
Univariate Analysis
14% of the customers (483) have
cancelled service while 86% (2850)
have continued.
Customer details with account
weeks as below:
Weeks #Customers %
<25 weeks 94 3
25-49 232 7
50-74 531 16
75-99 770 23
100-150 1350 41
>125 356 11
90% of the customers have
renewed their service recently
6|Page
Telecom Customer Churn Prediction Assessment
72% of the customers doesn’t have data
service offered while 28% have it.
Data Usage:
Mean =.81 with Std.dev=1.2727
No of outliers: 11
Service calls:
Mean = 1.56 with Std.dev=1.3154
No of outliers: 267
7|Page
Telecom Customer Churn Prediction Assessment
Daytime usage:
Mean = 179.8 with Std.dev=54.467
No of outliers: 25
Daycalls:
Mean =100.4 with Std.dev=20.069
No of outliers: 23
Monthly charges:
Mean=56.31 with Std.dev=16.426
No of outliers: 34
8|Page
Telecom Customer Churn Prediction Assessment
Overage Fee:
Mean = 10.05 with Std.dev=2.535
No of outliers: 24
Roaming minutes:
Mean=10.24 with Std.dev=2.791
No of outliers: 46
No treating of outliers carried out in this exercise and it is not explicitly asked
for.
Bivariate Analysis
We will analyze how the independent variables stack up with the dependent varirable
(Customer Churn)
Accountweeks Vs Customer Churn
No major trend in this
9|Page
Telecom Customer Churn Prediction Assessment
Contract Renewal Vs Churn
42% probability of customer churn if Contract is not renewed.
Dataplan Vs Churn
Higher probability of churn if customers don’t have dataplan as against having a plan
10 | P a g e
Telecom Customer Churn Prediction Assessment
Datausage Vs Churn
High probability of churn if datausage is very less or 0
Customer service calls Vs Churn
If the number of calls are between 3 to 7, churn % is higher.
Daytime usage Vs Churn
As the daytime usage goes up > 120, the probability of churn also increases.
11 | P a g e
Telecom Customer Churn Prediction Assessment
Daycalls Vs Churn
Higher probability of customer churn if the day calls are between 66 and 132.
Monthly charge Vs Churn
Higher probability of customer churn once the monthly charge are between 33 and 78
There is
a
12 | P a g e
Telecom Customer Churn Prediction Assessment
Overage Fee Vs Churn
Probability of churning out higher between the overage Fee range of 7.28 and 14.6
Roaming Vs Churn
Probability of churning is higher between the ranges : 8 to 14
13 | P a g e
Telecom Customer Churn Prediction Assessment
Multicollinearity:
cor.plot(mycellcor)
corrplot(mycellcor,method="number")
corrplot(mycellcor,method="ellipse")
Multicollinearity exists between Datausage/Dataplan, Monthly charge with Dataplan & Data
usage. This will be treated after confirming the VIF values in the logistic regression using
vif_func (). The results of this will be used for Regression, KNN & Naïve Bayes models (by
ignoring the fields)
14 | P a g e
Telecom Customer Churn Prediction Assessment
Model Building
First logistic regression model will be built on the dataset.
Train and Test data sets are created with a split of 70% & 30%.
set.seed(101)
mysplit=sample.split(mycell[,c(-12)],SplitRatio = 0.7)
mycell_train=subset(mycell[,c(-12)],mysplit==TRUE)
mycell_test=subset(mycell[,c(-12)],mysplit==FALSE)
str(mycell_train)
str(mycell_test)
First draft run of the logistic regression done on the training dataset with all the
columns included in the regression and the corresponding VIF is checked to act upon
the multicollinearity.
myglm1=glm(Churn~.,data=mycell_train,family="binomial")
summary(myglm1)
vif(myglm1)
15 | P a g e
Telecom Customer Churn Prediction Assessment
Vif output:
As illustrated in the corr.plot, Dataplan, Datausage,Daymins,Monthlycharge variables
have high VIF values.
We will use a step reduction using VIF_FUNC to remove the variables:
vif_func=dget("vif_func.R")
myvif=vif_func(in_frame=mycell_train,thresh=5,trace=TRUE)
Based on the above output, we will ignore Monthlycharge and
Datausage from the regression.
16 | P a g e
Telecom Customer Churn Prediction Assessment
New regression built by removing the above two variables(multicollinearity effect)
myglm2=glm(Churn~. -MonthlyCharge -DataUsage,data=mycell_train,family="binomial")
summary(myglm2)
vif(myglm2)
VIF is also checked and it is within accepted values.
Variables Accountweeks & Daycalls are not significant in the regression and this can be
discarded in the equation.
Final regression is built on the training set by removing these two variables.
myglm3=glm(Churn~. -MonthlyCharge -DataUsage -AccountWeeks -
DayCalls,data=mycell_train,family="binomial")
summary(myglm3)
vif(myglm3)
17 | P a g e
Telecom Customer Churn Prediction Assessment
Summary of the final regression in the training dataset:
VIF output
We will plot the prediction to check the threshold on the training dataset.
18 | P a g e
Telecom Customer Churn Prediction Assessment
Based on the plot above, we will use a cut off .16 to predict.
pred2.churn=ifelse(pred.test>0.2,1,0)
Confusion matrix on the training data set:
As the confusion matrix parameters (after threshold adjustment) looks optimized, we will
apply the same on the test dataset. Same threshold of .16 applied.
ROC Plot:
AUC : 81.25%
GINI: 62.5%
19 | P a g e
Telecom Customer Churn Prediction Assessment
Confusion Matrix
KS Plots
KS Value: 0.5268
20 | P a g e
Telecom Customer Churn Prediction Assessment
Summary of the final logistic
regression model Odds Ratio
Interpretation of Logistic regression model / Odds ratio:
Independent variables with positive higher Z values (Custservcalls, Daymins, OverageFee &
Roammins) are very significant which influences the churn.
Odds ratio explains negative relationship of ContractRenewal and Dataplan variables. Positive
relationship for Custservcalls, Daymins, OverageFee and Roammins. So for the positive
relationship variables, each increase in their score, the odds of being churn increase by the
factor as in the OR table.
Variable importance:
The model has an accuracy of : 82.6% with sensitivity of 87% and specificity of 60%.
Area under the curve: 81.25 and GINI : 62.5
KS Score of the model : 53
21 | P a g e
Telecom Customer Churn Prediction Assessment
KNN Model
Before creating the model, the dataset is normalized.
norm = function(x) { (x- min(x))/(max(x) - min(x)) }
mycell_orig.data = as.data.frame(lapply(mycell_orig, norm))
mycell_norm.data = cbind(mycell_orig[,1], mycell_orig.data)
Then from the normalized dataset, training and test split is done.
Best value of K is found using train function. We will be using the same independent variables
which are qualified in the logistic regression(effect of multicollinearity and significance)
This returns a best fit of
K=5
We will use K=5 for the model building.
mypred1 = knn(mycellknn_train[,c(3,4,6,7,10,11)], mycellknn_test[,c(3,4,6,7,10,11)],
mycellknn_train$Churn, k = 5)
Confusion matrix
22 | P a g e
Telecom Customer Churn Prediction Assessment
ROC Plots
AUC : 73.73%
GINI : 47.45%
KS Plots:
KS Statistic : 47.45
23 | P a g e
Telecom Customer Churn Prediction Assessment
Interpretation of KNN Model
Variable importance of the model can be found using varImp as below
Variable importance:
The model has an accuracy of 91% with sensitivity of 99% and specificity of 48%.
AUC=73.73, GINI = 47.45 and KS =47.45
With lesser value of K there will be more noise built into the classification prediction, whereas
with higher K it will be a overfit.
Naïve Bayes Model
We use the same test and training split done for the logistic regression.
Plotting the prediction on the training dataset.
24 | P a g e
Telecom Customer Churn Prediction Assessment
After few iterations, found that .15 threshold is getting the best results.
nb_train.churn=ifelse(nbpred1[,2]>.15,1,0)
ROC & AUC on the training dataset Confusion matrix
AUC = 87%
As the model performance measures are good in the training set, we will apply on the test data
set
25 | P a g e
Telecom Customer Churn Prediction Assessment
Prediction plot: threshold of > .15
ROC Plot Confusion Matrix
AUC =85.69
GINI = 71.38
26 | P a g e
Telecom Customer Churn Prediction Assessment
We will calculate KS static
KS= 67.23
Interpretation of Naïve Bayes model:
This model has an accuracy of 86% with sensitivity of % and specificity of 90% and 60%.
AUC = 86 , GINI = 71 and KS = 67
Model performance Measures chart:
Logistic
Measures Regression KNN Model Naïve Bayes
Confusion Matrix
Accuracy 80% 91% 85%
Sensitivity 82% 99% 86%
Specificity 67% 48% 81%
Balanced Accuracy 74% 91% 84%
AUC 81% 74% 86%
GINI 63% 47% 71%
KS 53% 47% 67%
27 | P a g e
Telecom Customer Churn Prediction Assessment
In the confusion matrix parameters, KNN outscores in accuracy and sensitivity, but less
specificity while NB has a balanced figure overall closely followed by logistic regression.
On the other parameters of AUC,GINI and KS, NB has outscored both Logistic regression and
KNN.
Result: NB is the best model for this case based on the above parameters for this
dataset.
Actionable insights & Recommendations:
Based on the variable importance parameters and bi-variate analysis, it is evident that:
Those who use more data time on the day have higher probability of churning out, may be
looking for faster/better service and they have to be identified and to be offered with free
additional data plans or other features.
Those who make more calls to the customer service have higher probability of churning out.
The company can look into better service models, self service portals/Apps, incident analysis
on the calls to pre-empt the customer with solutions etc...
Those who haven’t renewed their contract recently have higher probability of churning out.
The company can look into new attractive packages, longer duration contracts etc to lock in
the customers.
28 | P a g e
Telecom Customer Churn Prediction Assessment
References:
Great Learning Videos & Course Materials
CRAN package documentation
29 | P a g e