PREDICTIVE MODELLING ASSIGNMENT
-PREDICTION OF CUSTOMER CHURN
By:
Pranav Viswanathan
1
1. Initial Discovery
1.1. Initial analysis
Customer churn is a for Telecom companies. It is essentially an important
factor to determine what holds in the future for the company. Once a
company with an exponential growth for a long period has been left behind.
Customer churn is one of the important aspect which helps to study the
major reason for a customer to leave a particular concern.
One of the hardest problem in this type of situation is recognizing when and
why it occurred. Once a regular and a loyal customer now a customer of
competitive concern.
The learning graph of churn helps to study the pattern of customers leaving
their reasons for leaving and how to balance it with new customers.
1.2. Business problem
In this project, we simulate one such case of customer churn where we
work on a data of postpaid customers with a contract. The data has
information about the customer usage behavior, contract details and the
payment details. The data also indicates which were the customers who
canceled their service. Based on this past data, we need to build a model
which can predict whether a customer will cancel their service in the future
or not.
1.2.1. Data in hand
Variables Description
Churn 1 if customer cancelled the service
0 if not
AccountWeeks No of weeks customer has an active account
ContractRenewal 1 if customer recently renewed the contract
O if not
DataPlan 1 if the customer has Data plan
O if not
2
DataUsage Gigabytes of monthly data usage
CustServCalls Number of calls into customer service
DayMins Average daytime minutes per month
DayCalls Average number of daytime calls
MonthlyCharge Average monthly bill
OverageFee Largest overage fee in last 12 months
RoamMins Average number of roaming minutes
Table 1:Data variables and description
From the above table we can see there are total of 11 variables out of
which 10 are predictors and Churn is the Defendant variable.
1.3. Initial Hypothesis
NULL Hypothesis: (HO) –No predictor is available to predict churn.
ALTERNATE Hypothesis: (HA) – There is atleast one independent
variable to predict churn.
2. Basic Data Preparation
2.1. Setting up Environment ,Importing Libraries and dataset:
This step ensures that:
Environment is set up with a path .
Importing various libraries for using certain functions .
Getting data set for exploration .
2.1.A Setting up Environment
Before the exploration of the given dataset ,we first set up an
environment more precisely where we want to save and take in the data
3
set from .This is done with the help of setwd() which is used to set up
the working environment.
getwd() - is a function which helps to get the location which is set.
FIG 1: Setting up environment
FIG 2: Output-Setting up environment
The above Fig-1 and Fig-2 represnets the step and output of setting up an
environment for working in R studio.
2.1.B Importing Libraries
This step ensures and helps to import libraries which are essential for using
certain functions for data processing.
Libraries have in built functions for various manipulation and statistical
processes.
4
FIG 3: Importing Libraries
Details on certain important Libraries:
corrplot: used for getting correlation plot of the dataset.
ROCR: used for finding the AUC of models.
e1071: For Naïve bayes.
car : For regression purpose and data split,
ggplot: For plotting differents charts,
ineq: For performance metrics .
FIG 4: output of Loaded Libraries
2.1.C Importing Dataset
The dataset is then imported by means of read.csv() which reads the
dataset into the R environment .
5
FIG 5 :Reading dataset
FIG 6 : output of Reading dataset
3.Exploratory Data Analysis – Step by step approach:
In statistics, exploratory data analysis (EDA) is an approach to
analyzing data sets to summarize their main characteristics,
often with visual methods.
The objectives of EDA are to:
Suggest hypotheses about the causes of observed phenomena.
Assess assumptions on which statistical inference will be based.
Support the selection of appropriate statistical tools and techniques.
.
Provide a basis for further data collection through surveys or
experiments.
6
3.1 Data preparation
3.1.1 Initial data exploration
3.1.1 A.Top and Bottom Rows
Here the data is read and its characteristics like the structure,summary top
and bottom 6 rows are viewed for further analysis.
FIG 7 : Top and Bottom rows dataset
FIG 8 : Output- Top and Bottom rows dataset
From the Fig 7,8 we are able to see the top and bottom 6 rows of the
dataset .But this dosent give proper insights of dataset ,hence we further
explore to get the characteristics if Dataset.
3.1.1 B.Structure of Dataset
Structure is a compact way to display the structure of an R
object. This allows you to use structure as a diagnostic function
and an alternative to summary. Structure will output the
7
information on one line for each basic structure. Structure is best
for displaying contents of lists.
FIG 9: Code- Structure of dataset
FIG 10: Output- Structure of dataset
From the above Fig 10 we see that there are 11 variables with 3333
observations.
we can also see that Churn, ContractRenewal and CustServCalls are
categorical variables with numeric values that is with repeated values.
Churn with 0’s and 1’s
ContractRenewal with 0’s and 1’s
And CustServCalls with 1,2,3,4,5
So it is better to convert them to factors.
3.1.1 C. Variable Conversion
8
Here 3 variables namely Churn, ContractRebewal and CustServCalls are
categorical variables with numeric values that is with repeated values. So
we convert them to factors.
FIG 11: Variable conversion
To check them whether they are converted ,we once again use str() to see
the structure of the dataset.
FIG 12:Output: Variable conversion
3.1.1 D . Statistical parameters of Dataset
Summary is a generic function used to produce result summaries of
the results of various model fitting functions. The function invokes
particular methods which depend on the class of the first argument.
9
FIG 13:Output: Summary 1
For further important statistical parameters we use describe() to see
skewness,range etc
FIG 13:Output: Summary 2
3.1.2 Univariate Analysis:
Histogram and boxplot Distributions of Dataset:
10
In univariate analysis we see the insights of individual variables as a plot of
itself with frequency or count.
FIG 14 :Code : Univariate 1
11
FIG 15 : Output : Univariate 1
INSIGHTS:
From the histograms we see all the variables are Normally distributed
except the Churn , ContractRenewal and CustServCalls as they are
factor with different levels.
We see that Monthly charge has a great deal of outliers.
We further investigate by seeing separate plots of the above
variables.
NUMERICAL DATA
FIG 16 :Code : Univariate 2
12
FIG 17 :Output : Univariate 2
INSIGHTS:
AccountWeeks is normally distributed and slightly right skewed.
DataUsage is distributed normally based upon the amount of data
used.
Other variables are normally distributed along the mean .
3.1.3 Bi-variate Analysis:
INDEPENDENT:
Accountweeks,Custservcalls,Contractrenewal,Datausage,Dataplan,
Daymins,Daycalls,Monthlycharge,Overagefee,Roammins.
DEPENDANT: Churn
Bi-variate analysis gives plots as a function between dependant and
independent variable.
NUMERICAL VARIABLE:
13
FIG 18 :Code : Bivariate 1
FIG 19 :Output : Bivariate 1
INSIGHTS:
From Histograms almost all continuous prdictors like Account
Weeks, DayMins, OverageFee, Roam mins have normal
distributions.
Monthly charge has its distribution skewed to a bit left which
can be ignored.
Customers who churn vs who dont are mostly have similar
distribution for the Account weeks with mean of Churn(1) =
102.6 (~103) Weeks and Not Churn(0) = 100.7(~101) Weeks.
14
On an Average Customers who Churn are utilizing more Day
Minutes(207 mins) than who don’t (175 mins).
On the other hand Churning customers data usage (0.54 GB)
on an average is less compared to Non-Churning ones ( 0.86
GB).
15
Churning Customers call Customer Service more in the bracket
of ( 5 - 10 calls) v/s the bracket of (0-5 Calls).
Monthly Charges are also more for Churn customers compared
to Non-Churn.]
CATEGORICAL VARIABLE:
FIG 20 Code : Bivariate 2
Here the plot is between Churn and other categorical variables like contract
renewal ,data plan and customer service calls.
Here the plot is a categorical Vs categorical situation and helps to find the
relation between the same for further data analysis.
16
FIG 21 Output : Bivariate 2
3.2 OUTLIER’S ,MISSING VALUE’S and its TREATMENT:
OUTLIERS:
Outliers are basically data points that are far away from the inter quartile
range . An outlier may be due to variability in the measurement or it may
indicate experimental error; the latter are sometimes excluded from
the dataset. An outlier can cause serious problems in statistical analyses.
Outliers can have many anomalous causes. A physical apparatus for taking
measurements may have suffered a transient malfunction. There may have
been an error in data transmission or transcription. Outliers arise due to
changes in system behaviour, fraudulent behaviour, human error,
instrument error or simply through natural deviations in populations.
FIG 22 Code : OUTLIER 1
17
FIG 23 Output : OUTLIER 1
INSIGHTS:
From the above plot we can see that Datausage has a large number
of outliers followed by Monthly charge and Day Mins.
For further analysis ,we visualize the variables Data Usage,monthly
charge and Day Mins separately to see its positon and values.
Data usage has many outliers for the Churners (class 1).
Day Mins and Monthly Charge has many outliers in the Non-Churner
category (Class 0).
DATAUSAGE:
FIG 22 Code : OUTLIER 2
18
FIG 23 Output : OUTLIER 2
DAYMINS:
FIG 24 Code : OUTLIER 3
FIG 25 Output : OUTLIER 3
19
MONTHLY CHARGE:
FIG 26 Code : OUTLIER 4
FIG 27 Output : OUTLIER 4
INSIGHTS:
For churn =0 Monthly charge has a lot of outliers
compared to Daymins.
For churn=1 Datausage has a lot of outliers.
The best way to treat these outliers is to do scaling
and normalizing methods, so that the model’s
created will be less prone to error and wont be
overfit.
20
MISSING VALUES:
Missing data, or missing values, occur when no data value is stored for the
variable in an observation.
Missing data are a common occurrence and can have a significant effect
on the conclusions that can be drawn from the data.
FIG 28 Code : Missing values
FIG 29 Output : Missing values
INSIGHTS:
from the above figure we see that the data is free from missing
values,hence there is no need for any treatment on this front.
21
3.3 MULTICOLLINEARITY and its TREATMENT:
Multicollinearity occurs when the independent variables of a regression
model are correlated and if the degree of collinearity between the
independent variables is high, it becomes difficult to estimate the
relationship between each independent variable and the dependent
variable and the overall precision of the estimated coefficients.
FIG 30 Code : Corrplot
FIG 31 Output : Corrplot
INSIGHTS:
Data suggests there is very strong correlation between Monthly charges
and data usage which is quite obvious . So we can replace one variable
with another after evaluation.
22
TREATMENT:
FIG 32 Code : Multicollinearity treatment
Data suggests there is very strong correlation between Monthly charges
and data usage which is quite obvious . So we can replace one variable
with another after evaluation.
3.4 Summary from EDA
From initial analysis we come to know that there are 11
variables with 3333 observations.
Out of 11 variables 10 are predictors
(Accountweeks,Custservcalls,Contractrenewal,Datausage,Data
plan,Daymins,Daycalls,Monthlycharge,Overagefee,Roammins)
and one is dependant(churn).
From structure function (str()), we came to know about
the datatype of the variable .
Churn,ContractRenewal and DataPlan were categorical
with numerical values and they were converted to
factors.
23
Outliers :Data usage has many outliers for the Churners (class
1) and Day Mins and Monthly Charge has many outliers in the
Non-Churner category (Class 0).
The dataset had zero missing values.
Univariate analysis:
It helped to see the structure of dataset
Dataset was normally distributed about the mean
Bi-variate analysis:
From Histograms almost all continuous prdictors like
Account Weeks, DayMins, OverageFee, Roam mins have
normal distributions.
Monthly charge has its distribution skewed to a bit left
which can be ignored.
Customers who churn vs who dont are mostly have
similar distribution for the Account weeks with mean of
Churn(1) = 102.6 (~103) Weeks and Not Churn(0) =
100.7(~101) Weeks.
24
4.LOGISTIC REGRESSION
4.1 LOGISTIC REGRESSION MODEL:
DATA PREPARATION:
The Dataset is split into 70:30 ratio for creating and predicting
the model.
FIG 33 Code : Splitting the dataset
Checking the dimension and structure of dataset.
FIG 34 Code : Dimension of the dataset
FIG 35 Output : Split and Dimension of the dataset
25
APPLYING LOGISTIC REGRESSION
FIG 36 Code : LR 1
FIG 37 Output : LR 1
From the above figure we can see that ContractRenewal,DataPlan
,CustServCalls ,OverageFee and RoamMins are significant.
Moreover we need to carry out VIF(Varience Inflation Factor) to further
decide on the important variables to the model.
26
VARIENCE INFLATION FACTOR(VIF):
FIG 38 : VIF
From the above we can see that DataPlan, DayMins, MonthlyCharge
,OverageFee exceed vif >2 .
Hence we carry out chi square test for further analysis.
FIG 39 : Chisq 1
FIG 40 : Chisq 2
27
INSIGHTS:
From Fig 37 we can see that ContractRenewal,DataPlan
,CustServCalls ,OverageFee and RoamMins are significant.
From Fig 38 we can see that DataPlan, DayMins,
MonthlyCharge ,OverageFee exceed vif >2 .Hence we carry out
chi square test for further analysis.
After carrying out Chisq test to predict the significant predictors
we see that OverageFee and DayCalls can be left out of the
model .
4.2 LOGISTIC REGRESSION INTERPRETATION:
From the model created(Fig 37) and Chisq Test(Fig 40) we see that:
ContractRenewal ,DataPlan ,CustServCalls ,OverageFee and
RoamMins, MonthlyCharge,DayMins, are significant.
ContractRenewal and DataPlan have a negative influence on Churn.
Let’s find out the power of Odds and Probability of the variables impacting
on Customer Churn.
FIG 41 : Odds Ratio
28
FIG 42 : Probability
Variable Odds Ratio Probability
AccountsWeek 1.00180114 2.00360228
ContractRenewal 0.13653122 0.27306244
DataPlan 0.25005861 0.50011721
CustServCalls 1.71530831 3.43061662
DayMins 1.00965686 2.01931371
DayCalls 1.00354047 2.00708093
MonthlyCharge 1.01558769 2.03117538
OverageFee 1.11489382 2.22978764
RoamMins 1.07731308 2.15462617
Fig 43:Odds ratio-probability table
From the above table,the data points highlighted in yellow have a negative
impact on Churn.
29
4.3 PREDICTION:
Since we have confirmed the importance of additional significant variables,
let’s check performance of our Model using a Classification Table /
Confusion Matrix.
Fig 43:Code :Prediction
Fig 44 :Output :Prediction
CONFUSION MATRIX ON TEST DATASET:
Fig 45 :Code: CM-LR
30
Fig 46 :Output: CM-LR
INTERPRETATION:
31 out of (31+30) customers were found to be correctly churned out.
This turns out to be the positive prediction rate. (0.58).
825 out of (825+113) customers were found to be not churned
out.This is the negative prediction rate.(0.87).
Logistic Regression also performs poorly in case of general model
with positive pred rate of 0.58% and Sensitivity of just 0.21%.
Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.
The accuracy seems to be good at 0.85. This creates a paradox as
accuracy is greater with positive prediction rate lesser .
31
4.4 Interpretation of other Model Performance Measures for
logistic <KS, AUC, GINI>
ROC PLOT:
It is a plot of the True Positive Rate against the False Positive Rate for the
different possible cut-points of a diagnostic test.
An ROC curve demonstrates several things:
1. It shows the trade-off between sensitivity and specificity (any increase in
sensitivity will be accompanied by a decrease in specificity).
2. The closer the curve follows the left-hand border and then the top
border of the ROC space, the more accurate the test.
3. The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test. 4.
The slope of the tangent line at a cut-point gives the likelihood ratio (LR) for
that value of the test.
5. The area under the curve (AUC) is a measure of text accuracy.
Fig 47 :AUC
AUC or Area under the curve is 78% ie dataset has 78.6 % concordant
pairs.
32
Fig 48 :ROC
INTERPRETATION:
At a threshold of 0.5 , we find that the TPR(True positive rate) is 0.58
and FPR(False positive rate ) to be 0.87.
So if the threshold is decreased from 0.5 to 0.3 or 0.2 ,most cases fall
under class churn =1, which futhur helps to increase TPR.
From the plot we see that AUC is around 0.786 which implies that
78.6 concordant pairs in the entire dataset.
Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.
KS chart and interpretation:
Fig 49 :Code: KS
33
Fig 50 :Output: KS
The two sample Kolmogorov-Smirnov test is a nonparametric test
that compares the cumulative distributions of churn =0 and churn =1.
The KS test report the maximum difference between the two
cumulative distributions, and calculates a P value from that and the
sample sizes.
It is the maximum difference between TPR and NPR and it is
found out to be 52%.
GINI and interpretation:
Gini is measured in values between 0 and 1, where a score of
1 means that the model is 100% accurate in predicting the
outcome.
A higher Gini is beneficial to the bottom line because requests can be
assessed more accurately, which means acceptance can be
increased and at less risk.
34
Gini= AUC*2-1 ; AUC=0.78
=(0.78*2)-1
=0.56
We see that the Gini is 0.56 which is moderately adequate inequality if the
threshold is at 0.4.
5.Data Normalization/Scaling for KNN and Naïve bayes:
Data normalization is a process in which data attributes within a data
model are organized to increase the cohesion of entity types. In other
words, the goal of data normalization is to reduce and even eliminate data
redundancy, an important consideration for application developers because
it is incredibly difficult to stores objects in a relational database that
maintains the same information in several places.
Here we use min-max scaling.
Min-max normalisation is often known as feature scaling where the values
of a numeric range of a feature of data, i.e. a property, are reduced to a
scale between 0 and 1. Therefore, in order to calculate z, i.e. the
normalised value of a member of the set of observed values of x, we must
employ the following formula:
35
Fig 51 :Code: Data normalization
Fig 52 :Output: Data normalization
6.Data split for KNN and Naïve Bayes:
Fig 53 :Code: Data Split
36
Fig 54 :Output: Data Split
7. KNN (K Nearest Neighbours):
7.1 Applying KNN :
The KNN or k-nearest neighbors algorithm is one of the simplest machine
learning algorithms and is an example of instance-based learning, where
new data are classified based on stored, labeled instances.
More specifically, the distance between the stored data and the new
instance is calculated by means of some kind of a similarity measure. This
similarity measure is typically expressed by a distance measure such as
the Euclidean distance, cosine similarity or the Manhattan distance.
In other words, the similarity to the data that was already in the system is
calculated for any new data point that you input into the system.
37
Fig 55 :Code: KNN
INTERPRETATION:
trainControl: Control the computational nuances of the train function.
repeatedcv: Cross validation method
repeats: No of times the cross validation to take place
Optimum method for finding the value of K:
Fig 56 :Output: method to find K
38
From Fig 56 ,we can see that the train dataset is cross
validated in 10 fold and repeated 3 times.
Resampling of the dataset gives the optimum value of K with
a accuracy rate of 90.28%.
7.2 INTERPRETATION-KNN MODEL:
Fig 57 :Code: KNN-Prediction
Fig 58 :Output: KNN-Prediction
From the above prediction we can see that for k=9 the predictions
of churn =0 is 920 meaning that customers won’t leave and for
churn =1 is 79 which means they churn out.
39
CONFUSION MATRIX INTERPRETATION:
Fig 59 :Code: KNN-CM
Fig 60 :Output: KNN-CM
Positive prediction rate is found to be 0.73 which is found to be less
compared to accuracy rate of 0.89
Negative prediction rate is found to be 0.90
Of course this model can be improved through better selection of
predictors and their interaction effects but the general case is worst
performer.
40
The accuracy seems to be good at 0.89 This creates a paradox as
accuracy is greater with positive prediction rate lesser .
8. Naïve Bayes:
8.1 Applying Naïve Bayes:
Naïve Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong
(naïve) independence assumptions between the features. They are
among the simplest Bayesian network models.
It is based upon Bayes theorem
Using Bayes theorem, we can find the probability of A happening,
given that B has occurred. Here, B is the evidence and A is the
hypothesis. The assumption made here is that the predictors/features
are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
BUILDING NAÏVE BAYES MODEL:
The problem states that whether Naïve bayes model can be built with
the given Dataset or not. The given Dataset consistes of both
Categorical as well as numerical Variables.
We know that Naïve Bayes classifer works well on only Categorical
values .
Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
41
Since this algo runs on Conditional Probabilities it becomes very hard
to silo the continous variables as they have no frequency but a
continuum scale.
Moreover, The model can be created with a mixture of both
Categorical and numerical values but the accuracy is less than what
when created with only Categorical values.
Fig 61 :Code: Naïve Bayes
Fig 62 :Output: Naïve Bayes
42
8.2 INTERPRETATION:
Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
For continous variables: what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.
Above law suits binary classifier ; however if we have multinomial
Response categories than it will have to go for quantiles, deciles n-
iles partitioning the data accordingly and assigning them the
probabilities.
Based on above NB’s working on mixed dataset and its accuracy is
always questionable.
Its findings and predictions need to be supported by other Classifiers
before any actionable operations
The Output for the NB model displays in the matrix format for each
predictor its mean [,1] and std deviation [,2] for class 1 and class 0.
The independence of predictors (no-multicollinearity) has been
assumed for sake of simplicity.
43
CONFUSION MATRIX INTERPRETATION:
Fig 63 :Code: Naïve Bayes-CM
Fig 634:Output: Naïve Bayes-CM
INTERPRETATION:
Positive prediction value: 62.3%
Negative prediction value: 89.1%
Accuracy: 87.9%
44
9.CONFUSION MATRIX INTERPRETATION OF ALL MODELS:
Model
Parameter Logistic KNN(K Naïve Bayes
Regression Nearest
Neighbor’s)
Accuracy 0.8569 0.8929 0.8729
Positive 0.5082 0.7341 0.6231
prediction
Negative 0.8795 0.9065 0.8914
prediction
Sensitivity 0.2152 0.4027 0.2986
Specificity 0.9649 0.9754 0.9695
From the above table we can see that KNN model has a good
Accuracy(0.89) with positive prediction of (0.73) which seems
better when compared to other models in real time scenarios.
10.INSIGHTS OF EVERY MODEL’s(LR,KNN,NB) VALIDATION:
From the building and prediction of various model’s like Logistic
regression, KNN, Naïve Bayes we are able to see that KNN has a
good accuracy rate of 89.2% and with a positive prediction of about
73.4% when compared to Other model’s like Naïve Bayes and logistic
regression.
45
Of course this model (KNN) can be improved through better selection
of predictors and their interaction effects but the general case is worst
performer.
In Case of Logistic Regression , LR model also suffers from accuracy
paradox such that if threshold probability is decreses from 0.5 to say
0.2 or 0.1 then more cases will fall in Churner category (1).
Logistic Regression also performs poorly in case of general model
with positive prediction rate of 50.8% and Sensitivity of just 21.52%.
Navie Bayes works best with Categorical values but can be made to
work on mix datasets having continuous as well as categorical
variables as predictors like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very
hard to silo the continous variables as they have no frequency but a
continuum scale.
For continous variables what NB does is takes their mean and
standard deviation or variability and treats it as cut off thresholds ;
say anything less than mean of distributed predictor values is 0 and
more than mean is 1.
11.ACTIONABLE INSIGHTS AND RECOMMENDATIONS:
For Naïve Bayes:
Firstly all variables need to be categorical in nature .
Secondly if continuous variables are present proper methods
are needed to taken into action to normalize and scale them.
Based on above NB’s working on mixed dataset and its
accuracy is always questionable.
Naive Bayes has no parameters to tune.
46
For KNN:
k-NN performs the best with Positive pred rate of 81% in
the general case model where the formula intends to take all
the 10 predictors irrespective of their type whether continous or
categorical.
Furthur the model can be tuned to get good prediction and
accuracy.
For Logistic Regression:
For logistic regression ,all variables need to be independent
of each other.
The model can be tuned by performance metrics for better
predictions .
12.CONCLUSION:
MODEL POSITIVE ACCURACY
PREDICTION
Logistic Regression 50.8% 85.6%
KNN 73.4% 89.2%
Naïve Bayes 62.3% 87.2%
k-NN performs the best with Positive pred rate of 73.4% in the
general case model where the formula intends to take all the 10
predictors irrespective of their type whether continous or categorical.
The intended or any refined / tuned target model should be able to
catch the Churners based on the data provided . Ofcourse the
dataset is lopsided in favor of more-NonChurners rather than our
intended target of finding Churners based on their behavior hidden in
the dataset.
47
Naive Bayes has no parameters to tune , but k-NN and Logit Regr
can be improved by fine tuning the train control parameters and also
deploying the up/down sampling approach for Logistic regression to
counteract the class imbalance.
48
APPENDIX
#### Setting up environment
setwd('C:\\Users\\Viswanathan\\Desktop\\pgp-babi')
getwd()
#### Importing the important Libraries
library(DataExplorer)
library(readxl)
library(corrplot)
library(caTools)
library(gridExtra)
library(rpart)
library(rpart.plot)
library(randomForest)
library(data.table)
library(ROCR)
library(ineq)
library(InformationValue)
library(caret)
library(e1071)
library(car)
library(caret)
library(class)
library(devtools)
library(e1071)
library(ggplot2)
library(Hmisc)
library(klaR)
library(MASS)
library(nnet)
49
library(plyr)
library(pROC)
library(psych)
library(scatterplot3d)
library(SDMTools)
library(dplyr)
library(ElemStatLearn)
library(neuralnet)
library(rms)
library(gridExtra)
#### Reading in the files
Data <-read.csv('Cellphone.csv',header = TRUE,sep=',')
attach(Data)
#### To see the first 6 and last 6 data
head(Data)
tail(Data)
#### Basic data summary
describe(Data)
summary(Data)
str(Data)
#### Converting Churn,Contract Renewal and Data plan as factors
Data$Churn=as.factor(Data$Churn)
Data$ContractRenewal=as.factor(Data$ContractRenewal)
Data$DataPlan=as.factor(Data$DataPlan)
str(Data)
#### Checking for missing data and removing them
any(is.na(Data))
50
colSums(is.na(Data))
anyNA(Data)
#### Uni-variate Analysis
#### Histogram distribution of dataset
par(mfrow=c(3,3))
plot_histogram(Data) #Numerical
#Categorical
plot_bar(Churn)
plot_bar(ContractRenewal)
plot_bar(DataPlan)
names(Data)
par(mar=c(2,2,2,2))
par(mfrow=c(4,4))
hist(Churn,xlab='Churn')
boxplot(Churn,horizontal = TRUE,main='Boxplot of churn',xlab='Churn')
hist(AccountWeeks,xlab='AccountWeeks')
boxplot(AccountWeeks,horizontal = TRUE,main='Boxplot of Accountweeks',xlab='Accountweeks')
hist(ContractRenewal,xlab='Contract renewal')
boxplot(ContractRenewal,horizontal = TRUE,main='Boxplot of Contract renewal',xlab='Contract renewal')
hist(DataPlan,xlab='Data plan')
boxplot(DataPlan,horizontal = TRUE,main='Boxplot of Dataplan',xlab='Dataplan')
hist(DataUsage,xlab='Datausage')
boxplot(DataUsage,horizontal = TRUE,main='Boxplot of Data usage',xlab='Data usage')
hist(CustServCalls,xlab='Custservcalls')
boxplot(CustServCalls,horizontal = TRUE,main='Boxplot of Custservcall',xlab='Custservcall')
hist(DayMins,xlab='Daymins')
boxplot(DayMins,horizontal = TRUE,main='Boxplot of Daymins',xlab='Daymins')
hist(DayCalls,xlab='Daycalls')
boxplot(DayCalls,horizontal = TRUE,main='Boxplot of Daycalls',xlab='Daycalls')
51
hist(MonthlyCharge,xlab='Monthlycharge')
boxplot(MonthlyCharge,horizontal = TRUE,main='Boxplot of monthly charge',xlab='Monthly charge')
hist(OverageFee,xlab='overagefee')
boxplot(OverageFee,horizontal = TRUE,main='Boxplot of overagefee',xlab='Overagefee')
hist(RoamMins,xlab='RoamMins')
boxplot(RoamMins,horizontal = TRUE,main='Boxplot of Roam mins',xlab='Roam mins')
#### Density plots of variables
plot_density(Data,geom_density_args = list(fill='gold',alpha=0.4))
#### Bivariate
library(gridExtra)
p1 = ggplot(Data, aes(AccountWeeks, fill=Churn)) + geom_density(alpha=0.4)
p2 = ggplot(Data, aes(MonthlyCharge, fill=Churn)) + geom_density(alpha=0.4)
p3 = ggplot(Data, aes(CustServCalls, fill=Churn))+geom_bar(position = "dodge")
p4 = ggplot(Data, aes(RoamMins, fill=Churn)) + geom_histogram(bins = 50, color=c("red"))
grid.arrange(p1,p2,p3,p4)
### In depth analysis of Bi-variate:
## AccountWeeks Vs Churn
d1=Data$AccountWeeks[Data$Churn==1]
mean(d1)
d2=Data$AccountWeeks[Data$Churn==0]
mean(d2)
## DayMinutes Vs Churn
d3=Data$DayMins[Data$Churn==1]
mean(d3)
d4=Data$DayMins[Data$Churn==0]
mean(d4)
## DayMinutes Vs Churn
52
d5=Data$DataUsage[Data$Churn==1]
mean(d5)
d6=Data$DataUsage[Data$Churn==0]
mean(d6)
names(Data)
#### Categorical
p6=ggplot(Data,aes(x=ContractRenewal))+geom_bar(aes(fill=Churn))
p7=ggplot(Data,aes(x=DataPlan))+geom_bar(aes(fill=Churn))
p8=ggplot(Data,aes(x=CustServCalls))+geom_bar(aes(fill=Churn))
grid.arrange(p6,p7,p8)
#### Outlier and its treatment
plot_boxplot(Data,by='Churn',geom_boxplot_args = list('outlier.color'='red',fill='blue'))
## Data usage has many outliers for churn(class 1)
## Day mins and Monthly charge has outliers for churn(class 0)
outlier1=boxplot(DataUsage)$out
print(outlier1)
which(DataUsage %in% outlier1)
outlier2=boxplot(DayMins)$out
print(outlier2)
which(DayMins %in% outlier2)
outlier3=boxplot(MonthlyCharge)$out
print(outlier3)
which(MonthlyCharge %in% outlier3)
53
#### Insight for Multicollinearity
cell_numeric=Data %>% select_if(is.numeric)
a=round(cor(cell_numeric),2)
corrplot(a)
##Data suggests there is very strong correlation between Monthly charges
#and data usage which is quite obvious .
##So we can replace one variable with another after evaluation
names(Data)
### VIF
lr=read.csv('Cellphone.csv',header = TRUE,sep=',')
LR=lm(Churn~., data=lr)
summary(LR)
vif(LR)
### Removing Datausage
Data=Data[,-5]
str(Data)
#### Splitting dataset - Train and Test
set.seed(332)
split=createDataPartition(Data$Churn,p=0.7,list=FALSE)
train_Data=Data[split,]
test_Data=Data[-split,]
dim(train_Data)
dim(test_Data)
#### LOGISTIC REGRESSION ####
Model1=glm(train_Data$Churn~., data=train_Data,family='binomial')
summary(Model1)
## Contract renewal and data plan has negative impact on customer churn
54
## checking for varience inflation factor
vif(Model1)
?anova
### CHI SQUARED TEST-to check significant predictors
anova(Model1,test='Chisq')
#Dataplan,Daymins,Monthlycharge need to be cured as vif is greater than 5
Model1$coefficients
## Likelihood ratio
lh=exp(Model1$coefficients)
print(lh) ##odds ratio of accountweek is 1.0018,one unit increase
#leads to 0.0018 increase in churn
prob=exp(coef(Model1))/1+exp(coef(Model1))
prob
## Interpretation
Model1_pred = predict(Model1,newdata = test_Data,type='response')
Model1_predicted=ifelse(Model1_pred>0.5,1,0)
#Factor conversion
Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))
head(Model1_predicted_factor)
## Confusion matrix
Model1.CM=confusionMatrix(Model1_predicted_factor,test_Data$Churn,positive='1')
Model1.CM
## ROC curve
LR_pred=predict(Model1,newdata = test_Data,type='response')
55
rocr_pred=prediction(LR_pred,test_Data$Churn)
perf=performance(rocr_pred,'tpr','fpr')
plot(perf)
plot(perf,colorize=TRUE,print.cutoffs.at=seq(0,1,0.05),text.adj=c(-0.2,1.7))
as.numeric(performance(rocr_pred,'auc')@y.values)
## KS
library(blorr)
ks=blr_gains_table(Model1)
blr_ks_chart(ks,title='KS chart',ks_line_color='black')
##GINI
LR_gini=Gini(LR_pred,Churn)
LR_gini
names(Data)
##### NORMALIZED DATA FOR KNN AND NAIVE BAYES #####
Data1=read.csv('Cellphone.csv',header = TRUE,sep=',')
norm=function(x){(x-min(x))/(max(x)-min(x))}
data_normalised=as.data.frame(lapply(Data1[,-1],norm))
library(tibble)
view(data_normalised)
data=cbind(Data1[,1],data_normalised)
str(data)
data$`Data1[, 1]`=as.factor(data$`Data1[, 1]`)
data$ContractRenewal=as.factor(data$ContractRenewal)
data$DataPlan=as.factor(data$DataPlan)
str(data)
#### Splitting dataset - Train and Test
set.seed(332)
split=createDataPartition(data$`Data1[, 1]`,p=0.7,list=FALSE)
56
Train_Data=data[split,]
Test_Data=data[-split,]
dim(Train_Data)
dim(Test_Data)
str(Train_Data)
attach(Train_Data)
### KNN ###
set.seed(2020)
ctrl=trainControl(method='repeatedcv',repeats = 3)
knn.fit=train(`Data1[, 1]`~.,data=Train_Data,method='knn',trControl=ctrl
,preProcess=c('center','scale'),tuneLength=10)
knn.fit
Model2=knn(Train_Data[,-1],Test_Data[,-1],Train_Data[,1],k=9)
summary(Model2)
knn_table=table(Test_Data[,1],Model2)
knn_table
sum(diag(knn_table)/sum(knn_table))
knn.cm=confusionMatrix(Model2,Test_Data$`Data1[, 1]`,positive='1')
knn.cm
#### Naive bayes ####
NB.fit=naiveBayes(Train_Data$`Data1[, 1]`~.,data=Train_Data)
NB.fit
NB.pred=predict(NB.fit,Test_Data)
NB.pred
57
NB.cm=confusionMatrix(NB.pred,Test_Data$`Data1[, 1]`,positive='1')
NB.cm
58