CART-RF-ANN
PREPARED BY
MURALIDHARAN N
1
CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the task to
make a model which predicts the claim status and provide recommendations to management.
Use CART, RF & ANN and compare the models' performances in train and test sets.
Data Dictionary
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it.
Reading the dataset,
The data has read successfully,
The shape of the dataset is (3000, 10)
Info function clearly indicates the dataset has object, integer and float so we have to change the
object data type to numeric value.
2
No missing values in the dataset,
Summary of the dataset,
3
We have 4 numeric values and 6 categorical values,
Agency code EPX has a frequency of 1365,
The most preferred type seems to be travel agency
Channel is online
Customized plan is the most sought plan by customers
Destination ASIA seems to be most sought destination place by customers.
We will further look at the distribution of dataset in univarite and bivariate analysis
Checking for duplicates in the dataset,
As there is no unique identifier I’m not dropping the duplicates it may be different customer’s data.
4
Outliers exist in almost all the numeric values.
We can treat outliers in random forest classification.
AGENCY_CODE: 4
JZI 239
CWT 472
C2B 924
EPX 1365
TYPE: 2
Airlines 1163
Travel Agency 1837
CLAIMED: 2
5
Yes 924
No 2076
CHANNEL: 2
Offline 46
Online 2954
PRODUCT NAME: 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
DESTINATION: 3
EUROPE 215
Americas 320
ASIA 2465
Univariate / Bivariate analysis
The box plot of the age variable shows outliers.
Spending is positively skewed - 1.149713
The dist plot shows the distribution of data from 20 to 80
In the range of 30 to 40 is where the majority of the distribution lies.
6
The box plot of the commission variable shows outliers.
Spending is positively skewed - 3.148858
The dist plot shows the distribution of data from 0 to 30
The box plot of the duration variable shows outliers.
Spending is positively skewed - 13.784681
The dist plot shows the distribution of data from 0 to 100
The box plot of the sales variable shows outliers.
7
Spending is positively skewed - 2.381148
The dist plot shows the distribution of data from 0 to 300
Categorical Variables
Agency Code
The distribution of the agency code, shows us EPX with maximum frequency
8
The box plot shows the split of sales with different agency code and also hue having claimed
column.
It seems that C2B have claimed more claims than other agency.
9
The box plot shows the split of sales with different type and also hue having claimed column.
We could understand airlines type has more claims.
The majority of customers have used online medium, very less with offline medium
10
The box plot shows the split of sales with different channel and also hue having claimed
column.
Customized plan seems to be most liked plan by customers when compared to all other plans.
11
The box plot shows the split of sales with different product name and also hue having
claimed column.
Asia is where customers choose when compared with other destination places.
12
The box plot shows the split of sales with different destination and also hue having claimed
column.
Checking pairwise distribution of the continuous variables
13
Checking for Correlations
14
Not much of multi collinearity observed
No negative correlation
Only positive correlation
To build our models we are changing the object data type to numeric values.
feature: Agency Code
[C2B, EPX, CWT, JZI]
Categories (4, object): [C2B, CWT, EPX, JZI]
[0 2 1 3]
Feature: Type
[Airlines, Travel Agency]
Categories (2, object): [Airlines, Travel Agency]
[0 1]
Feature: Claimed
[No, Yes]
Categories (2, object): [No, Yes]
[0 1]
Feature: Channel
[Online, Offline]
Categories (2, object): [Offline, Online]
[1 0]
Feature: Product Name
15
[Customised Plan, Cancellation Plan, Bronze Plan, Silver Plan, Gold
Plan]
Categories (5, object): [Bronze Plan, Cancellation Plan, Customised
Plan, Gold Plan, Silver Plan]
[2 1 0 4 3]
Feature: Destination
[ASIA, Americas, EUROPE]
Categories (3, object): [ASIA, Americas, EUROPE]
[0 1 2]
Checking the info
16
Checking the proportion of 1s and 2s in the dataset. That is our target column.
2.2 Data Split: Split the data into test and train, build classification
model CART, Random Forest, Artificial Neural Network
For training and testing purpose we are splitting the dataset into train and test data in the ratio
70:30.
17
We have bifurcated the dataset into train and test.
We have also taken out the target column out of train and test data into separate vector for
evaluation purposes.
MODEL 1
CHECKING THE FEATURE
OPTIMAL VALUES FOR DECISSION TREE,
GRID SEARCH FOR FINDING,
18
FITTING THE OPTMAL VALUES TO THE TRAINING DATASET
BEST GRID
Regularising the Decision Tree
Adding Tuning Parameters
19
MODEL 2
TREATING OUTLIERS FOR RANDOM FOREST
BOX PLOT TO CHECK PRESENCE OF OUTLIERS
20
RANDOM FOREST CLASSIFIER
TO FIND OPTIMAL NUMBERS USING GRID SEARCH
FIFTING THE MODEL TO RFCL VALUES OBTAINED BY OPTIMAL GRID
SEARCH METHOD
BEST GRID VALUES
21
MODEL 3
Building a Neural Network Classifier
BEFORE BUILDING THE MODEL
WE SCALE THE VALUES, TO STANDARD SCALE USING MINMAXSCALER
AFTER SCALING WE ARE TRANSFORMING THE SAME TO THE TEST DATA
MLP CLASSIFIER
TRAINING THE MODEL
22
GRID SEARCH
FITTING THE MODEL USING THE OPTIMAL VALUES FROM GRID SEARCH
BEST GRID VALUES,
23
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model
DECISSION TREE PREDICTION
ACCURACY
CONFUSION MATRIX
24
Model Evaluation for Decision Tree
AUC and ROC for the training data for Decision Tree
25
MODEL 2 PREDICTION RANDOM FOREST
ACCURACY
CONFUSION MATRIX
26
Model Evaluation for Random Forest
AUC and ROC for the training data for Random Forest
27
ACCURACY
CONFUSION MATRIX
28
MODEL 3
ANN
29
CONFUSION MATRIX
ACCURACY
Model Evaluation for Neural Network Classifier
30
ACCURACY
CONFUSION MATRIX
31
2.4 Final Model: Compare all the model and write an inference which
model is best/optimized.¶
32
CONCLUSION:
I am selecting the RF model, as it has better
accuracy, precision, recall, and f1 score better than
other two CART & NN.
2.5 Inference: Based on the whole Analysis, what are the business
insights and recommendations?
Looking at the model, more data will help us understand and predict models better.
Streamlining online experiences benefitted customers, leading to an increase in conversions,
which subsequently raised profits.
• As per the data 90% of insurance is done by online channel.
• Other interesting fact, is almost all the offline business has a claimed associated
• Need to train the JZI agency resources to pick up sales as they are in bottom, need to run
promotional marketing campaign or evaluate if we need to tie up with alternate agency
• Also based on the model we are getting 80%accuracy, so we need customer books airline
tickets or plans, cross sell the insurance based on the claim data pattern.
• Other interesting fact is more sales happen via Agency than Airlines and the trend shows the
claim are processed more at Airline. So we may need to deep dive into the process to
understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are
• Increase customer satisfaction which in fact will give more revenue
• Combat fraud transactions, deploy measures to avoid fraudulent transactions at earliest
• Optimize claims recovery method
• Reduce claim handling costs.