0% found this document useful (0 votes)
104 views51 pages

Ola Case Study

Ola is a ride-sharing platform focused on reducing driver churn through data analysis and ensemble learning techniques. The case study outlines the responsibilities of a data scientist tasked with analyzing driver data to predict attrition and improve retention strategies. The dataset contains 19,104 entries with various driver attributes, and the analysis aims to derive actionable insights to enhance driver loyalty.

Uploaded by

Raja Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views51 pages

Ola Case Study

Ola is a ride-sharing platform focused on reducing driver churn through data analysis and ensemble learning techniques. The case study outlines the responsibilities of a data scientist tasked with analyzing driver data to predict attrition and improve retention strategies. The dataset contains 19,104 entries with various driver attributes, and the analysis aims to derive actionable insights to enhance driver loyalty.

Uploaded by

Raja Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ola Case Study

Introduction
Ola is a leading ride-sharing platform, aiming to provide reliable, affordable, and convenient urban
transportation for everyone. The constant challenge Ola faces is the churn rate of its drivers. Ensuring driver
loyalty and reducing attrition are crucial to the company's operation. Analyzing driver data can reveal
patterns in driver behavior, performance, and satisfaction. This would help in foreseeing potential churn,
allowing proactive measures. By leveraging data science and ensemble learning, Ola can predict driver
churn, which would be pivotal in its driver retention strategy.

What is expected
Assuming you are a data scientist at Ola, you are entrusted with the responsibility of analyzing the dataset
to predict driver attrition. Your primary goal is to utilize ensemble learning techniques, evaluate the
performance of your models, and provide actionable insights to reduce driver churn.

1. Data
The analysis was done on the data located at -
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/002/492/original/ola_driver_scaler.csv

2. Libraries
Below are the libraries required

In [1]: # libraries to analyze data


import numpy as np
import pandas as pd

# libraries to visualize data


import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, label_binarize


from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, v
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, Con
from sklearn.multiclass import OneVsRestClassifier

from imblearn.over_sampling import SMOTE

from scipy.stats import randint

from xgboost import XGBClassifier


from lightgbm import LGBMClassifier

3. Data Loading
Loading the data into Pandas dataframe for easily handling of data

In [2]: # read the file into a pandas dataframe


df = pd.read_csv('ola_driver_scaler.csv')
# look at the datatypes of the columns
print('*************************************************')
print(df.info())
print('*************************************************\n')
print('*************************************************')
print(f'Shape of the dataset is {df.shape}')
print('*************************************************\n')
print('*************************************************')
print(f'Number of nan/null values in each column: \n{df.isna().sum()}')
print('*************************************************\n')
print('*************************************************')
print(f'Number of unique values in each column: \n{df.nunique()}')
print('*************************************************\n')
print('*************************************************')
print(f'Duplicate entries: \n{df.duplicated().value_counts()}')

*************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 19104 non-null int64
1 MMM-YY 19104 non-null object
2 Driver_ID 19104 non-null int64
3 Age 19043 non-null float64
4 Gender 19052 non-null float64
5 City 19104 non-null object
6 Education_Level 19104 non-null int64
7 Income 19104 non-null int64
8 Dateofjoining 19104 non-null object
9 LastWorkingDate 1616 non-null object
10 Joining Designation 19104 non-null int64
11 Grade 19104 non-null int64
12 Total Business Value 19104 non-null int64
13 Quarterly Rating 19104 non-null int64
dtypes: float64(2), int64(8), object(4)
memory usage: 2.0+ MB
None
*************************************************

*************************************************
Shape of the dataset is (19104, 14)
*************************************************

*************************************************
Number of nan/null values in each column:
Unnamed: 0 0
MMM-YY 0
Driver_ID 0
Age 61
Gender 52
City 0
Education_Level 0
Income 0
Dateofjoining 0
LastWorkingDate 17488
Joining Designation 0
Grade 0
Total Business Value 0
Quarterly Rating 0
dtype: int64
*************************************************

*************************************************
Number of unique values in each column:
Unnamed: 0 19104
MMM-YY 24
Driver_ID 2381
Age 36
Gender 2
City 29
Education_Level 3
Income 2383
Dateofjoining 869
LastWorkingDate 493
Joining Designation 5
Grade 5
Total Business Value 10181
Quarterly Rating 4
dtype: int64
*************************************************

*************************************************
Duplicate entries:
False 19104
Name: count, dtype: int64

In [3]: # look at the top 20 rows


df.head(20)

Out[3]:
Unnamed: MMM-
Driver_ID Age Gender City Education_Level Income Dateofjoining LastWorkingDate
0 YY

0 0 01/01/19 1 28.0 0.0 C23 2 57387 24/12/18 NaN

1 1 02/01/19 1 28.0 0.0 C23 2 57387 24/12/18 NaN

2 2 03/01/19 1 28.0 0.0 C23 2 57387 24/12/18 03/11/19

3 3 11/01/20 2 31.0 0.0 C7 2 67016 11/06/20 NaN

4 4 12/01/20 2 31.0 0.0 C7 2 67016 11/06/20 NaN

5 5 12/01/19 4 43.0 0.0 C13 2 65603 12/07/19 NaN

6 6 01/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

7 7 02/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

8 8 03/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

9 9 04/01/20 4 43.0 0.0 C13 2 65603 12/07/19 27/04/20

10 10 01/01/19 5 29.0 0.0 C9 0 46368 01/09/19 NaN

11 11 02/01/19 5 29.0 0.0 C9 0 46368 01/09/19 NaN

12 12 03/01/19 5 29.0 0.0 C9 0 46368 01/09/19 03/07/19

13 13 08/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN


14 14 09/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

15 15 10/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

16 16 11/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

17 17 12/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

18 18 09/01/20 8 34.0 0.0 C2 0 70656 19/09/20 NaN

19 19 10/01/20 8 34.0 0.0 C2 0 70656 19/09/20 NaN

In [4]: df.describe()

Out[4]: Joining
Unnamed: 0 Driver_ID Age Gender Education_Level Income
Designation

count 19104.000000 19104.000000 19043.000000 19052.000000 19104.000000 19104.000000 19104.000000 191

mean 9551.500000 1415.591133 34.668435 0.418749 1.021671 65652.025126 1.690536

std 5514.994107 810.705321 6.257912 0.493367 0.800167 30914.515344 0.836984

min 0.000000 1.000000 21.000000 0.000000 0.000000 10747.000000 1.000000

25% 4775.750000 710.000000 30.000000 0.000000 0.000000 42383.000000 1.000000

50% 9551.500000 1417.000000 34.000000 0.000000 1.000000 60087.000000 1.000000

75% 14327.250000 2137.000000 39.000000 1.000000 2.000000 83969.000000 2.000000

max 19103.000000 2788.000000 58.000000 1.000000 2.000000 188418.000000 5.000000

In [5]: df.describe(include='object')

Out[5]: MMM-YY City Dateofjoining LastWorkingDate

count 19104 19104 19104 1616

unique 24 29 869 493

top 01/01/19 C20 23/07/15 29/07/20

freq 1022 1008 192 70

Insight
There are 19104 entries with 14 columns
There are 61 null/missing values in Age, 52 in Gender and 17488 in LastWorkingDate
There are 2381 unique drivers
There are no duplicates
The column Unnamed: 0 can be dropped as it doesnt provide any new information
The columns Gender, City, Education_Level, Joining Designation, Grade and Quarterly Rating can
be converted to categorical datatype
The columns MMM-YY, Dateofjoining and LastWorkingDate can be converted to datetime datatype
Drivers who have valid LastWorkingDate can be considered as churned

In [6]: # Drop "Unnamed: 0" column


df.drop(columns=['Unnamed: 0'], inplace=True)

# Convert to category
categorical_columns = ['Gender', 'City', 'Education_Level', 'Joining Designation', 'Grad
df[categorical_columns] = df[categorical_columns].astype('category')
df['Gender'].replace({0.0:'Male', 1.0: 'Female'}, inplace=True)
df['Education_Level'].replace({0:'10+', 1:'12+', 2:'Graduate'}, inplace=True)

# Convert to datetime
df['MMM-YY'] = pd.to_datetime(df['MMM-YY'], format='%m/%d/%y')
df['Dateofjoining'] = pd.to_datetime(df['Dateofjoining'], format='%d/%m/%y')
df['LastWorkingDate'] = pd.to_datetime(df['LastWorkingDate'], format='%d/%m/%y')

# Rename 'MMM-YY' to 'ReportingMonthYear'


df.rename(columns={'MMM-YY':'ReportingMonthYear'}, inplace=True)
df['ReportingMonthYear'] = df['ReportingMonthYear'].dt.to_period('M')
df['ReportingYear'] = df['ReportingMonthYear'].dt.year

# Extract month and year from 'Dateofjoining'


df['Monthofjoining'] = df['Dateofjoining'].dt.month
df['Yearofjoining'] = df['Dateofjoining'].dt.year

# Find drivers who haved churned


df['Churn'] = df.groupby('Driver_ID')['LastWorkingDate'].transform('last')
df['Churn'] = df['Churn'].apply(lambda x: 0 if pd.isnull(x) else 1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ReportingMonthYear 19104 non-null period[M]
1 Driver_ID 19104 non-null int64
2 Age 19043 non-null float64
3 Gender 19052 non-null category
4 City 19104 non-null category
5 Education_Level 19104 non-null category
6 Income 19104 non-null int64
7 Dateofjoining 19104 non-null datetime64[ns]
8 LastWorkingDate 1616 non-null datetime64[ns]
9 Joining Designation 19104 non-null category
10 Grade 19104 non-null category
11 Total Business Value 19104 non-null int64
12 Quarterly Rating 19104 non-null int64
13 ReportingYear 19104 non-null int64
14 Monthofjoining 19104 non-null int32
15 Yearofjoining 19104 non-null int32
16 Churn 19104 non-null int64
dtypes: category(5), datetime64[ns](2), float64(1), int32(2), int64(6), period[M](1)
memory usage: 1.7 MB

In [7]: # look at the top 5 rows


df.head(5)

Out[7]:
ReportingMonthYear Driver_ID Age Gender City Education_Level Income Dateofjoining LastWorkingDate

0 2019-01 1 28.0 Male C23 Graduate 57387 2018-12-24 NaT

1 2019-02 1 28.0 Male C23 Graduate 57387 2018-12-24 NaT

2 2019-03 1 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03

3 2020-11 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT

4 2020-12 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT


4. Exploratory Data Analysis

4.1. Univariate analysis


In [8]: plt.figure(figsize=(10,4))
temp_df = df.groupby('ReportingMonthYear')['Driver_ID'].nunique()
ax = temp_df.plot(kind='bar')
ax.bar_label(ax.containers[0])
plt.ylabel('Number of unique drivers')
plt.title('Number of drivers reporting each month')
plt.show()

Insight
The month during which maximum number of drivers reported is January 2019. A total of 1022
drivers reported on January 2019
It then dropeed every month after January and has been stagnant at around 800 drivers reported every
month

In [9]: fig, axs = plt.subplots(1,2,figsize=(10,4))


temp_df = df.groupby('Driver_ID').agg({'Age':'last'})['Age']
temp_df.plot(ax=axs[0], kind='hist', bins=35)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Age distribution of drivers')
plt.show()
Insight
There are drivers from different age groups ranging from 21 to 58 years
Most of the drivers are in the age group of 30 to 35
The distribution is mostly normal with little skewness towards the right

In [10]: temp_df = df.groupby('Driver_ID').agg({'Gender':'first'})


temp_df['Gender'].value_counts().plot(kind='pie', autopct='%.1f')
plt.title('Gender % distribution of drivers')
plt.ylabel('')
plt.show()

Insight
59% of the drivers are Male and remaining 41% are Female
In [11]: plt.figure(figsize=(10,5))
temp_df = df.groupby('Driver_ID').agg({'City':'first'})
ax = temp_df['City'].value_counts().plot(kind='bar')
ax.bar_label(ax.containers[0])
plt.title('Distributon of drivers in different cities')
plt.show()

Insight
City C20 has the maximum number of drivers followed by city C15

In [12]: temp_df = df.groupby('Driver_ID').agg({'Education_Level':'first'})


temp_df['Education_Level'].value_counts().plot(kind='pie', autopct='%.1f')
plt.ylabel('')
plt.title('Education level % distribution of drivers')
plt.show()
Insight
Almost equal proportion of drivers are from the 3 different education level

In [13]: fig, axs = plt.subplots(1,2,figsize=(12,4))


temp_df = df.groupby('Driver_ID').agg({'Income':'last'})['Income']
temp_df.plot(ax=axs[0], kind='hist', bins=35, rot=45)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Income distribution of drivers')
plt.show()

Insight
Most of the drivers have an average monthly income of 40k to 75k
The distribution is right skewed
In [14]: temp_df = df.groupby('Driver_ID').agg({'Dateofjoining':'first'})['Dateofjoining']
fig, axs = plt.subplots(1,2,figsize=(10,5))
temp_df.dt.year.value_counts().plot(kind='bar', ax=axs[0], xlabel='Year of joning', titl
temp_df.dt.month.value_counts().plot(kind='bar', ax=axs[1], xlabel='Month of joning', ti
plt.show()

Insight
Maximum number of drivers joined in the year 2020
Maximum number of drivers joined in the month of July

In [15]: ax = df.groupby('Driver_ID').agg({'Joining Designation':'first'})['Joining Designation']


ax.bar_label(ax.containers[0])
plt.show()
Insight
Maximum number of drivers, 1026, have a joining designation of 1

In [16]: temp_df_1 = df.groupby('Driver_ID').agg({'Grade':'first'}).reset_index()


temp_df_1.rename(columns = {'Grade':'Grade_First'}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({'Grade':'last'}).reset_index()
temp_df_2.rename(columns = {'Grade':'Grade_Last'}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df['Grade_Change'] = temp_df['Grade_Last'].astype('int') - temp_df['Grade_First'].a
fig, axs = plt.subplots(1,2,figsize=(10,4))
ax = temp_df['Grade_First'].value_counts().plot(kind='bar', ax=axs[0], title='Driver dis
ax.bar_label(ax.containers[0])
ax = temp_df['Grade_Last'].value_counts().plot(kind='bar', ax=axs[1], title='Driver dist
ax.bar_label(ax.containers[0])
plt.show()
ax = sns.countplot(data=temp_df, x = 'Grade_Change')
ax.set_title('Driver distribution based on change in grade')
ax.bar_label(ax.containers[0])
plt.show()
Insight
Maximum number of drivers have a grade of 2 and it doesnt change for the majority of the drivers

In [17]: fig, axs = plt.subplots(1,2,figsize=(12,4))


temp_df = df.groupby('Driver_ID').agg({'Total Business Value':'sum'})['Total Business Va
temp_df.plot(ax=axs[0], kind='hist', bins=100)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Distribution of drivers as per total business value')
plt.show()
Insight
It is very evident that many drivers have a total business value of 0 and there are also a few drivers
who have a -ve business value
The distribution is extremely right skewed

In [18]: temp_df_1 = df.groupby('Driver_ID').agg({'Quarterly Rating':'first'}).reset_index()


temp_df_1.rename(columns = {'Quarterly Rating':'QR_First'}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({'Quarterly Rating':'last'}).reset_index()
temp_df_2.rename(columns = {'Quarterly Rating':'QR_Last'}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df['QR_Change'] = temp_df['QR_Last'].astype('int') - temp_df['QR_First'].astype('in
fig, axs = plt.subplots(1,2,figsize=(10,4))
ax = temp_df['QR_First'].value_counts().plot(kind='bar', ax=axs[0], title='Driver distrib
ax.bar_label(ax.containers[0])
ax = temp_df['QR_Last'].value_counts().plot(kind='bar', ax=axs[1], title='Driver distrib
ax.bar_label(ax.containers[0])
plt.show()
ax = sns.countplot(data=temp_df, x = 'QR_Change')
ax.set_title('Driver distribution based on change in QR')
ax.bar_label(ax.containers[0])
plt.show()
Insight
Majority of the drivers have a very low quarterly rating of 1
The change in QR plot shows that majority of the drivers don't see a change in their QR but there are
decent number of drivers with positive change in QR and equally decent number of drivers with
negative change in QR
There are no drivers with QR of 5

In [19]: temp_df = df.groupby('Driver_ID').agg({'Churn':'first'})['Churn']


ax = temp_df.value_counts().plot(kind='bar', title='Driver distribution based on the chu
ax.bar_label(ax.containers[0])
plt.show()
(temp_df.value_counts(normalize=True)*100).round(0)
Churn
Out[19]:
1 68.0
0 32.0
Name: proportion, dtype: float64

Insight
1616 drivers have churned, which is around 68%

4.2. Bivariate analysis


In [22]: driver_df = df.groupby('Driver_ID').agg({
'ReportingMonthYear' : len,
'Age' : 'last',
'Gender' : 'first',
'City' : 'first',
'Education_Level' : 'first',
'Income' : 'last',
'Dateofjoining' : 'first',
'LastWorkingDate' : 'last',
'Joining Designation' : 'first',
'Grade' : 'last',
'Total Business Value' : 'sum',
'Quarterly Rating' : 'last',
'Churn':'last'
}).reset_index()
driver_df.rename(columns={'ReportingMonthYear': 'Months of Service'}, inplace=True)
driver_df.head(10)

Out[22]: Months
Joining
Driver_ID of Age Gender City Education_Level Income Dateofjoining LastWorkingDate
Designation
Service

0 1 3 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03 1


1 2 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT 2

2 4 5 43.0 Male C13 Graduate 65603 2019-07-12 2020-04-27 2

3 5 3 29.0 Male C9 10+ 46368 2019-09-01 2019-07-03 1

4 6 5 31.0 Female C11 12+ 78728 2020-07-31 NaT 3

5 8 3 34.0 Male C2 10+ 70656 2020-09-19 2020-11-15 3

6 11 1 28.0 Female C19 Graduate 42172 2020-07-12 NaT 1

7 12 6 35.0 Male C23 Graduate 28116 2019-06-29 2019-12-21 1

8 13 23 31.0 Male C19 Graduate 119227 2015-05-28 2020-11-25 1

9 14 3 39.0 Female C26 10+ 19734 2020-10-16 NaT 3

In [23]: drivers_with_2_year_service = driver_df[driver_df['Months of Service'] == 24]['Driver_ID

In [24]: def calculate_change(df, column_name):


temp_df_1 = df.groupby('Driver_ID').agg({column_name:'first'}).reset_index()
first_column_name = column_name+'_First'
temp_df_1.rename(columns = {column_name:first_column_name}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({column_name:'last'}).reset_index()
last_column_name = column_name+'_Last'
temp_df_2.rename(columns = {column_name:last_column_name}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df[column_name+'_Change'] = temp_df[last_column_name].astype('int') - temp_df[f
temp_df.drop(columns=[first_column_name, last_column_name], inplace=True)
return temp_df

In [25]: column_name = 'Income'


temp_df1 = calculate_change(df, 'Income')
driver_df = pd.merge(driver_df, temp_df1, on='Driver_ID')
temp_df2 = calculate_change(df, 'Grade')
driver_df = pd.merge(driver_df, temp_df2, on='Driver_ID')
temp_df3 = calculate_change(df, 'Quarterly Rating')
driver_df = pd.merge(driver_df, temp_df3, on='Driver_ID')
driver_df['Quarterly Rating Improved'] = driver_df['Quarterly Rating_Change'].apply(lamb
driver_df.head()

Out[25]: Months
Joining
Driver_ID of Age Gender City Education_Level Income Dateofjoining LastWorkingDate
Designation
Service

0 1 3 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03 1

1 2 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT 2

2 4 5 43.0 Male C13 Graduate 65603 2019-07-12 2020-04-27 2

3 5 3 29.0 Male C9 10+ 46368 2019-09-01 2019-07-03 1

4 6 5 31.0 Female C11 12+ 78728 2020-07-31 NaT 3

In [26]: driver_df['Income_Raise'] = driver_df['Income_Change'].apply(lambda x: 1 if x>0 else 0)

In [27]: fig, axs = plt.subplots(2,1,figsize=(10,6))


sns.histplot(ax = axs[0], data=driver_df, x='Months of Service', hue='Churn', stat="prop
sns.histplot(ax = axs[1], data=driver_df, x='City', hue='Churn', stat="proportion", mult
plt.tight_layout()
plt.show()
Insight
The churn rate is generally higher in drivers with less months of service and low in drivers with longer
months of service with exception for 21, 22 and 23 months of service where the churn rates seems to be
very high
The city C13 has the highest churn rate and city C29 has the lowest churn rate

In [28]: fig, axs = plt.subplots(1,3,figsize=(10,5))


sns.boxplot(ax=axs[0], data=driver_df, y='Age', hue='Churn', width=0.5, gap=0.2)
sns.boxplot(ax=axs[1], data=driver_df, y='Income', hue='Churn', width=0.5, gap=0.2)
sns.boxplot(ax=axs[2], data=driver_df, y='Total Business Value', hue='Churn', gap=0.2)
plt.tight_layout()
plt.show()
Insight
The median age of drivers who have churned is slighly lesser than that of the drivers who have not
churned
The median income of drivers who have churned is lesser than that of the drivers who have not
churned
The median Total Bussiness Value of drivers who have churned is lesser than that of the drivers who
have not churned
The drivers who have churned also had -ve Total Bussiness Value

In [29]: fig, axs = plt.subplots(2,2,figsize=(8,4),sharey=True)


sns.histplot(ax=axs[0,0], data=driver_df, x='Gender', hue='Churn', stat="proportion", mu
sns.histplot(ax=axs[0,1], data=driver_df, x='Education_Level', hue='Churn', stat='propor
sns.histplot(ax=axs[1,0], data=driver_df, x='Joining Designation', hue='Churn', stat='pr
sns.histplot(ax=axs[1,1], data=driver_df, x='Grade', hue='Churn', stat='proportion', mul
plt.tight_layout()
plt.show()

Insight
The churn rate is almost equal in both male and female drivers
The churn rate is almost equal in 10+ and Graduates and slighly lower in 12+
The churn rate is less for joining designation 3
The churn rate is less for higher grades

In [30]: fig, axs = plt.subplots(1,2,figsize=(8,2),sharey=True)


sns.histplot(ax=axs[0], data=driver_df, x='Income_Raise', hue='Churn', stat='proportion'
sns.histplot(ax=axs[1], data=driver_df, x='Grade_Change', hue='Churn', stat='proportion'
plt.tight_layout()
plt.show()
plt.figure(figsize=(9,2))
sns.histplot(data=driver_df, x='Quarterly Rating_Change', hue='Churn', stat='proportion'
plt.show()
plt.figure(figsize=(9,2))
sns.histplot(data=driver_df, x='Quarterly Rating Improved', hue='Churn', stat='proportio
plt.show()
Insight
The churn rate is very less in drivers whose income has raised
The churn rate is very less in drivers whose grade has raised
The churn rate is very less in drivers whose Quarterly rating has increased

In [31]: temp_df = df.groupby(['City', 'ReportingYear']).agg({'Quarterly Rating': 'mean'}).reset_


temp_df1 = pd.pivot_table(data=temp_df, index='City', columns='ReportingYear', values='Q
temp_df1.rename(columns={'ReportingYear':'index', 2019:'2019', 2020:'2020'}, inplace=True
temp_df1['%change'] = (((temp_df1['2020'] - temp_df1['2019'])/temp_df1['2019'])*100).rou
plt.figure(figsize=(10,5))
sns.barplot(data=temp_df1, x='City', y='%change')
plt.tight_layout()
plt.grid(True)
plt.show()
Insight
The city C29 shows most improvement in Quarterly Rating in 2020 compared to 2019

In [32]: plt.figure(figsize=(10,3))
sns.barplot(data=driver_df, x='Grade', y='Total Business Value', estimator='mean')
plt.show()
print('Mean of Total Business Value of drivers with grade 5:', driver_df[driver_df['Grad

Mean of Total Business Value of drivers with grade 5: 565760460

Insight
The mean of Total Business Value of drivers with grade 5 is higher than those with other grades

In [33]: def convert_to_year_quarter(x):


year = str(x.year)
month = x.month
if(month >=1 and month <=3):
return year+'-Q1'
elif(month >=4 and month <=6):
return year+'-Q2'
elif(month >=7 and month <=9):
return year+'-Q3'
else:
return year+'-Q4'

temp_df = df.copy()
temp_df['ReportingYearQuarter']=temp_df['ReportingMonthYear'].apply(convert_to_year_quar
temp_df.head()
temp_driver_full_service_df = temp_df[temp_df['Driver_ID'].isin(drivers_with_2_year_serv
plt.figure(figsize=(8,4))
sns.lineplot(data=temp_driver_full_service_df, x='ReportingYearQuarter', y='Quarterly Ra
plt.xticks(rotation=90)
plt.show()

Insight
There is a dip in the quarterly rating in Q2 and then it increases in Q3.
This pattern can be osberved for both the years

In [34]: temp_driver_full_service_df = temp_df[temp_df['Driver_ID'].isin(drivers_with_2_year_serv


num_of_drivers = 20
count=0
for driver_id in temp_driver_full_service_df['Driver_ID'].unique():
if(count < num_of_drivers):
count = count + 1
sample_df = temp_driver_full_service_df[temp_driver_full_service_df['Driver_ID']
fig, axs = plt.subplots(2,1,figsize=(10, 5), sharex=True)
sns.barplot(ax=axs[0], data=sample_df, x = 'ReportingMonthYear', y='Quarterly Ra
axs[0].tick_params(axis='x', rotation=90)
sns.barplot(ax=axs[1], data=sample_df, x = 'ReportingMonthYear', y='Total Busine
axs[1].tick_params(axis='x', rotation=90)
fig.suptitle(f'Driver Id: {driver_id}')
plt.tight_layout()
plt.show()
else:
break
Insight
It can be observed that a significant drop in rating impacts the Total Business Value. Drop in rating
demotivates the drivers, leading to accepting only a few rides or in somecases not accepting any rides
and hence impacting the Total Business Value

4.3. Multivariate analysis


In [35]: driver_df['Gender'].replace({'Male':0, 'Female':1}, inplace=True)
driver_df['Education_Level'].replace({'Graduate':0, '10+':1, '12+':2}, inplace=True)
driver_df['City'] = driver_df['City'].str[1:]

In [36]: plt.figure(figsize=(15,10))
sns.heatmap(driver_df.drop(columns=['Driver_ID', 'Dateofjoining', 'LastWorkingDate', 'In
plt.tight_layout()
plt.show()
Insight
Months of Service and Total Business Value are highly correlated
Income and Grade are highly correlated
Joining Designation and Grade are highly correlated
Quarterly Rating and Months of Service are highly correlated
Chrun is decently correlated with Quarterly Rating, Total Business Value, Months of Service

5. Data Preprocessing
In [37]: driver_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Driver_ID 2381 non-null int64
1 Months of Service 2381 non-null int64
2 Age 2381 non-null float64
3 Gender 2381 non-null category
4 City 2381 non-null object
5 Education_Level 2381 non-null category
6 Income 2381 non-null int64
7 Dateofjoining 2381 non-null datetime64[ns]
8 LastWorkingDate 1616 non-null datetime64[ns]
9 Joining Designation 2381 non-null category
10 Grade 2381 non-null category
11 Total Business Value 2381 non-null int64
12 Quarterly Rating 2381 non-null int64
13 Churn 2381 non-null int64
14 Income_Change 2381 non-null int32
15 Grade_Change 2381 non-null int32
16 Quarterly Rating_Change 2381 non-null int32
17 Quarterly Rating Improved 2381 non-null int64
18 Income_Raise 2381 non-null int64
dtypes: category(4), datetime64[ns](2), float64(1), int32(3), int64(8), object(1)
memory usage: 261.2+ KB

Insight
The columns Driver_ID, Gender, City, Education_Level, Dateofjoining, LastWorkingDate can be
dropped as they do not contribute towards the driver churn rate

In [38]: driver_df.drop(columns=['Driver_ID', 'Gender', 'City', 'Education_Level', 'Dateofjoining


driver_df['Quarterly Rating'] = driver_df['Quarterly Rating'].astype('category')
driver_df['Churn'] = driver_df['Churn'].astype('category')
driver_df['Grade_Change'] = driver_df['Grade_Change'].astype('category')
driver_df['Quarterly Rating_Change'] = driver_df['Quarterly Rating_Change'].astype('cate
driver_df['Income_Raise'] = driver_df['Income_Raise'].astype('category')
driver_df.head()

Out[38]: Months Total Quart


Joining Quarterly Quarterly
of Age Income Grade Business Churn Grade_Change Rat
Designation Rating Rating_Change
Service Value Impro

0 3 28.0 57387 1 1 1715580 2 1 0 0

1 2 31.0 67016 2 2 0 1 0 0 0

2 5 43.0 65603 2 2 350000 1 1 0 0

3 3 29.0 46368 1 1 120360 1 1 0 0

4 5 31.0 78728 3 3 1265000 2 0 0 1

In [39]: driver_df.duplicated().value_counts()
False 2381
Out[39]:
Name: count, dtype: int64

Insight
There are no duplicates

5.1. Handling null values


In [40]: driver_df.isna().sum()

Months of Service 0
Out[40]:
Age 0
Income 0
Joining Designation 0
Grade 0
Total Business Value 0
Quarterly Rating 0
Churn 0
Grade_Change 0
Quarterly Rating_Change 0
Quarterly Rating Improved 0
Income_Raise 0
dtype: int64

Insight
There are no missing data or null values

5.2. Outlier Treatment


In [41]: # helper function to detect outliers using IQR method
def detectOutliers_iqr(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3-q1
lower_outliers = df[df<(q1-1.5*iqr)]
higher_outliers = df[df>(q3+1.5*iqr)]
return lower_outliers, higher_outliers

# helper function to detect outliers using standard deviation method


def detectOutliers_std(df):
mean = df.mean()
std = df.std()
upper_limit = mean+(3*std)
lower_limit = mean-(3*std)
lower_outliers = df[df<lower_limit]
higher_outliers = df[df>upper_limit]
return lower_outliers, higher_outliers

In [42]: numerical_columns = driver_df.select_dtypes(include=np.number).columns


column_outlier_dictionary = {}
for column in numerical_columns:
lower_outliers, higher_outliers = detectOutliers_iqr(driver_df[column])
column_outlier_dictionary[column] = [lower_outliers, higher_outliers]
#print('*'*50)
#print(f'Outliers of \'{column}\' column are:')
#print("Lower outliers:\n", lower_outliers)
#print("Higher outliers:\n", higher_outliers)
#print('*'*50, end="\n")

In [43]: for key, value in column_outlier_dictionary.items():


print(f'The column \'{key}\' has {len(value[0]) + len(value[1])} outliers')

The column 'Months of Service' has 249 outliers


The column 'Age' has 25 outliers
The column 'Income' has 48 outliers
The column 'Total Business Value' has 336 outliers
The column 'Quarterly Rating Improved' has 358 outliers

In [44]: numerical_columns = driver_df.select_dtypes(include=np.number).columns


column_outlier_dictionary = {}
for column in numerical_columns:
lower_outliers, higher_outliers = detectOutliers_std(driver_df[column])
column_outlier_dictionary[column] = [lower_outliers, higher_outliers]
#print('*'*50)
#print(f'Outliers of \'{column}\' column are:')
#print("Lower outliers:\n", lower_outliers)
#print("Higher outliers:\n", higher_outliers)
#print('*'*50, end="\n")

In [45]: for key, value in column_outlier_dictionary.items():


print(f'The column \'{key}\' has {len(value[0]) + len(value[1])} outliers')

The column 'Months of Service' has 0 outliers


The column 'Age' has 14 outliers
The column 'Income' has 16 outliers
The column 'Total Business Value' has 64 outliers
The column 'Quarterly Rating Improved' has 0 outliers

Insight
I will keep the outliers in Age and Income columns as they are less in number
I will cap the outliers in Total Business Value column as drivers with higher business value do not
churn usually

In [46]: mean = driver_df['Total Business Value'].mean()


std = driver_df['Total Business Value'].std()
upper_limit = mean+(3*std)
driver_df['Total Business Value'] = driver_df['Total Business Value'].apply(lambda x: x

5.3. Multicollinearity Check


In [47]: features_df = driver_df.drop(columns=['Churn']) # Drop target column
features_df = features_df.drop(columns=features_df.select_dtypes(include='category').col
features_df = sm.add_constant(features_df) # Adding a constant column for the intercept
vif_df = pd.DataFrame()
vif_df['Features'] = features_df.columns
vif_df['VIF'] = [variance_inflation_factor(features_df.values, idx) for idx in range(len
vif_df['VIF'] = round(vif_df['VIF'], 2)
vif_df = vif_df.sort_values(by='VIF', ascending=False)
vif_df

Out[47]: Features VIF

0 const 36.42

4 Total Business Value 3.99

1 Months of Service 3.83

3 Income 1.19

2 Age 1.12

5 Quarterly Rating Improved 1.09

Insight
Based on the above VIF scores, I can conclude that there are no multicolinear numerical features

5.4. Encode categorical variables


In [48]: final_df = driver_df.copy()

Sepearte out target and feature columns

In [49]: X = final_df.drop(columns=['Churn'])
y = final_df['Churn']
X.shape, y.shape

((2381, 11), (2381,))


Out[49]:

Encode target variable

In [50]: y = y.astype(int)

Encode features with just 2 classes as 0 or 1

In [51]: X[['Grade_Change','Quarterly Rating_Change', 'Income_Raise']] = X[['Grade_Change','Quart

One-Hot-Encoding for remaining categorical features

In [52]: X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Months of Service 2381 non-null int64
1 Age 2381 non-null float64
2 Income 2381 non-null int64
3 Joining Designation 2381 non-null category
4 Grade 2381 non-null category
5 Total Business Value 2381 non-null float64
6 Quarterly Rating 2381 non-null category
7 Grade_Change 2381 non-null int8
8 Quarterly Rating_Change 2381 non-null int8
9 Quarterly Rating Improved 2381 non-null int64
10 Income_Raise 2381 non-null int8
dtypes: category(3), float64(2), int64(3), int8(3)
memory usage: 107.7 KB

In [53]: categorical_columns = X.select_dtypes(include='category').columns


categorical_columns

Index(['Joining Designation', 'Grade', 'Quarterly Rating'], dtype='object')


Out[53]:

In [54]: encoder = OneHotEncoder(sparse_output=False)


encoded_data = encoder.fit_transform(X[categorical_columns])
encoded_df = pd.DataFrame(encoded_data, columns = encoder.get_feature_names_out(categori
X = pd.concat([X, encoded_df], axis=1)
X.drop(columns = categorical_columns, inplace=True)
X.head()

Out[54]: Months Total Quarterly


Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1 Des
Service Value Improved

0 3 28.0 57387 1715580.0 0 0 0 0 1.0

1 2 31.0 67016 0.0 0 0 0 0 0.0

2 5 43.0 65603 350000.0 0 0 0 0 0.0

3 3 29.0 46368 120360.0 0 0 0 0 1.0

4 5 31.0 78728 1265000.0 0 1 1 0 0.0

5 rows × 22 columns
6. Model building

6.1. Train-test split


In [55]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1904, 22), (477, 22), (1904,), (477,))


Out[55]:

6.2. Perform data normalization/standardization


Data normalization/standardization is required so that features with higher scales do not dominate the
model's performance. Hence all features should have same scale

Data before scaling

In [56]: X_train.head()

Out[56]: Months Total Quarterly


Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1
Service Value Improved

2236 7 28.0 57164 1092560.0 0 0 0 0 0.0

6 1 28.0 42172 0.0 0 0 0 0 1.0

1818 1 29.0 43989 0.0 0 0 0 0 1.0

1534 7 40.0 59636 2589640.0 0 0 0 0 0.0

2123 6 25.0 29052 2172260.0 0 0 0 0 1.0

5 rows × 22 columns

In [57]: min_max_scaler = MinMaxScaler()


# Fit min_max_scaler to training data
min_max_scaler.fit(X_train)
# Scale the training and testing data
X_train = pd.DataFrame(min_max_scaler.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(min_max_scaler.transform(X_test), columns=X_test.columns)

Data after scaling

In [58]: X_train.head()

Out[58]: Months Total Quarterly


Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1
Service Value Improved

0 0.260870 0.205882 0.261253 0.074297 0.0 0.5 0.0 0.0 0.0

1 0.000000 0.205882 0.176872 0.041541 0.0 0.5 0.0 0.0 1.0

2 0.000000 0.235294 0.187099 0.041541 0.0 0.5 0.0 0.0 1.0

3 0.260870 0.558824 0.275166 0.119183 0.0 0.5 0.0 0.0 0.0

4 0.217391 0.117647 0.103028 0.106669 0.0 0.5 0.0 0.0 1.0


5 rows × 22 columns

Check for imbalance in target class

In [59]: y_train.value_counts(normalize=True)*100

Churn
Out[59]:
1 68.644958
0 31.355042
Name: proportion, dtype: float64

We can see a clear imbalance in the target class with 1 being ~69% and 0 being ~31%. Hence, I will use
SMOTE to fix this imbalance

In [60]: sm = SMOTE(random_state=0)
X_train, y_train = sm.fit_resample(X_train, y_train)
y_train.value_counts(normalize=True)*100

Churn
Out[60]:
1 50.0
0 50.0
Name: proportion, dtype: float64

6.3. Ensemble Learning: Bagging - RandomForestClassifier


A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting.\ The hyper-
parameters of the random forest classifier will be selected using grid search cross validation

In [61]: # Define parameter grid


param_grid = {
'n_estimators': list(range(100, 1000, 100)),
'max_features': ['sqrt', 'log2'],
'max_depth': list(range(10, 100, 10)),
'min_samples_split': list(range(2, 10, 1))
}

# Initialize classifier and RandomizedSearchCV


rf = RandomForestClassifier()
rf_random = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1

# Fit the model


rf_random.fit(X_train, y_train)

# Evaluate best parameters


print("Best parameters found: ", rf_random.best_params_)

Fitting 3 folds for each of 1296 candidates, totalling 3888 fits


Best parameters found: {'max_depth': 50, 'max_features': 'sqrt', 'min_samples_split':
2, 'n_estimators': 200}

In [62]: color = '\033[91m'


bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = rf_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = rf_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 1.0
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 1307


1 1.00 1.00 1.00 1307

accuracy 1.00 2614


macro avg 1.00 1.00 1.00 2614
weighted avg 1.00 1.00 1.00 2614

Test data:
Accuracy: 0.7777777777777778
Classification Report:
precision recall f1-score support

0 0.70 0.65 0.67 168


1 0.82 0.85 0.83 309

accuracy 0.78 477


macro avg 0.76 0.75 0.75 477
weighted avg 0.77 0.78 0.78 477

Insight
The training accuracy is 1 whereas testing accuracy is 0.778. This is a case of overfitting.
The best parameters found are well within the provided range

6.3.1 Feature Importance Plot


In [63]: def plot_feature_importance(estimator, features):
# Extract feature importances
importances = estimator.feature_importances_

# Create a DatafRame for plotting


feature_importance_df = pd.DataFrame({'Feature':features, 'Importance':importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=

# Plot feature importance


plt.figure(figsize=(8,5))
sns.barplot(data=feature_importance_df, x='Importance', y='Feature')
plt.title('Feature Importance')
plt.show()

In [64]: plot_feature_importance(rf_random.best_estimator_, X_train.columns)


6.3.2 Confusion Matrix
In [65]: def display_confusion_matrix(y_test, y_pred):
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix


disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

In [66]: display_confusion_matrix(y_test, y_pred)


6.3.3 ROC Curve
In [67]: def plot_roc_curve(estimator, X_train, X_test, y_train, y_test):
# Binarize the output
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_binarized.shape[1]-1

# Compute ROC curve and ROC area for each class


classifier = OneVsRestClassifier(estimator)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curve for each class


plt.figure(figsize=(5, 5))
for i in range(n_classes):
plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'.format
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

In [68]: plot_roc_curve(rf_random.best_estimator_, X_train, X_test, y_train, y_test)


6.3.4 Precision-Recall Curve
In [69]: def plot_pr_curve(estimator, X_train, X_test, y_train, y_test):
# Binarize the output
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_binarized.shape[1]-1

# Compute ROC curve and ROC area for each class


classifier = OneVsRestClassifier(estimator)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# For each class


precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(y_test_binarized[:, i], y_sc
average_precision[i] = average_precision_score(y_test_binarized[:, i], y_score[:

# Plot Precision-Recall curve for each class


plt.figure(figsize=(5, 5))
for i in range(n_classes):
plt.plot(recall[i], precision[i], label='PR curve of class {0} (area = {1:0.2f})

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower right')
plt.show()

In [70]: plot_pr_curve(rf_random.best_estimator_, X_train, X_test, y_train, y_test)


Insight
The top 5 features as per the RandomForestCLassifier are \ --Months of Service \ --Income \ --Total
Business Value \ --Age\ --Quarterly Rating 1
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.85
The Area Under the PR curve for class 0 is 0.77 and class 1 is 0.90

6.4. Ensemble Learning: Boosting -


GradientBoostingClassifier
This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of
arbitrary differentiable loss functions. In each stage nclasses regression trees are fit on the negative gradient
of the loss function, e.g. binary or multiclass log loss.\ The hyper-parameters of the
GradientBoostingClassifier will be selected using random search cross validation

In [71]: # Define parameter grid


param_grid = {
'n_estimators': np.arange(100, 1001, 100),
'learning_rate': np.logspace(-3, 0, 10),
'max_depth': np.arange(3, 11, 1),
'min_samples_split': np.arange(2, 21, 2),
'min_samples_leaf': np.arange(1, 21, 2),
'subsample': np.linspace(0.5, 1.0, 6)
}

# Initialize classifier and RandomizedSearchCV


gb = GradientBoostingClassifier()
gb_random = RandomizedSearchCV(estimator=gb, param_distributions=param_grid,
n_iter=300, cv=3, verbose=2, random_state=42, n_jobs=-1)

# Fit the model


gb_random.fit(X_train, y_train)

# Evaluate best parameters


print("Best parameters found for GradientBoostingClassifier: ", gb_random.best_params_)

Fitting 3 folds for each of 300 candidates, totalling 900 fits


Best parameters found for GradientBoostingClassifier: {'subsample': 0.8, 'n_estimator
s': 500, 'min_samples_split': 20, 'min_samples_leaf': 3, 'max_depth': 10, 'learning_rat
e': 0.021544346900318832}

In [72]: color = '\033[91m'


bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = gb_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = gb_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 1.0
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 1307


1 1.00 1.00 1.00 1307

accuracy 1.00 2614


macro avg 1.00 1.00 1.00 2614
weighted avg 1.00 1.00 1.00 2614

Test data:
Accuracy: 0.7861635220125787
Classification Report:
precision recall f1-score support

0 0.73 0.63 0.68 168


1 0.81 0.87 0.84 309

accuracy 0.79 477


macro avg 0.77 0.75 0.76 477
weighted avg 0.78 0.79 0.78 477

Insight
The training accuracy is 1 whereas testing accuracy is 0.786. This is also a case of overfitting.

6.4.1 Performance
In [73]: plot_feature_importance(gb_random.best_estimator_, X_train.columns)
display_confusion_matrix(y_test, y_pred)
plot_roc_curve(gb_random.best_estimator_, X_train, X_test, y_train, y_test)
plot_pr_curve(gb_random.best_estimator_, X_train, X_test, y_train, y_test)
Insight
The top 5 features as per the GradientBoostingClassifier are \ --Quarterly Rating 1 \ --Months of
Service \ --Income \ --Total Business Value \ --Age
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.85
The Area Under the PR curve for class 0 is 0.79 and class 1 is 0.91

6.5. Ensemble Learning: Boosting - XGBClassifier


XGBClassifier is a highly optimized version of GBM. It includes regularization to prevent overfitting and
various other enhancements.\ The hyper-parameters of the XGBClassifier will be selected using random
search cross validation

In [74]: # Define parameter grid


param_grid = {
'n_estimators': np.arange(100, 1001, 100),
'learning_rate': np.logspace(-3, 0, 10),
'max_depth': np.arange(3, 11, 1),
'min_child_weight': np.arange(1, 11, 1),
'gamma': np.logspace(-3, 1, 10),
'subsample': np.linspace(0.5, 1.0, 6),
'colsample_bytree': np.linspace(0.5, 1.0, 6)
}

# Initialize classifier and RandomizedSearchCV


xgb = XGBClassifier(eval_metric='mlogloss')
xgb_random = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid,
n_iter=300, cv=3, verbose=2, random_state=42, n_jobs=-1)

# Fit the model


xgb_random.fit(X_train, y_train)

# Evaluate best parameters


print("Best parameters found for XGBoost: ", xgb_random.best_params_)

Fitting 3 folds for each of 300 candidates, totalling 900 fits


Best parameters found for XGBoost: {'subsample': 0.8, 'n_estimators': 200, 'min_child_w
eight': 1, 'max_depth': 7, 'learning_rate': 0.046415888336127774, 'gamma': 0.00774263682
6811269, 'colsample_bytree': 0.5}

In [75]: color = '\033[91m'


bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = xgb_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = xgb_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 0.9391736801836267
Classification Report:
precision recall f1-score support

0 0.96 0.92 0.94 1307


1 0.92 0.96 0.94 1307

accuracy 0.94 2614


macro avg 0.94 0.94 0.94 2614
weighted avg 0.94 0.94 0.94 2614

Test data:
Accuracy: 0.8134171907756813
Classification Report:
precision recall f1-score support

0 0.76 0.68 0.72 168


1 0.84 0.88 0.86 309

accuracy 0.81 477


macro avg 0.80 0.78 0.79 477
weighted avg 0.81 0.81 0.81 477

Insight
The training accuracy has reduced to 0.939 whereas testing accuracy has slightly increased to 0.813.
This is still a case of overfitting but better than all the previous models.
This model is also faster than the previous models

6.5.1 Performance
In [76]: plot_feature_importance(xgb_random.best_estimator_, X_train.columns)
display_confusion_matrix(y_test, y_pred)
plot_roc_curve(xgb_random.best_estimator_, X_train, X_test, y_train, y_test)
plot_pr_curve(xgb_random.best_estimator_, X_train, X_test, y_train, y_test)
Insight
The top 5 features as per the XGBClassifier are \ --Quarterly Rating 1 \ --Quarterly Rating Improved
\ --Quarterly Rating 4 \ --Months of Service \ --Quarterly Rating_Change\
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.86
The Area Under the PR curve for class 0 is 0.81 and class 1 is 0.90

7. Insights
Most of the drivers are in the age group of 30 to 35
59% of the drivers are Male and remaining 41% are Female
City C20 has the maximum number of drivers
Maximum number of drivers joined in the year 2020 and in the month of July
1026 drivers have a joining designation of 1
Maximum number of drivers have a grade of 2
Majority of the drivers have a very low quarterly rating of 1
There are no drivers with quarterly rating of 5
1616 drivers have churned, which is around 68%
The median income of drivers who have churned is lesser than that of the drivers who have not
churned
The churn rate is very less in drivers whose income has raised
The churn rate is very less in drivers whose grade has raised
The churn rate is very less in drivers whose Quarterly rating has increased
8. Recommendation
The quartely rating has been the top contibutor on deciding if a driver will churn or not. As the ratings
are given by the customers to the driver, Ola should urge all customers to rate the drivers on time. Ola
should provide incentives/points to the customers to encourage timely rating.
Ola should make sure that the income of deserving drivers should be increased every 6 months, if not
every quarter, to encourage drivers to stay
Long service awards/bonuses should be given to drivers to keep them motivated
Special trainings should be given to drivers on how to handle different customers and different
situations so that the customers always provide positive ratings

9. Questionnaire
9.1 What percentage of drivers have received a quarterly rating of 5?
Ans: No drivers have received a quarterly rating of 5

9.2 Comment on the correlation between Age and Quarterly Rating.


Ans: Age and Quarterly rating do not have much correlation. They have a small correlation value of 0.15

9.3 Name the city which showed the most improvement in Quarterly
Rating over the past year
Ans: The city C29 shows most improvement in Quarterly Rating in 2020 compared to 2019

9.4 Drivers with a Grade of ‘A’ are more likely to have a higher Total
Business Value. (T/F)
Ans: Yes, the mean of Total Business Value of drivers with grade 5(or A) is higher than those with other
grades

9.5 If a driver's Quarterly Rating drops significantly, how does it impact


their Total Business Value in the subsequent period?
Ans: A significant drop in rating leads to dip in the Total Business Value in the subsequesnt period. Drop in
rating demotivates the drivers, leading to accepting only a few rides or in somecases not accepting any rides
and hence impacting the Total Business Value

9.6 From Ola's perspective, which metric should be the primary focus for
driver retention? 1. ROC AUC, 2. Precision, 3. Recall, 4. F1 Score
Ans: Recall. It is ok if the model predicts most drivers as Churn but it should not predict Churn drivers as
Not Churn

9.7 How does the gap in precision and recall affect Ola's relationship
with its drivers and customers?
Ans: Gap in the precision and recall implies that the False Negatives and False Positives values are very
different. If more instances of Churn are misclassified as Not Churn, then the customers may get drives who
are not-motived/unsatisfied leading to bad customer experience. On the other hand if more instances of
Not Churn are misclassified as Churn, then the good performing drivers will be neglected leading to driver
dissatification.

9.8 Besides the obvious features like "Number of Rides", which lesser-
discussed features might have a strong impact on a driver's Quarterly
Rating?
Ans: 1) Customers not providing timely rating or providing false rating has a strong impact on high
performing drivers and their quarterly rating.\ 2) Lack of training to the driver on handling different situation
can also impact their quarterly rating. Not all customers are same, so the driver needs to adapt his
behaviour as per the customer.

9.9 Will the driver's performance be affected by the City they operate in?
(Yes/No)
Ans: Yes, it might be the case that the people(customers) of a city are of a particular mindset. The people of
a city could be more accomodative and provide good ratings always and people of a different city could get
irriated easily and provide bad ratings

9.10 Analyze any seasonality in the driver's ratings. Do certain times of


the year correspond to higher or lower ratings, and why might that be?
Ans: Yes, there is a seasonality in the driver's rating. The ratings dip in Q2 and then shoot up in Q3. This
could be becuase of the holiday season in Q2 when many people move out of the cities for vacation and
hence less usage of cabs.

In [ ]:

You might also like