0% found this document useful (0 votes)

104 views51 pages

Ola Case Study

Ola is a ride-sharing platform focused on reducing driver churn through data analysis and ensemble learning techniques. The case study outlines the responsibilities of a data scientist tasked with analyzing driver data to predict attrition and improve retention strategies. The dataset contains 19,104 entries with various driver attributes, and the analysis aims to derive actionable insights to enhance driver loyalty.

Uploaded by

Raja Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views51 pages

Ola Case Study

Uploaded by

Raja Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Ola Case Study

Introduction
Ola is a leading ride-sharing platform, aiming to provide reliable, affordable, and convenient urban
transportation for everyone. The constant challenge Ola faces is the churn rate of its drivers. Ensuring driver
loyalty and reducing attrition are crucial to the company's operation. Analyzing driver data can reveal
patterns in driver behavior, performance, and satisfaction. This would help in foreseeing potential churn,
allowing proactive measures. By leveraging data science and ensemble learning, Ola can predict driver
churn, which would be pivotal in its driver retention strategy.

What is expected
Assuming you are a data scientist at Ola, you are entrusted with the responsibility of analyzing the dataset
to predict driver attrition. Your primary goal is to utilize ensemble learning techniques, evaluate the
performance of your models, and provide actionable insights to reduce driver churn.

1. Data
The analysis was done on the data located at -
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/002/492/original/ola_driver_scaler.csv

2. Libraries
Below are the libraries required

In [1]: # libraries to analyze data

import numpy as np
import pandas as pd

# libraries to visualize data

import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, label_binarize

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, v
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, Con
from sklearn.multiclass import OneVsRestClassifier

from imblearn.over_sampling import SMOTE

from scipy.stats import randint

from xgboost import XGBClassifier

from lightgbm import LGBMClassifier

3. Data Loading
Loading the data into Pandas dataframe for easily handling of data

In [2]: # read the file into a pandas dataframe

df = pd.read_csv('ola_driver_scaler.csv')
# look at the datatypes of the columns
print('*************************************************')
print(df.info())
print('*************************************************\n')
print('*************************************************')
print(f'Shape of the dataset is {df.shape}')
print('*************************************************\n')
print('*************************************************')
print(f'Number of nan/null values in each column: \n{df.isna().sum()}')
print('*************************************************\n')
print('*************************************************')
print(f'Number of unique values in each column: \n{df.nunique()}')
print('*************************************************\n')
print('*************************************************')
print(f'Duplicate entries: \n{df.duplicated().value_counts()}')

*************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 19104 non-null int64
1 MMM-YY 19104 non-null object
2 Driver_ID 19104 non-null int64
3 Age 19043 non-null float64
4 Gender 19052 non-null float64
5 City 19104 non-null object
6 Education_Level 19104 non-null int64
7 Income 19104 non-null int64
8 Dateofjoining 19104 non-null object
9 LastWorkingDate 1616 non-null object
10 Joining Designation 19104 non-null int64
11 Grade 19104 non-null int64
12 Total Business Value 19104 non-null int64
13 Quarterly Rating 19104 non-null int64
dtypes: float64(2), int64(8), object(4)
memory usage: 2.0+ MB
None
*************************************************

*************************************************
Shape of the dataset is (19104, 14)
*************************************************

*************************************************
Number of nan/null values in each column:
Unnamed: 0 0
MMM-YY 0
Driver_ID 0
Age 61
Gender 52
City 0
Education_Level 0
Income 0
Dateofjoining 0
LastWorkingDate 17488
Joining Designation 0
Grade 0
Total Business Value 0
Quarterly Rating 0
dtype: int64
*************************************************

*************************************************
Number of unique values in each column:
Unnamed: 0 19104
MMM-YY 24
Driver_ID 2381
Age 36
Gender 2
City 29
Education_Level 3
Income 2383
Dateofjoining 869
LastWorkingDate 493
Joining Designation 5
Grade 5
Total Business Value 10181
Quarterly Rating 4
dtype: int64
*************************************************

*************************************************
Duplicate entries:
False 19104
Name: count, dtype: int64

In [3]: # look at the top 20 rows

df.head(20)

Out[3]:
Unnamed: MMM-
Driver_ID Age Gender City Education_Level Income Dateofjoining LastWorkingDate
0 YY

0 0 01/01/19 1 28.0 0.0 C23 2 57387 24/12/18 NaN

1 1 02/01/19 1 28.0 0.0 C23 2 57387 24/12/18 NaN

2 2 03/01/19 1 28.0 0.0 C23 2 57387 24/12/18 03/11/19

3 3 11/01/20 2 31.0 0.0 C7 2 67016 11/06/20 NaN

4 4 12/01/20 2 31.0 0.0 C7 2 67016 11/06/20 NaN

5 5 12/01/19 4 43.0 0.0 C13 2 65603 12/07/19 NaN

6 6 01/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

7 7 02/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

8 8 03/01/20 4 43.0 0.0 C13 2 65603 12/07/19 NaN

9 9 04/01/20 4 43.0 0.0 C13 2 65603 12/07/19 27/04/20

10 10 01/01/19 5 29.0 0.0 C9 0 46368 01/09/19 NaN

11 11 02/01/19 5 29.0 0.0 C9 0 46368 01/09/19 NaN

12 12 03/01/19 5 29.0 0.0 C9 0 46368 01/09/19 03/07/19

13 13 08/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

14 14 09/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

15 15 10/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

16 16 11/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

17 17 12/01/20 6 31.0 1.0 C11 1 78728 31/07/20 NaN

18 18 09/01/20 8 34.0 0.0 C2 0 70656 19/09/20 NaN

19 19 10/01/20 8 34.0 0.0 C2 0 70656 19/09/20 NaN

In [4]: df.describe()

Out[4]: Joining
Unnamed: 0 Driver_ID Age Gender Education_Level Income
Designation

count 19104.000000 19104.000000 19043.000000 19052.000000 19104.000000 19104.000000 19104.000000 191

mean 9551.500000 1415.591133 34.668435 0.418749 1.021671 65652.025126 1.690536

std 5514.994107 810.705321 6.257912 0.493367 0.800167 30914.515344 0.836984

min 0.000000 1.000000 21.000000 0.000000 0.000000 10747.000000 1.000000

25% 4775.750000 710.000000 30.000000 0.000000 0.000000 42383.000000 1.000000

50% 9551.500000 1417.000000 34.000000 0.000000 1.000000 60087.000000 1.000000

75% 14327.250000 2137.000000 39.000000 1.000000 2.000000 83969.000000 2.000000

max 19103.000000 2788.000000 58.000000 1.000000 2.000000 188418.000000 5.000000

In [5]: df.describe(include='object')

Out[5]: MMM-YY City Dateofjoining LastWorkingDate

count 19104 19104 19104 1616

unique 24 29 869 493

top 01/01/19 C20 23/07/15 29/07/20

freq 1022 1008 192 70

Insight
There are 19104 entries with 14 columns
There are 61 null/missing values in Age, 52 in Gender and 17488 in LastWorkingDate
There are 2381 unique drivers
There are no duplicates
The column Unnamed: 0 can be dropped as it doesnt provide any new information
The columns Gender, City, Education_Level, Joining Designation, Grade and Quarterly Rating can
be converted to categorical datatype
The columns MMM-YY, Dateofjoining and LastWorkingDate can be converted to datetime datatype
Drivers who have valid LastWorkingDate can be considered as churned

In [6]: # Drop "Unnamed: 0" column

df.drop(columns=['Unnamed: 0'], inplace=True)

# Convert to category
categorical_columns = ['Gender', 'City', 'Education_Level', 'Joining Designation', 'Grad
df[categorical_columns] = df[categorical_columns].astype('category')
df['Gender'].replace({0.0:'Male', 1.0: 'Female'}, inplace=True)
df['Education_Level'].replace({0:'10+', 1:'12+', 2:'Graduate'}, inplace=True)

# Convert to datetime
df['MMM-YY'] = pd.to_datetime(df['MMM-YY'], format='%m/%d/%y')
df['Dateofjoining'] = pd.to_datetime(df['Dateofjoining'], format='%d/%m/%y')
df['LastWorkingDate'] = pd.to_datetime(df['LastWorkingDate'], format='%d/%m/%y')

# Rename 'MMM-YY' to 'ReportingMonthYear'

df.rename(columns={'MMM-YY':'ReportingMonthYear'}, inplace=True)
df['ReportingMonthYear'] = df['ReportingMonthYear'].dt.to_period('M')
df['ReportingYear'] = df['ReportingMonthYear'].dt.year

# Extract month and year from 'Dateofjoining'

df['Monthofjoining'] = df['Dateofjoining'].dt.month
df['Yearofjoining'] = df['Dateofjoining'].dt.year

# Find drivers who haved churned

df['Churn'] = df.groupby('Driver_ID')['LastWorkingDate'].transform('last')
df['Churn'] = df['Churn'].apply(lambda x: 0 if pd.isnull(x) else 1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19104 entries, 0 to 19103
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ReportingMonthYear 19104 non-null period[M]
1 Driver_ID 19104 non-null int64
2 Age 19043 non-null float64
3 Gender 19052 non-null category
4 City 19104 non-null category
5 Education_Level 19104 non-null category
6 Income 19104 non-null int64
7 Dateofjoining 19104 non-null datetime64[ns]
8 LastWorkingDate 1616 non-null datetime64[ns]
9 Joining Designation 19104 non-null category
10 Grade 19104 non-null category
11 Total Business Value 19104 non-null int64
12 Quarterly Rating 19104 non-null int64
13 ReportingYear 19104 non-null int64
14 Monthofjoining 19104 non-null int32
15 Yearofjoining 19104 non-null int32
16 Churn 19104 non-null int64
dtypes: category(5), datetime64[ns](2), float64(1), int32(2), int64(6), period[M](1)
memory usage: 1.7 MB

In [7]: # look at the top 5 rows

df.head(5)

Out[7]:
ReportingMonthYear Driver_ID Age Gender City Education_Level Income Dateofjoining LastWorkingDate

0 2019-01 1 28.0 Male C23 Graduate 57387 2018-12-24 NaT

1 2019-02 1 28.0 Male C23 Graduate 57387 2018-12-24 NaT

2 2019-03 1 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03

3 2020-11 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT

4 2020-12 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT

4. Exploratory Data Analysis

4.1. Univariate analysis

In [8]: plt.figure(figsize=(10,4))
temp_df = df.groupby('ReportingMonthYear')['Driver_ID'].nunique()
ax = temp_df.plot(kind='bar')
ax.bar_label(ax.containers[0])
plt.ylabel('Number of unique drivers')
plt.title('Number of drivers reporting each month')
plt.show()

Insight
The month during which maximum number of drivers reported is January 2019. A total of 1022
drivers reported on January 2019
It then dropeed every month after January and has been stagnant at around 800 drivers reported every
month

In [9]: fig, axs = plt.subplots(1,2,figsize=(10,4))

temp_df = df.groupby('Driver_ID').agg({'Age':'last'})['Age']
temp_df.plot(ax=axs[0], kind='hist', bins=35)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Age distribution of drivers')
plt.show()
Insight
There are drivers from different age groups ranging from 21 to 58 years
Most of the drivers are in the age group of 30 to 35
The distribution is mostly normal with little skewness towards the right

In [10]: temp_df = df.groupby('Driver_ID').agg({'Gender':'first'})

temp_df['Gender'].value_counts().plot(kind='pie', autopct='%.1f')
plt.title('Gender % distribution of drivers')
plt.ylabel('')
plt.show()

Insight
59% of the drivers are Male and remaining 41% are Female
In [11]: plt.figure(figsize=(10,5))
temp_df = df.groupby('Driver_ID').agg({'City':'first'})
ax = temp_df['City'].value_counts().plot(kind='bar')
ax.bar_label(ax.containers[0])
plt.title('Distributon of drivers in different cities')
plt.show()

Insight
City C20 has the maximum number of drivers followed by city C15

In [12]: temp_df = df.groupby('Driver_ID').agg({'Education_Level':'first'})

temp_df['Education_Level'].value_counts().plot(kind='pie', autopct='%.1f')
plt.ylabel('')
plt.title('Education level % distribution of drivers')
plt.show()
Insight
Almost equal proportion of drivers are from the 3 different education level

In [13]: fig, axs = plt.subplots(1,2,figsize=(12,4))

temp_df = df.groupby('Driver_ID').agg({'Income':'last'})['Income']
temp_df.plot(ax=axs[0], kind='hist', bins=35, rot=45)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Income distribution of drivers')
plt.show()

Insight
Most of the drivers have an average monthly income of 40k to 75k
The distribution is right skewed
In [14]: temp_df = df.groupby('Driver_ID').agg({'Dateofjoining':'first'})['Dateofjoining']
fig, axs = plt.subplots(1,2,figsize=(10,5))
temp_df.dt.year.value_counts().plot(kind='bar', ax=axs[0], xlabel='Year of joning', titl
temp_df.dt.month.value_counts().plot(kind='bar', ax=axs[1], xlabel='Month of joning', ti
plt.show()

Insight
Maximum number of drivers joined in the year 2020
Maximum number of drivers joined in the month of July

In [15]: ax = df.groupby('Driver_ID').agg({'Joining Designation':'first'})['Joining Designation']

ax.bar_label(ax.containers[0])
plt.show()
Insight
Maximum number of drivers, 1026, have a joining designation of 1

In [16]: temp_df_1 = df.groupby('Driver_ID').agg({'Grade':'first'}).reset_index()

temp_df_1.rename(columns = {'Grade':'Grade_First'}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({'Grade':'last'}).reset_index()
temp_df_2.rename(columns = {'Grade':'Grade_Last'}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df['Grade_Change'] = temp_df['Grade_Last'].astype('int') - temp_df['Grade_First'].a
fig, axs = plt.subplots(1,2,figsize=(10,4))
ax = temp_df['Grade_First'].value_counts().plot(kind='bar', ax=axs[0], title='Driver dis
ax.bar_label(ax.containers[0])
ax = temp_df['Grade_Last'].value_counts().plot(kind='bar', ax=axs[1], title='Driver dist
ax.bar_label(ax.containers[0])
plt.show()
ax = sns.countplot(data=temp_df, x = 'Grade_Change')
ax.set_title('Driver distribution based on change in grade')
ax.bar_label(ax.containers[0])
plt.show()
Insight
Maximum number of drivers have a grade of 2 and it doesnt change for the majority of the drivers

In [17]: fig, axs = plt.subplots(1,2,figsize=(12,4))

temp_df = df.groupby('Driver_ID').agg({'Total Business Value':'sum'})['Total Business Va
temp_df.plot(ax=axs[0], kind='hist', bins=100)
temp_df.plot(ax=axs[1], kind='box')
fig.suptitle('Distribution of drivers as per total business value')
plt.show()
Insight
It is very evident that many drivers have a total business value of 0 and there are also a few drivers
who have a -ve business value
The distribution is extremely right skewed

In [18]: temp_df_1 = df.groupby('Driver_ID').agg({'Quarterly Rating':'first'}).reset_index()

temp_df_1.rename(columns = {'Quarterly Rating':'QR_First'}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({'Quarterly Rating':'last'}).reset_index()
temp_df_2.rename(columns = {'Quarterly Rating':'QR_Last'}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df['QR_Change'] = temp_df['QR_Last'].astype('int') - temp_df['QR_First'].astype('in
fig, axs = plt.subplots(1,2,figsize=(10,4))
ax = temp_df['QR_First'].value_counts().plot(kind='bar', ax=axs[0], title='Driver distrib
ax.bar_label(ax.containers[0])
ax = temp_df['QR_Last'].value_counts().plot(kind='bar', ax=axs[1], title='Driver distrib
ax.bar_label(ax.containers[0])
plt.show()
ax = sns.countplot(data=temp_df, x = 'QR_Change')
ax.set_title('Driver distribution based on change in QR')
ax.bar_label(ax.containers[0])
plt.show()
Insight
Majority of the drivers have a very low quarterly rating of 1
The change in QR plot shows that majority of the drivers don't see a change in their QR but there are
decent number of drivers with positive change in QR and equally decent number of drivers with
negative change in QR
There are no drivers with QR of 5

In [19]: temp_df = df.groupby('Driver_ID').agg({'Churn':'first'})['Churn']

ax = temp_df.value_counts().plot(kind='bar', title='Driver distribution based on the chu
ax.bar_label(ax.containers[0])
plt.show()
(temp_df.value_counts(normalize=True)*100).round(0)
Churn
Out[19]:
1 68.0
0 32.0
Name: proportion, dtype: float64

Insight
1616 drivers have churned, which is around 68%

4.2. Bivariate analysis

In [22]: driver_df = df.groupby('Driver_ID').agg({
'ReportingMonthYear' : len,
'Age' : 'last',
'Gender' : 'first',
'City' : 'first',
'Education_Level' : 'first',
'Income' : 'last',
'Dateofjoining' : 'first',
'LastWorkingDate' : 'last',
'Joining Designation' : 'first',
'Grade' : 'last',
'Total Business Value' : 'sum',
'Quarterly Rating' : 'last',
'Churn':'last'
}).reset_index()
driver_df.rename(columns={'ReportingMonthYear': 'Months of Service'}, inplace=True)
driver_df.head(10)

Out[22]: Months
Joining
Driver_ID of Age Gender City Education_Level Income Dateofjoining LastWorkingDate
Designation
Service

0 1 3 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03 1

1 2 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT 2

2 4 5 43.0 Male C13 Graduate 65603 2019-07-12 2020-04-27 2

3 5 3 29.0 Male C9 10+ 46368 2019-09-01 2019-07-03 1

4 6 5 31.0 Female C11 12+ 78728 2020-07-31 NaT 3

5 8 3 34.0 Male C2 10+ 70656 2020-09-19 2020-11-15 3

6 11 1 28.0 Female C19 Graduate 42172 2020-07-12 NaT 1

7 12 6 35.0 Male C23 Graduate 28116 2019-06-29 2019-12-21 1

8 13 23 31.0 Male C19 Graduate 119227 2015-05-28 2020-11-25 1

9 14 3 39.0 Female C26 10+ 19734 2020-10-16 NaT 3

In [23]: drivers_with_2_year_service = driver_df[driver_df['Months of Service'] == 24]['Driver_ID

In [24]: def calculate_change(df, column_name):

temp_df_1 = df.groupby('Driver_ID').agg({column_name:'first'}).reset_index()
first_column_name = column_name+'_First'
temp_df_1.rename(columns = {column_name:first_column_name}, inplace=True)
temp_df_2 = df.groupby('Driver_ID').agg({column_name:'last'}).reset_index()
last_column_name = column_name+'_Last'
temp_df_2.rename(columns = {column_name:last_column_name}, inplace=True)
temp_df = pd.merge(temp_df_1, temp_df_2, on='Driver_ID')
temp_df[column_name+'_Change'] = temp_df[last_column_name].astype('int') - temp_df[f
temp_df.drop(columns=[first_column_name, last_column_name], inplace=True)
return temp_df

In [25]: column_name = 'Income'

temp_df1 = calculate_change(df, 'Income')
driver_df = pd.merge(driver_df, temp_df1, on='Driver_ID')
temp_df2 = calculate_change(df, 'Grade')
driver_df = pd.merge(driver_df, temp_df2, on='Driver_ID')
temp_df3 = calculate_change(df, 'Quarterly Rating')
driver_df = pd.merge(driver_df, temp_df3, on='Driver_ID')
driver_df['Quarterly Rating Improved'] = driver_df['Quarterly Rating_Change'].apply(lamb
driver_df.head()

Out[25]: Months
Joining
Driver_ID of Age Gender City Education_Level Income Dateofjoining LastWorkingDate
Designation
Service

0 1 3 28.0 Male C23 Graduate 57387 2018-12-24 2019-11-03 1

1 2 2 31.0 Male C7 Graduate 67016 2020-06-11 NaT 2

2 4 5 43.0 Male C13 Graduate 65603 2019-07-12 2020-04-27 2

3 5 3 29.0 Male C9 10+ 46368 2019-09-01 2019-07-03 1

4 6 5 31.0 Female C11 12+ 78728 2020-07-31 NaT 3

In [26]: driver_df['Income_Raise'] = driver_df['Income_Change'].apply(lambda x: 1 if x>0 else 0)

In [27]: fig, axs = plt.subplots(2,1,figsize=(10,6))

sns.histplot(ax = axs[0], data=driver_df, x='Months of Service', hue='Churn', stat="prop
sns.histplot(ax = axs[1], data=driver_df, x='City', hue='Churn', stat="proportion", mult
plt.tight_layout()
plt.show()
Insight
The churn rate is generally higher in drivers with less months of service and low in drivers with longer
months of service with exception for 21, 22 and 23 months of service where the churn rates seems to be
very high
The city C13 has the highest churn rate and city C29 has the lowest churn rate

In [28]: fig, axs = plt.subplots(1,3,figsize=(10,5))

sns.boxplot(ax=axs[0], data=driver_df, y='Age', hue='Churn', width=0.5, gap=0.2)
sns.boxplot(ax=axs[1], data=driver_df, y='Income', hue='Churn', width=0.5, gap=0.2)
sns.boxplot(ax=axs[2], data=driver_df, y='Total Business Value', hue='Churn', gap=0.2)
plt.tight_layout()
plt.show()
Insight
The median age of drivers who have churned is slighly lesser than that of the drivers who have not
churned
The median income of drivers who have churned is lesser than that of the drivers who have not
churned
The median Total Bussiness Value of drivers who have churned is lesser than that of the drivers who
have not churned
The drivers who have churned also had -ve Total Bussiness Value

In [29]: fig, axs = plt.subplots(2,2,figsize=(8,4),sharey=True)

sns.histplot(ax=axs[0,0], data=driver_df, x='Gender', hue='Churn', stat="proportion", mu
sns.histplot(ax=axs[0,1], data=driver_df, x='Education_Level', hue='Churn', stat='propor
sns.histplot(ax=axs[1,0], data=driver_df, x='Joining Designation', hue='Churn', stat='pr
sns.histplot(ax=axs[1,1], data=driver_df, x='Grade', hue='Churn', stat='proportion', mul
plt.tight_layout()
plt.show()

Insight
The churn rate is almost equal in both male and female drivers
The churn rate is almost equal in 10+ and Graduates and slighly lower in 12+
The churn rate is less for joining designation 3
The churn rate is less for higher grades

In [30]: fig, axs = plt.subplots(1,2,figsize=(8,2),sharey=True)

sns.histplot(ax=axs[0], data=driver_df, x='Income_Raise', hue='Churn', stat='proportion'
sns.histplot(ax=axs[1], data=driver_df, x='Grade_Change', hue='Churn', stat='proportion'
plt.tight_layout()
plt.show()
plt.figure(figsize=(9,2))
sns.histplot(data=driver_df, x='Quarterly Rating_Change', hue='Churn', stat='proportion'
plt.show()
plt.figure(figsize=(9,2))
sns.histplot(data=driver_df, x='Quarterly Rating Improved', hue='Churn', stat='proportio
plt.show()
Insight
The churn rate is very less in drivers whose income has raised
The churn rate is very less in drivers whose grade has raised
The churn rate is very less in drivers whose Quarterly rating has increased

In [31]: temp_df = df.groupby(['City', 'ReportingYear']).agg({'Quarterly Rating': 'mean'}).reset_

temp_df1 = pd.pivot_table(data=temp_df, index='City', columns='ReportingYear', values='Q
temp_df1.rename(columns={'ReportingYear':'index', 2019:'2019', 2020:'2020'}, inplace=True
temp_df1['%change'] = (((temp_df1['2020'] - temp_df1['2019'])/temp_df1['2019'])*100).rou
plt.figure(figsize=(10,5))
sns.barplot(data=temp_df1, x='City', y='%change')
plt.tight_layout()
plt.grid(True)
plt.show()
Insight
The city C29 shows most improvement in Quarterly Rating in 2020 compared to 2019

In [32]: plt.figure(figsize=(10,3))
sns.barplot(data=driver_df, x='Grade', y='Total Business Value', estimator='mean')
plt.show()
print('Mean of Total Business Value of drivers with grade 5:', driver_df[driver_df['Grad

Mean of Total Business Value of drivers with grade 5: 565760460

Insight
The mean of Total Business Value of drivers with grade 5 is higher than those with other grades

In [33]: def convert_to_year_quarter(x):

year = str(x.year)
month = x.month
if(month >=1 and month <=3):
return year+'-Q1'
elif(month >=4 and month <=6):
return year+'-Q2'
elif(month >=7 and month <=9):
return year+'-Q3'
else:
return year+'-Q4'

temp_df = df.copy()
temp_df['ReportingYearQuarter']=temp_df['ReportingMonthYear'].apply(convert_to_year_quar
temp_df.head()
temp_driver_full_service_df = temp_df[temp_df['Driver_ID'].isin(drivers_with_2_year_serv
plt.figure(figsize=(8,4))
sns.lineplot(data=temp_driver_full_service_df, x='ReportingYearQuarter', y='Quarterly Ra
plt.xticks(rotation=90)
plt.show()

Insight
There is a dip in the quarterly rating in Q2 and then it increases in Q3.
This pattern can be osberved for both the years

In [34]: temp_driver_full_service_df = temp_df[temp_df['Driver_ID'].isin(drivers_with_2_year_serv

num_of_drivers = 20
count=0
for driver_id in temp_driver_full_service_df['Driver_ID'].unique():
if(count < num_of_drivers):
count = count + 1
sample_df = temp_driver_full_service_df[temp_driver_full_service_df['Driver_ID']
fig, axs = plt.subplots(2,1,figsize=(10, 5), sharex=True)
sns.barplot(ax=axs[0], data=sample_df, x = 'ReportingMonthYear', y='Quarterly Ra
axs[0].tick_params(axis='x', rotation=90)
sns.barplot(ax=axs[1], data=sample_df, x = 'ReportingMonthYear', y='Total Busine
axs[1].tick_params(axis='x', rotation=90)
fig.suptitle(f'Driver Id: {driver_id}')
plt.tight_layout()
plt.show()
else:
break
Insight
It can be observed that a significant drop in rating impacts the Total Business Value. Drop in rating
demotivates the drivers, leading to accepting only a few rides or in somecases not accepting any rides
and hence impacting the Total Business Value

4.3. Multivariate analysis

In [35]: driver_df['Gender'].replace({'Male':0, 'Female':1}, inplace=True)
driver_df['Education_Level'].replace({'Graduate':0, '10+':1, '12+':2}, inplace=True)
driver_df['City'] = driver_df['City'].str[1:]

In [36]: plt.figure(figsize=(15,10))
sns.heatmap(driver_df.drop(columns=['Driver_ID', 'Dateofjoining', 'LastWorkingDate', 'In
plt.tight_layout()
plt.show()
Insight
Months of Service and Total Business Value are highly correlated
Income and Grade are highly correlated
Joining Designation and Grade are highly correlated
Quarterly Rating and Months of Service are highly correlated
Chrun is decently correlated with Quarterly Rating, Total Business Value, Months of Service

5. Data Preprocessing
In [37]: driver_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Driver_ID 2381 non-null int64
1 Months of Service 2381 non-null int64
2 Age 2381 non-null float64
3 Gender 2381 non-null category
4 City 2381 non-null object
5 Education_Level 2381 non-null category
6 Income 2381 non-null int64
7 Dateofjoining 2381 non-null datetime64[ns]
8 LastWorkingDate 1616 non-null datetime64[ns]
9 Joining Designation 2381 non-null category
10 Grade 2381 non-null category
11 Total Business Value 2381 non-null int64
12 Quarterly Rating 2381 non-null int64
13 Churn 2381 non-null int64
14 Income_Change 2381 non-null int32
15 Grade_Change 2381 non-null int32
16 Quarterly Rating_Change 2381 non-null int32
17 Quarterly Rating Improved 2381 non-null int64
18 Income_Raise 2381 non-null int64
dtypes: category(4), datetime64[ns](2), float64(1), int32(3), int64(8), object(1)
memory usage: 261.2+ KB

Insight
The columns Driver_ID, Gender, City, Education_Level, Dateofjoining, LastWorkingDate can be
dropped as they do not contribute towards the driver churn rate

In [38]: driver_df.drop(columns=['Driver_ID', 'Gender', 'City', 'Education_Level', 'Dateofjoining

driver_df['Quarterly Rating'] = driver_df['Quarterly Rating'].astype('category')
driver_df['Churn'] = driver_df['Churn'].astype('category')
driver_df['Grade_Change'] = driver_df['Grade_Change'].astype('category')
driver_df['Quarterly Rating_Change'] = driver_df['Quarterly Rating_Change'].astype('cate
driver_df['Income_Raise'] = driver_df['Income_Raise'].astype('category')
driver_df.head()

Out[38]: Months Total Quart

Joining Quarterly Quarterly
of Age Income Grade Business Churn Grade_Change Rat
Designation Rating Rating_Change
Service Value Impro

0 3 28.0 57387 1 1 1715580 2 1 0 0

1 2 31.0 67016 2 2 0 1 0 0 0

2 5 43.0 65603 2 2 350000 1 1 0 0

3 3 29.0 46368 1 1 120360 1 1 0 0

4 5 31.0 78728 3 3 1265000 2 0 0 1

In [39]: driver_df.duplicated().value_counts()
False 2381
Out[39]:
Name: count, dtype: int64

Insight
There are no duplicates

5.1. Handling null values

In [40]: driver_df.isna().sum()

Months of Service 0
Out[40]:
Age 0
Income 0
Joining Designation 0
Grade 0
Total Business Value 0
Quarterly Rating 0
Churn 0
Grade_Change 0
Quarterly Rating_Change 0
Quarterly Rating Improved 0
Income_Raise 0
dtype: int64

Insight
There are no missing data or null values

5.2. Outlier Treatment

In [41]: # helper function to detect outliers using IQR method
def detectOutliers_iqr(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3-q1
lower_outliers = df[df<(q1-1.5*iqr)]
higher_outliers = df[df>(q3+1.5*iqr)]
return lower_outliers, higher_outliers

# helper function to detect outliers using standard deviation method

def detectOutliers_std(df):
mean = df.mean()
std = df.std()
upper_limit = mean+(3*std)
lower_limit = mean-(3*std)
lower_outliers = df[df<lower_limit]
higher_outliers = df[df>upper_limit]
return lower_outliers, higher_outliers

In [42]: numerical_columns = driver_df.select_dtypes(include=np.number).columns

column_outlier_dictionary = {}
for column in numerical_columns:
lower_outliers, higher_outliers = detectOutliers_iqr(driver_df[column])
column_outlier_dictionary[column] = [lower_outliers, higher_outliers]
#print('*'*50)
#print(f'Outliers of \'{column}\' column are:')
#print("Lower outliers:\n", lower_outliers)
#print("Higher outliers:\n", higher_outliers)
#print('*'*50, end="\n")

In [43]: for key, value in column_outlier_dictionary.items():

print(f'The column \'{key}\' has {len(value[0]) + len(value[1])} outliers')

The column 'Months of Service' has 249 outliers

The column 'Age' has 25 outliers
The column 'Income' has 48 outliers
The column 'Total Business Value' has 336 outliers
The column 'Quarterly Rating Improved' has 358 outliers

In [44]: numerical_columns = driver_df.select_dtypes(include=np.number).columns

column_outlier_dictionary = {}
for column in numerical_columns:
lower_outliers, higher_outliers = detectOutliers_std(driver_df[column])
column_outlier_dictionary[column] = [lower_outliers, higher_outliers]
#print('*'*50)
#print(f'Outliers of \'{column}\' column are:')
#print("Lower outliers:\n", lower_outliers)
#print("Higher outliers:\n", higher_outliers)
#print('*'*50, end="\n")

In [45]: for key, value in column_outlier_dictionary.items():

print(f'The column \'{key}\' has {len(value[0]) + len(value[1])} outliers')

The column 'Months of Service' has 0 outliers

The column 'Age' has 14 outliers
The column 'Income' has 16 outliers
The column 'Total Business Value' has 64 outliers
The column 'Quarterly Rating Improved' has 0 outliers

Insight
I will keep the outliers in Age and Income columns as they are less in number
I will cap the outliers in Total Business Value column as drivers with higher business value do not
churn usually

In [46]: mean = driver_df['Total Business Value'].mean()

std = driver_df['Total Business Value'].std()
upper_limit = mean+(3*std)
driver_df['Total Business Value'] = driver_df['Total Business Value'].apply(lambda x: x

5.3. Multicollinearity Check

In [47]: features_df = driver_df.drop(columns=['Churn']) # Drop target column
features_df = features_df.drop(columns=features_df.select_dtypes(include='category').col
features_df = sm.add_constant(features_df) # Adding a constant column for the intercept
vif_df = pd.DataFrame()
vif_df['Features'] = features_df.columns
vif_df['VIF'] = [variance_inflation_factor(features_df.values, idx) for idx in range(len
vif_df['VIF'] = round(vif_df['VIF'], 2)
vif_df = vif_df.sort_values(by='VIF', ascending=False)
vif_df

Out[47]: Features VIF

0 const 36.42

4 Total Business Value 3.99

1 Months of Service 3.83

3 Income 1.19

2 Age 1.12

5 Quarterly Rating Improved 1.09

Insight
Based on the above VIF scores, I can conclude that there are no multicolinear numerical features

5.4. Encode categorical variables

In [48]: final_df = driver_df.copy()

Sepearte out target and feature columns

In [49]: X = final_df.drop(columns=['Churn'])
y = final_df['Churn']
X.shape, y.shape

((2381, 11), (2381,))

Out[49]:

Encode target variable

In [50]: y = y.astype(int)

Encode features with just 2 classes as 0 or 1

In [51]: X[['Grade_Change','Quarterly Rating_Change', 'Income_Raise']] = X[['Grade_Change','Quart

One-Hot-Encoding for remaining categorical features

In [52]: X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2381 entries, 0 to 2380
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Months of Service 2381 non-null int64
1 Age 2381 non-null float64
2 Income 2381 non-null int64
3 Joining Designation 2381 non-null category
4 Grade 2381 non-null category
5 Total Business Value 2381 non-null float64
6 Quarterly Rating 2381 non-null category
7 Grade_Change 2381 non-null int8
8 Quarterly Rating_Change 2381 non-null int8
9 Quarterly Rating Improved 2381 non-null int64
10 Income_Raise 2381 non-null int8
dtypes: category(3), float64(2), int64(3), int8(3)
memory usage: 107.7 KB

In [53]: categorical_columns = X.select_dtypes(include='category').columns

categorical_columns

Index(['Joining Designation', 'Grade', 'Quarterly Rating'], dtype='object')

Out[53]:

In [54]: encoder = OneHotEncoder(sparse_output=False)

encoded_data = encoder.fit_transform(X[categorical_columns])
encoded_df = pd.DataFrame(encoded_data, columns = encoder.get_feature_names_out(categori
X = pd.concat([X, encoded_df], axis=1)
X.drop(columns = categorical_columns, inplace=True)
X.head()

Out[54]: Months Total Quarterly

Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1 Des
Service Value Improved

0 3 28.0 57387 1715580.0 0 0 0 0 1.0

1 2 31.0 67016 0.0 0 0 0 0 0.0

2 5 43.0 65603 350000.0 0 0 0 0 0.0

3 3 29.0 46368 120360.0 0 0 0 0 1.0

4 5 31.0 78728 1265000.0 0 1 1 0 0.0

5 rows × 22 columns
6. Model building

6.1. Train-test split

In [55]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1904, 22), (477, 22), (1904,), (477,))

Out[55]:

6.2. Perform data normalization/standardization

Data normalization/standardization is required so that features with higher scales do not dominate the
model's performance. Hence all features should have same scale

Data before scaling

In [56]: X_train.head()

Out[56]: Months Total Quarterly

Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1
Service Value Improved

2236 7 28.0 57164 1092560.0 0 0 0 0 0.0

6 1 28.0 42172 0.0 0 0 0 0 1.0

1818 1 29.0 43989 0.0 0 0 0 0 1.0

1534 7 40.0 59636 2589640.0 0 0 0 0 0.0

2123 6 25.0 29052 2172260.0 0 0 0 0 1.0

5 rows × 22 columns

In [57]: min_max_scaler = MinMaxScaler()

# Fit min_max_scaler to training data
min_max_scaler.fit(X_train)
# Scale the training and testing data
X_train = pd.DataFrame(min_max_scaler.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(min_max_scaler.transform(X_test), columns=X_test.columns)

Data after scaling

In [58]: X_train.head()

Out[58]: Months Total Quarterly

Quarterly Joining
of Age Income Business Grade_Change Rating Income_Raise
Rating_Change Designation_1
Service Value Improved

0 0.260870 0.205882 0.261253 0.074297 0.0 0.5 0.0 0.0 0.0

1 0.000000 0.205882 0.176872 0.041541 0.0 0.5 0.0 0.0 1.0

2 0.000000 0.235294 0.187099 0.041541 0.0 0.5 0.0 0.0 1.0

3 0.260870 0.558824 0.275166 0.119183 0.0 0.5 0.0 0.0 0.0

4 0.217391 0.117647 0.103028 0.106669 0.0 0.5 0.0 0.0 1.0

5 rows × 22 columns

Check for imbalance in target class

In [59]: y_train.value_counts(normalize=True)*100

Churn
Out[59]:
1 68.644958
0 31.355042
Name: proportion, dtype: float64

We can see a clear imbalance in the target class with 1 being ~69% and 0 being ~31%. Hence, I will use
SMOTE to fix this imbalance

In [60]: sm = SMOTE(random_state=0)
X_train, y_train = sm.fit_resample(X_train, y_train)
y_train.value_counts(normalize=True)*100

Churn
Out[60]:
1 50.0
0 50.0
Name: proportion, dtype: float64

6.3. Ensemble Learning: Bagging - RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of
the dataset and uses averaging to improve the predictive accuracy and control over-fitting.\ The hyper-
parameters of the random forest classifier will be selected using grid search cross validation

In [61]: # Define parameter grid

param_grid = {
'n_estimators': list(range(100, 1000, 100)),
'max_features': ['sqrt', 'log2'],
'max_depth': list(range(10, 100, 10)),
'min_samples_split': list(range(2, 10, 1))
}

# Initialize classifier and RandomizedSearchCV

rf = RandomForestClassifier()
rf_random = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, verbose=2, n_jobs=-1

# Fit the model

rf_random.fit(X_train, y_train)

# Evaluate best parameters

print("Best parameters found: ", rf_random.best_params_)

Fitting 3 folds for each of 1296 candidates, totalling 3888 fits

Best parameters found: {'max_depth': 50, 'max_features': 'sqrt', 'min_samples_split':
2, 'n_estimators': 200}

In [62]: color = '\033[91m'

bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = rf_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = rf_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 1.0
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 1307

1 1.00 1.00 1.00 1307

accuracy 1.00 2614

macro avg 1.00 1.00 1.00 2614
weighted avg 1.00 1.00 1.00 2614

Test data:
Accuracy: 0.7777777777777778
Classification Report:
precision recall f1-score support

0 0.70 0.65 0.67 168

1 0.82 0.85 0.83 309

accuracy 0.78 477

macro avg 0.76 0.75 0.75 477
weighted avg 0.77 0.78 0.78 477

Insight
The training accuracy is 1 whereas testing accuracy is 0.778. This is a case of overfitting.
The best parameters found are well within the provided range

6.3.1 Feature Importance Plot

In [63]: def plot_feature_importance(estimator, features):
# Extract feature importances
importances = estimator.feature_importances_

# Create a DatafRame for plotting

feature_importance_df = pd.DataFrame({'Feature':features, 'Importance':importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=

# Plot feature importance

plt.figure(figsize=(8,5))
sns.barplot(data=feature_importance_df, x='Importance', y='Feature')
plt.title('Feature Importance')
plt.show()

In [64]: plot_feature_importance(rf_random.best_estimator_, X_train.columns)

6.3.2 Confusion Matrix
In [65]: def display_confusion_matrix(y_test, y_pred):
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix

disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

In [66]: display_confusion_matrix(y_test, y_pred)

6.3.3 ROC Curve
In [67]: def plot_roc_curve(estimator, X_train, X_test, y_train, y_test):
# Binarize the output
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_binarized.shape[1]-1

# Compute ROC curve and ROC area for each class

classifier = OneVsRestClassifier(estimator)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test_binarized[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curve for each class

plt.figure(figsize=(5, 5))
for i in range(n_classes):
plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'.format
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

In [68]: plot_roc_curve(rf_random.best_estimator_, X_train, X_test, y_train, y_test)

6.3.4 Precision-Recall Curve
In [69]: def plot_pr_curve(estimator, X_train, X_test, y_train, y_test):
# Binarize the output
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2])
n_classes = y_test_binarized.shape[1]-1

# Compute ROC curve and ROC area for each class

classifier = OneVsRestClassifier(estimator)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# For each class

precision = dict()
recall = dict()
average_precision = dict()
for i in range(n_classes):
precision[i], recall[i], _ = precision_recall_curve(y_test_binarized[:, i], y_sc
average_precision[i] = average_precision_score(y_test_binarized[:, i], y_score[:

# Plot Precision-Recall curve for each class

plt.figure(figsize=(5, 5))
for i in range(n_classes):
plt.plot(recall[i], precision[i], label='PR curve of class {0} (area = {1:0.2f})

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower right')
plt.show()

In [70]: plot_pr_curve(rf_random.best_estimator_, X_train, X_test, y_train, y_test)

Insight
The top 5 features as per the RandomForestCLassifier are \ --Months of Service \ --Income \ --Total
Business Value \ --Age\ --Quarterly Rating 1
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.85
The Area Under the PR curve for class 0 is 0.77 and class 1 is 0.90

6.4. Ensemble Learning: Boosting -

GradientBoostingClassifier
This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of
arbitrary differentiable loss functions. In each stage nclasses regression trees are fit on the negative gradient
of the loss function, e.g. binary or multiclass log loss.\ The hyper-parameters of the
GradientBoostingClassifier will be selected using random search cross validation

In [71]: # Define parameter grid

param_grid = {
'n_estimators': np.arange(100, 1001, 100),
'learning_rate': np.logspace(-3, 0, 10),
'max_depth': np.arange(3, 11, 1),
'min_samples_split': np.arange(2, 21, 2),
'min_samples_leaf': np.arange(1, 21, 2),
'subsample': np.linspace(0.5, 1.0, 6)
}

# Initialize classifier and RandomizedSearchCV

gb = GradientBoostingClassifier()
gb_random = RandomizedSearchCV(estimator=gb, param_distributions=param_grid,
n_iter=300, cv=3, verbose=2, random_state=42, n_jobs=-1)

# Fit the model

gb_random.fit(X_train, y_train)

# Evaluate best parameters

print("Best parameters found for GradientBoostingClassifier: ", gb_random.best_params_)

Fitting 3 folds for each of 300 candidates, totalling 900 fits

Best parameters found for GradientBoostingClassifier: {'subsample': 0.8, 'n_estimator
s': 500, 'min_samples_split': 20, 'min_samples_leaf': 3, 'max_depth': 10, 'learning_rat
e': 0.021544346900318832}

In [72]: color = '\033[91m'

bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = gb_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = gb_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 1.0
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 1307

1 1.00 1.00 1.00 1307

accuracy 1.00 2614

macro avg 1.00 1.00 1.00 2614
weighted avg 1.00 1.00 1.00 2614

Test data:
Accuracy: 0.7861635220125787
Classification Report:
precision recall f1-score support

0 0.73 0.63 0.68 168

1 0.81 0.87 0.84 309

accuracy 0.79 477

macro avg 0.77 0.75 0.76 477
weighted avg 0.78 0.79 0.78 477

Insight
The training accuracy is 1 whereas testing accuracy is 0.786. This is also a case of overfitting.

6.4.1 Performance
In [73]: plot_feature_importance(gb_random.best_estimator_, X_train.columns)
display_confusion_matrix(y_test, y_pred)
plot_roc_curve(gb_random.best_estimator_, X_train, X_test, y_train, y_test)
plot_pr_curve(gb_random.best_estimator_, X_train, X_test, y_train, y_test)
Insight
The top 5 features as per the GradientBoostingClassifier are \ --Quarterly Rating 1 \ --Months of
Service \ --Income \ --Total Business Value \ --Age
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.85
The Area Under the PR curve for class 0 is 0.79 and class 1 is 0.91

6.5. Ensemble Learning: Boosting - XGBClassifier

XGBClassifier is a highly optimized version of GBM. It includes regularization to prevent overfitting and
various other enhancements.\ The hyper-parameters of the XGBClassifier will be selected using random
search cross validation

In [74]: # Define parameter grid

param_grid = {
'n_estimators': np.arange(100, 1001, 100),
'learning_rate': np.logspace(-3, 0, 10),
'max_depth': np.arange(3, 11, 1),
'min_child_weight': np.arange(1, 11, 1),
'gamma': np.logspace(-3, 1, 10),
'subsample': np.linspace(0.5, 1.0, 6),
'colsample_bytree': np.linspace(0.5, 1.0, 6)
}

# Initialize classifier and RandomizedSearchCV

xgb = XGBClassifier(eval_metric='mlogloss')
xgb_random = RandomizedSearchCV(estimator=xgb, param_distributions=param_grid,
n_iter=300, cv=3, verbose=2, random_state=42, n_jobs=-1)

# Fit the model

xgb_random.fit(X_train, y_train)

# Evaluate best parameters

print("Best parameters found for XGBoost: ", xgb_random.best_params_)

Fitting 3 folds for each of 300 candidates, totalling 900 fits

Best parameters found for XGBoost: {'subsample': 0.8, 'n_estimators': 200, 'min_child_w
eight': 1, 'max_depth': 7, 'learning_rate': 0.046415888336127774, 'gamma': 0.00774263682
6811269, 'colsample_bytree': 0.5}

In [75]: color = '\033[91m'

bold = '\033[1m'
end = '\033[0m'
# Predict and evaluate performance
y_true = y_train
y_pred = xgb_random.predict(X_train)
print(color + bold + "Train data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))
y_true = y_test
y_pred = xgb_random.predict(X_test)
print(color + bold + "Test data:" + color + end)
print("Accuracy: ", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Train data:
Accuracy: 0.9391736801836267
Classification Report:
precision recall f1-score support

0 0.96 0.92 0.94 1307

1 0.92 0.96 0.94 1307

accuracy 0.94 2614

macro avg 0.94 0.94 0.94 2614
weighted avg 0.94 0.94 0.94 2614

Test data:
Accuracy: 0.8134171907756813
Classification Report:
precision recall f1-score support

0 0.76 0.68 0.72 168

1 0.84 0.88 0.86 309

accuracy 0.81 477

macro avg 0.80 0.78 0.79 477
weighted avg 0.81 0.81 0.81 477

Insight
The training accuracy has reduced to 0.939 whereas testing accuracy has slightly increased to 0.813.
This is still a case of overfitting but better than all the previous models.
This model is also faster than the previous models

6.5.1 Performance
In [76]: plot_feature_importance(xgb_random.best_estimator_, X_train.columns)
display_confusion_matrix(y_test, y_pred)
plot_roc_curve(xgb_random.best_estimator_, X_train, X_test, y_train, y_test)
plot_pr_curve(xgb_random.best_estimator_, X_train, X_test, y_train, y_test)
Insight
The top 5 features as per the XGBClassifier are \ --Quarterly Rating 1 \ --Quarterly Rating Improved
\ --Quarterly Rating 4 \ --Months of Service \ --Quarterly Rating_Change\
Both the classes 0 and 1 have a decent Area Under the ROC curve of 0.86
The Area Under the PR curve for class 0 is 0.81 and class 1 is 0.90

7. Insights
Most of the drivers are in the age group of 30 to 35
59% of the drivers are Male and remaining 41% are Female
City C20 has the maximum number of drivers
Maximum number of drivers joined in the year 2020 and in the month of July
1026 drivers have a joining designation of 1
Maximum number of drivers have a grade of 2
Majority of the drivers have a very low quarterly rating of 1
There are no drivers with quarterly rating of 5
1616 drivers have churned, which is around 68%
The median income of drivers who have churned is lesser than that of the drivers who have not
churned
The churn rate is very less in drivers whose income has raised
The churn rate is very less in drivers whose grade has raised
The churn rate is very less in drivers whose Quarterly rating has increased
8. Recommendation
The quartely rating has been the top contibutor on deciding if a driver will churn or not. As the ratings
are given by the customers to the driver, Ola should urge all customers to rate the drivers on time. Ola
should provide incentives/points to the customers to encourage timely rating.
Ola should make sure that the income of deserving drivers should be increased every 6 months, if not
every quarter, to encourage drivers to stay
Long service awards/bonuses should be given to drivers to keep them motivated
Special trainings should be given to drivers on how to handle different customers and different
situations so that the customers always provide positive ratings

9. Questionnaire
9.1 What percentage of drivers have received a quarterly rating of 5?
Ans: No drivers have received a quarterly rating of 5

9.2 Comment on the correlation between Age and Quarterly Rating.

Ans: Age and Quarterly rating do not have much correlation. They have a small correlation value of 0.15

9.3 Name the city which showed the most improvement in Quarterly
Rating over the past year
Ans: The city C29 shows most improvement in Quarterly Rating in 2020 compared to 2019

9.4 Drivers with a Grade of ‘A’ are more likely to have a higher Total
Business Value. (T/F)
Ans: Yes, the mean of Total Business Value of drivers with grade 5(or A) is higher than those with other
grades

9.5 If a driver's Quarterly Rating drops significantly, how does it impact

their Total Business Value in the subsequent period?
Ans: A significant drop in rating leads to dip in the Total Business Value in the subsequesnt period. Drop in
rating demotivates the drivers, leading to accepting only a few rides or in somecases not accepting any rides
and hence impacting the Total Business Value

9.6 From Ola's perspective, which metric should be the primary focus for
driver retention? 1. ROC AUC, 2. Precision, 3. Recall, 4. F1 Score
Ans: Recall. It is ok if the model predicts most drivers as Churn but it should not predict Churn drivers as
Not Churn

9.7 How does the gap in precision and recall affect Ola's relationship
with its drivers and customers?
Ans: Gap in the precision and recall implies that the False Negatives and False Positives values are very
different. If more instances of Churn are misclassified as Not Churn, then the customers may get drives who
are not-motived/unsatisfied leading to bad customer experience. On the other hand if more instances of
Not Churn are misclassified as Churn, then the good performing drivers will be neglected leading to driver
dissatification.

9.8 Besides the obvious features like "Number of Rides", which lesser-
discussed features might have a strong impact on a driver's Quarterly
Rating?
Ans: 1) Customers not providing timely rating or providing false rating has a strong impact on high
performing drivers and their quarterly rating.\ 2) Lack of training to the driver on handling different situation
can also impact their quarterly rating. Not all customers are same, so the driver needs to adapt his
behaviour as per the customer.

9.9 Will the driver's performance be affected by the City they operate in?
(Yes/No)
Ans: Yes, it might be the case that the people(customers) of a city are of a particular mindset. The people of
a city could be more accomodative and provide good ratings always and people of a different city could get
irriated easily and provide bad ratings

9.10 Analyze any seasonality in the driver's ratings. Do certain times of

the year correspond to higher or lower ratings, and why might that be?
Ans: Yes, there is a seasonality in the driver's rating. The ratings dip in Q2 and then shoot up in Q3. This
could be becuase of the holiday season in Q2 when many people move out of the cities for vacation and
hence less usage of cabs.

In [ ]:

1st Project
No ratings yet
1st Project
24 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Data Cleaning
No ratings yet
Data Cleaning
1 page
Capstone Project - Employee Attrition Rate
No ratings yet
Capstone Project - Employee Attrition Rate
66 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Diwali Sales Anlaysis
No ratings yet
Diwali Sales Anlaysis
10 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Building Logistic Regression Model in Python
No ratings yet
Building Logistic Regression Model in Python
24 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Scaler Case Study
No ratings yet
Scaler Case Study
18 pages
Practical Example Full Notes
No ratings yet
Practical Example Full Notes
48 pages
Basics of Pandas
No ratings yet
Basics of Pandas
5 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Data Wrangling - Jupyter Notebook
No ratings yet
Data Wrangling - Jupyter Notebook
5 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
Exp 343
No ratings yet
Exp 343
18 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
18 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Python
No ratings yet
Python
32 pages
Prints
No ratings yet
Prints
43 pages
UQ21CA632B Unit2 Class14a Data Representation
No ratings yet
UQ21CA632B Unit2 Class14a Data Representation
5 pages
Exploratory Data Analysis and Preprocessing Pipeline
No ratings yet
Exploratory Data Analysis and Preprocessing Pipeline
18 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Data Analytics On Vechicle Insurance Data
No ratings yet
Data Analytics On Vechicle Insurance Data
22 pages
BankX Marketing 1744722258
No ratings yet
BankX Marketing 1744722258
29 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
FDS Practical 2
No ratings yet
FDS Practical 2
8 pages
Asg One
No ratings yet
Asg One
10 pages
Data Overview: 25480 Entries
No ratings yet
Data Overview: 25480 Entries
11 pages
Satya772244@gmail Compdf
No ratings yet
Satya772244@gmail Compdf
7 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
Untitled0.ipynb - Colab
No ratings yet
Untitled0.ipynb - Colab
6 pages
Aerofit Case Study
No ratings yet
Aerofit Case Study
16 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
Exp 3
No ratings yet
Exp 3
10 pages
Python Project - Checkpoint
No ratings yet
Python Project - Checkpoint
26 pages
Analyzing Customer Data with NumPy
No ratings yet
Analyzing Customer Data with NumPy
9 pages
Employee Data Analysis Report
No ratings yet
Employee Data Analysis Report
22 pages
Kunal DA-12 Assignment-4
No ratings yet
Kunal DA-12 Assignment-4
26 pages
Create A Pandas Series From A Dictionary of Values and An Ndarray
No ratings yet
Create A Pandas Series From A Dictionary of Values and An Ndarray
15 pages
EDA Project
No ratings yet
EDA Project
7 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Pandas Revision1
No ratings yet
Pandas Revision1
2 pages
Data Preprocessing for Auto Insurance
No ratings yet
Data Preprocessing for Auto Insurance
22 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Import As Import As
No ratings yet
Import As Import As
18 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
MGNM - 801 - Ca1
No ratings yet
MGNM - 801 - Ca1
14 pages
EDA Unit II
No ratings yet
EDA Unit II
117 pages
Complete Case Analysis (CCA) : Advantages
No ratings yet
Complete Case Analysis (CCA) : Advantages
6 pages
MLT Ann Lab 2
No ratings yet
MLT Ann Lab 2
7 pages
EDA All Functions
No ratings yet
EDA All Functions
9 pages
Uji Asumsi Klasik dalam Statistika
No ratings yet
Uji Asumsi Klasik dalam Statistika
3 pages
SPSS Lab Manual
No ratings yet
SPSS Lab Manual
32 pages
Appendix C
No ratings yet
Appendix C
11 pages
Correlational Research Design
No ratings yet
Correlational Research Design
16 pages
Appendix Computation
40% (5)
Appendix Computation
20 pages
Measures of Position and Variability
100% (1)
Measures of Position and Variability
36 pages
Statistics Class 9 MCQs Questions With Answers
100% (1)
Statistics Class 9 MCQs Questions With Answers
8 pages
Measures of Variability - Khyati
No ratings yet
Measures of Variability - Khyati
40 pages
SG Quiz9
No ratings yet
SG Quiz9
15 pages
Measures of Central Tendency Explained
No ratings yet
Measures of Central Tendency Explained
227 pages
2-2 Cse-A, B, CSM Bit Bank Mid - 1
No ratings yet
2-2 Cse-A, B, CSM Bit Bank Mid - 1
7 pages
Descriptive Statistics: Numerical Measures: Measures of Location (Central Tendency) Measures of Variability
No ratings yet
Descriptive Statistics: Numerical Measures: Measures of Location (Central Tendency) Measures of Variability
68 pages
All Ema + Vwap + Sut + Vwap + Pivot
No ratings yet
All Ema + Vwap + Sut + Vwap + Pivot
12 pages
Mean, Median, Mode & Range Guide
No ratings yet
Mean, Median, Mode & Range Guide
3 pages
Properties of Normal Distribution Explained
No ratings yet
Properties of Normal Distribution Explained
23 pages
Lp-Modulle 1-Measures of Position For Ungrouped Data
No ratings yet
Lp-Modulle 1-Measures of Position For Ungrouped Data
22 pages
Financial Data Analysis Report
No ratings yet
Financial Data Analysis Report
20 pages
Chapter - 3 Notes - Measures of Central Tendency
No ratings yet
Chapter - 3 Notes - Measures of Central Tendency
13 pages
Business Mathematics Exam Sample Paper-2024
No ratings yet
Business Mathematics Exam Sample Paper-2024
8 pages
Normality
No ratings yet
Normality
3 pages
Standard Deviation
No ratings yet
Standard Deviation
22 pages
Maths Class X Chapter 13 Statistics Practice Paper 241020 201441
No ratings yet
Maths Class X Chapter 13 Statistics Practice Paper 241020 201441
4 pages
Lecture-6-7-8-Descriptive Statistics-Dispersion
No ratings yet
Lecture-6-7-8-Descriptive Statistics-Dispersion
42 pages
Data Management
100% (1)
Data Management
51 pages
Business Statistics 3rd Year PDF
No ratings yet
Business Statistics 3rd Year PDF
2 pages
6.2.2 Combining RV
No ratings yet
6.2.2 Combining RV
2 pages
Ungrouped Data
No ratings yet
Ungrouped Data
13 pages
Long Test 4TH Quarter 1
No ratings yet
Long Test 4TH Quarter 1
5 pages
Analytics 2022 12 29 020016
No ratings yet
Analytics 2022 12 29 020016
128 pages
Exam Questions
No ratings yet
Exam Questions
5 pages