Task-MINI PROJECT ON MACHINE LEARNING
Name: Prof Shantanu Chakraborty
Reg.No: GO_STP_7458
Date: 08-06-2021
Logistic Regression Model on Why HR Leaving | Predicting employee attrition using Machine Learning
Predict retention of an employee within an organization such that whether the employee will leave the company or continue with it. An
organization is only as good as its employees, and these people are the true source of its competitive advantage.
Kaggle Link: https://www.kaggle.com/giripujar/hr-analytics
First do data exploration and visualization, after this create a logistic regression model to predict Employee Attrition Using
Machine Learning & Python.
Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
df=pd.read_csv("/content/HR_comma_sep.csv")
df.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left
0 0.38 0.53 2 157 3 0 1
1 0.80 0.86 5 262 6 0 1
2 0.11 0.88 7 272 4 0 1
3 0.72 0.87 5 223 5 0 1
4 0.37 0.52 2 159 3 0 1
df.shape
(14999, 10)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
df.dtypes
satisfaction_level float64
last_evaluation float64
number_project int64
average_montly_hours int64
time_spend_company int64
Work_accident int64
left int64
promotion_last_5years int64
Department object
salary object
dtype: object
df.describe()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000
df.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
'promotion_last_5years', 'Department', 'salary'],
dtype='object')
df.groupby('left').count()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident pro
left
0 11428 11428 11428 11428 11428 11428
1 3571 3571 3571 3571 3571 3571
sns.set(rc={'figure.figsize':(9,7)})
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True ,cmap="YlGnBu")
<matplotlib.axes._subplots.AxesSubplot at 0x7ffaf6b30790>
corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values,cmap='PuRd')
plt.title('Heatmap of Correlation Matrix')
corr
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Wo
satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866
last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591
number_project -0.142970 0.349333 1.000000 0.417211 0.196786
average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755
time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000
Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120
left -0.388375 0.006567 0.023787 0.071287 0.144822
promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433
df.groupby('salary').mean()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident
salary
high 0.637470 0.704325 3.767179 199.867421 3.692805 0.155214
low 0.600753 0.717017 3.799891 200.996583 3.438218 0.142154
medium 0.621817 0.717322 3.813528 201.338349 3.529010 0.145361
pd.crosstab(df.Department, df.left)
left 0 1
Department
IT 954 273
RandD 666 121
accounting 563 204
hr 524 215
management 539 91
marketing 655 203
product_mng 704 198
sales 3126 1014
support 1674 555
technical 2023 697
emp_population_satisfaction = df['satisfaction_level'].mean()
emp_turnover_satisfaction = df [df [ 'left']==1]['satisfaction_level'].mean()
print('The mean for the employee population is: ' +
str(emp_population_satisfaction))
print('The mean for the employees that had a left is: '+ str (emp_turnover_satisfaction))
The mean for the employee population is: 0.6128335222348166
The mean for the employees that had a left is: 0.44009801176140917
f, axes = plt.subplots (ncols=3, figsize=(15, 6))
sns.distplot (df.satisfaction_level, kde= False, color="y", ax=axes[0]).set_title('Employee Satisfaction Distribution')
axes[0].set_ylabel('Employee Count')
sns.distplot (df.last_evaluation, kde= False, color="b", ax=axes[1]).set_title('Employee Evaluation Distribution')
axes [1].set_ylabel ('Employee Count')
sns.distplot (df.average_montly_hours, kde=False, color="r", ax=axes[2]).set_title( 'Employee Average Monthly Hours Distribu
axes[2].set_ylabel ('Employee Count')
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated
functio warnings.warn(msg, FutureWarning)
Text(0, 0.5, 'Employee Count')
color_types = ['#78c850','#F08030','#6890F0','#ABB820','#A8A878','#A040A0','#F8D030',
'#E0C068','#EE99AC', '#C03028','#F85888','#B8A038','#705898','#98D8D8', '#7038F8']
sns.countplot (x='Department', data=df, palette=color_types).set_title('Employee Department Distribution');
%matplotlib inline
#Bar chart for department employee work for and the frequency of
turnover pd.crosstab(df.Department,df.left).plot(kind='bar')
plt.title('Turnover Frequency for Department')
plt.xlabel('Department')
plt.ylabel('Frequency of Turnover')
plt.savefig('department_bar_chart')
#Bar chart for employee salary level and the frequency of turnover
table=pd.crosstab(df.salary, df.left)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Salary Level vs Turnover')
plt.xlabel('Salary Level')
plt.ylabel('Proportion of Employees')
plt.savefig('salary_bar_chart')
fig = plt.figure(figsize= (15,5),)
ax=sns.kdeplot (df.loc [(df ["left"] == 0),'last_evaluation'], color='blue', shade=True)
ax=sns.kdeplot (df.loc[ (df['left'] == 1),'last_evaluation'], color='black', shade=True)
plt.title('Employee Evaluation Distribution Left vs retained')
Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')
fig= plt.figure (figsize=(15,5))
ax=sns.kdeplot(df.loc[(df ['left'] == 0),'average_montly_hours'], color='green', shade=True)
ax=sns.kdeplot(df.loc[(df['left'] ==1),'average_montly_hours'], color='red',shade=True)
plt.title('Employee Evaluation Distribution Left vs retained')
Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')
fig=plt.figure(figsize=(15,5))
ax=sns.kdeplot(df.loc[(df ['left'] == 0), 'satisfaction_level'], color='red', shade=True)
ax=sns.kdeplot(df.loc[(df [ 'left'] == 1), 'satisfaction_level'], color='black', shade=True)
plt.title ('Employee Evaluation Distribution Left vs retained')
Text(0.5, 1.0, 'Employee Evaluation Distribution Left vs retained')
Code Text
data = df[['satisfaction_level', 'average_montly_hours', 'promotion_last_5years',
'salary']] data.head ()
satisfaction_level average_montly_hours promotion_last_5years salary
0 0.38 157 0 low
1 0.80 262 0 medium
2 0.11 272 0 medium
3 0.72 223 0 low
4 0.37 159 0 low
salary=pd.get_dummies (data['salary'], prefix='salary')
salary
salary_high salary_low salary_medium
0 0 1 0
1 0 0 1
2 0 0 1
3 0 1 0
4 0 1 0
... ... ... ...
14994 0 1 0
14995 0 1 0
14996 0 1 0
14997 0 1 0
14998 0 1 0
14999 rows × 3 columns
new_df = pd.concat([data,salary],axis=1)
new_df
satisfaction_level average_montly_hours promotion_last_5years salary sa
0 0.38 157 0 low
1 0.80 262 0 medium
2 0.11 272 0 medium
3 0.72 223 0 low
4 0.37 159 0 low
... ... ... ... ...
14994 0.40 151 0 low
14995 0.37 160 0 low
14996 0.37 143 0 low
new_df.drop(['salary','salary_high'], axis=1, inplace=True)
new_df
satisfaction_level average_montly_hours promotion_last_5years salary_low
0 0.38 157 0 1
1 0.80 262 0 0
2 0.11 272 0 0
3 0.72 223 0 1
4 0.37 159 0 1
X = new_df.copy()
X ... ... ... ... ...
14994 0.40 151 0 1
satisfaction_level average_montly_hours promotion_last_5years salary_low
14995 0.37 160 0 1
0 0.38 157 0 1
14996 0.37 143 0 1
1 0.80 262 0 0
2 0.11 272 0 0
3 0.72 223 0 1
4 0.37 159 0 1
... ... ... ... ...
14994 0.40 151 0 1
14995 0.37 160 0 1
14996 0.37 143 0 1
y = df['left']
y
0 1
1 1
2 1
3 1
4 1
..
14994 1
14995 1
14996 1
14997 1
14998 1
Name: left, Length: 14999, dtype: int64
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y,
test_size=0.30, random_state=99)
train_x.shape, test_x.shape, train_y.shape, test_y.shape
((10499, 5), (4500, 5), (10499,), (4500,))
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
lr.fit(train_x,train_y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
y_pred = lr.predict(test_x)
y_pred
array([0, 0, 0, ..., 0, 0, 0])
lr.score(test_x,test_y)
0.7724444444444445
from sklearn.metrics import accuracy_score,confusion_matrix
plot_confusion_matrix
<function sklearn.metrics._plot.confusion_matrix.plot_confusion_matrix>
accuracy_score(test_y,y_pred)
0.7724444444444445
confusion_matrix(test_y,y_pred)
array([[3202, 229],
[ 795, 274]])
plot_confusion_matrix(lr, test_x, test_y,cmap=plt.cm.PuBu)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7ffaf2863710>
from sklearn import metrics
y_true = test_y # true labels
y_probas = y_pred # predicted results
fpr tpr thresholds = metrics roc curve(y true y probas pos label=0)
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_probas, pos_label=0)
# Print ROC curve
plt.plot(fpr,tpr,linewidth="4",color='black')
plt.show()
# Print AUC
auc = np.trapz(tpr,fpr)
print('AUC:', auc)
c 0s completed at 6:00 PM