Assiut University
Faculty of computers and informa on
Report in Data Mining
Dataset name:
Employee A ri on and Factors
Student name:
Esmail Mohamed Esmail Elkot
Sec on : CS 2
Data Modifica ons
Dropped Irrelevant Columns:
o Removed EmployeeCount, Over18, StandardHours, and EmployeeNumber as
these columns did not add analy cal value (e.g., StandardHours was the same for
all entries).
Handling Missing Values:
o Used NumPy to fill any missing values in numerical columns with the column
mean.
Encoding Categorical Variables:
o Converted binary categorical columns (A ri on and Gender) into numerical
values for analysis, mapping them to binary values (e.g., 1 for Yes/Male, 0 for
No/Female).
o Applied one-hot encoding to columns with mul ple categories, such as
BusinessTravel, Department, Educa onField, JobRole, MaritalStatus, and
OverTime, to convert them into numerical features.
Outlier Handling:
o Capped the MonthlyIncome column at the 95th percen le to manage
outliers, limi ng the impact of extreme values on analysis.
Feature Engineering:
o Created a new feature called YearsInRoleRa o, which is the ra o of
YearsInCurrentRole to YearsAtCompany + 1. This feature helps to understand
how long employees stay in their roles rela ve to their total me in the
company.
o Created a binary column HighEarner to indicate whether an employee earns
above a certain threshold (mean + standard devia on), helping iden fy high
earners.
Using data frame func ons:
o Shape
o Size
o Head
o Tail
o Describe
o Series kind(box)
Visualiza ons and Insights
Visualiza on 1: A ri on Rate by Department
o Created a bar chart to show the a ri on rate for each department.
o Explana on: This visualiza on helps understand which departments
have the highest a ri on rates, allowing the organiza on to iden fy
areas that may need a en on for employee reten on.
Visualiza on 2: Monthly Income Distribu on
o Created a histogram with capped values for MonthlyIncome to view
the distribu on of income across employees.
o Explana on: This helps analyze salary distribu on pa erns and
iden fy salary ranges. Capping at the 95th percen le provides a
clearer view without distor on from outliers.