0% found this document useful (0 votes)
14 views52 pages

Week 11 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views52 pages

Week 11 Notes

Uploaded by

Rama Bhushan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Lesson 1: Principal Component


Analysis
Module: Dimension Reduction and Clustering
Introduction
• PCA background and motivation with mobile phone survey example
• Understanding the data with 1-d and 2-d dimensions
• Understanding the PCA extraction process and Eigen value analysis
• PCA extraction with 3-d data
• Introduction to K-mean clustering algorithm
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Background and Motivation


Background and Motivation
Let us start with a mobile phone example
• You have features such as Price, RAM, Brand, other specs
• Using the buyer survey data the following questions need to be answered
(1) Which parameter (price etc.) is important for each individual
(2) How can I cluster the individuals using this data
(3) Is there a particular parameter that is most important
(4) Are there any hidden dimensions in the data (e.g., value-for-money)
Background and Motivation
Data has two dimensions (1) Individuals and (2) Variables (mobiles)
• Either of these dimensions can describe the full variation in data
• Practically, the variable dimension would be smaller and hence would
determine the maximum possible PCAs
• Let us start with 1-d plot, where price is the only possible dimension,
clustering is easier
Background and Motivation

What if the data has two attributes (RAM, Price),


that is, 2-d plot
• Clustering is a little difficult
• Complexity increases with more dimensions
• Idea is to extract less number of dimensions,
and group the data
• Those dimensions that explain the variability in
the data

6
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Finding PCA Part 1


Finding PCA

First, find the centroid of the data


• Project each point and X and Y axis
• Find averages, these averages of X and Y axis
are the coordinates of the centroid
• Now shift the data/axis in a manner that centre
becomes the new origin
• The relative positions of the point remains intact

8
Finding PCA

Once, the data is centered at the origin, we need


to find the PCA dimension through fitting a line,
the best fit line
• What should be the right criteria
• Recall when the price was the only sufficient
dimension
• All the variation in the data were explained by
price in fact all the points were on price axis
only

9
Finding PCA

Once, the data is centred at the


origin, we need to find the PCA
dimension through fitting a line,
the best fit line
• Choose a random line
• Project the PCA data and aim
to minimize the ai’s
• Recall Pythagoras theorem 𝑎2 + 𝑑2 = 𝑏 2
• Here, b is constant, so minimizing a is the same as maximizing d (distance
from origin)

10
Finding PCA

• So we aim to maximize Sum of


Squared Distances (SSD)= 𝑑12 +
𝑑22 + ⋯ + 𝑑𝑖2
• Keep rotating the line till we find the
best fit
• This line is our PC1

• For example, think of a slope α =1/3, as we move in the direction of PC1,


we travel 3-units on x-axis for each unit on y-axis
• Thus, PC1 is a linear combination of X and Y dimensions

11
Finding PCA

• Also 𝑏 2 = 𝑎2 + 𝑑 2 = 12 + 32 = 10, 𝑏 = 3.162


• Scale the length b to 1, this will convert
a=0.316 and d=0.949
• These are the loadings of PC1 on both the
dimensions

• Recall, SSDs, here SSD/(n-1) is the eigen value of PC1


• Eigen value of PC1/(Total sum of all the eigen-values for all the maximum
possible vectors)= Variation explained by PC1

12
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Finding PCA Part 2


Finding PCA

• Suppose there are only 2 PCs here (why?)


• Eigen value for PC1=20 and PC2=5
• Then 80% and 20% of the overall variation is
explained by PC1 and PC2 respectively

• Now let us estimate PC2; it has to go from origin and needs to be


perpendicular to PC1
• So the slope of PC1*slope of PC2=-1
• Hence slope of PC2=-3, that is 0.316 loading on x axis and -0.949 loading
on y-axis

14
Finding PCA

• Last step here is to make PC1 horizontal and


PC2 vertical
• Then using the projections on PC1 and PC2
trace the original data points on the new (XY
plot) where PC1 is X axis and PC2 is why axis

• Similarly one can add a third factor


• How many factors are needed that can be determined by Scree plot

15
Finding PCA

• Effectively, the PCA plot will convert original variables into new dimensions that are
perpendicular and uncorrelated to achieve clusters
• PC1 is more important than PC2. Also, the distances between points across PC1
are given more weight than those on PC2
• PC1 is more important than PC2. Also, the distances between points across PC1
are given more weight than those on PC2
• Maximum numbers of PCs will be obtained depending on whichever is lower (that
value), individual or variables.

16
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Lesson 2: K-Means Clustering


Module: Dimension Reduction and Clustering
K-means algorithm

• Randomly choose k centroids, one centroid for each cluster


• Create K clusters by assigning observations to nearest centroid
• Compute the new centroid and again assign the observations to
the nearest centroid
• Keep repeating until new centroids don’t change
K-means algorithm: A simple Example

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking and
data science
K-means algorithm: A simple Example

Start with K=4 initial centers

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking and
data science
K-means algorithm: A simple Example

Iteration 1

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 2

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 3

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 4

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 5

Designed by Prof. John Guttag (MIT)


:Introduction to Computational thinking
and data science
K-means algorithm

• The clustering method is heavily dependent on the choice of


number of clusters and initial cluster centers
• If k is chosen not effectively, may lead to poor clustering
• Choose initial K using a priori knowledge
K-means algorithm: Distance measure

• To cluster observations, one needs a distance measure


• Consider two features x,y on a two-dimensional chart
• Also consider two observations, A (xa, ya) and B (xb, yb), on this
crge
• A natural measure of distance is the Euclidean distance between
A and B: 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2
K-means algorithm: Distance measure

• A natural measure of distance is the Euclidean distance between


A and B: 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2

Source: From John C Hull, Machine Learning in


Business. 2nd Edition
K-means algorithm: Distance measure

• This distance can be extended to many dimensions; e.g.,


suppose m features, and the jth feature for ith observation is 𝑣𝑖𝑗
• Then the distance between pth and qth observation is
2
σ𝑚
𝑗=1 𝑣𝑝𝑗 − 𝑣𝑞𝑗

• For example, with three features:

• 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2 + 𝑧𝑎 − 𝑧𝑏 2
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Summary and Concluding Remarks


Summary and Concluding Remarks

• You are a mobile phone manufacturer and you survey various


target populations for different mobile offerings from competitor
segments.
• You would like to position your offering suitably in the target
segment.
• You would also like to analyze the target customer segment with
your data facilitated by PCA and clustering analysis.
• Lastly, you want to examine latent dimensions of your data.
Summary and Concluding Remarks
• We noted that just with one dimensional (1-d) or two dimensional
(2-d) data, this kind of analysis is easier.
• However as dimensions increases, complexity of this exercise
increases. Thus, we need PCAs.
• The following steps are needed in the PCA extraction process.
First, we find the centre of the data and shift it to the origin in a
manner that relative positions of all the points on the XY axis
remain unchanged.
• In the shifted data, we find the PCA as the best fit line that
maximizes the sum of square distances (SSD) from the origin.
Summary and Concluding Remarks
• Next, we find the second PCA as the line that goes through
origin, orthogonal to PC1, and maximizes SS in a similar manner.
• Now we project the data points on the PC1 and PC2. Next we
rotate PC1 and PC2 in manner that PC1 is horizontal and PC2 is
vertical.
• We find the revised data points back using the projections of PC1
and PC2. One can find PC3 if required in a similar manner, that
is, a vector orthogonal to PC1 and PC2, passes through origin
and maximizes SSD.
• Here SSD/(n-1) is the Eigen value and helps us in examining the
total variation in the data explained by individual PCs.
Summary and Concluding Remarks
• Finally we conduct K-means clustering exercise. For the same, we
randomly assigned a predetermined number of centroids to the
data plotted on XY plane.
• Next, we compute the distance of each point to these centroids,
and each point is assigned to the nearest centres. We can
compute the revised centres of the data.
• We keep repeating this exercise till the centroids are steady and
no switching of observations across clusters occur. This will lead
us to the final set of clusters.
Thanks!
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: HR and Marketing


Analytics
Introduction

• HR Analytics introduction and background


• Recruitment Analysis
• Employee engagement analysis
• Performance evaluation
• Employee safety and accident data analysis
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: HR Analytics


Case Study
• As an HR consultant, one of the fortune 500 companies has invited you to
understand employee’s needs, improve employee safety, understand and
improve employee fairness and diversity, identify drivers of employee
attrition and also identify the best recruiting source and check if the
workloads of employees are appropriate.

• The company has collected data on the following parameters (Recruitment


firm survey, Employee engagement survey, Salary survey, Gender Survey,
Performance Survey, Accident Survey, Location Survey
Part 1: Recruitment data analysis

• You are expected to analyze the data by performing the following tasks
• Import the “recruitment_data.csv” file and find the number of hires per
recruiting source
• Find what recruiting source provides the best performing sales persons,
also visualize the data
• Find which recruiting source sales persons have the lowest attrition rates,
also visualize the data
• Provide precise and sharp inferences from these results
Part 2: Employee Engagement data analysis
• You are expected to analyze the data by performing the following tasks
• Import the “survey_data.csv” file and find the number of employees from
each department
• Find the average engagement score for each department and classify those
with scores of 0, 1, and 2 as disengaged
• Find the department-wise average salary and vacation days for the
disengaged employees, also visualize the data
• Examine whether the difference in engagement and vacations are
statistically different in sales department vis-à-vis other departments
• Using these results, provide precise and sharp inferences about the link
between engagement and business outcomes.
Part 3: Fair pay data analysis
• You are expected to analyze the data by performing the following tasks
• Import the “fair_pay.csv” file and find the average salary of new hires and
older employees, is this difference statistically significant
• Summarize and visualize the salary data across job levels between new
hires and old employees
• Perform multiple linear regression of salary on new hires and job levels (as
categorical variables), make inferences from these results
• Perform multiple linear regression of salary on new hires and departments
(as categorical variables), make inferences from these results
Part 4: Gender and performance data
analysis
• You are expected to analyze the data by performing the following tasks
• Import and merge the “hr_data.csv” and "performance_data.csv" files and
examine whether average performance ratings differ by gender, also
visualize the data
• Analyze the gender and performance data across job levels and make
inferences, also visualize the data
Part 5: Location and accident data analysis
• You are expected to analyze the data by performing the following tasks
• Import and merge the “hr_2.csv” and "accident_data.csv" data files and
examine whether the accident rates have increased from 2016 to 2017, also
visualize the data
• Examine whether excessive employee overtime has any relationship with
the accidents
• Using descriptive statistics, visualization techniques, and logit regression
examine whether employee engagement has changed over the years and if
it has any relation with the accident
• Conclude the exercise with sharp and brief inferences
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Summary and Concluding Remarks


Summary and Concluding Remarks

• We started the case study implementation by setting the working


directory, loading the relevant packages and the reading the data
• First we examined the recruitment source data. We found that
those coming from search firms are performing poorly on sale
quota percentage achievement and also having higher attrition
rates.
• In contrast, those coming from NA sources, that is, lateral hires
like LinkedIn or online direct mail applications are doing well on
their sales quota targets and also have lower attrition rates.
Summary and Concluding Remarks

• Then we examined the employee engagement data.

• We found that engagement levels of sales department employees


are quite low as compared to other departments.

• One possible reason of this could be low vacation days as the


number of vacation days are significantly lower form this
department as compared to the other departments.
Summary and Concluding Remarks
• In the next part we examined pay fairness across the job mix and
across departments.
• We found that the pay of new hires is more than old employees.
When we analyzed across job mix levels (hourly, salaried, and
managerial levels), we found that new hires or less hourly workers
and more likely to be salaried and managerial employees.
• So the difference in pay is not on account of old vs new bias, but
simply because the composition of new employees is tiled more
towards salaried and managers who have higher salaries and less
of hourly wagers who have lower salaries.
Summary and Concluding Remarks
• We also examined the pay differences across departments.
• We found that the pay differences are only significantly different
for new and old employees in the Engineering department.
• But the pays are not significantly and statistically different across
sales and finance department.
• For these inferences, we employed linear regression analysis and
t-test of means.
Summary and Concluding Remarks
• In part 4, we examined the performance ratings across gender and
department. We found that there is a significant difference in
performance ratings across male vs. female.
• Initial evidence suggested that male employees have higher ratings
than female.
• Subsequently, we analyzed the male vs. female performance
evaluation across different job levels, i.e., hourly, salaried, and
managers.
• We found that the systematic bias or difference is only found along the
hourly wagers, in contrast, the salaried and managerial levels there is
no statistically significant difference or bias in the results.
Summary and Concluding Remarks
• In part 5 of the case, we examined the relationship between employee
safety, disengagement, and vacation days taken.
• First, we found that these is a significant increase in accident rates in
the Southfield locality, and there is a considerable increase from 2016
to 2017.
• We found that vacation days taken, and disengagement levels may
have some role to play in demotivating the employees and causing
accidents.
• In general, there is an increase in accidents from 2016 to 2017, which
needs further examination.
Thanks!

You might also like