0% found this document useful (0 votes)

14 views52 pages

Week 11 Notes

Uploaded by

Rama Bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views52 pages

Week 11 Notes

Uploaded by

Rama Bhushan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Lesson 1: Principal Component

Analysis
Module: Dimension Reduction and Clustering
Introduction
• PCA background and motivation with mobile phone survey example
• Understanding the data with 1-d and 2-d dimensions
• Understanding the PCA extraction process and Eigen value analysis
• PCA extraction with 3-d data
• Introduction to K-mean clustering algorithm
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Background and Motivation

Background and Motivation
Let us start with a mobile phone example
• You have features such as Price, RAM, Brand, other specs
• Using the buyer survey data the following questions need to be answered
(1) Which parameter (price etc.) is important for each individual
(2) How can I cluster the individuals using this data
(3) Is there a particular parameter that is most important
(4) Are there any hidden dimensions in the data (e.g., value-for-money)
Background and Motivation
Data has two dimensions (1) Individuals and (2) Variables (mobiles)
• Either of these dimensions can describe the full variation in data
• Practically, the variable dimension would be smaller and hence would
determine the maximum possible PCAs
• Let us start with 1-d plot, where price is the only possible dimension,
clustering is easier
Background and Motivation

What if the data has two attributes (RAM, Price),

that is, 2-d plot
• Clustering is a little difficult
• Complexity increases with more dimensions
• Idea is to extract less number of dimensions,
and group the data
• Those dimensions that explain the variability in
the data

6
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Finding PCA Part 1

Finding PCA

First, find the centroid of the data

• Project each point and X and Y axis
• Find averages, these averages of X and Y axis
are the coordinates of the centroid
• Now shift the data/axis in a manner that centre
becomes the new origin
• The relative positions of the point remains intact

8
Finding PCA

Once, the data is centered at the origin, we need

to find the PCA dimension through fitting a line,
the best fit line
• What should be the right criteria
• Recall when the price was the only sufficient
dimension
• All the variation in the data were explained by
price in fact all the points were on price axis
only

9
Finding PCA

Once, the data is centred at the

origin, we need to find the PCA
dimension through fitting a line,
the best fit line
• Choose a random line
• Project the PCA data and aim
to minimize the ai’s
• Recall Pythagoras theorem 𝑎2 + 𝑑2 = 𝑏 2
• Here, b is constant, so minimizing a is the same as maximizing d (distance
from origin)

10
Finding PCA

• So we aim to maximize Sum of

Squared Distances (SSD)= 𝑑12 +
𝑑22 + ⋯ + 𝑑𝑖2
• Keep rotating the line till we find the
best fit
• This line is our PC1

• For example, think of a slope α =1/3, as we move in the direction of PC1,

we travel 3-units on x-axis for each unit on y-axis
• Thus, PC1 is a linear combination of X and Y dimensions

11
Finding PCA

• Also 𝑏 2 = 𝑎2 + 𝑑 2 = 12 + 32 = 10, 𝑏 = 3.162

• Scale the length b to 1, this will convert
a=0.316 and d=0.949
• These are the loadings of PC1 on both the
dimensions

• Recall, SSDs, here SSD/(n-1) is the eigen value of PC1

• Eigen value of PC1/(Total sum of all the eigen-values for all the maximum
possible vectors)= Variation explained by PC1

12
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

PCA: Finding PCA Part 2

Finding PCA

• Suppose there are only 2 PCs here (why?)

• Eigen value for PC1=20 and PC2=5
• Then 80% and 20% of the overall variation is
explained by PC1 and PC2 respectively

• Now let us estimate PC2; it has to go from origin and needs to be

perpendicular to PC1
• So the slope of PC1*slope of PC2=-1
• Hence slope of PC2=-3, that is 0.316 loading on x axis and -0.949 loading
on y-axis

14
Finding PCA

• Last step here is to make PC1 horizontal and

PC2 vertical
• Then using the projections on PC1 and PC2
trace the original data points on the new (XY
plot) where PC1 is X axis and PC2 is why axis

• Similarly one can add a third factor

• How many factors are needed that can be determined by Scree plot

15
Finding PCA

• Effectively, the PCA plot will convert original variables into new dimensions that are
perpendicular and uncorrelated to achieve clusters
• PC1 is more important than PC2. Also, the distances between points across PC1
are given more weight than those on PC2
• PC1 is more important than PC2. Also, the distances between points across PC1
are given more weight than those on PC2
• Maximum numbers of PCs will be obtained depending on whichever is lower (that
value), individual or variables.

16
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Lesson 2: K-Means Clustering

Module: Dimension Reduction and Clustering
K-means algorithm

• Randomly choose k centroids, one centroid for each cluster

• Create K clusters by assigning observations to nearest centroid
• Compute the new centroid and again assign the observations to
the nearest centroid
• Keep repeating until new centroids don’t change
K-means algorithm: A simple Example

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking and
data science
K-means algorithm: A simple Example

Start with K=4 initial centers

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking and
data science
K-means algorithm: A simple Example

Iteration 1

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 2

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 3

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 4

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking
and data science
K-means algorithm: A simple Example

Iteration 5

Designed by Prof. John Guttag (MIT)

:Introduction to Computational thinking
and data science
K-means algorithm

• The clustering method is heavily dependent on the choice of

number of clusters and initial cluster centers
• If k is chosen not effectively, may lead to poor clustering
• Choose initial K using a priori knowledge
K-means algorithm: Distance measure

• To cluster observations, one needs a distance measure

• Consider two features x,y on a two-dimensional chart
• Also consider two observations, A (xa, ya) and B (xb, yb), on this
crge
• A natural measure of distance is the Euclidean distance between
A and B: 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2
K-means algorithm: Distance measure

• A natural measure of distance is the Euclidean distance between

A and B: 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2

Source: From John C Hull, Machine Learning in

Business. 2nd Edition
K-means algorithm: Distance measure

• This distance can be extended to many dimensions; e.g.,

suppose m features, and the jth feature for ith observation is 𝑣𝑖𝑗
• Then the distance between pth and qth observation is
2
σ𝑚
𝑗=1 𝑣𝑝𝑗 − 𝑣𝑞𝑗

• For example, with three features:

• 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2 + 𝑧𝑎 − 𝑧𝑏 2
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Summary and Concluding Remarks

• You are a mobile phone manufacturer and you survey various

target populations for different mobile offerings from competitor
segments.
• You would like to position your offering suitably in the target
segment.
• You would also like to analyze the target customer segment with
your data facilitated by PCA and clustering analysis.
• Lastly, you want to examine latent dimensions of your data.
Summary and Concluding Remarks
• We noted that just with one dimensional (1-d) or two dimensional
(2-d) data, this kind of analysis is easier.
• However as dimensions increases, complexity of this exercise
increases. Thus, we need PCAs.
• The following steps are needed in the PCA extraction process.
First, we find the centre of the data and shift it to the origin in a
manner that relative positions of all the points on the XY axis
remain unchanged.
• In the shifted data, we find the PCA as the best fit line that
maximizes the sum of square distances (SSD) from the origin.
Summary and Concluding Remarks
• Next, we find the second PCA as the line that goes through
origin, orthogonal to PC1, and maximizes SS in a similar manner.
• Now we project the data points on the PC1 and PC2. Next we
rotate PC1 and PC2 in manner that PC1 is horizontal and PC2 is
vertical.
• We find the revised data points back using the projections of PC1
and PC2. One can find PC3 if required in a similar manner, that
is, a vector orthogonal to PC1 and PC2, passes through origin
and maximizes SSD.
• Here SSD/(n-1) is the Eigen value and helps us in examining the
total variation in the data explained by individual PCs.
Summary and Concluding Remarks
• Finally we conduct K-means clustering exercise. For the same, we
randomly assigned a predetermined number of centroids to the
data plotted on XY plane.
• Next, we compute the distance of each point to these centroids,
and each point is assigned to the nearest centres. We can
compute the revised centres of the data.
• We keep repeating this exercise till the centroids are steady and
no switching of observations across clusters occur. This will lead
us to the final set of clusters.
Thanks!
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: HR and Marketing

Analytics
Introduction

• HR Analytics introduction and background

• Recruitment Analysis
• Employee engagement analysis
• Performance evaluation
• Employee safety and accident data analysis
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Case Study: HR Analytics

Case Study
• As an HR consultant, one of the fortune 500 companies has invited you to
understand employee’s needs, improve employee safety, understand and
improve employee fairness and diversity, identify drivers of employee
attrition and also identify the best recruiting source and check if the
workloads of employees are appropriate.

• The company has collected data on the following parameters (Recruitment

firm survey, Employee engagement survey, Salary survey, Gender Survey,
Performance Survey, Accident Survey, Location Survey
Part 1: Recruitment data analysis

• You are expected to analyze the data by performing the following tasks
• Import the “recruitment_data.csv” file and find the number of hires per
recruiting source
• Find what recruiting source provides the best performing sales persons,
also visualize the data
• Find which recruiting source sales persons have the lowest attrition rates,
also visualize the data
• Provide precise and sharp inferences from these results
Part 2: Employee Engagement data analysis
• You are expected to analyze the data by performing the following tasks
• Import the “survey_data.csv” file and find the number of employees from
each department
• Find the average engagement score for each department and classify those
with scores of 0, 1, and 2 as disengaged
• Find the department-wise average salary and vacation days for the
disengaged employees, also visualize the data
• Examine whether the difference in engagement and vacations are
statistically different in sales department vis-à-vis other departments
• Using these results, provide precise and sharp inferences about the link
between engagement and business outcomes.
Part 3: Fair pay data analysis
• You are expected to analyze the data by performing the following tasks
• Import the “fair_pay.csv” file and find the average salary of new hires and
older employees, is this difference statistically significant
• Summarize and visualize the salary data across job levels between new
hires and old employees
• Perform multiple linear regression of salary on new hires and job levels (as
categorical variables), make inferences from these results
• Perform multiple linear regression of salary on new hires and departments
(as categorical variables), make inferences from these results
Part 4: Gender and performance data
analysis
• You are expected to analyze the data by performing the following tasks
• Import and merge the “hr_data.csv” and "performance_data.csv" files and
examine whether average performance ratings differ by gender, also
visualize the data
• Analyze the gender and performance data across job levels and make
inferences, also visualize the data
Part 5: Location and accident data analysis
• You are expected to analyze the data by performing the following tasks
• Import and merge the “hr_2.csv” and "accident_data.csv" data files and
examine whether the accident rates have increased from 2016 to 2017, also
visualize the data
• Examine whether excessive employee overtime has any relationship with
the accidents
• Using descriptive statistics, visualization techniques, and logit regression
examine whether employee engagement has changed over the years and if
it has any relation with the accident
• Conclude the exercise with sharp and brief inferences
INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Summary and Concluding Remarks

• We started the case study implementation by setting the working

directory, loading the relevant packages and the reading the data
• First we examined the recruitment source data. We found that
those coming from search firms are performing poorly on sale
quota percentage achievement and also having higher attrition
rates.
• In contrast, those coming from NA sources, that is, lateral hires
like LinkedIn or online direct mail applications are doing well on
their sales quota targets and also have lower attrition rates.
Summary and Concluding Remarks

• Then we examined the employee engagement data.

• We found that engagement levels of sales department employees

are quite low as compared to other departments.

• One possible reason of this could be low vacation days as the

number of vacation days are significantly lower form this
department as compared to the other departments.
Summary and Concluding Remarks
• In the next part we examined pay fairness across the job mix and
across departments.
• We found that the pay of new hires is more than old employees.
When we analyzed across job mix levels (hourly, salaried, and
managerial levels), we found that new hires or less hourly workers
and more likely to be salaried and managerial employees.
• So the difference in pay is not on account of old vs new bias, but
simply because the composition of new employees is tiled more
towards salaried and managers who have higher salaries and less
of hourly wagers who have lower salaries.
Summary and Concluding Remarks
• We also examined the pay differences across departments.
• We found that the pay differences are only significantly different
for new and old employees in the Engineering department.
• But the pays are not significantly and statistically different across
sales and finance department.
• For these inferences, we employed linear regression analysis and
t-test of means.
Summary and Concluding Remarks
• In part 4, we examined the performance ratings across gender and
department. We found that there is a significant difference in
performance ratings across male vs. female.
• Initial evidence suggested that male employees have higher ratings
than female.
• Subsequently, we analyzed the male vs. female performance
evaluation across different job levels, i.e., hourly, salaried, and
managers.
• We found that the systematic bias or difference is only found along the
hourly wagers, in contrast, the salaried and managerial levels there is
no statistically significant difference or bias in the results.
Summary and Concluding Remarks
• In part 5 of the case, we examined the relationship between employee
safety, disengagement, and vacation days taken.
• First, we found that these is a significant increase in accident rates in
the Southfield locality, and there is a considerable increase from 2016
to 2017.
• We found that vacation days taken, and disengagement levels may
have some role to play in demotivating the employees and causing
accidents.
• In general, there is an increase in accidents from 2016 to 2017, which
needs further examination.
Thanks!

What Is PCA?: Image Source
No ratings yet
What Is PCA?: Image Source
17 pages
Module3 OTML
No ratings yet
Module3 OTML
67 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Pca&kmean
No ratings yet
Pca&kmean
6 pages
Dimensionality Reduction Technique
No ratings yet
Dimensionality Reduction Technique
17 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
1 Principal Component Analysis (PCA) : Complete Lecture Notes
No ratings yet
1 Principal Component Analysis (PCA) : Complete Lecture Notes
22 pages
Things To Remember - Principal Component Analysis
No ratings yet
Things To Remember - Principal Component Analysis
2 pages
PCA in Machine Learning Explained
No ratings yet
PCA in Machine Learning Explained
33 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Clustering and Dimensionality Reduction Techniques PCA T SNE K Means
No ratings yet
Clustering and Dimensionality Reduction Techniques PCA T SNE K Means
15 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
Lecture 9 - Data Prep - Reduction - PCA-M
No ratings yet
Lecture 9 - Data Prep - Reduction - PCA-M
44 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
PCA for Data Scientists
No ratings yet
PCA for Data Scientists
11 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
03 Dimensionality Reduction
No ratings yet
03 Dimensionality Reduction
38 pages
PCA in Machine Learning Explained
No ratings yet
PCA in Machine Learning Explained
20 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
PCA Dev
No ratings yet
PCA Dev
16 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
PCA Guide for B.Tech Students
No ratings yet
PCA Guide for B.Tech Students
10 pages
Advanced Data Analysis Techniques 2
No ratings yet
Advanced Data Analysis Techniques 2
32 pages
Principal Component Analysis Guide
No ratings yet
Principal Component Analysis Guide
23 pages
QSRI Lecture4
No ratings yet
QSRI Lecture4
56 pages
Ai (PCA)
No ratings yet
Ai (PCA)
3 pages
PCA ChrisDing4
No ratings yet
PCA ChrisDing4
74 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
Pca 1
No ratings yet
Pca 1
3 pages
Linear Algebra
No ratings yet
Linear Algebra
5 pages
20 Pca
No ratings yet
20 Pca
50 pages
Principal Component Analysis - (Pca) : Its Mechanics & Relevance To Modelling
No ratings yet
Principal Component Analysis - (Pca) : Its Mechanics & Relevance To Modelling
5 pages
PCA for Banking Multicollinearity
No ratings yet
PCA for Banking Multicollinearity
5 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
4 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
PCA PDF 1646672241
No ratings yet
PCA PDF 1646672241
11 pages
Unit 3
No ratings yet
Unit 3
102 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
2 pages
PCA (GenX)
No ratings yet
PCA (GenX)
19 pages
Pages 141-210
No ratings yet
Pages 141-210
70 pages
The Intuition Behind PCA: Machine Learning Assignment
No ratings yet
The Intuition Behind PCA: Machine Learning Assignment
11 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
24 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
Unsupervised ML 2 - Dr. Niyati - NIT KKR
No ratings yet
Unsupervised ML 2 - Dr. Niyati - NIT KKR
54 pages
1501589578da Mod15 Q1 e Text
No ratings yet
1501589578da Mod15 Q1 e Text
9 pages
Pca - Principal Component Analysis 1233
No ratings yet
Pca - Principal Component Analysis 1233
30 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
PCA for Data Scientists
No ratings yet
PCA for Data Scientists
45 pages
Love Report
No ratings yet
Love Report
7 pages
Data Pre-Processing-IV (Feature Extraction-PCA)
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)
23 pages
PCA Theory
No ratings yet
PCA Theory
13 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Week 6 Sol
No ratings yet
Week 6 Sol
5 pages
Week 5 Notes
No ratings yet
Week 5 Notes
175 pages
Week 9 Advanced Financial Time Series
No ratings yet
Week 9 Advanced Financial Time Series
5 pages
Panel Data Methods Overview
No ratings yet
Panel Data Methods Overview
56 pages
Week 4 Assignment Solution
No ratings yet
Week 4 Assignment Solution
4 pages
Quiz Week 11
No ratings yet
Quiz Week 11
5 pages
Week 10 Quiz
No ratings yet
Week 10 Quiz
4 pages
OLS Model Estimation for Returns Analysis
No ratings yet
OLS Model Estimation for Returns Analysis
5 pages
Clustering for Related Post Discovery
No ratings yet
Clustering for Related Post Discovery
62 pages
Aiml Qbank Ar 23 Format
No ratings yet
Aiml Qbank Ar 23 Format
6 pages
Assignment 3 FML July Nov 2024
No ratings yet
Assignment 3 FML July Nov 2024
2 pages
DWM Questions
No ratings yet
DWM Questions
5 pages
Clustering For Gene Expression Analysis
No ratings yet
Clustering For Gene Expression Analysis
8 pages
Machine Learning 1 - Programming Assignment 1
No ratings yet
Machine Learning 1 - Programming Assignment 1
1 page
Machine Learning With Python Report
100% (1)
Machine Learning With Python Report
41 pages
ORDBMS: Multivalued Attributes & Inheritance
100% (1)
ORDBMS: Multivalued Attributes & Inheritance
69 pages
Clase 1 Machine Learning II - Clustering
No ratings yet
Clase 1 Machine Learning II - Clustering
66 pages
Avishek Nag - Pragmatic Machine Learning With Python-BPB Publications (2020) - Pages-248-260
No ratings yet
Avishek Nag - Pragmatic Machine Learning With Python-BPB Publications (2020) - Pages-248-260
13 pages
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
No ratings yet
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
6 pages
Sas#23-Acc 117
No ratings yet
Sas#23-Acc 117
9 pages
Learning Algorithms For Gender Prediction
No ratings yet
Learning Algorithms For Gender Prediction
19 pages
Data Analytics With Cognos Questions
No ratings yet
Data Analytics With Cognos Questions
15 pages
Loss Functions in Unsupervised Learning
No ratings yet
Loss Functions in Unsupervised Learning
4 pages
DWM Lab Manual
No ratings yet
DWM Lab Manual
43 pages
Clustering
No ratings yet
Clustering
11 pages
Interpolation and Basis Function
No ratings yet
Interpolation and Basis Function
12 pages
3 - GGI 3203 - Classification of Mixed Pixels - Updated
No ratings yet
3 - GGI 3203 - Classification of Mixed Pixels - Updated
15 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Virus-MNIST: Malware Image Dataset
No ratings yet
Virus-MNIST: Malware Image Dataset
6 pages
Rapid Miner Cheat Doc 1
No ratings yet
Rapid Miner Cheat Doc 1
14 pages
Fast RNN Clustering Algorithm
No ratings yet
Fast RNN Clustering Algorithm
5 pages
Resume - ArchanaBalasubramanian - Assistant Professor - CSE - NGPIT - Coimbatore - 10X
No ratings yet
Resume - ArchanaBalasubramanian - Assistant Professor - CSE - NGPIT - Coimbatore - 10X
3 pages
AI in Project Management Review
No ratings yet
AI in Project Management Review
13 pages
SIC AI Chapter 6 Quiz With Answers v1.2
No ratings yet
SIC AI Chapter 6 Quiz With Answers v1.2
3 pages
PIC-2019 Paper 161 Female Crime Data
No ratings yet
PIC-2019 Paper 161 Female Crime Data
10 pages
Deep Learning for Constrained Clustering
No ratings yet
Deep Learning for Constrained Clustering
28 pages
Unsupervised Classification Report Word
No ratings yet
Unsupervised Classification Report Word
3 pages
ML Lab Manual for CSE Students
No ratings yet
ML Lab Manual for CSE Students
66 pages

Week 11 Notes

Uploaded by

Week 11 Notes

Uploaded by

INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Lesson 1: Principal Component

PCA: Background and Motivation

What if the data has two attributes (RAM, Price),

PCA: Finding PCA Part 1

First, find the centroid of the data

Once, the data is centered at the origin, we need

Once, the data is centred at the

• So we aim to maximize Sum of

• For example, think of a slope α =1/3, as we move in the direction of PC1,

• Also 𝑏 2 = 𝑎2 + 𝑑 2 = 12 + 32 = 10, 𝑏 = 3.162

• Recall, SSDs, here SSD/(n-1) is the eigen value of PC1

PCA: Finding PCA Part 2

• Suppose there are only 2 PCs here (why?)

• Now let us estimate PC2; it has to go from origin and needs to be

• Last step here is to make PC1 horizontal and

• Similarly one can add a third factor

Lesson 2: K-Means Clustering

• Randomly choose k centroids, one centroid for each cluster

Designed by Prof. John Guttag (MIT)

Start with K=4 initial centers

Designed by Prof. John Guttag (MIT)

Designed by Prof. John Guttag (MIT)

Designed by Prof. John Guttag (MIT)

Designed by Prof. John Guttag (MIT)

Designed by Prof. John Guttag (MIT)

Designed by Prof. John Guttag (MIT)

• The clustering method is heavily dependent on the choice of

• To cluster observations, one needs a distance measure

• A natural measure of distance is the Euclidean distance between

Source: From John C Hull, Machine Learning in

• This distance can be extended to many dimensions; e.g.,

• For example, with three features:

Summary and Concluding Remarks

• You are a mobile phone manufacturer and you survey various

Case Study: HR and Marketing

• HR Analytics introduction and background

Case Study: HR Analytics

• The company has collected data on the following parameters (Recruitment

Summary and Concluding Remarks

• We started the case study implementation by setting the working

• Then we examined the employee engagement data.

• We found that engagement levels of sales department employees

• One possible reason of this could be low vacation days as the

You might also like