0% found this document useful (0 votes)
6 views54 pages

Section+09+ +Data+Processing

The document provides an overview of data summarization techniques, including descriptive statistics, handling missing values, and calculating measures like mean, variance, and skewness. It also covers methods for detecting outliers, normalizing data, and using Principal Component Analysis (PCA) for dimensionality reduction. Additionally, it explains the MICE method for imputing missing data through multivariate approaches.

Uploaded by

sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views54 pages

Section+09+ +Data+Processing

The document provides an overview of data summarization techniques, including descriptive statistics, handling missing values, and calculating measures like mean, variance, and skewness. It also covers methods for detecting outliers, normalizing data, and using Principal Component Analysis (PCA) for dimensionality reduction. Additionally, it explains the MICE method for imputing missing data through multivariate approaches.

Uploaded by

sundar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

© Jitesh Khurkhuriya – Azure ML Online Course

Summarize Data

© Jitesh Khurkhuriya – Azure ML Online Course


Summarize Data Module
• Generates a basic descriptive statistics for the columns in a dataset

• All Columns with Missing Values

• Get a count of categorical values for a column

• Numerical statistics such as mean and standard deviation of the column

© Jitesh Khurkhuriya – Azure ML Online Course


Some Additional Terms

© Jitesh Khurkhuriya – Azure ML Online Course


Mean Deviation
Row Number Salary
1 $ 3,725
Sum of Salary 2 $ 4,155
Mean = 3 $ 4,627
Number of Observations 4 $ 5,147
5 $ 5,718
6 $ 6,347
7 $ 7,039
$ 103,723
= 8
9
$ 7,210
$ 7,423
15 10 $ 7,556
11 $ 8,369
= $ 6,915 12
13
$ 8,810
$ 8,940
14 $ 9,200
15 $ 9,458
© Jitesh Khurkhuriya – Azure ML Online Course
Mean Deviation
Row Number Salary Distance from Mean
1 $ 3,725 $3,190
2 $ 4,155 $2,760 Mean = $ 6,915
3 $ 4,627 $2,288
4 $ 5,147 $1,768
5 $ 5,718 $1,197
6 $ 6,347 $568 Mean Deviation = $ 1,569
7 $ 7,039 $124
8 $ 7,210 $295
9 $ 7,423 $508
10 $ 7,556 $641
11 $ 8,369 $1,454
12 $ 8,810 $1,895
13 $ 8,940 $2,025
14 $ 9,200 $2,285
15 $ 9,458 $2,543
© Jitesh Khurkhuriya – Azure ML Online Course
Sample Variance & Standard Deviation
Salary Distance from Square of the
X Mean distance Mean = $ 6,915
$ 3,725 $3,190 $1,01,76,100
$ 4,155 $2,760 $76,17,600
$ 4,627 $2,288 $52,34,944
$ 5,147 $1,768 $31,25,824
2 Sum of Squared distances
$ 5,718 $1,197 $14,32,809 Variance (S ) =
$ 6,347 $568 $3,22,624 N-1
$ 7,039 $124 $15,376
$ 7,210 $295 $87,025
$ 7,423 $508 $2,58,064
$ 7,556 $641 $4,10,881 Sample Standard Deviation = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
$ 8,369 $1,454 $21,14,116
$ 8,810 $1,895 $35,91,025
$ 8,940 $2,025 $41,00,625
$ 9,200 $2,285 $52,21,225
$ 9,458 $2,543 $64,66,849
© Jitesh Khurkhuriya – Azure ML Online Course
Quartile
Row Number Salary
1 $ 3,725
2 $ 4,155
3 $ 4,627
1st Quartile 4 $ 5,147
5 $ 5,718
6 $ 6,347
7 $ 7,039 Q3 – Q1
Median 8 $ 7,210 Inter Quartile Range
9 $ 7,423
IQR
10 $ 7,556
11 $ 8,369
3rd Quartile 12 $ 8,810
13 $ 8,940
14 $ 9,200
15 $ 9,458
© Jitesh Khurkhuriya – Azure ML Online Course
Skewness
• Skewness is a measure of the asymmetry of the probability distribution of a real-valued random
variable about its mean – Wikipedia

Positive Skew Negative Skew

© Jitesh Khurkhuriya – Azure ML Online Course


Outliers

© Jitesh Khurkhuriya – Azure ML Online Course


Outliers
Salary
• Observation that is distant from other observations $ 4,000
$ 4,500
• Impacts the predictions or estimates $ 8,000
$ 5,300
$ 5,700
$ 7,200
Mean = $107,600 / 12 = $8,967 $ 7,400
$ 7,900
Mean = $ 62,600 / 10 = $ 6,260 $ 6,400
$ 21,000
$ 24,000
$ 6,200
© Jitesh Khurkhuriya – Azure ML Online Course
Outliers – Occurrences and Causes
• Human error

• Malfunction of the measurement equipment

• Data transmission or transcription error

• System Behaviour

• Fraudulent behaviour

• Natural Outliers

• Sampling error

© Jitesh Khurkhuriya – Azure ML Online Course


Types of Outliers
Salary
$ 4,000
$ 4,500
$ 8,000
$ 5,300 Salary
$ 5,700
$ 7,200
$ 7,400
$ 7,900
$ 6,400
$ 21,000 Years of Experience
$ 24,000
$ 6,200 Multivariate
Univariate
© Jitesh Khurkhuriya – Azure ML Online Course
Impact of Outliers

© Jitesh Khurkhuriya – Azure ML Online Course


How to Detect Outliers?
• Most common method is visualisation

• Box Plot, Histogram, Scatter plot

• Percentile measures

10% 90%

© Jitesh Khurkhuriya – Azure ML Online Course


Normalize Data

© Jitesh Khurkhuriya – Azure ML Online Course


What is Normalization?
• A method to standardise the range of independent variables or features of data

• Variables are fitted within a certain range (Generally between 0 and 1)

• Applied on numeric columns

© Jitesh Khurkhuriya – Azure ML Online Course


Why to Normalise the data?

2
Y = a + b1X1 + b2X2

X1 = 1,2,3…..20 Y = a + bX

X2 = 1000,2000,…..20000

© Jitesh Khurkhuriya – Azure ML Online Course


Normalize data – Transformation Methods
ZScore MinMax

X – mean(x) X – min(x)
Z= Z=
stdev (x) Max (x) – min(x)

Logistic
Most commonly used
1 transformation methods
Z=
1 + exp(-x)

© Jitesh Khurkhuriya – Azure ML Online Course


Principal Component Analysis

© Jitesh Khurkhuriya – Azure ML Online Course


Curse of dimensionality
• 100s or 1000s of variables in a dataset Optimal Number of Features

• Data becomes sparse as the available space


increase multi-fold

Performance
• Sparse data can result in lesser accuracy

• Requires higher run-time

• May Lead to overfitting Dimensionality (Number of Features)

© Jitesh Khurkhuriya – Azure ML Online Course


What is a Principal Component?

• Creates a new set of coordinates for the data

• Reveals the internal structure of the data that best explains the
variance in data

• Reduces the dimensionality of the multivariate dataset

© Jitesh Khurkhuriya – Azure ML Online Course


What is PCA?

Reveals the internal structure of the data that


X2 best explains the variance in data

X1

© Jitesh Khurkhuriya – Azure ML Online Course


What is PCA?

Plotting all observations on X1


X2

X1

© Jitesh Khurkhuriya – Azure ML Online Course


What is PCA?

Plotting all observations on X2


X2

X1

© Jitesh Khurkhuriya – Azure ML Online Course


What is PCA?

X2 X2

X1 X1

© Jitesh Khurkhuriya – Azure ML Online Course


What is PCA?

X2

X1

© Jitesh Khurkhuriya – Azure ML Online Course


Understanding the PCA
ev2
X2

Spread of data
eigenvectors
eigenvalue1

X1 ev1
ev1 has higher eigenvalue. Hence drop ev2 as it
explains much lesser variation compared to ev1
© Jitesh Khurkhuriya – Azure ML Online Course
PCA

PC1

PC1

© Jitesh Khurkhuriya – Azure ML Online Course


Clean Missing Data with MICE

© Jitesh Khurkhuriya – Azure ML Online Course


MICE
• Replace with mean, mode or custom value – Single Imputation Method

• Multivariate Imputation using Chained Equation or Multiple Imputation by Chained Equations

• Each variable with missing data is modelled conditionally using the other variables in the data

• Data is Missing at Random

• Regression for predicting continuous variables and classification for categorical missing values

© Jitesh Khurkhuriya – Azure ML Online Course


Simple example
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29 $ 5,500
38 $ 7,000
Original Dataset 42 $ 7,500
33 $ 6,200
46 $ 7,800
48 $ 8,000
51 $ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Simple example
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29
38 $ 7,000
Missing Values 42
33 $ 6,200
$ 7,800
48 $ 8,000
$ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


MICE Steps

Step 1 – Calculate the Mean based on the available values

Step 2 – Replace all missing values with mean

Step 3 – Choose Dependent column and restore original

Step 4 – Apply transformation and create prediction model

Step 5 – Predict Missing values and repeat steps 3 to 5

© Jitesh Khurkhuriya – Azure ML Online Course


Step 1 – Calculate the Mean based on the available values
Age Salary
23 $ 4,000
34 $ 6,500 Age Mean = 38.1
36 $ 6,700
29 Salary Mean = $ 7,080
38 $ 7,000
42
33 $ 6,200
$ 7,800
48 $ 8,000
$ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Step 2 – Replace all missing values with mean
Age Salary
23 $ 4,000
34 $ 6,500 Age Mean = 38.1
36 $ 6,700
29 $ 7,080 Salary Mean = $ 7,080
38 $ 7,000
42 $ 7,080
33 $ 6,200
38.1 $ 7,800
48 $ 8,000
38.1 $ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Step 3 – Choose Dependent column and restore original
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29
38 $ 7,000
42
33 $ 6,200
38.1 $ 7,800
48 $ 8,000
38.1 $ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Step 4 – Apply transformation and create prediction model
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29
38 $ 7,000
42
33 $ 6,200
38.1 $ 7,800
48 $ 8,000
38.1 $ 8,500
43 $ 7,600
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Step 5 – Predict Missing values
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29
38 $ 7,000
42
33 $ 6,200
38.1 $ 7,800
For Age = 29 For Age = 42
48 $ 8,000 Salary = 132.07 (29) + 1979.3 Salary = 132.07 (42) + 1979.3
38.1 $ 8,500 = $ 5,809.33 = $ 7,526.24
43 $ 7,600
Original salary $ 5,500 Original salary $ 7,500
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Repeat for Age with new values
of Salary

© Jitesh Khurkhuriya – Azure ML Online Course


New Prediction Model
Age Salary
23 $ 4,000
34 $ 6,500
36 $ 6,700
29 $ 5,809.33
38 $ 7,000
42 $ 7,526.24
33 $ 6,200 For Salary = $ 7,800 For Salary = $ 8,500
$ 7,800 Age = 0.007(7800) – 9.1214 Age = 0.007(8500) – 9.1214
48 $ 8,000 = 45.48 = 50.38
$ 8,500
43 $ 7,600 Original Age 46 Original Age 51
55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


Replace with MICE Result – 2 iterations
Replace with MICE Replace with Mean
Age Salary Age Salary Age Salary
23 $ 4,000 23 $ 4,000 23 $ 4,000
34 $ 6,500 34 $ 6,500 34 $ 6,500
36 $ 6,700 36 $ 6,700 36 $ 6,700
29 $ 5,500 29 $ 5,809.33 29 $ 7,080
38 $ 7,000 38 $ 7,000 38 $ 7,000
42 $ 7,500 42 $ 7,526.24 42 $ 7,080
33 $ 6,200 33 $ 6,200 33 $ 6,200
46 $ 7,800 45.48 $ 7,800 38.1 $ 7,800
48 $ 8,000 48 $ 8,000 48 $ 8,000
51 $ 8,500 50.38 $ 8,500 38.1 $ 8,500
43 $ 7,600 43 $ 7,600 43 $ 7,600
55 $ 8,500 55 $ 8,500 55 $ 8,500

© Jitesh Khurkhuriya – Azure ML Online Course


SMOTE

© Jitesh Khurkhuriya – Azure ML Online Course


Dealing with Imbalanced Dataset
• Presence of minority class in the dataset

• Challenges related Imbalanced Dataset


• Biased predictions
• Misleading accuracy

• Some Examples
• Credit card frauds
• Manufacturing defects
• Rare diseases diagnosis
• Natural disasters Two Class Classification
• Enrolment to premier institutes
No-Fraud  99.5%
Fraud  0.5%

© Jitesh Khurkhuriya – Azure ML Online Course


Re-Sample the Dataset
• Balance the classes by Increasing minority or decreasing majority

Total Observations = 1,000


• Random Under-Sampling
Fraudulent = 10 or 1%
• Randomly remove majority class observations Normal = 990 or 99%
• Helps balance the dataset
• Discarded observations could have important information Reduce normal to 90
Fraudulent = 10 or 10%
• May lead to bias

• Random Over-Sampling Total Observations = 1,000


• Randomly add more minority observations by replication Fraudulent = 10 or 1%
• No information loss Normal = 990 or 99%
• Prone to overfitting due to copying same information
Increase fraudulent by 100
Fraudulent 110 or 10%

© Jitesh Khurkhuriya – Azure ML Online Course


SMOTE
• Synthetic Minority Oversampling Technique

• Creates new “Synthetic” observations

• SMOTE Process
• Identify the feature vector and its nearest neighbour
• Take the difference between the two
• Multiply the difference with a random number between 0 and 1
• Identify a new point on the line segment by adding the random number to feature vector
• Repeat the process for identified feature vectors

© Jitesh Khurkhuriya – Azure ML Online Course


SMOTE

© Jitesh Khurkhuriya – Azure ML Online Course


Join Data

© Jitesh Khurkhuriya – Azure ML Online Course


What is Join Data?
• Information is provided in two or more datasets
• Different sources
• Created at different times

• Datasets are related by key columns

• Different types of Join supported by AzureML


• Inner Join
• Left Outer Join
• Full Outer Join
• Left Semi-join

© Jitesh Khurkhuriya – Azure ML Online Course


Inner Join
EmpID Salary
EMP001 $ 5,000
EMP002 $ 5,500
EMP003 $ 5,200
EMP004 $ 6,000
EmpID Salary Department
EMP007 $ 5,800
EMP001 $ 5,000 IT
EMP008 $ 6,700
EMP003 $ 5,200 IT
EmpID Department EMP004 $ 6,000 Marketing
EMP001 IT EMP007 $ 5,800 Finance
EMP003 IT
EMP004 Marketing
EMP007 Finance
EMP009 Marketing
EMP010 Finance

© Jitesh Khurkhuriya – Azure ML Online Course


Full Outer Join
EmpID Salary
EMP001 $ 5,000
EMP002 $ 5,500
EMP003 $ 5,200 EmpID Salary Department
EMP004 $ 6,000 EMP001 $ 5,000 IT
EMP007 $ 5,800 EMP002 $ 5,500
EMP008 $ 6,700 EMP003 $ 5,200 IT
EMP004 $ 6,000 Marketing
EmpID Department
EMP007 $ 5,800 Finance
EMP001 IT
EMP008 $ 6,700
EMP003 IT
EMP009 Marketing
EMP004 Marketing
EMP010 Finance
EMP007 Finance
EMP009 Marketing
EMP010 Finance

© Jitesh Khurkhuriya – Azure ML Online Course


Left Outer Join
EmpID Salary
EMP001 $ 5,000
EMP002 $ 5,500
EMP003 $ 5,200 EmpID Salary Department
EMP004 $ 6,000 EMP001 $ 5,000 IT
EMP007 $ 5,800 EMP002 $ 5,500
EMP008 $ 6,700 EMP003 $ 5,200 IT
EMP004 $ 6,000 Marketing
EmpID Department
EMP007 $ 5,800 Finance
EMP001 IT
EMP008 $ 6,700
EMP003 IT
EMP004 Marketing
EMP007 Finance
EMP009 Marketing
EMP010 Finance

© Jitesh Khurkhuriya – Azure ML Online Course


Left Semi Join
EmpID Salary
EMP001 $ 5,000
EMP002 $ 5,500
EMP003 $ 5,200 EmpID Salary
EMP004 $ 6,000
EMP001 $ 5,000
EMP007 $ 5,800
EMP003 $ 5,200
EMP008 $ 6,700
EMP004 $ 6,000
EmpID Department EMP007 $ 5,800
EMP001 IT
EMP003 IT
EMP004 Marketing
EMP007 Finance
EMP009 Marketing
EMP010 Finance

© Jitesh Khurkhuriya – Azure ML Online Course


Thank You..!

© Jitesh Khurkhuriya – Azure ML Online Course

You might also like